Skip to content

Commit

Permalink
Merge pull request #174 from NVIDIA/reorg_conf
Browse files Browse the repository at this point in the history
Reorg conf directory
  • Loading branch information
TaekyungHeo authored Aug 15, 2024
2 parents 570181a + ca678b2 commit b13bafe
Show file tree
Hide file tree
Showing 55 changed files with 32 additions and 28 deletions.
38 changes: 19 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,38 +68,38 @@ Please make sure to use the correct system configuration file that corresponds t
```bash
cloudai\
--mode install\
--system-config conf/system/example_slurm_cluster.toml\
--test-templates-dir conf/test_template\
--tests-dir conf/tests
--system-config conf/common/system/example_slurm_cluster.toml\
--test-templates-dir conf/common/test_template\
--tests-dir conf/common/tests
```

To simulate running experiments without execution, use the dry-run mode:
```bash
cloudai\
--mode dry-run\
--system-config conf/system/example_slurm_cluster.toml\
--test-templates-dir conf/test_template\
--tests-dir conf/tests\
--test-scenario conf/test_scenario/sleep.toml
--system-config conf/common/system/example_slurm_cluster.toml\
--test-templates-dir conf/common/test_template\
--tests-dir conf/common/tests\
--test-scenario conf/common/test_scenario/sleep.toml
```

To run experiments, execute CloudAI CLI in run mode:
```bash
cloudai\
--mode run\
--system-config conf/system/example_slurm_cluster.toml\
--test-templates-dir conf/test_template\
--tests-dir conf/tests\
--test-scenario conf/test_scenario/sleep.toml
--system-config conf/common/system/example_slurm_cluster.toml\
--test-templates-dir conf/common/test_template\
--tests-dir conf/common/tests\
--test-scenario conf/common/test_scenario/sleep.toml
```

To generate reports, execute CloudAI CLI in generate-report mode:
```bash
cloudai\
--mode generate-report\
--system-config conf/system/example_slurm_cluster.toml\
--test-templates-dir conf/test_template\
--tests-dir conf/tests\
--system-config conf/common/system/example_slurm_cluster.toml\
--test-templates-dir conf/common/test_template\
--tests-dir conf/common/tests\
--output-dir /path/to/output_directory
```
In the generate-report mode, use the --output-dir argument to specify a subdirectory under the result directory.
Expand All @@ -109,13 +109,13 @@ To uninstall test templates, run CloudAI CLI in uninstall mode:
```bash
cloudai\
--mode uninstall\
--system-config conf/system/example_slurm_cluster.toml\
--test-templates-dir conf/test_template\
--tests-dir conf/tests
--system-config conf/common/system/example_slurm_cluster.toml\
--test-templates-dir conf/common/test_template\
--tests-dir conf/common/tests
```

# Contributing
## Contributing
Feel free to contribute to the CloudAI project. Your contributions are highly appreciated.

# License
## License
This project is licensed under Apache 2.0. See the LICENSE file for detailed information.
10 changes: 5 additions & 5 deletions USER_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,15 +48,15 @@ CloudAI allows users to package workloads as test templates to facilitate the au
```

#### Step 2: Prepare configuration files
CloudAI is fully configurable via set of TOML configuration files. You can find examples of these files under `conf/`. In this guide, we will use the following configuration files:
CloudAI is fully configurable via set of TOML configuration files. You can find examples of these files under `conf/common`. In this guide, we will use the following configuration files:
1. `myconfig/test_templates/nccl_template.toml` - Describes the test template configuration.
1. `myconfig/system.toml` - Describes the system configuration.
1. `myconfig/tests/nccl_test.toml` - Describes the test to run.
1. `myconfig/scenario.toml` - Describes the test scenario configuration.


#### Step 3: Test Template
Test template config describes all arguments of a test. Let's create a test template file for the NCCL test. You can find more examples of test templates under `conf/test_template/`. Our example will be small for demonstration purposes. Below is the `myconfig/test_templates/nccl_template.toml` file:
Test template config describes all arguments of a test. Let's create a test template file for the NCCL test. You can find more examples of test templates under `conf/common/test_template/`. Our example will be small for demonstration purposes. Below is the `myconfig/test_templates/nccl_template.toml` file:
```toml
name = "NcclTest"
Expand Down Expand Up @@ -93,7 +93,7 @@ name = "NcclTest"
Notice that `cmd_args.docker_image_url` uses `nvcr.io/nvidia/pytorch:24.02-py3`, but you can use Docker image from Step 1.
#### Step 3: System Config
System config describes the system configuration. You can find more examples of system configs under `conf/system/`. Our example will be small for demonstration purposes. Below is the `myconfig/system.toml` file:
System config describes the system configuration. You can find more examples of system configs under `conf/common/system/`. Our example will be small for demonstration purposes. Below is the `myconfig/system.toml` file:
```toml
name = "my-cluster"
scheduler = "slurm"
Expand Down Expand Up @@ -139,7 +139,7 @@ extra_cmd_args = "--stepfactor 2"
"iters" = "5"
"warmup_iters" = "3"
```
You can find more examples under `conf/test`. In a test schema file, you can adjust arguments as shown above. In the `cmd_args` section, you can provide different values other than the default values for each argument. In `extra_cmd_args`, you can provide additional arguments that will be appended after the NCCL test command. You can specify additional environment variables in the `extra_env_vars` section.
You can find more examples under `conf/common/test`. In a test schema file, you can adjust arguments as shown above. In the `cmd_args` section, you can provide different values other than the default values for each argument. In `extra_cmd_args`, you can provide additional arguments that will be appended after the NCCL test command. You can specify additional environment variables in the `extra_env_vars` section.
#### Step 6: Run Experiments
Test Scenario uses Test description from the previous step. Below is the `myconfig/scenario.toml` file:
Expand Down Expand Up @@ -361,7 +361,7 @@ You can update the fields to adjust the behavior. For example, you can update th
### Note: For running Nemo Llama model, it is important to follow these additional steps:
1. Go to https://huggingface.co/docs/transformers/en/model_doc/llama#usage-tips.
2. Follow the instructions under 'Usage Tips' on how to download the tokenizer.
3. Replace "training.model.tokenizer.model=TOKENIZER_MODEL" with "training.model.tokenizer.model=YOUR_TOKENIZER_PATH" (the tokenizer should be a .model file) in conf/general/test/llama.toml.
3. Replace "training.model.tokenizer.model=TOKENIZER_MODEL" with "training.model.tokenizer.model=YOUR_TOKENIZER_PATH" (the tokenizer should be a .model file) in conf/common/test/llama.toml.

## Troubleshooting
In this section, we will guide you through identifying the root cause of issues, determining whether they stem from system infrastructure or a bug in CloudAI. Users should closely follow the USER_GUIDE.md and README.md for installation, adding test templates, tests, and test scenarios.
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
12 changes: 8 additions & 4 deletions tests/test_acceptance.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,12 @@
from cloudai.systems.slurm import SlurmNode, SlurmNodeState

SLURM_TEST_SCENARIOS = [
{"path": Path("conf/test_scenario/sleep.toml"), "expected_dirs_number": 4, "log_file": "sleep_debug.log"},
{"path": Path("conf/test_scenario/ucc_test.toml"), "expected_dirs_number": 5, "log_file": "ucc_test_debug.log"},
{"path": Path("conf/common/test_scenario/sleep.toml"), "expected_dirs_number": 4, "log_file": "sleep_debug.log"},
{
"path": Path("conf/common/test_scenario/ucc_test.toml"),
"expected_dirs_number": 5,
"log_file": "ucc_test_debug.log",
},
]


Expand All @@ -39,8 +43,8 @@ def test_slurm(tmp_path: Path, scenario: Dict):
log_file = scenario.get("log_file")
log_file_path = tmp_path / str(log_file)

parser = Parser(Path("conf/system/example_slurm_cluster.toml"), Path("conf/test_template"))
system, tests, test_scenario = parser.parse(Path("conf/test"), test_scenario_path)
parser = Parser(Path("conf/common/system/example_slurm_cluster.toml"), Path("conf/common/test_template"))
system, tests, test_scenario = parser.parse(Path("conf/common/test"), test_scenario_path)
system.output_path = str(tmp_path)
assert test_scenario is not None, "Test scenario is None"
setup_logging(str(log_file_path), "DEBUG")
Expand Down

0 comments on commit b13bafe

Please sign in to comment.