Releases: NVIDIA/cloudai
v0.9.beta4
What's Changed
- Fix Bug in DockerImageCacheResult to Correctly Retrieve Absolute Paths by @TaekyungHeo in #212
- Add Support for Cluster Account and gpus_per_node in Command Generation by @TaekyungHeo in #210
Full Changelog: v0.9.beta3...v0.9.beta4
v0.9.beta3
What's Changed
- Extend CI for high level use cases by @amaslenn in #204
- Replace 'training.values' Key with 'training' in final_cmd_args by @TaekyungHeo in #209
- Use System Install Path Instead of Local Member Variables by @TaekyungHeo in #206
- Remove Unused env_vars From Initialization Code by @TaekyungHeo in #207
- Remove Config File Handling and TOML Dependency from SlurmInstaller by @TaekyungHeo in #211
Full Changelog: v0.9.beta2...v0.9.beta3
v0.9.beta2
Release notes
We are working on schema improvements to simplify configs management and make them verifiable. This will help ensure that configs are correct before expensive runs on real hardware. Today we are enabling it for Test configs. This is a continuation of #158.
- Test Template TOML files were replaced with Pydantic models. That ensures mandatory arguments as well as its types and requires less code to maintain.
--test-templates-dir
option was removed for all commands. All supported tests are registered in code usingRegistry().add_test_definition(...)
andRegistry().add_test_template(...)
. Documentation was updated to reflect this change.- Test TOML files now take advantage of standard TOML format for all know arguments.
Before:Now:[cmd_args] "training" = "llama/llama2_70b" "training.trainer.max_steps" = "120" "training.model.global_batch_size" = "256" "training.model.pipeline_model_parallel_size" = "1"
[cmd_args] [cmd_args.training] values = "llama/llama2_70b" [cmd_args.training.trainer] max_steps = "120" [cmd_args.training.model] global_batch_size = "256" pipeline_model_parallel_size = "2"
extra_cmd_args
converted fromstr
todict[str, str]
:
Before:Now:extra_cmd_args = "--stepfactor 2"
[extra_cmd_args] "--stepfactor" = "2"
- Add a new mode to verify if Tests TOMLs are valid:
cloudai --mode verify-tests --system-config conf/common/system/standalone_system.toml --tests-dir conf/common/test/chakra_replay.toml
Full Changelog: v0.9.beta1...v0.9.beta2
v0.9.beta1
What's Changed
- Feature group all by @ybenvidia in #173
- Fix Bugs in K8s System Class Introduced During the Adoption of Pydantic by @TaekyungHeo in #196
- Exclude abstract methods from coverage by @amaslenn in #201
- Always return an absolute path to cached docker image by @amaslenn in #202
Full Changelog: v0.9.dev1...v0.9.beta1
v0.9.dev1
Highlights
We are working on schema improvements to simplify configs management and make them verifiable. This will help ensure that configs are correct before expensive runs on real hardware. Today we are enabling it for System configs.
Added new command for verifying the configs: cloudai --mode verify-systems
. --system-config
can be a file or a directory to verify all configs in the directory.
Slurm system config format was updated to take advantage of TOML features:
[partitions]
[partitions.partition_1]
name = "partition_1"
nodes = ["node-[001-100]"]
[partitions.partition_2]
name = "partition_2"
nodes = ["node-[101-200]"]
is now
[[partitions]]
name = "partition_1"
nodes = ["node-[001-100]"]
[[partitions]]
name = "partition_2"
nodes = ["node-[101-200]"]
The same is for groups inside partitions.
System parser objects were removed, this functionality is now handled by Pydantic.
What's Changed
- Add Pre-test for long LLMs run on Coreweave by @srivatsankrishnan in #170
- Redefine Job, Runner, and System Responsibilities; Refactor Classes by @TaekyungHeo in #176
- Move Common Variables to System class and TestTemplateStrategy by @TaekyungHeo in #179
- Refactor
gen_exec_command
flow by @amaslenn in #144 - Add Kubernetes Support by @TaekyungHeo in #180
- Add Nemotron model via PAXML to CloudAI + optimization for large GPU runs by @srivatsankrishnan in #171
- Update NCCL Test Templates and Introduce Kubernetes-Specific Strategies by @TaekyungHeo in #186
- Make GradingStrategy independant from TestTemplateStrategy by @amaslenn in #187
- add test for the reservation feature by @jeffnvidia in #189
- Add missing k8s package into package dependencies. by @amaslenn in #190
- Fix test directory name typo in README.md by @srinivas212 in #191
- Fix bug in generating NCCL test reports on K8s by @TaekyungHeo in #188
- Log Runner Initialization with Run Mode Indication in BaseRunner by @TaekyungHeo in #192
- Use Host Network for Kubernetes NCCL Tests by @TaekyungHeo in #193
- Use Pydantic to verify System schemas by @amaslenn in #158
Full Changelog: v0.9.dev0...v0.9.dev1
v0.9.dev0
What's Changed
- Refactor to Use pathlib.Path for Path-Related Variables by @TaekyungHeo in #183
Full Changelog: v0.8.1...v0.9.dev0
v0.8.1
v0.8.0
CloudAI v0.8 release notes
Compatibility
CloudAI v0.8 has been tested with: PyTorch/JAX NGC Container 24.05, NCCL 2.19/2.21, and SPC-X 1.1.
Key Features and Enhancements:
- Applied the registry pattern to enhance the flexibility and scalability of CloudAI.
- Extensive unit and integration testing framework using PyTest
- Enhanced error messages and user guide to improve user experience, helping users troubleshoot issues swiftly.
- Enhanced the installation feature, focusing on Slurm systems.
What’s next
- Improve schema for easier validation
- Support automated grading mechanism
- Support K8S scheduler
- Support preflight and post-flight tests
v0.8.rc1
v0.8.rc0
What's Changed
- update container version by @jeffnvidia in #122
- Update copyright format by @amaslenn in #155
- Enhance job status error message for clarity and user guidance by @TaekyungHeo in #156
- Upgrade pyright, fix warning by @amaslenn in #157
- Allow extra srun args set in system config by @amaslenn in #160
- Automatically define version from the git tags by @amaslenn in #162
- add support for NeMo Launcher in the reservation by @jeffnvidia in #161
- Add job_status_check flag to disable checks for specific tests by @TaekyungHeo in #146
- Defer docker image URL accessibility check to srun when not caching locally by @TaekyungHeo in #164
- Check TOMLs formatting with taplo by @amaslenn in #163
- Hierarchical Test template for support Grok/GPT via PAXML by @srivatsankrishnan in #141
Full Changelog: v0.7.14...v0.8.rc0