27 Sep 21:59

TaekyungHeo

b305b86

v0.9.beta4 Pre-release

Pre-release

What's Changed

Fix Bug in DockerImageCacheResult to Correctly Retrieve Absolute Paths by @TaekyungHeo in #212
Add Support for Cluster Account and gpus_per_node in Command Generation by @TaekyungHeo in #210

Full Changelog: v0.9.beta3...v0.9.beta4

Contributors

TaekyungHeo

Assets 2

27 Sep 14:21

TaekyungHeo

v0.9.beta3

85d4276

v0.9.beta3 Pre-release

Pre-release

What's Changed

Extend CI for high level use cases by @amaslenn in #204
Replace 'training.values' Key with 'training' in final_cmd_args by @TaekyungHeo in #209
Use System Install Path Instead of Local Member Variables by @TaekyungHeo in #206
Remove Unused env_vars From Initialization Code by @TaekyungHeo in #207
Remove Config File Handling and TOML Dependency from SlurmInstaller by @TaekyungHeo in #211

Full Changelog: v0.9.beta2...v0.9.beta3

Contributors

amaslenn and TaekyungHeo

Assets 2

25 Sep 16:36

amaslenn

v0.9.beta2

953f04b

v0.9.beta2 Pre-release

Pre-release

Release notes

We are working on schema improvements to simplify configs management and make them verifiable. This will help ensure that configs are correct before expensive runs on real hardware. Today we are enabling it for Test configs. This is a continuation of #158.

Test Template TOML files were replaced with Pydantic models. That ensures mandatory arguments as well as its types and requires less code to maintain.
--test-templates-dir option was removed for all commands. All supported tests are registered in code using Registry().add_test_definition(...) and Registry().add_test_template(...). Documentation was updated to reflect this change.

Test TOML files now take advantage of standard TOML format for all know arguments.
Before:

[cmd_args]
"training" = "llama/llama2_70b"
"training.trainer.max_steps" = "120"
"training.model.global_batch_size" = "256"
"training.model.pipeline_model_parallel_size" = "1"

Now:

[cmd_args]
  [cmd_args.training]
  values = "llama/llama2_70b"
    [cmd_args.training.trainer]
    max_steps = "120"
    [cmd_args.training.model]
    global_batch_size = "256"
    pipeline_model_parallel_size = "2"

extra_cmd_args converted from str to dict[str, str]:
Before:

extra_cmd_args = "--stepfactor 2"

Now:

[extra_cmd_args]
"--stepfactor" = "2"

Add a new mode to verify if Tests TOMLs are valid: cloudai --mode verify-tests --system-config conf/common/system/standalone_system.toml --tests-dir conf/common/test/chakra_replay.toml

Full Changelog: v0.9.beta1...v0.9.beta2

Assets 2

24 Sep 15:13

amaslenn

v0.9.beta1

c3542c7

v0.9.beta1 Pre-release

Pre-release

What's Changed

Feature group all by @ybenvidia in #173
Fix Bugs in K8s System Class Introduced During the Adoption of Pydantic by @TaekyungHeo in #196
Exclude abstract methods from coverage by @amaslenn in #201
Always return an absolute path to cached docker image by @amaslenn in #202

Full Changelog: v0.9.dev1...v0.9.beta1

Contributors

amaslenn, TaekyungHeo, and ybenvidia

Assets 2

16 Sep 08:19

amaslenn

v0.9.dev1

11c5592

v0.9.dev1 Pre-release

Pre-release

Highlights

Added new command for verifying the configs: cloudai --mode verify-systems. --system-config can be a file or a directory to verify all configs in the directory.
Slurm system config format was updated to take advantage of TOML features:

[partitions]
[partitions.partition_1]
name = "partition_1"
nodes = ["node-[001-100]"]

[partitions.partition_2]
name = "partition_2"
nodes = ["node-[101-200]"]

is now

[[partitions]]
name = "partition_1"
nodes = ["node-[001-100]"]

[[partitions]]
name = "partition_2"
nodes = ["node-[101-200]"]

The same is for groups inside partitions.
System parser objects were removed, this functionality is now handled by Pydantic.

What's Changed

Add Pre-test for long LLMs run on Coreweave by @srivatsankrishnan in #170
Redefine Job, Runner, and System Responsibilities; Refactor Classes by @TaekyungHeo in #176
Move Common Variables to System class and TestTemplateStrategy by @TaekyungHeo in #179
Refactor gen_exec_command flow by @amaslenn in #144
Add Kubernetes Support by @TaekyungHeo in #180
Add Nemotron model via PAXML to CloudAI + optimization for large GPU runs by @srivatsankrishnan in #171
Update NCCL Test Templates and Introduce Kubernetes-Specific Strategies by @TaekyungHeo in #186
Make GradingStrategy independant from TestTemplateStrategy by @amaslenn in #187
add test for the reservation feature by @jeffnvidia in #189
Add missing k8s package into package dependencies. by @amaslenn in #190
Fix test directory name typo in README.md by @srinivas212 in #191
Fix bug in generating NCCL test reports on K8s by @TaekyungHeo in #188
Log Runner Initialization with Run Mode Indication in BaseRunner by @TaekyungHeo in #192
Use Host Network for Kubernetes NCCL Tests by @TaekyungHeo in #193
Use Pydantic to verify System schemas by @amaslenn in #158

Full Changelog: v0.9.dev0...v0.9.dev1

Contributors

amaslenn, TaekyungHeo, and 3 other contributors

Assets 2

28 Aug 18:37

TaekyungHeo

v0.9.dev0

e8a959a

v0.9.dev0 Pre-release

Pre-release

What's Changed

Refactor to Use pathlib.Path for Path-Related Variables by @TaekyungHeo in #183

Full Changelog: v0.8.1...v0.9.dev0

Contributors

TaekyungHeo

Assets 2

27 Aug 06:12

srinivas212

v0.8.1

35d1489

v0.8.1 Latest

Latest

Minor enhancements to v0.8.0 release. Improves nccl test html generation and slurm reservation features.

Assets 2

19 Aug 15:53

srinivas212

v0.8.0

b13bafe

v0.8.0

CloudAI v0.8 release notes

Compatibility

CloudAI v0.8 has been tested with: PyTorch/JAX NGC Container 24.05, NCCL 2.19/2.21, and SPC-X 1.1.

Key Features and Enhancements:

Applied the registry pattern to enhance the flexibility and scalability of CloudAI.
Extensive unit and integration testing framework using PyTest
Enhanced error messages and user guide to improve user experience, helping users troubleshoot issues swiftly.
Enhanced the installation feature, focusing on Slurm systems.

What’s next

Improve schema for easier validation
Support automated grading mechanism
Support K8S scheduler
Support preflight and post-flight tests

Assets 2

15 Aug 15:30

TaekyungHeo

v0.8.rc1

b13bafe

v0.8.rc1 Pre-release

Pre-release

What's Changed

Reorg conf directory by @srinivas212 in #174

Full Changelog: v0.8.rc0...v0.8.rc1

Contributors

srinivas212

Assets 2

02 Aug 19:41

TaekyungHeo

v0.8.rc0

570181a

v0.8.rc0 Pre-release

Pre-release

What's Changed

update container version by @jeffnvidia in #122
Update copyright format by @amaslenn in #155
Enhance job status error message for clarity and user guidance by @TaekyungHeo in #156
Upgrade pyright, fix warning by @amaslenn in #157
Allow extra srun args set in system config by @amaslenn in #160
Automatically define version from the git tags by @amaslenn in #162
add support for NeMo Launcher in the reservation by @jeffnvidia in #161
Add job_status_check flag to disable checks for specific tests by @TaekyungHeo in #146
Defer docker image URL accessibility check to srun when not caching locally by @TaekyungHeo in #164
Check TOMLs formatting with taplo by @amaslenn in #163
Hierarchical Test template for support Grok/GPT via PAXML by @srivatsankrishnan in #141

Full Changelog: v0.7.14...v0.8.rc0

Contributors

amaslenn, TaekyungHeo, and 2 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

What's Changed

Contributors

Release notes

What's Changed

Contributors

Highlights

What's Changed

Contributors

What's Changed

Contributors

Compatibility

Key Features and Enhancements:

What’s next

What's Changed

Contributors

What's Changed

Contributors

Releases: NVIDIA/cloudai

v0.9.beta4

What's Changed

Contributors

v0.9.beta3

What's Changed

Contributors

v0.9.beta2

Release notes

v0.9.beta1

What's Changed

Contributors

v0.9.dev1

Highlights

What's Changed

Contributors

v0.9.dev0

What's Changed

Contributors

v0.8.1

v0.8.0

Compatibility

Key Features and Enhancements:

What’s next

v0.8.rc1

What's Changed

Contributors

v0.8.rc0

What's Changed

Contributors