Skip to content

Releases: NVIDIA/cloudai

v0.9.beta4

27 Sep 21:59
b305b86
Compare
Choose a tag to compare
v0.9.beta4 Pre-release
Pre-release

What's Changed

  • Fix Bug in DockerImageCacheResult to Correctly Retrieve Absolute Paths by @TaekyungHeo in #212
  • Add Support for Cluster Account and gpus_per_node in Command Generation by @TaekyungHeo in #210

Full Changelog: v0.9.beta3...v0.9.beta4

v0.9.beta3

27 Sep 14:21
85d4276
Compare
Choose a tag to compare
v0.9.beta3 Pre-release
Pre-release

What's Changed

  • Extend CI for high level use cases by @amaslenn in #204
  • Replace 'training.values' Key with 'training' in final_cmd_args by @TaekyungHeo in #209
  • Use System Install Path Instead of Local Member Variables by @TaekyungHeo in #206
  • Remove Unused env_vars From Initialization Code by @TaekyungHeo in #207
  • Remove Config File Handling and TOML Dependency from SlurmInstaller by @TaekyungHeo in #211

Full Changelog: v0.9.beta2...v0.9.beta3

v0.9.beta2

25 Sep 16:36
953f04b
Compare
Choose a tag to compare
v0.9.beta2 Pre-release
Pre-release

Release notes

We are working on schema improvements to simplify configs management and make them verifiable. This will help ensure that configs are correct before expensive runs on real hardware. Today we are enabling it for Test configs. This is a continuation of #158.

  1. Test Template TOML files were replaced with Pydantic models. That ensures mandatory arguments as well as its types and requires less code to maintain.
  2. --test-templates-dir option was removed for all commands. All supported tests are registered in code using Registry().add_test_definition(...) and Registry().add_test_template(...). Documentation was updated to reflect this change.
  3. Test TOML files now take advantage of standard TOML format for all know arguments.
    Before:
    [cmd_args]
    "training" = "llama/llama2_70b"
    "training.trainer.max_steps" = "120"
    "training.model.global_batch_size" = "256"
    "training.model.pipeline_model_parallel_size" = "1"
    Now:
    [cmd_args]
      [cmd_args.training]
      values = "llama/llama2_70b"
        [cmd_args.training.trainer]
        max_steps = "120"
        [cmd_args.training.model]
        global_batch_size = "256"
        pipeline_model_parallel_size = "2"
  4. extra_cmd_args converted from str to dict[str, str]:
    Before:
    extra_cmd_args = "--stepfactor 2"
    Now:
    [extra_cmd_args]
    "--stepfactor" = "2"
  5. Add a new mode to verify if Tests TOMLs are valid: cloudai --mode verify-tests --system-config conf/common/system/standalone_system.toml --tests-dir conf/common/test/chakra_replay.toml

Full Changelog: v0.9.beta1...v0.9.beta2

v0.9.beta1

24 Sep 15:13
c3542c7
Compare
Choose a tag to compare
v0.9.beta1 Pre-release
Pre-release

What's Changed

Full Changelog: v0.9.dev1...v0.9.beta1

v0.9.dev1

16 Sep 08:19
11c5592
Compare
Choose a tag to compare
v0.9.dev1 Pre-release
Pre-release

Highlights

We are working on schema improvements to simplify configs management and make them verifiable. This will help ensure that configs are correct before expensive runs on real hardware. Today we are enabling it for System configs.

Added new command for verifying the configs: cloudai --mode verify-systems. --system-config can be a file or a directory to verify all configs in the directory.
Slurm system config format was updated to take advantage of TOML features:

[partitions]
[partitions.partition_1]
name = "partition_1"
nodes = ["node-[001-100]"]

[partitions.partition_2]
name = "partition_2"
nodes = ["node-[101-200]"]

is now

[[partitions]]
name = "partition_1"
nodes = ["node-[001-100]"]

[[partitions]]
name = "partition_2"
nodes = ["node-[101-200]"]

The same is for groups inside partitions.
System parser objects were removed, this functionality is now handled by Pydantic.

What's Changed

Full Changelog: v0.9.dev0...v0.9.dev1

v0.9.dev0

28 Aug 18:37
e8a959a
Compare
Choose a tag to compare
v0.9.dev0 Pre-release
Pre-release

What's Changed

  • Refactor to Use pathlib.Path for Path-Related Variables by @TaekyungHeo in #183

Full Changelog: v0.8.1...v0.9.dev0

v0.8.1

27 Aug 06:12
35d1489
Compare
Choose a tag to compare

Minor enhancements to v0.8.0 release. Improves nccl test html generation and slurm reservation features.

v0.8.0

19 Aug 15:53
b13bafe
Compare
Choose a tag to compare

CloudAI v0.8 release notes

Compatibility

CloudAI v0.8 has been tested with: PyTorch/JAX NGC Container 24.05, NCCL 2.19/2.21, and SPC-X 1.1.

Key Features and Enhancements:

  • Applied the registry pattern to enhance the flexibility and scalability of CloudAI.
  • Extensive unit and integration testing framework using PyTest
  • Enhanced error messages and user guide to improve user experience, helping users troubleshoot issues swiftly.
  • Enhanced the installation feature, focusing on Slurm systems.

What’s next

  • Improve schema for easier validation
  • Support automated grading mechanism
  • Support K8S scheduler
  • Support preflight and post-flight tests

v0.8.rc1

15 Aug 15:30
b13bafe
Compare
Choose a tag to compare
v0.8.rc1 Pre-release
Pre-release

What's Changed

Full Changelog: v0.8.rc0...v0.8.rc1

v0.8.rc0

02 Aug 19:41
570181a
Compare
Choose a tag to compare
v0.8.rc0 Pre-release
Pre-release

What's Changed

Full Changelog: v0.7.14...v0.8.rc0