v0.9.dev1
Pre-release
Pre-release
Highlights
We are working on schema improvements to simplify configs management and make them verifiable. This will help ensure that configs are correct before expensive runs on real hardware. Today we are enabling it for System configs.
Added new command for verifying the configs: cloudai --mode verify-systems
. --system-config
can be a file or a directory to verify all configs in the directory.
Slurm system config format was updated to take advantage of TOML features:
[partitions]
[partitions.partition_1]
name = "partition_1"
nodes = ["node-[001-100]"]
[partitions.partition_2]
name = "partition_2"
nodes = ["node-[101-200]"]
is now
[[partitions]]
name = "partition_1"
nodes = ["node-[001-100]"]
[[partitions]]
name = "partition_2"
nodes = ["node-[101-200]"]
The same is for groups inside partitions.
System parser objects were removed, this functionality is now handled by Pydantic.
What's Changed
- Add Pre-test for long LLMs run on Coreweave by @srivatsankrishnan in #170
- Redefine Job, Runner, and System Responsibilities; Refactor Classes by @TaekyungHeo in #176
- Move Common Variables to System class and TestTemplateStrategy by @TaekyungHeo in #179
- Refactor
gen_exec_command
flow by @amaslenn in #144 - Add Kubernetes Support by @TaekyungHeo in #180
- Add Nemotron model via PAXML to CloudAI + optimization for large GPU runs by @srivatsankrishnan in #171
- Update NCCL Test Templates and Introduce Kubernetes-Specific Strategies by @TaekyungHeo in #186
- Make GradingStrategy independant from TestTemplateStrategy by @amaslenn in #187
- add test for the reservation feature by @jeffnvidia in #189
- Add missing k8s package into package dependencies. by @amaslenn in #190
- Fix test directory name typo in README.md by @srinivas212 in #191
- Fix bug in generating NCCL test reports on K8s by @TaekyungHeo in #188
- Log Runner Initialization with Run Mode Indication in BaseRunner by @TaekyungHeo in #192
- Use Host Network for Kubernetes NCCL Tests by @TaekyungHeo in #193
- Use Pydantic to verify System schemas by @amaslenn in #158
Full Changelog: v0.9.dev0...v0.9.dev1