Skip to content

Releases: NVIDIA/cloudai

v0.7.4

07 Jun 22:46
63a4655
Compare
Choose a tag to compare
v0.7.4 Pre-release
Pre-release

What's Changed

  • Adding unit test for checking the copyright headers and removing the check_copyright_headers in ci_tools by @srivatsankrishnan in #87
  • Update docstring to comply with coding style and max column length by @TaekyungHeo in #89
  • Fix parsing logic and add unit tests for NCCL test report generation by @TaekyungHeo in #90
  • Bump version to v0.7.4 by @TaekyungHeo in #91

New Contributors

Full Changelog: v0.7.3...v0.7.4

v0.7.3

06 Jun 18:15
89f0445
Compare
Choose a tag to compare
v0.7.3 Pre-release
Pre-release

What's Changed

  • Update docstring to comply with coding style and max column length by @TaekyungHeo in #61
  • Llama by @jeffnvidia in #23
  • Support empty queues in sinfo by @TaekyungHeo in #64
  • Update docstring to comply with coding style and max column length by @TaekyungHeo in #62
  • Enhance installation process with detailed error handling and Docker image caching by @TaekyungHeo in #57
  • Pass install_path to SlurmInstallStrategy by @TaekyungHeo in #65
  • Enhance error reporting and formatting in installation by @TaekyungHeo in #66
  • Replace 'Cloud AI' with 'CloudAI' by @TaekyungHeo in #70
  • Ensure proper conversion of 'cache_docker_images_locally' to boolean by @TaekyungHeo in #67
  • Ensure docker image URL has valid scheme in _check_docker_image_accessibility by @TaekyungHeo in #68
  • Updated BaseInstaller to correctly handle DockerImageCacheResult by @TaekyungHeo in #71
  • Handle missing subdirectories and pass partition name dynamically in DockerImageCacheManager by @TaekyungHeo in #72
  • Handle boolean and string values for cache_docker_images_locally by @TaekyungHeo in #76
  • Use DockerImageCacheManager in CommandGenStrategy by @TaekyungHeo in #69
  • Update _clone_repository to handle existing repository path by @TaekyungHeo in #75
  • Remove redundant installation check from run mode by @TaekyungHeo in #77
  • Fix bug in _check_docker_image_accessibility by @TaekyungHeo in #78
  • Update CI tool to include tests directory and add copyright notices by @TaekyungHeo in #74
  • Add credentials for NGC to README by @lappazos in #81
  • Update NCCL test_template container by @lappazos in #39
  • Update UCC test_template container by @lappazos in #37
  • Use logging.config and do not pass CLI args as a parameter by @amaslenn in #73
  • Bump version to v0.7.2 by @TaekyungHeo in #84
  • Fix bug in check_docker_image_exists to correctly prioritize cached files by @TaekyungHeo in #85
  • Bump version to v0.7.3 by @TaekyungHeo in #86

New Contributors

Full Changelog: v0.7.1...v0.7.3

v0.7.1

03 Jun 16:22
Compare
Choose a tag to compare
v0.7.1 Pre-release
Pre-release

Highlights

  1. Added NcclTestJobStatusRetrievalStrategy and JaxToolboxJobStatusRetrievalStrategy objects.
  2. Moved num_nodes from Test Template to Test Scenario.
  3. Bugfixes and refactoring.

What's Changed

  • Update docstring to comply with coding style and max column length by @TaekyungHeo in #45
  • Refactor test scenario directory structure to remove unnecessary indirection by @TaekyungHeo in #44
  • Bump package version to v0.7 by @amaslenn in #47
  • Add tests for _write_sbatch_script() by @amaslenn in #48
  • fix updating nodes states by @jeffnvidia in #31
  • Fix bug in parse_node_list to handle multiple node ranges correctly by @TaekyungHeo in #49
  • Update docstring to comply with coding style and max column length by @TaekyungHeo in #51
  • Lint imports and high level design doc by @amaslenn in #50
  • Remove Optional for correct type annotation in test.py by @TaekyungHeo in #52
  • Move num_nodes from test template to test scenario by @TaekyungHeo in #42
  • Fix ci_tools/check_copyright_headers.py and add missing copyright headers by @TaekyungHeo in #55
  • Identify job completion status after each test by @TaekyungHeo in #46
  • Implement NcclTestJobStatusRetrievalStrategy and add corresponding tests by @TaekyungHeo in #53
  • Implement JaxToolboxJobStatusRetrievalStrategy and add corresponding tests by @TaekyungHeo in #54
  • Cast num_nodes to int in TestScenarioParser to ensure correct data type by @TaekyungHeo in #58
  • Bump version to v0.7.1 by @amaslenn in #59
  • use official online container by @jeffnvidia in #21
  • Check number of created directories in acceptance test by @x41lakazam in #60

New Contributors

Full Changelog: v0.7.0...v0.7.1

v0.7.0

29 May 06:14
503e2d4
Compare
Choose a tag to compare
v0.7.0 Pre-release
Pre-release

Highlights

  1. A number of changes in imports: we are restructuring project internals, some public names can now be accessible only via absolute import from a specific module.
  2. Expanded tests.
  3. Migrated to Registry object for all specific implementations.
  4. Internal code reorganization and bug fixes.

What's Changed

New Contributors

Full Changelog: v0.6...v0.7.0

v0.6

09 May 00:03
Compare
Choose a tag to compare

Compatibility

CloudAI v0.6 has been tested with: PyTorch NGC Container 24.02, CUDA 12.4, NCCL 2.19, and SPC-X 1.0.1.

Key Features and Enhancements:

  • Designed and implemented extensible SW architecture with support for defining test templates and scenarios, and system schemas.
  • Describe test templates and test scenarios for NeMo Megatron, JAX Toolbox/PAXML, NCCL tests, UCC tests, and Chakra replay.
  • Added support for Slurm and direct job launching and checking status (for testing purposes).
  • Added ability to install, uninstall, dry-run, executing test scenarios, and generating reports.

What’s next

  • Use CloudAI for benchmarking upcoming systems.
  • Better engineering focused on improving user experience (e.g. handle job scheduling failures), and new features (e.g. K8S support).