Releases: NVIDIA/cloudai
Releases · NVIDIA/cloudai
v0.7.4
What's Changed
- Adding unit test for checking the copyright headers and removing the check_copyright_headers in ci_tools by @srivatsankrishnan in #87
- Update docstring to comply with coding style and max column length by @TaekyungHeo in #89
- Fix parsing logic and add unit tests for NCCL test report generation by @TaekyungHeo in #90
- Bump version to v0.7.4 by @TaekyungHeo in #91
New Contributors
- @srivatsankrishnan made their first contribution in #87
Full Changelog: v0.7.3...v0.7.4
v0.7.3
What's Changed
- Update docstring to comply with coding style and max column length by @TaekyungHeo in #61
- Llama by @jeffnvidia in #23
- Support empty queues in sinfo by @TaekyungHeo in #64
- Update docstring to comply with coding style and max column length by @TaekyungHeo in #62
- Enhance installation process with detailed error handling and Docker image caching by @TaekyungHeo in #57
- Pass install_path to SlurmInstallStrategy by @TaekyungHeo in #65
- Enhance error reporting and formatting in installation by @TaekyungHeo in #66
- Replace 'Cloud AI' with 'CloudAI' by @TaekyungHeo in #70
- Ensure proper conversion of 'cache_docker_images_locally' to boolean by @TaekyungHeo in #67
- Ensure docker image URL has valid scheme in _check_docker_image_accessibility by @TaekyungHeo in #68
- Updated BaseInstaller to correctly handle DockerImageCacheResult by @TaekyungHeo in #71
- Handle missing subdirectories and pass partition name dynamically in DockerImageCacheManager by @TaekyungHeo in #72
- Handle boolean and string values for
cache_docker_images_locally
by @TaekyungHeo in #76 - Use DockerImageCacheManager in CommandGenStrategy by @TaekyungHeo in #69
- Update
_clone_repository
to handle existing repository path by @TaekyungHeo in #75 - Remove redundant installation check from run mode by @TaekyungHeo in #77
- Fix bug in
_check_docker_image_accessibility
by @TaekyungHeo in #78 - Update CI tool to include tests directory and add copyright notices by @TaekyungHeo in #74
- Add credentials for NGC to README by @lappazos in #81
- Update NCCL test_template container by @lappazos in #39
- Update UCC test_template container by @lappazos in #37
- Use logging.config and do not pass CLI args as a parameter by @amaslenn in #73
- Bump version to v0.7.2 by @TaekyungHeo in #84
- Fix bug in check_docker_image_exists to correctly prioritize cached files by @TaekyungHeo in #85
- Bump version to v0.7.3 by @TaekyungHeo in #86
New Contributors
Full Changelog: v0.7.1...v0.7.3
v0.7.1
Highlights
- Added
NcclTestJobStatusRetrievalStrategy
andJaxToolboxJobStatusRetrievalStrategy
objects. - Moved
num_nodes
from Test Template to Test Scenario. - Bugfixes and refactoring.
What's Changed
- Update docstring to comply with coding style and max column length by @TaekyungHeo in #45
- Refactor test scenario directory structure to remove unnecessary indirection by @TaekyungHeo in #44
- Bump package version to v0.7 by @amaslenn in #47
- Add tests for _write_sbatch_script() by @amaslenn in #48
- fix updating nodes states by @jeffnvidia in #31
- Fix bug in parse_node_list to handle multiple node ranges correctly by @TaekyungHeo in #49
- Update docstring to comply with coding style and max column length by @TaekyungHeo in #51
- Lint imports and high level design doc by @amaslenn in #50
- Remove Optional for correct type annotation in test.py by @TaekyungHeo in #52
- Move num_nodes from test template to test scenario by @TaekyungHeo in #42
- Fix ci_tools/check_copyright_headers.py and add missing copyright headers by @TaekyungHeo in #55
- Identify job completion status after each test by @TaekyungHeo in #46
- Implement NcclTestJobStatusRetrievalStrategy and add corresponding tests by @TaekyungHeo in #53
- Implement JaxToolboxJobStatusRetrievalStrategy and add corresponding tests by @TaekyungHeo in #54
- Cast
num_nodes
to int in TestScenarioParser to ensure correct data type by @TaekyungHeo in #58 - Bump version to v0.7.1 by @amaslenn in #59
- use official online container by @jeffnvidia in #21
- Check number of created directories in acceptance test by @x41lakazam in #60
New Contributors
- @jeffnvidia made their first contribution in #31
- @x41lakazam made their first contribution in #60
Full Changelog: v0.7.0...v0.7.1
v0.7.0
Highlights
- A number of changes in imports: we are restructuring project internals, some public names can now be accessible only via absolute import from a specific module.
- Expanded tests.
- Migrated to
Registry
object for all specific implementations. - Internal code reorganization and bug fixes.
What's Changed
- Add CI by @amaslenn in #1
- Use src layout for packaging by @amaslenn in #2
- Add ISSUE_TEMPLATE and PULL_REQUEST_TEMPLATE by @TaekyungHeo in #3
- Add vulture into requirements-dev.txt by @amaslenn in #7
- Enhance JaxToolbox report generation and update stats collection by @TaekyungHeo in #4
- Allow overriding of runner mappings in Runner class by @TaekyungHeo in #9
- Remove NCCL Miner Test Template by @TaekyungHeo in #10
- Rename report file names to include 'cloudai' prefix by @TaekyungHeo in #12
- Rename default sbatch script to include project-specific prefix by @TaekyungHeo in #11
- Refactor SlurmSystem methods for bug fix and add unit tests by @TaekyungHeo in #14
- Acceptence tests by @amaslenn in #6
- Disable grading until further improvements to grading policy by @TaekyungHeo in #16
- Introduce a Registry class to hold implementation mapping by @amaslenn in #13
- Flat tests structure, fix warnings by @amaslenn in #17
- Update docker support for Nemo/Nccl testing by @amaslenn in #18
- Enable pydocstyle by @amaslenn in #26
- Remove requests from requirements.txt by @TaekyungHeo in #28
- Register runners by @amaslenn in #27
- Add CloudAI prefix to plots by @TaekyungHeo in #32
- Add test name in reports by @TaekyungHeo in #33
- Handle training.model.tokenizer.model CLI arg by @amaslenn in #30
- Register strategies via Registry by @amaslenn in #29
- Generate CSV reports in NCCL-tests by @TaekyungHeo in #34
- Use Registry for test templates by @amaslenn in #35
- Remove SystemObjectUpdater functionality by @amaslenn in #36
- Use Registry for Installers by @amaslenn in #38
- Remove docstring from gen_exec_command in SleepStandaloneCommandGenStrategy to comply with docstring conventions by @TaekyungHeo in #43
- Reorganize files in repo by @amaslenn in #40
- Enhance job submission error handling with custom exceptions by @TaekyungHeo in #41
New Contributors
- @amaslenn made their first contribution in #1
- @TaekyungHeo made their first contribution in #3
Full Changelog: v0.6...v0.7.0
v0.6
Compatibility
CloudAI v0.6 has been tested with: PyTorch NGC Container 24.02, CUDA 12.4, NCCL 2.19, and SPC-X 1.0.1.
Key Features and Enhancements:
- Designed and implemented extensible SW architecture with support for defining test templates and scenarios, and system schemas.
- Describe test templates and test scenarios for NeMo Megatron, JAX Toolbox/PAXML, NCCL tests, UCC tests, and Chakra replay.
- Added support for Slurm and direct job launching and checking status (for testing purposes).
- Added ability to install, uninstall, dry-run, executing test scenarios, and generating reports.
What’s next
- Use CloudAI for benchmarking upcoming systems.
- Better engineering focused on improving user experience (e.g. handle job scheduling failures), and new features (e.g. K8S support).