-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Nemotron model via PAXML to CloudAI + optimization for large GPU runs #171
Conversation
This reverts commit dbc4841.
src/cloudai/schema/test_template/jax_toolbox/grok_slurm_command_gen_strategy.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this PR depends on #170, it is a bit hard to understand for is uniq for this PR itself. Especially in the context of testing. Please try to keep the coverage at least on the same level as it is.
src/cloudai/schema/test_template/jax_toolbox/slurm_command_gen_strategy.py
Show resolved
Hide resolved
src/cloudai/schema/test_template/jax_toolbox/slurm_command_gen_strategy.py
Outdated
Show resolved
Hide resolved
* Long runs jobs in CW slowing down due to serialization (by ~4X) of the profiling stderr generation * created rank specific profiling stderr generation duing profiling stage * Modified the job_status_retrieval_strategy to parse rank specific profiling stderr files * Todo: Not chaned the unit tests yet. CI/CD will fail still
This PR adds lot of features required for running 1k-4k gpu jobs. Updated the the PR summary to reflect this. |
Summary
This is an umbrella PR for supporting large GPU runs in CW. Though this PR was originally created for adding just Nemotron, there were bunch of feature request and changes for having resilient/best known method s that were needed to scale beyond 1k GPUs. As of today, we have successful runs upto 2K GPUs. The following features were added to via this PR
Test Plan
CI/CD should pass. The existing unit test should support the new architecture.
More details will be updated later.
Test on internal systems to make sure it trains GPT/Grok/Nemotron correctly meeting existing performance targets
More details will be updated later. We were able to successfully scale to 2K gpu runs for Grok-1 via Paxml.