Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Nemotron model via PAXML to CloudAI + optimization for large GPU runs #171

Merged
merged 33 commits into from
Sep 5, 2024

Conversation

srivatsankrishnan
Copy link
Contributor

@srivatsankrishnan srivatsankrishnan commented Aug 13, 2024

Summary

This is an umbrella PR for supporting large GPU runs in CW. Though this PR was originally created for adding just Nemotron, there were bunch of feature request and changes for having resilient/best known method s that were needed to scale beyond 1k GPUs. As of today, we have successful runs upto 2K GPUs. The following features were added to via this PR

  • Nemotron models via PAXML.
  • XLA flags specific to perf/profile stages
  • Enable/Disable PGLE flow through a flag.
  • Start the container and launch the training through container name instead of sqsh file
  • Jax training does not need the mpi flag in srun. For NCCL pre-test it is okay
  • Perfomance fix for profiling stage in CloudAI. CloudAI jobs were ~4x slower (measured at 1k gpus jobs) compared to standalone bash scripts.
  • Nsys profiling only on rank-0 jobs (was slowing up jobs at 1k+ gpu scale)
  • Remove the nsys trace to sqlite converter (not needed as per Haixin)

Test Plan

  • CI/CD should pass. The existing unit test should support the new architecture.
    More details will be updated later.

  • Test on internal systems to make sure it trains GPT/Grok/Nemotron correctly meeting existing performance targets
    More details will be updated later. We were able to successfully scale to 2K gpu runs for Grok-1 via Paxml.

$ python cloudaix.py --mode run --system-config conf/internal/jax_toolbox/system/israel_1.toml --test-templates-dir conf/internal/jax_toolbox/test_template/ --tests-dir conf/internal/jax_toolbox/test --test-scenario conf/internal/jax_toolbox/test_scenario/nemo_proxy/test_scenario_israel_1.toml
[INFO] System configuration file: conf/internal/jax_toolbox/system/israel_1.toml
[INFO] Test templates directory: conf/internal/jax_toolbox/test_template
[INFO] Tests directory: conf/internal/jax_toolbox/test
[INFO] Test scenario file: conf/internal/jax_toolbox/test_scenario/nemo_proxy/test_scenario_israel_1.toml
[INFO] Output directory: None
[INFO] System Name: Israel-1
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: jax_toolbox
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: jax_toolbox

Section Name: Tests.1
  Test Name: sanity_fp8_nemo340b_1
  Description: sanity-fp8-nemo340b-1
  No dependencies
[INFO] Initializing Runner
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Executing command for test Tests.1: sbatch /results/jax_toolbox_2024-08-17_00-23-43/Tests.1/0/cloudai_sbatch_script.sh

@srivatsankrishnan srivatsankrishnan changed the title Add Nemotron model via PAXML to CloudAI (including refactor to Jaxtoolbox Command Gen Strategy) [Draft]: Add Nemotron model via PAXML to CloudAI (including refactor to Jaxtoolbox Command Gen Strategy) Aug 13, 2024
@srivatsankrishnan srivatsankrishnan changed the title [Draft]: Add Nemotron model via PAXML to CloudAI (including refactor to Jaxtoolbox Command Gen Strategy) [Draft]: Add Nemotron model via PAXML to CloudAI Aug 16, 2024
@srivatsankrishnan srivatsankrishnan changed the title [Draft]: Add Nemotron model via PAXML to CloudAI Add Nemotron model via PAXML to CloudAI Aug 17, 2024
@srivatsankrishnan srivatsankrishnan marked this pull request as ready for review August 17, 2024 02:21
Copy link
Contributor

@amaslenn amaslenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this PR depends on #170, it is a bit hard to understand for is uniq for this PR itself. Especially in the context of testing. Please try to keep the coverage at least on the same level as it is.

* Long runs jobs in CW slowing down due to serialization (by ~4X) of the profiling stderr generation
* created rank specific profiling stderr generation duing profiling stage
* Modified the job_status_retrieval_strategy to parse rank specific profiling stderr files
* Todo: Not chaned the unit tests yet. CI/CD will fail still
@srivatsankrishnan srivatsankrishnan changed the title Add Nemotron model via PAXML to CloudAI Add Nemotron model via PAXML to CloudAI + optimization for large GPU runs Sep 5, 2024
@srivatsankrishnan
Copy link
Contributor Author

Looks like this PR depends on #170, it is a bit hard to understand for is uniq for this PR itself. Especially in the context of testing. Please try to keep the coverage at least on the same level as it is.

This PR adds lot of features required for running 1k-4k gpu jobs. Updated the the PR summary to reflect this.

@TaekyungHeo TaekyungHeo added feature Oct24 Oct'24 release feature labels Sep 5, 2024
@srinivas212 srinivas212 merged commit 2181e22 into NVIDIA:main Sep 5, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Oct24 Oct'24 release feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants