Add Nemotron model via PAXML to CloudAI + optimization for large GPU runs #171

srivatsankrishnan · 2024-08-13T18:02:01Z

Summary

This is an umbrella PR for supporting large GPU runs in CW. Though this PR was originally created for adding just Nemotron, there were bunch of feature request and changes for having resilient/best known method s that were needed to scale beyond 1k GPUs. As of today, we have successful runs upto 2K GPUs. The following features were added to via this PR

Nemotron models via PAXML.
XLA flags specific to perf/profile stages
Enable/Disable PGLE flow through a flag.
Start the container and launch the training through container name instead of sqsh file
Jax training does not need the mpi flag in srun. For NCCL pre-test it is okay
Perfomance fix for profiling stage in CloudAI. CloudAI jobs were ~4x slower (measured at 1k gpus jobs) compared to standalone bash scripts.
Nsys profiling only on rank-0 jobs (was slowing up jobs at 1k+ gpu scale)
Remove the nsys trace to sqlite converter (not needed as per Haixin)

Test Plan

CI/CD should pass. The existing unit test should support the new architecture.
More details will be updated later.
Test on internal systems to make sure it trains GPT/Grok/Nemotron correctly meeting existing performance targets
More details will be updated later. We were able to successfully scale to 2K gpu runs for Grok-1 via Paxml.

$ python cloudaix.py --mode run --system-config conf/internal/jax_toolbox/system/israel_1.toml --test-templates-dir conf/internal/jax_toolbox/test_template/ --tests-dir conf/internal/jax_toolbox/test --test-scenario conf/internal/jax_toolbox/test_scenario/nemo_proxy/test_scenario_israel_1.toml
[INFO] System configuration file: conf/internal/jax_toolbox/system/israel_1.toml
[INFO] Test templates directory: conf/internal/jax_toolbox/test_template
[INFO] Tests directory: conf/internal/jax_toolbox/test
[INFO] Test scenario file: conf/internal/jax_toolbox/test_scenario/nemo_proxy/test_scenario_israel_1.toml
[INFO] Output directory: None
[INFO] System Name: Israel-1
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: jax_toolbox
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: jax_toolbox

Section Name: Tests.1
  Test Name: sanity_fp8_nemo340b_1
  Description: sanity-fp8-nemo340b-1
  No dependencies
[INFO] Initializing Runner
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Executing command for test Tests.1: sbatch /results/jax_toolbox_2024-08-17_00-23-43/Tests.1/0/cloudai_sbatch_script.sh

…/Grok or more

This reverts commit dbc4841.

src/cloudai/schema/test_template/jax_toolbox/grok_slurm_command_gen_strategy.py

amaslenn

Looks like this PR depends on #170, it is a bit hard to understand for is uniq for this PR itself. Especially in the context of testing. Please try to keep the coverage at least on the same level as it is.

src/cloudai/schema/test_template/jax_toolbox/slurm_command_gen_strategy.py

… and perf runs

* Long runs jobs in CW slowing down due to serialization (by ~4X) of the profiling stderr generation * created rank specific profiling stderr generation duing profiling stage * Modified the job_status_retrieval_strategy to parse rank specific profiling stderr files * Todo: Not chaned the unit tests yet. CI/CD will fail still

srivatsankrishnan · 2024-09-05T00:45:40Z

Looks like this PR depends on #170, it is a bit hard to understand for is uniq for this PR itself. Especially in the context of testing. Please try to keep the coverage at least on the same level as it is.

This PR adds lot of features required for running 1k-4k gpu jobs. Updated the the PR summary to reflect this.

src/cloudai/schema/test_template/jax_toolbox/job_status_retrieval_strategy.py

tests/test_job_status_retrieval_strategy.py

Seperate Slurm comand gen class for GPT in jaxtoolbox

2770e9a

srivatsankrishnan changed the title ~~Add Nemotron model via PAXML to CloudAI (including refactor to Jaxtoolbox Command Gen Strategy)~~ [Draft]: Add Nemotron model via PAXML to CloudAI (including refactor to Jaxtoolbox Command Gen Strategy) Aug 13, 2024

srivatsankrishnan added 13 commits August 13, 2024 21:16

Seperate Slurm command gen class for Grok in jaxtoolbox

dbc4841

Create a selection strategy method in base class to select beteen GPT…

4aa27ad

…/Grok or more

Make conditional import to avoid circulaar imports for pytest

802c556

Adding support for enabling registry with multiple strategies.

487a5c5

Revert "Seperate Slurm command gen class for Grok in jaxtoolbox"

7cb10b5

This reverts commit dbc4841.

Add the grok command gen class back

fc0fb31

remove gork slurm gen

1a0bb4c

Fix the imports in init

1d968f7

remove create strategy from slurm comand gen

405778c

keep extract test name for the pytest

687dde8

Adding nemotron 340b support existing approach

3dd6fb2

lint fix

40026ab

Functional script generation for nemotron340b in Paxml

b460246

srivatsankrishnan changed the title ~~[Draft]: Add Nemotron model via PAXML to CloudAI (including refactor to Jaxtoolbox Command Gen Strategy)~~ [Draft]: Add Nemotron model via PAXML to CloudAI Aug 16, 2024

srivatsankrishnan added 3 commits August 17, 2024 03:01

Pushing the seperate class for gork and nemotron

0e06184

removing seperate classes for grok/gpt/nemo

04da262

Final working version with nemotron/grok/gpt + cleaning pytest

6e21956

srivatsankrishnan changed the title ~~[Draft]: Add Nemotron model via PAXML to CloudAI~~ Add Nemotron model via PAXML to CloudAI Aug 17, 2024

srivatsankrishnan added 2 commits August 17, 2024 05:17

Added new unit test for the env handling refactor

2a79ef1

Merge branch 'main' into nemotron

a92e753

srivatsankrishnan requested review from TaekyungHeo and srinivas212 August 17, 2024 02:20

srivatsankrishnan marked this pull request as ready for review August 17, 2024 02:21

amaslenn reviewed Aug 19, 2024

View reviewed changes

src/cloudai/schema/test_template/jax_toolbox/grok_slurm_command_gen_strategy.py Outdated Show resolved Hide resolved

Add number of nodes in srun for pre-test

757c819

amaslenn reviewed Aug 21, 2024

View reviewed changes

src/cloudai/schema/test_template/jax_toolbox/slurm_command_gen_strategy.py Show resolved Hide resolved

src/cloudai/schema/test_template/jax_toolbox/slurm_command_gen_strategy.py Outdated Show resolved Hide resolved

srivatsankrishnan added 2 commits August 22, 2024 14:55

remove mpi for jax runs and nsys to sqlite db conversion

0c208b9

modify to have only nsys profile for node and rank 0 for both profile…

212491c

… and perf runs

srivatsankrishnan added 10 commits August 24, 2024 00:14

modify the slurm commans to start and load container

f770573

Add command generation for XLA flags for perf and profile stages

2c01152

Disable PGLE path for bf16 runs

854e06f

disabling pgle based on a flag (bf16 doesn't need pgle)

4b4f342

Merge branch 'main' into nemotron

c01b010

ruff fixes + merge conflict fixes

ee4fdcb

fix pytest with job_status_retrival_stratergy

2d97509

ruffing

d25d85e

fix the test for large run optimization (container pre-loads etc)

b1c53c0

srivatsankrishnan changed the title ~~Add Nemotron model via PAXML to CloudAI~~ Add Nemotron model via PAXML to CloudAI + optimization for large GPU runs Sep 5, 2024

TaekyungHeo reviewed Sep 5, 2024

View reviewed changes

src/cloudai/schema/test_template/jax_toolbox/job_status_retrieval_strategy.py Outdated Show resolved Hide resolved

TaekyungHeo reviewed Sep 5, 2024

View reviewed changes

tests/test_job_status_retrieval_strategy.py Show resolved Hide resolved

Fixing Taekyung's comments

c69f1f3

TaekyungHeo added feature Oct24 Oct'24 release feature labels Sep 5, 2024

TaekyungHeo requested a review from artemry-nv September 5, 2024 11:03

TaekyungHeo approved these changes Sep 5, 2024

View reviewed changes

srinivas212 approved these changes Sep 5, 2024

View reviewed changes

srinivas212 merged commit 2181e22 into NVIDIA:main Sep 5, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Nemotron model via PAXML to CloudAI + optimization for large GPU runs #171

Add Nemotron model via PAXML to CloudAI + optimization for large GPU runs #171

srivatsankrishnan commented Aug 13, 2024 •

edited

Loading

amaslenn left a comment

srivatsankrishnan commented Sep 5, 2024

Add Nemotron model via PAXML to CloudAI + optimization for large GPU runs #171

Add Nemotron model via PAXML to CloudAI + optimization for large GPU runs #171

Conversation

srivatsankrishnan commented Aug 13, 2024 • edited Loading

Summary

Test Plan

amaslenn left a comment

Choose a reason for hiding this comment

srivatsankrishnan commented Sep 5, 2024

srivatsankrishnan commented Aug 13, 2024 •

edited

Loading