JUPITER Benchmark Suite: Megatron-LM

This benchmark is part of the JUPITER Benchmark Suite. See the repository of the suite for some general remarks.

This repository contains the Megatron-LM NLP/LLM benchmark. DESCRIPTION.md contains details for compilation, execution, and evaluation.

The required source code (Megatron-LM, Apex) is included in the ./src/ subdirectory as submodules from the upstream repositories; github.com/NVIDIA/Megatron-LM for Megatron-LM and github.com/NVIDIA/apex for Apex. Sample data files are also included.

Overview of Benchmark

Description Of Folder Structure

benchmark
- aux
  - tokenizers
  - script used for getting data and tokenizers; get_shrink_data_and_tokenizers.sh
  - script used for preprocessing data; job_preprocess_data.sbatch
  - sample 10MB oscar dataset got using get_shrink_data_and_tokenizers.sh
- env
  - script for activating the python virtual env; activate.bash
  - script to set up python virtual env; setup_venv.sh
- slurm
  - sbatch scripts for 13B and 175B model to be used when running without JUBE
- jube
  - contains accompanying files for JUBE run and the JUBE yaml file
src
- data : contains the preprocessed data (*idx and *.bin files)
- compile_build.sh : script to build the software dependencies
- variables.bash : file that sets important paths
- prebuild_kernels.py : script to prebuild fused kernels

Workflow Without JUBE:

Getting Data and Tokenizers

The following workflow can be done if data and tokenizers are not already present with this repository:

Step 1: Set NLP_BENCH_ROOT variable as export NLP_BENCH_ROOT=<rootdir path of this benchmark> in your bash shell
Step 2: cd benchmark/aux/
Step 3: bash get_shrink_data_and_tokenizers.sh to get tokenizers and compress the raw data oscar-1GB.jsonl.xz to oscar-10MB.jsonl.xz

Prepocessing Data

If your src/data folder does not contain preprocessed data (*.idx and *.bin files), then execute sbatch job_preprocess_data.sbatch after Step 5 in "Workflow With Preprocessed Data And Tokenizers Available" from benchmark/aux directory.

The job_preprocess_data.sbatch script in benchmark/aux/ is used to preprocess the oscar-10MB.jsonl.xz and put it in src/data/. The file can be modified to preprocess any data of choice.

Workflow With Preprocessed Data And Tokenizers Available

Step 1: cd into it the folder of this benchmark
Step 2: Set NLP_BENCH_ROOT variable as export NLP_BENCH_ROOT=<rootdir path of this benchmark> in your bash shell
Step 3: Set TORCH_CUDA_ARCH_LIST according to GPU's compute capability in benchmark/env/activate.bash
Step 4: Run bash benchmark/env/setup_venv.sh
Step 5: Run bash src/compile_build.sh
Step 6: Run sbatch benchmark/slurm/jobscript_13B.sbatch or sbatch benchmark/slurm/jobscript_175B.sbatch

The output file *.out file would have result logs of the following form that are important :

[default3]: iteration       10/  292968 | consumed samples:        10240 | elapsed time per iteration (s): 35.8651 | learning rate: 4.734E-06 | global batch size:  1024 | lm loss: 1.332803E+01 | loss scale: 4096.0 | grad norm: 42.627 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 28.551 | TFLOPs: 199.03 |
[default3]: iteration       20/  292968 | consumed samples:        20480 | elapsed time per iteration (s): 34.9991 | learning rate: 9.467E-06 | global batch size:  1024 | lm loss: 1.010884E+01 | loss scale: 4096.0 | grad norm: 13.038 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 29.258 | TFLOPs: 203.96 |
[default3]: iteration       30/  292968 | consumed samples:        30720 | elapsed time per iteration (s): 34.8709 | learning rate: 1.420E-05 | global batch size:  1024 | lm loss: 9.072961E+00 | loss scale: 4096.0 | grad norm: 26.640 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 29.365 | TFLOPs: 204.71 |
[default3]: iteration       40/  292968 | consumed samples:        40960 | elapsed time per iteration (s): 35.3346 | learning rate: 1.893E-05 | global batch size:  1024 | lm loss: 8.486469E+00 | loss scale: 4096.0 | grad norm: 3.441 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 28.980 | TFLOPs: 202.02 |
[default3]: iteration       50/  292968 | consumed samples:        51200 | elapsed time per iteration (s): 35.3357 | learning rate: 2.367E-05 | global batch size:  1024 | lm loss: 8.

The metric tokens_per_sec should be calculated as (1.0/$elapsed_time_per_iteration)*$global_batch_size*$sequence_length obtained from the *.out file.

For submission the throughput tokens_per_sec is converted into time, a hypothetical training would require. This conversion is done by assuming a training with 20 Million tokens, using the formula

[ time_to_report_in_seconds ] =  [tokens] / [tokens/second]

Example: Using the 13B model result below (Tokens/sec: 59463.14), we obtain a duration of 20,000,000 / 59463.14 = 336.34 seconds.

Hint: sequence_length can be found in the jobscript.

Workflow With JUBE:

Step 1: cd into it the folder of this benchmark
Step 2: Set TORCH_CUDA_ARCH_LIST according to GPU's compute capability in benchmark/env/activate.bash
Step 3: Execute either jube run benchmark/jube/nlp_benchmark.yaml --tag 175 for 175B model orjube run benchmark/jube/nlp_benchmark.yaml --tag 13 for 13B model
Step 4: Wait for the benchmark to run and then do jube continue nlp_benchmark_run -i last until no Steps with the "wait" state remain
Step 5: After the benchmark finishes, run jube result -a nlp_benchmark_run -i last to print the benchmark results

Example result from JUBE:

|        system | version |   queue |    JobID |   Job_Time | Model_Size (Billion Param) | Nodes | Batch_Size | Pipeline_Parallel | Tensor_Parallel | Iterations | Avg_TFLOPs/GPU | Tokens/sec | time_to_report_in_seconds |
|---------------|---------|---------|----------|------------|----------------------------|-------|------------|-------------------|-----------------|------------|----------------|------------|---------------------------|
| juwelsbooster | 2024.01 | booster | 10011638 | "00:30:00" |                         13 |     8 |       1024 |                 4 |               2 |         20 |        206.885 |   60777.68 |                    329.07 |

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
benchmark		benchmark
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
CITATION.cff		CITATION.cff
DESCRIPTION.md		DESCRIPTION.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JUPITER Benchmark Suite: Megatron-LM

Overview of Benchmark

Description Of Folder Structure

Workflow Without JUBE:

Getting Data and Tokenizers

Prepocessing Data

Workflow With Preprocessed Data And Tokenizers Available

Workflow With JUBE:

About

Releases 2

Packages

Languages

License

FZJ-JSC/jubench-megatron-lm

Folders and files

Latest commit

History

Repository files navigation

JUPITER Benchmark Suite: Megatron-LM

Overview of Benchmark

Description Of Folder Structure

Workflow Without JUBE:

Getting Data and Tokenizers

Prepocessing Data

Workflow With Preprocessed Data And Tokenizers Available

Workflow With JUBE:

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages