You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to follow the SFT Tutorial, on a Llama-3-8b LLM and the process fails
with error torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1961, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Failed to CUDA calloc async 608 bytes
Steps/Code to reproduce bug
Within the NeMO container, I followed the tutorial to convert the Llama-3-8b weights from HF to nemo format, then I ran the following commands:
Environment overview (please complete the following information)
Environment location: VMware vSphere 8
Method of NeMo-Aligner install: I used container nemo:24.05.01
If method of install is [Docker], provide docker pull & docker run commands used
docker pull nvcr.io/nvidia/nemo:24.05.01
docker run --runtime nvidia --gpus all \
-v ~/HF:/workspace/huggingface \
-v ~/nemo:/workspace/nemo \
--name my_nemo -td nvcr.io/nvidia/nemo:24.05.01
Environment details
nvidia-smi output from within the container
nvidia-smi
Sat Jul 13 17:38:14 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100XM-80C On | 00000000:03:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100XM-80C On | 00000000:03:01.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Additional context
I'm using 2 x H100 GPUs. These work normally as I have used them already on the same VM to serve Llama-3-70b using vLLM without any issues. Of course, the vLLM container got stopped before I tried to run SFT on the NeMo Alignment container.
Stack error:
[NeMo W 2024-07-13 17:26:30 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
[NeMo I 2024-07-13 17:26:30 train_gpt_sft:118]
************** Experiment configuration ***********
[NeMo I 2024-07-13 17:26:30 train_gpt_sft:119]
name: megatron_gpt_sft
trainer:
num_nodes: 1
devices: 1
accelerator: gpu
precision: bf16
sft:
max_epochs: 1
max_steps: -1
val_check_interval: 1000
save_interval: ${.val_check_interval}
limit_train_batches: 1.0
limit_val_batches: 40
gradient_clip_val: 1.0
logger: false
enable_checkpointing: false
use_distributed_sampler: false
max_time: null
max_epochs: ${.sft.max_epochs}
max_steps: ${.sft.max_steps}
exp_manager:
explicit_log_dir: /results
exp_dir: null
name: ${name}
create_wandb_logger: true
wandb_logger_kwargs:
project: sft_run
name: dolly_sft_run
resume_if_exists: true
resume_ignore_no_checkpoint: true
create_checkpoint_callback: true
checkpoint_callback_params:
monitor: validation_loss
save_top_k: 5
mode: min
save_nemo_on_train_end: true
filename: megatron_gpt_sft--{${.monitor}:.3f}-{step}-{consumed_samples}-{epoch}
model_parallel_size: ${model.tensor_model_parallel_size}
save_best_model: false
model:
seed: 1234
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
restore_from_path: /workspace/nemo/models/llama3-8b/mcore_gpt.nemo
resume_from_checkpoint: null
save_nemo_on_validation_end: true
sync_batch_comm: false
megatron_amp_O2: true
encoder_seq_length: 4096
sequence_parallel: false
activations_checkpoint_granularity: null
activations_checkpoint_method: null
activations_checkpoint_num_layers: null
activations_checkpoint_layers_per_pipeline: null
answer_only_loss: true
gradient_as_bucket_view: false
seq_len_interpolation_factor: null
use_flash_attention: null
hidden_dropout: 0.0
attention_dropout: 0.0
ffn_dropout: 0.0
steerlm2:
forward_micro_batch_size: 1
micro_batch_size: 1
peft:
peft_scheme: none
restore_from_path: null
lora_tuning:
target_modules:
- attention_qkv
adapter_dim: 32
adapter_dropout: 0.0
column_init_method: xavier
row_init_method: zero
layer_selection: null
weight_tying: false
position_embedding_strategy: null
data:
chat: false
chat_prompt_tokens:
system_turn_start: "\0"
turn_start: "\x11"
label_start: "\x12"
end_of_turn: '
'
end_of_name: '
'
sample: false
num_workers: 0
dataloader_type: single
train_ds:
file_path: /workspace/nemo/datasets/databricks-dolly-15k-output.jsonl
global_batch_size: 128
micro_batch_size: 1
shuffle: true
memmap_workers: null
max_seq_length: ${model.encoder_seq_length}
min_seq_length: 1
drop_last: true
label_key: output
add_eos: true
add_sep: false
add_bos: false
truncation_field: input
index_mapping_dir: null
prompt_template: '{input} {output}'
hf_dataset: false
truncation_method: right
validation_ds:
file_path: /workspace/nemo/datasets/databricks-dolly-15k-output.jsonl
global_batch_size: 128
micro_batch_size: 1
shuffle: false
memmap_workers: ${model.data.train_ds.memmap_workers}
max_seq_length: ${model.data.train_ds.max_seq_length}
min_seq_length: 1
drop_last: true
label_key: ${model.data.train_ds.label_key}
add_eos: ${model.data.train_ds.add_eos}
add_sep: ${model.data.train_ds.add_sep}
add_bos: ${model.data.train_ds.add_bos}
truncation_field: ${model.data.train_ds.truncation_field}
index_mapping_dir: null
prompt_template: ${model.data.train_ds.prompt_template}
hf_dataset: false
truncation_method: right
output_original_text: true
optim:
name: distributed_fused_adam
lr: 5.0e-06
weight_decay: 0.01
betas:
- 0.9
- 0.98
sched:
name: CosineAnnealing
warmup_steps: 10
constant_steps: 1000
min_lr: 9.0e-07
[NeMo W 2024-07-13 17:26:30 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2024-07-13 17:26:31 exp_manager:708] Exp_manager is logging to /results, but it already exists.
[NeMo W 2024-07-13 17:26:31 exp_manager:630] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/results/checkpoints. Training from scratch.
[NeMo I 2024-07-13 17:26:31 exp_manager:396] Experiments will be logged at /results
[NeMo I 2024-07-13 17:26:31 exp_manager:856] TensorboardLogger has been set up
[NeMo I 2024-07-13 17:26:31 exp_manager:871] WandBLogger has been set up
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_rs_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: defer_embedding_wgrad_compute in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[NeMo I 2024-07-13 17:26:42 megatron_init:263] Rank 0 has data parallel group : [0]
[NeMo I 2024-07-13 17:26:42 megatron_init:269] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-07-13 17:26:42 megatron_init:274] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2024-07-13 17:26:42 megatron_init:277] Ranks 0 has data parallel rank: 0
[NeMo I 2024-07-13 17:26:42 megatron_init:285] Rank 0 has context parallel group: [0]
[NeMo I 2024-07-13 17:26:42 megatron_init:288] All context parallel group ranks: [[0]]
[NeMo I 2024-07-13 17:26:42 megatron_init:289] Ranks 0 has context parallel rank: 0
[NeMo I 2024-07-13 17:26:42 megatron_init:296] Rank 0 has model parallel group: [0]
[NeMo I 2024-07-13 17:26:42 megatron_init:297] All model parallel group ranks: [[0]]
[NeMo I 2024-07-13 17:26:42 megatron_init:306] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-07-13 17:26:42 megatron_init:310] All tensor model parallel group ranks: [[0]]
[NeMo I 2024-07-13 17:26:42 megatron_init:311] Rank 0 has tensor model parallel rank: 0
[NeMo I 2024-07-13 17:26:42 megatron_init:331] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2024-07-13 17:26:42 megatron_init:343] Rank 0 has embedding group: [0]
[NeMo I 2024-07-13 17:26:42 megatron_init:349] All pipeline model parallel group ranks: [[0]]
[NeMo I 2024-07-13 17:26:42 megatron_init:350] Rank 0 has pipeline model parallel rank 0
[NeMo I 2024-07-13 17:26:42 megatron_init:351] All embedding group ranks: [[0]]
[NeMo I 2024-07-13 17:26:42 megatron_init:352] Rank 0 has embedding rank: 0
24-07-13 17:26:42 - PID:928 - rank:(0, 0, 0, 0) - microbatches.py:39 - INFO - setting number of micro-batches to constant 128
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_rs_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: defer_embedding_wgrad_compute in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo I 2024-07-13 17:26:42 tokenizer_utils:178] Getting HuggingFace AutoTokenizer with pretrained_model_name: meta-llama/Meta-Llama-3-8B
[NeMo W 2024-07-13 17:26:42 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[NeMo I 2024-07-13 17:26:42 megatron_base_model:584] Padded vocab_size: 128256, original vocab_size: 128256, dummy tokens: 0.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_rs_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: defer_embedding_wgrad_compute in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:498] apply_query_key_layer_scaling is only enabled when using FP16, setting it to False and setting NVTE_APPLY_QK_LAYER_SCALING=0
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: activation_func_fp8_input_store in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: num_moe_experts in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: window_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: qk_layernorm in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: test_mode in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: calculate_per_token_loss in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: memory_efficient_layer_norm in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: fp8_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: fp8_dot_product_attention in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: fp8_multi_head_attention in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_router_load_balancing_type in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_router_topk in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_grouped_gemm in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_aux_loss_coeff in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_z_loss_coeff in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_input_jitter_eps in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_token_dropping in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_token_dispatcher_type in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_per_layer_logging in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_expert_capacity_factor in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_pad_expert_input_to_capacity in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_token_drop_policy in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_layer_recompute in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: clone_scatter_output_in_embedding in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: disable_parameter_transpose_cache in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: enable_cuda_graph in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: rotary_percent in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
[NeMo I 2024-07-13 17:26:56 dist_ckpt_io:95] Using ('zarr', 1) dist-ckpt save strategy.
Error executing job with overrides: ['trainer.precision=bf16', 'trainer.num_nodes=1', 'trainer.devices=1', 'trainer.sft.max_steps=-1', 'trainer.sft.limit_val_batches=40', 'trainer.sft.val_check_interval=1000', 'model.megatron_amp_O2=True', 'model.restore_from_path=/workspace/nemo/models/llama3-8b/mcore_gpt.nemo', 'model.optim.lr=5e-6', 'model.answer_only_loss=True', 'model.data.num_workers=0', 'model.data.train_ds.micro_batch_size=1', 'model.data.train_ds.global_batch_size=128', 'model.data.train_ds.file_path=/workspace/nemo/datasets/databricks-dolly-15k-output.jsonl', 'model.data.validation_ds.micro_batch_size=1', 'model.data.validation_ds.global_batch_size=128', 'model.data.validation_ds.file_path=/workspace/nemo/datasets/databricks-dolly-15k-output.jsonl', 'exp_manager.create_wandb_logger=True', 'exp_manager.explicit_log_dir=/results', 'exp_manager.wandb_logger_kwargs.project=sft_run', 'exp_manager.wandb_logger_kwargs.name=dolly_sft_run', 'exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True', 'exp_manager.resume_if_exists=True', 'exp_manager.resume_ignore_no_checkpoint=True', 'exp_manager.create_checkpoint_callback=True', 'exp_manager.checkpoint_callback_params.monitor=validation_loss']
Traceback (most recent call last):
File "/opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py", line 243, in <module>
main()
File "/opt/NeMo/nemo/core/config/hydra_runner.py", line 129, in wrapper
_run_hydra(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py", line 129, in main
ptl_model, updated_cfg = load_from_nemo(
File "/opt/NeMo-Aligner/nemo_aligner/utils/utils.py", line 98, in load_from_nemo
model = cls.restore_from(
File "/opt/NeMo/nemo/collections/nlp/models/nlp_model.py", line 465, in restore_from
return super().restore_from(
File "/opt/NeMo/nemo/core/classes/modelPT.py", line 464, in restore_from
instance = cls._save_restore_connector.restore_from(
File "/opt/NeMo-Aligner/nemo_aligner/utils/utils.py", line 51, in restore_from
return super().restore_from(*args, **kwargs)
File "/opt/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 1172, in restore_from
checkpoint = checkpoint_io.load_checkpoint(tmp_model_weights_dir, sharded_state_dict=checkpoint)
File "/opt/NeMo/nemo/utils/callbacks/dist_ckpt_io.py", line 78, in load_checkpoint
return dist_checkpointing.load(
File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 133, in load
validate_sharding_integrity(nested_values(sharded_state_dict))
File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 425, in validate_sharding_integrity
torch.distributed.all_gather_object(all_sharding, sharding)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2310, in all_gather_object
all_gather(object_size_list, local_size, group=group)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2724, in all_gather
work = default_pg.allgather([tensor_list], [tensor])
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1961, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Failed to CUDA calloc async 608 bytes
The text was updated successfully, but these errors were encountered:
Same issues with H100, nemo
Have you found any solutions?
Edit: There oare other errors which are labelled as ncclUnhandledCudaError, but as far as I've searched online, the error that's exactly the same as mine is the one in this repo
Last error:
Failed to CUDA calloc async 608 bytes
My error even showed the exact same number of bytes 608
Describe the bug
I'm trying to follow the SFT Tutorial, on a Llama-3-8b LLM and the process fails
with error
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1961, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Failed to CUDA calloc async 608 bytes
Steps/Code to reproduce bug
Within the NeMO container, I followed the tutorial to convert the Llama-3-8b weights from HF to nemo format, then I ran the following commands:
Expected behavior
I expected SFT to suceed
Environment overview (please complete the following information)
Environment location: VMware vSphere 8
Method of NeMo-Aligner install: I used container nemo:24.05.01
If method of install is [Docker], provide
docker pull
&docker run
commands usedEnvironment details
nvidia-smi output from within the container
Additional context
I'm using 2 x H100 GPUs. These work normally as I have used them already on the same VM to serve Llama-3-70b using vLLM without any issues. Of course, the vLLM container got stopped before I tried to run SFT on the NeMo Alignment container.
Stack error:
The text was updated successfully, but these errors were encountered: