SFT not working on nemo:24.05.01 container #236

vecorro · 2024-07-13T17:41:07Z

Describe the bug

I'm trying to follow the SFT Tutorial, on a Llama-3-8b LLM and the process fails
with error torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1961, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Failed to CUDA calloc async 608 bytes

Steps/Code to reproduce bug

Within the NeMO container, I followed the tutorial to convert the Llama-3-8b weights from HF to nemo format, then I ran the following commands:

cd /opt/NeMo-Aligner/

python examples/nlp/gpt/train_gpt_sft.py \
   trainer.precision=bf16 \
   trainer.num_nodes=1 \
   trainer.devices=1 \
   trainer.sft.max_steps=-1 \
   trainer.sft.limit_val_batches=40 \
   trainer.sft.val_check_interval=1000 \
   model.megatron_amp_O2=True \
   model.restore_from_path=/workspace/nemo/models/llama3-8b/mcore_gpt.nemo \
   model.optim.lr=5e-6 \
   model.answer_only_loss=True \
   model.data.num_workers=0 \
   model.data.train_ds.micro_batch_size=1 \
   model.data.train_ds.global_batch_size=128 \
   model.data.train_ds.file_path=/workspace/nemo/datasets/databricks-dolly-15k-output.jsonl \
   model.data.validation_ds.micro_batch_size=1 \
   model.data.validation_ds.global_batch_size=128 \
   model.data.validation_ds.file_path=/workspace/nemo/datasets/databricks-dolly-15k-output.jsonl \
   exp_manager.create_wandb_logger=True \
   exp_manager.explicit_log_dir=/results \
   exp_manager.wandb_logger_kwargs.project=sft_run \
   exp_manager.wandb_logger_kwargs.name=dolly_sft_run \
   exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
   exp_manager.resume_if_exists=True \
   exp_manager.resume_ignore_no_checkpoint=True \
   exp_manager.create_checkpoint_callback=True \
   exp_manager.checkpoint_callback_params.monitor=validation_loss

Expected behavior

I expected SFT to suceed

Environment overview (please complete the following information)

Environment location: VMware vSphere 8
Method of NeMo-Aligner install: I used container nemo:24.05.01
If method of install is [Docker], provide docker pull & docker run commands used

docker pull nvcr.io/nvidia/nemo:24.05.01

docker run --runtime nvidia --gpus all \
    -v ~/HF:/workspace/huggingface \
    -v ~/nemo:/workspace/nemo \
    --name my_nemo -td nvcr.io/nvidia/nemo:24.05.01

Environment details
nvidia-smi output from within the container

nvidia-smi
Sat Jul 13 17:38:14 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100XM-80C              On  |   00000000:03:00.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100XM-80C              On  |   00000000:03:01.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Additional context

I'm using 2 x H100 GPUs. These work normally as I have used them already on the same VM to serve Llama-3-70b using vLLM without any issues. Of course, the vLLM container got stopped before I tried to run SFT on the NeMo Alignment container.

Stack error:

[NeMo W 2024-07-13 17:26:30 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    
[NeMo I 2024-07-13 17:26:30 train_gpt_sft:118] 
    
    ************** Experiment configuration ***********
[NeMo I 2024-07-13 17:26:30 train_gpt_sft:119] 
    name: megatron_gpt_sft
    trainer:
      num_nodes: 1
      devices: 1
      accelerator: gpu
      precision: bf16
      sft:
        max_epochs: 1
        max_steps: -1
        val_check_interval: 1000
        save_interval: ${.val_check_interval}
        limit_train_batches: 1.0
        limit_val_batches: 40
        gradient_clip_val: 1.0
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_time: null
      max_epochs: ${.sft.max_epochs}
      max_steps: ${.sft.max_steps}
    exp_manager:
      explicit_log_dir: /results
      exp_dir: null
      name: ${name}
      create_wandb_logger: true
      wandb_logger_kwargs:
        project: sft_run
        name: dolly_sft_run
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_loss
        save_top_k: 5
        mode: min
        save_nemo_on_train_end: true
        filename: megatron_gpt_sft--{${.monitor}:.3f}-{step}-{consumed_samples}-{epoch}
        model_parallel_size: ${model.tensor_model_parallel_size}
        save_best_model: false
    model:
      seed: 1234
      tensor_model_parallel_size: 1
      pipeline_model_parallel_size: 1
      restore_from_path: /workspace/nemo/models/llama3-8b/mcore_gpt.nemo
      resume_from_checkpoint: null
      save_nemo_on_validation_end: true
      sync_batch_comm: false
      megatron_amp_O2: true
      encoder_seq_length: 4096
      sequence_parallel: false
      activations_checkpoint_granularity: null
      activations_checkpoint_method: null
      activations_checkpoint_num_layers: null
      activations_checkpoint_layers_per_pipeline: null
      answer_only_loss: true
      gradient_as_bucket_view: false
      seq_len_interpolation_factor: null
      use_flash_attention: null
      hidden_dropout: 0.0
      attention_dropout: 0.0
      ffn_dropout: 0.0
      steerlm2:
        forward_micro_batch_size: 1
        micro_batch_size: 1
      peft:
        peft_scheme: none
        restore_from_path: null
        lora_tuning:
          target_modules:
          - attention_qkv
          adapter_dim: 32
          adapter_dropout: 0.0
          column_init_method: xavier
          row_init_method: zero
          layer_selection: null
          weight_tying: false
          position_embedding_strategy: null
      data:
        chat: false
        chat_prompt_tokens:
          system_turn_start: "\0"
          turn_start: "\x11"
          label_start: "\x12"
          end_of_turn: '
    
            '
          end_of_name: '
    
            '
        sample: false
        num_workers: 0
        dataloader_type: single
        train_ds:
          file_path: /workspace/nemo/datasets/databricks-dolly-15k-output.jsonl
          global_batch_size: 128
          micro_batch_size: 1
          shuffle: true
          memmap_workers: null
          max_seq_length: ${model.encoder_seq_length}
          min_seq_length: 1
          drop_last: true
          label_key: output
          add_eos: true
          add_sep: false
          add_bos: false
          truncation_field: input
          index_mapping_dir: null
          prompt_template: '{input} {output}'
          hf_dataset: false
          truncation_method: right
        validation_ds:
          file_path: /workspace/nemo/datasets/databricks-dolly-15k-output.jsonl
          global_batch_size: 128
          micro_batch_size: 1
          shuffle: false
          memmap_workers: ${model.data.train_ds.memmap_workers}
          max_seq_length: ${model.data.train_ds.max_seq_length}
          min_seq_length: 1
          drop_last: true
          label_key: ${model.data.train_ds.label_key}
          add_eos: ${model.data.train_ds.add_eos}
          add_sep: ${model.data.train_ds.add_sep}
          add_bos: ${model.data.train_ds.add_bos}
          truncation_field: ${model.data.train_ds.truncation_field}
          index_mapping_dir: null
          prompt_template: ${model.data.train_ds.prompt_template}
          hf_dataset: false
          truncation_method: right
          output_original_text: true
      optim:
        name: distributed_fused_adam
        lr: 5.0e-06
        weight_decay: 0.01
        betas:
        - 0.9
        - 0.98
        sched:
          name: CosineAnnealing
          warmup_steps: 10
          constant_steps: 1000
          min_lr: 9.0e-07
    
[NeMo W 2024-07-13 17:26:30 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2024-07-13 17:26:31 exp_manager:708] Exp_manager is logging to /results, but it already exists.
[NeMo W 2024-07-13 17:26:31 exp_manager:630] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/results/checkpoints. Training from scratch.
[NeMo I 2024-07-13 17:26:31 exp_manager:396] Experiments will be logged at /results
[NeMo I 2024-07-13 17:26:31 exp_manager:856] TensorboardLogger has been set up
[NeMo I 2024-07-13 17:26:31 exp_manager:871] WandBLogger has been set up
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_rs_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: defer_embedding_wgrad_compute in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[NeMo I 2024-07-13 17:26:42 megatron_init:263] Rank 0 has data parallel group : [0]
[NeMo I 2024-07-13 17:26:42 megatron_init:269] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-07-13 17:26:42 megatron_init:274] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2024-07-13 17:26:42 megatron_init:277] Ranks 0 has data parallel rank: 0
[NeMo I 2024-07-13 17:26:42 megatron_init:285] Rank 0 has context parallel group: [0]
[NeMo I 2024-07-13 17:26:42 megatron_init:288] All context parallel group ranks: [[0]]
[NeMo I 2024-07-13 17:26:42 megatron_init:289] Ranks 0 has context parallel rank: 0
[NeMo I 2024-07-13 17:26:42 megatron_init:296] Rank 0 has model parallel group: [0]
[NeMo I 2024-07-13 17:26:42 megatron_init:297] All model parallel group ranks: [[0]]
[NeMo I 2024-07-13 17:26:42 megatron_init:306] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-07-13 17:26:42 megatron_init:310] All tensor model parallel group ranks: [[0]]
[NeMo I 2024-07-13 17:26:42 megatron_init:311] Rank 0 has tensor model parallel rank: 0
[NeMo I 2024-07-13 17:26:42 megatron_init:331] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2024-07-13 17:26:42 megatron_init:343] Rank 0 has embedding group: [0]
[NeMo I 2024-07-13 17:26:42 megatron_init:349] All pipeline model parallel group ranks: [[0]]
[NeMo I 2024-07-13 17:26:42 megatron_init:350] Rank 0 has pipeline model parallel rank 0
[NeMo I 2024-07-13 17:26:42 megatron_init:351] All embedding group ranks: [[0]]
[NeMo I 2024-07-13 17:26:42 megatron_init:352] Rank 0 has embedding rank: 0
24-07-13 17:26:42 - PID:928 - rank:(0, 0, 0, 0) - microbatches.py:39 - INFO - setting number of micro-batches to constant 128
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_rs_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: defer_embedding_wgrad_compute in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo I 2024-07-13 17:26:42 tokenizer_utils:178] Getting HuggingFace AutoTokenizer with pretrained_model_name: meta-llama/Meta-Llama-3-8B
[NeMo W 2024-07-13 17:26:42 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
      warnings.warn(
    
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[NeMo I 2024-07-13 17:26:42 megatron_base_model:584] Padded vocab_size: 128256, original vocab_size: 128256, dummy tokens: 0.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_rs_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: defer_embedding_wgrad_compute in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:498] apply_query_key_layer_scaling is only enabled when using FP16, setting it to False and setting NVTE_APPLY_QK_LAYER_SCALING=0
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: activation_func_fp8_input_store in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: num_moe_experts in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: window_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: qk_layernorm in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: test_mode in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: calculate_per_token_loss in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: memory_efficient_layer_norm in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: fp8_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: fp8_dot_product_attention in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: fp8_multi_head_attention in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_router_load_balancing_type in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_router_topk in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_grouped_gemm in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_aux_loss_coeff in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_z_loss_coeff in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_input_jitter_eps in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_token_dropping in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_token_dispatcher_type in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_per_layer_logging in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_expert_capacity_factor in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_pad_expert_input_to_capacity in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_token_drop_policy in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_layer_recompute in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: clone_scatter_output_in_embedding in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: disable_parameter_transpose_cache in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: enable_cuda_graph in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: rotary_percent in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

[NeMo I 2024-07-13 17:26:56 dist_ckpt_io:95] Using ('zarr', 1) dist-ckpt save strategy.
Error executing job with overrides: ['trainer.precision=bf16', 'trainer.num_nodes=1', 'trainer.devices=1', 'trainer.sft.max_steps=-1', 'trainer.sft.limit_val_batches=40', 'trainer.sft.val_check_interval=1000', 'model.megatron_amp_O2=True', 'model.restore_from_path=/workspace/nemo/models/llama3-8b/mcore_gpt.nemo', 'model.optim.lr=5e-6', 'model.answer_only_loss=True', 'model.data.num_workers=0', 'model.data.train_ds.micro_batch_size=1', 'model.data.train_ds.global_batch_size=128', 'model.data.train_ds.file_path=/workspace/nemo/datasets/databricks-dolly-15k-output.jsonl', 'model.data.validation_ds.micro_batch_size=1', 'model.data.validation_ds.global_batch_size=128', 'model.data.validation_ds.file_path=/workspace/nemo/datasets/databricks-dolly-15k-output.jsonl', 'exp_manager.create_wandb_logger=True', 'exp_manager.explicit_log_dir=/results', 'exp_manager.wandb_logger_kwargs.project=sft_run', 'exp_manager.wandb_logger_kwargs.name=dolly_sft_run', 'exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True', 'exp_manager.resume_if_exists=True', 'exp_manager.resume_ignore_no_checkpoint=True', 'exp_manager.create_checkpoint_callback=True', 'exp_manager.checkpoint_callback_params.monitor=validation_loss']
Traceback (most recent call last):
  File "/opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py", line 243, in <module>
    main()
  File "/opt/NeMo/nemo/core/config/hydra_runner.py", line 129, in wrapper
    _run_hydra(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py", line 129, in main
    ptl_model, updated_cfg = load_from_nemo(
  File "/opt/NeMo-Aligner/nemo_aligner/utils/utils.py", line 98, in load_from_nemo
    model = cls.restore_from(
  File "/opt/NeMo/nemo/collections/nlp/models/nlp_model.py", line 465, in restore_from
    return super().restore_from(
  File "/opt/NeMo/nemo/core/classes/modelPT.py", line 464, in restore_from
    instance = cls._save_restore_connector.restore_from(
  File "/opt/NeMo-Aligner/nemo_aligner/utils/utils.py", line 51, in restore_from
    return super().restore_from(*args, **kwargs)
  File "/opt/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 1172, in restore_from
    checkpoint = checkpoint_io.load_checkpoint(tmp_model_weights_dir, sharded_state_dict=checkpoint)
  File "/opt/NeMo/nemo/utils/callbacks/dist_ckpt_io.py", line 78, in load_checkpoint
    return dist_checkpointing.load(
  File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 133, in load
    validate_sharding_integrity(nested_values(sharded_state_dict))
  File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 425, in validate_sharding_integrity
    torch.distributed.all_gather_object(all_sharding, sharding)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2310, in all_gather_object
    all_gather(object_size_list, local_size, group=group)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2724, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1961, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Failed to CUDA calloc async 608 bytes

The text was updated successfully, but these errors were encountered:

qiyue-liang · 2024-10-31T19:28:04Z

Same issues with H100, nemo
Have you found any solutions?

Edit: There oare other errors which are labelled as ncclUnhandledCudaError, but as far as I've searched online, the error that's exactly the same as mine is the one in this repo
Last error:
Failed to CUDA calloc async 608 bytes

My error even showed the exact same number of bytes 608

terrykong · 2024-10-31T19:48:10Z

Can you confirm that the issue persists even with the latest released imagenvcr.io/nvidia/nemo:24.07?

qiyue-liang · 2024-10-31T19:52:39Z

Can you confirm that the issue persists even with the latest released imagenvcr.io/nvidia/nemo:24.07?

Yes, I was using the latest nemo 24.07 docker image.

vecorro added the bug Something isn't working label Jul 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SFT not working on nemo:24.05.01 container #236

SFT not working on nemo:24.05.01 container #236

vecorro commented Jul 13, 2024

qiyue-liang commented Oct 31, 2024 •

edited

Loading

terrykong commented Oct 31, 2024

qiyue-liang commented Oct 31, 2024

SFT not working on nemo:24.05.01 container #236

SFT not working on nemo:24.05.01 container #236

Comments

vecorro commented Jul 13, 2024

qiyue-liang commented Oct 31, 2024 • edited Loading

terrykong commented Oct 31, 2024

qiyue-liang commented Oct 31, 2024

qiyue-liang commented Oct 31, 2024 •

edited

Loading