You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I am using the train_gpt3_175b_distributed.sh script to launch training on a single node with 4 A100 80GB GPUs. Training goes well if I use tensor parallel or pipeline parallel, but fails if I enable context parallel. The following is my script:
building GPT model ...
> number of parameters on (tensor, pipeline) model parallel rank (0, 0): 1718685696
> number of parameters on (tensor, pipeline) model parallel rank (0, 0): 1718685696
> number of parameters on (tensor, pipeline) model parallel rank (0, 0): 1718685696
> number of parameters on (tensor, pipeline) model parallel rank (0, 0): 1718685696
WARNING: could not find the metadata file /temp/latest_checkpointed_iteration.txt
will not load any checkpoints and will start from random
/pytorch/torch/distributed/c10d_logger.py:83: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
return func(*args, **kwargs)
/pytorch/torch/distributed/c10d_logger.py:83: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
return func(*args, **kwargs)
/pytorch/torch/distributed/c10d_logger.py:83: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
return func(*args, **kwargs)
/pytorch/torch/distributed/c10d_logger.py:83: FutureWarning: `torch.distributed._all_gather_base` is a private function and will be deprecated. Please use `torch.distributed.all_gather_into_tensor` instead.
return func(*args, **kwargs)
(min, max) time across ranks (ms):
load-checkpoint ................................: (0.73, 0.78)
[after model, optimizer, and learning rate scheduler are built] datetime: 2024-09-19 15:56:43
> building train, validation, and test datasets ...
> datasets target sizes (minimum size):
train: 500000
validation: 5010
test: 10
> building train, validation, and test datasets for GPT ...
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2024-09-19 15:56:43
done with setup ...
training ...
(min, max) time across ranks (ms):
model-and-optimizer-setup ......................: (300.04, 309.25)
train/valid/test-data-iterators-setup ..........: (288.56, 334.41)
[before the start of training step] datetime: 2024-09-19 15:56:43
WARNING:megatron.core.utils:NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)
['Traceback (most recent call last):\n', ' File "/Megatron-LM/pretrain_gpt.py", line 192, in forward_step\n output_tensor = model(tokens, position_ids, attention_mask,\n', ' File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py", line 305, in forward\n return self.module(*inputs, **kwargs)\n', ' File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/Megatron-LM/megatron/legacy/model/module.py", line 189, in forward\n outputs = self.module(*inputs, **kwargs)\n', ' File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/Megatron-LM/megatron/core/models/gpt/gpt_model.py", line 217, in forward\n hidden_states = self.decoder(\n', ' File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/Megatron-LM/megatron/core/transformer/transformer_block.py", line 496, in forward\n hidden_states, context = layer(\n', ' File "/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 377, in __call__\n return super(MegatronModule, self).__call__(*args, **kwargs)\n', ' File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 281, in forward\n attention_output_with_bias = self.self_attention(\n', ' File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/Megatron-LM/megatron/core/transformer/attention.py", line 291, in forward\n core_attn_out = self._checkpointed_attention_forward(\n', ' File "/Megatron-LM/megatron/core/transformer/attention.py", line 143, in _checkpointed_attention_forward\n hidden_states = tensor_parallel.checkpoint(\n', ' File "/Megatron-LM/megatron/core/tensor_parallel/random.py", line 308, in checkpoint\n return CheckpointFunction.apply(function, distribute_saved_activations, *args)\n', ' File "/pytorch/torch/autograd/function.py", line 575, in apply\n return super().apply(*args, **kwargs) # type: ignore[misc]\n', ' File "/Megatron-LM/megatron/core/tensor_parallel/random.py", line 247, in forward\n outputs = run_function(*args)\n', ' File "/Megatron-LM/megatron/core/transformer/attention.py", line 130, in custom_forward\n output_ = self.core_attention(\n', ' File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/Megatron-LM/megatron/core/extensions/transformer_engine.py", line 589, in forward\n core_attn_out = super().forward(\n', ' File "/miniconda3/envs/megatron_lm/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 6854, in forward\n return self.flash_attention(\n', ' File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/miniconda3/envs/megatron_lm/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 4328, in forward\n output = attn_forward_func_with_cp(\n', ' File "/miniconda3/envs/megatron_lm/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 3379, in attn_forward_func_with_cp\n out = AttnFuncWithCPAndKVP2P.apply(\n', ' File "/pytorch/torch/autograd/function.py", line 575, in apply\n return super().apply(*args, **kwargs) # type: ignore[misc]\n', ' File "/miniconda3/envs/megatron_lm/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 1473, in forward\n send_recv_reqs[i % 2] = flash_attn_p2p_communicate(\n', ' File "/miniconda3/envs/megatron_lm/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 1249, in flash_attn_p2p_communicate\n send_op = torch.distributed.isend(send_tensor, send_dst, cp_group)\n', ' File "/pytorch/torch/distributed/distributed_c10d.py", line 2071, in isend\n return pg.send([tensor], dst, tag)\n', 'RuntimeError: NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)\n']
[rank2]: Traceback (most recent call last):
[rank2]: File "/Megatron-LM/pretrain_gpt.py", line 264, in <module>
[rank2]: pretrain(
[rank2]: File "/Megatron-LM/megatron/training/training.py", line 355, in pretrain
[rank2]: iteration, num_floating_point_operations_so_far = train(
[rank2]: File "/Megatron-LM/megatron/training/training.py", line 1234, in train
[rank2]: train_step(forward_step_func,
[rank2]: File "/Megatron-LM/megatron/training/training.py", line 718, in train_step
[rank2]: losses_reduced = forward_backward_func(
[rank2]: File "/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 468, in forward_backward_no_pipelining
[rank2]: output_tensor, num_tokens = forward_step(
[rank2]: File "/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 273, in forward_step
[rank2]: output_tensor, loss_func = forward_step_func(data_iterator, model)
[rank2]: File "/Megatron-LM/pretrain_gpt.py", line 192, in forward_step
[rank2]: output_tensor = model(tokens, position_ids, attention_mask,
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py", line 305, in forward
[rank2]: return self.module(*inputs, **kwargs)
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/Megatron-LM/megatron/legacy/model/module.py", line 189, in forward
[rank2]: outputs = self.module(*inputs, **kwargs)
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/Megatron-LM/megatron/core/models/gpt/gpt_model.py", line 217, in forward
[rank2]: hidden_states = self.decoder(
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/Megatron-LM/megatron/core/transformer/transformer_block.py", line 496, in forward
[rank2]: hidden_states, context = layer(
[rank2]: File "/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 377, in __call__
[rank2]: return super(MegatronModule, self).__call__(*args, **kwargs)
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 281, in forward
[rank2]: attention_output_with_bias = self.self_attention(
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/Megatron-LM/megatron/core/transformer/attention.py", line 291, in forward
[rank2]: core_attn_out = self._checkpointed_attention_forward(
[rank2]: File "/Megatron-LM/megatron/core/transformer/attention.py", line 143, in _checkpointed_attention_forward
[rank2]: hidden_states = tensor_parallel.checkpoint(
[rank2]: File "/Megatron-LM/megatron/core/tensor_parallel/random.py", line 308, in checkpoint
[rank2]: return CheckpointFunction.apply(function, distribute_saved_activations, *args)
[rank2]: File "/pytorch/torch/autograd/function.py", line 575, in apply
[rank2]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank2]: File "/Megatron-LM/megatron/core/tensor_parallel/random.py", line 247, in forward
[rank2]: outputs = run_function(*args)
[rank2]: File "/Megatron-LM/megatron/core/transformer/attention.py", line 130, in custom_forward
[rank2]: output_ = self.core_attention(
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/Megatron-LM/megatron/core/extensions/transformer_engine.py", line 589, in forward
[rank2]: core_attn_out = super().forward(
[rank2]: File "/miniconda3/envs/megatron_lm/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 6854, in forward
[rank2]: return self.flash_attention(
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/pytorch/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/miniconda3/envs/megatron_lm/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 4328, in forward
[rank2]: output = attn_forward_func_with_cp(
[rank2]: File "/miniconda3/envs/megatron_lm/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 3379, in attn_forward_func_with_cp
[rank2]: out = AttnFuncWithCPAndKVP2P.apply(
[rank2]: File "/pytorch/torch/autograd/function.py", line 575, in apply
[rank2]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank2]: File "/miniconda3/envs/megatron_lm/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 1473, in forward
[rank2]: send_recv_reqs[i % 2] = flash_attn_p2p_communicate(
[rank2]: File "/miniconda3/envs/megatron_lm/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 1249, in flash_attn_p2p_communicate
[rank2]: send_op = torch.distributed.isend(send_tensor, send_dst, cp_group)
[rank2]: File "/pytorch/torch/distributed/distributed_c10d.py", line 2071, in isend
[rank2]: return pg.send([tensor], dst, tag)
[rank2]: RuntimeError: NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)
To Reproduce
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.
Expected behavior
A clear and concise description of what you expected to happen.
Stack trace/logs
If applicable, add the stack trace or logs from the time of the error.
Environment (please complete the following information):
Describe the bug
I am using the
train_gpt3_175b_distributed.sh
script to launch training on a single node with 4 A100 80GB GPUs. Training goes well if I use tensor parallel or pipeline parallel, but fails if I enable context parallel. The following is my script:The output log is:
To Reproduce
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.
Expected behavior
A clear and concise description of what you expected to happen.
Stack trace/logs
If applicable, add the stack trace or logs from the time of the error.
Environment (please complete the following information):
Proposed fix
If you have a proposal for how to fix the issue state it here or link to a PR.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: