overlapping issue about backward of LayerNormLinear #1353

cos120 · 2024-12-03T01:11:03Z

Hi, folks.

I am using 4 A100-SXM4 for with pytorch2.4.0 and mcore0.9.0 with transformer engine(0.11.0+fc03478) with tp2/pp2 and sequence parallel.

I found that if i set TORCH_NCCL_ENABLE_TIMING=1 for timing all nccl operations, the ag/rs of sp in LayerNormLinear and LayerNormMLP will not overlap with dgrad/wgrad.

timeline without TORCH_NCCL_ENABLE_TIMING=1

there are 4 cudaEventRecord, but those events should create without timing flag.

timeline with TORCH_NCCL_ENABLE_TIMING=1

the there are 5 cudaEventRecord, torch will let two of them create with timing flag.

Why event record with timing will break the overlapping?? nccl operators use 24 sm, the matmul has enough space to launch.

Here are timeline files, torch_record.json with TORCH_NCCL_ENABLE_TIMING=1 and no_record.json does not set TORCH_NCCL_ENABLE_TIMING
timeline.tar.gz

The text was updated successfully, but these errors were encountered:

timmoon10 · 2024-12-03T08:18:02Z

It's strange that TORCH_NCCL_ENABLE_TIMING=1 has this effect. I haven't been able to fully dig into what it does in PyTorch, but as far as I can tell the only extra thing is that it records a CUDA event before each NCCL collective: https://github.com/pytorch/pytorch/blob/e499b46465bc6e5f1a95f158e44bbf0f8356a220/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L2997
However, it's possible the profiler could have more complicated interactions.

Can you provide a timeline of the LayerNormLinear backward with TORCH_NCCL_ENABLE_TIMING=1? The provided timeline shows Linear, which does not overlap its tensor-parallel communication.

cos120 · 2024-12-03T11:01:49Z

It's strange that TORCH_NCCL_ENABLE_TIMING=1 has this effect. I haven't been able to fully dig into what it does in PyTorch, but as far as I can tell the only extra thing is that it records a CUDA event before each NCCL collective: https://github.com/pytorch/pytorch/blob/e499b46465bc6e5f1a95f158e44bbf0f8356a220/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L2997
However, it's possible the profiler could have more complicated interactions.

Can you provide a timeline of the LayerNormLinear backward with TORCH_NCCL_ENABLE_TIMING=1? The provided timeline shows Linear, which does not overlap its tensor-parallel communication.

thanks for your reply, I correct the image upload two timeline files, torch_record.json with TORCH_NCCL_ENABLE_TIMING=1 and no_record.json does not set TORCH_NCCL_ENABLE_TIMING

cos120 · 2024-12-12T06:09:17Z

@timmoon10 any update？😭

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

overlapping issue about backward of LayerNormLinear #1353

overlapping issue about backward of LayerNormLinear #1353

cos120 commented Dec 3, 2024 •

edited

Loading

timmoon10 commented Dec 3, 2024

cos120 commented Dec 3, 2024 •

edited

Loading

cos120 commented Dec 12, 2024

overlapping issue about backward of LayerNormLinear #1353

overlapping issue about backward of LayerNormLinear #1353

Comments

cos120 commented Dec 3, 2024 • edited Loading

timmoon10 commented Dec 3, 2024

cos120 commented Dec 3, 2024 • edited Loading

cos120 commented Dec 12, 2024

cos120 commented Dec 3, 2024 •

edited

Loading

cos120 commented Dec 3, 2024 •

edited

Loading