-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
overlapping issue about backward of LayerNormLinear #1353
Comments
It's strange that Can you provide a timeline of the |
thanks for your reply, I correct the image upload two timeline files, |
@timmoon10 any update?😭 |
Hi, folks.
I am using 4 A100-SXM4 for with pytorch2.4.0 and mcore0.9.0 with transformer engine(0.11.0+fc03478) with tp2/pp2 and sequence parallel.
I found that if i set
TORCH_NCCL_ENABLE_TIMING=1
for timing all nccl operations, the ag/rs of sp in LayerNormLinear and LayerNormMLP will not overlap with dgrad/wgrad.timeline without
TORCH_NCCL_ENABLE_TIMING=1
there are 4
cudaEventRecord
, but those events should create without timing flag.timeline with
TORCH_NCCL_ENABLE_TIMING=1
the there are 5
cudaEventRecord
, torch will let two of them create with timing flag.Why event record with timing will break the overlapping?? nccl operators use 24 sm, the matmul has enough space to launch.
Here are timeline files,
torch_record.json
withTORCH_NCCL_ENABLE_TIMING=1
andno_record.json
does not setTORCH_NCCL_ENABLE_TIMING
timeline.tar.gz
The text was updated successfully, but these errors were encountered: