-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Composer profiler not supporting multi-node settings #1270
Labels
bug
Something isn't working
Comments
@YilunKuang thanks for reporting and digging into this issue. We'll get this fixed in the next Composer release. Are you running off of |
@bandish-shah I am using the 0.8.0 release. Thanks! Look forward to the new changes. |
ravi-mosaicml
added a commit
to ravi-mosaicml/ravi-composer
that referenced
this issue
Aug 3, 2022
mosaicml#1270 identified how the trace merger did not work on multi-node training. This PR fixes that by using the local rank zero, rather than the global rank zero, to synchronize timestamps. Did not add test cases as we do not support multi-node testing. Closes mosaicml#1270 https://mosaicml.atlassian.net/browse/CO-674
ravi-mosaicml
added a commit
that referenced
this issue
Aug 4, 2022
#1270 identified how the trace merger did not work on multi-node training. This PR fixes that by using the local rank zero, rather than the global rank zero, to synchronize timestamps. Did not add test cases as we do not support multi-node testing. Closes #1270 Closes https://mosaicml.atlassian.net/browse/CO-674
Note @YilunKuang we have fixed this and will go out in the next release! |
Thank you @hanlint ! Look forward to the new release. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Environment
Setting
I was running composer trainer for ResNet50 training with profiler in a multi-node environment (two nodes, each with 4 GPUs) using PyTorch with SLURM. All of the environment variables (
RANK
,WORLD_SIZE
,LOCAL_RANK
,LOCAL_WORLD_SIZE
,NODE_RANK
,MASTER_ADDR
,MASTER_PORT
) are properly set and the code runs smoothly without profiler.I am able to get a merge trace file from processes 0-3 on the first node saved as
merged_trace.json
, but processes 4 on the second node give the following errorThe values of
ranks_to_clock_sync
on process 4 (local_rank = 0 for the second node) areranks_to_clock_sync={4: 1657466255542351, 5: 1657466255636291, 6: 1657466255665198, 7: 1657466255576674}
, and the values ofranks_to_clock_sync
on process 0 areranks_to_clock_sync={0: 1657466255521662, 1: 1657466255676957, 2: 1657466255537099, 3: 1657466255521094}
.It looks like the
ranks_to_clock_sync
dictionary is generated from the_get_rank_to_clock_syncs
function in json_trace_merger.py, and_get_rank_to_clock_syncs
explicitly set the keys inranks_to_clock_sync
to be global ranks. So it looks like composer profiler is not multi-node compatible right now?The text was updated successfully, but these errors were encountered: