Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Composer profiler not supporting multi-node settings #1270

Closed
YilunKuang opened this issue Jul 10, 2022 · 4 comments · Fixed by #1358
Closed

Composer profiler not supporting multi-node settings #1270

YilunKuang opened this issue Jul 10, 2022 · 4 comments · Fixed by #1358
Assignees
Labels
bug Something isn't working

Comments

@YilunKuang
Copy link
Contributor

Environment

  • OS: [CentOS Linux 7 (Core)]
  • Hardware: [8 NVIDIA A100-SXM4-40GB, two nodes with four GPUs each]

Setting
I was running composer trainer for ResNet50 training with profiler in a multi-node environment (two nodes, each with 4 GPUs) using PyTorch with SLURM. All of the environment variables (RANK, WORLD_SIZE, LOCAL_RANK, LOCAL_WORLD_SIZE, NODE_RANK, MASTER_ADDR, MASTER_PORT) are properly set and the code runs smoothly without profiler.

composer_profiler = Profiler(
                                    trace_handlers=JSONTraceHandler(folder=comprof_folder, overwrite=True),
                                    schedule=cyclic_schedule(
                                        wait=0,
                                        warmup=1,
                                        active=4,
                                        repeat=1,
                                    ),
                                    torch_prof_folder=torch_prof_folder,
                                    torch_prof_overwrite=True,
                                )

I am able to get a merge trace file from processes 0-3 on the first node saved as merged_trace.json, but processes 4 on the second node give the following error

Traceback (most recent call last):
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/submitit/core/submission.py", line 72, in submitit_main
    process_job(args.folder)
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/submitit/core/submission.py", line 65, in process_job
    raise error
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/submitit/core/submission.py", line 54, in process_job
    result = delayed.result()
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/submitit/core/utils.py", line 133, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "/mnt/ceph/users/ykuang/MCMC/scripts/train_model.py", line 32, in __call__
    train(None, self.args)
  File "/mnt/ceph/users/ykuang/MCMC/src/train.py", line 387, in train
    trainer.fit()
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/composer/trainer/trainer.py", line 1348, in fit
    self._train_loop()
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/composer/trainer/trainer.py", line 1564, in _train_loop
    self.engine.run_event(Event.BATCH_END)
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/composer/core/engine.py", line 249, in run_event
    self._run_callbacks(event)
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/composer/core/engine.py", line 374, in _run_callbacks
    cb.run_event(event, self.state, self.logger)
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/composer/core/callback.py", line 96, in run_event
    return event_cb(state, logger)
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/composer/profiler/json_trace_handler.py", line 332, in batch_end
    merge_traces(merged_trace_filename, *trace_files_to_merge)
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/composer/profiler/json_trace_merger.py", line 84, in merge_traces
    rank_zero_clock_sync = ranks_to_clock_sync[0]
KeyError: 0

The values of ranks_to_clock_sync on process 4 (local_rank = 0 for the second node) are
ranks_to_clock_sync={4: 1657466255542351, 5: 1657466255636291, 6: 1657466255665198, 7: 1657466255576674}, and the values of ranks_to_clock_sync on process 0 are ranks_to_clock_sync={0: 1657466255521662, 1: 1657466255676957, 2: 1657466255537099, 3: 1657466255521094}.

It looks like the ranks_to_clock_sync dictionary is generated from the _get_rank_to_clock_syncs function in json_trace_merger.py, and _get_rank_to_clock_syncs explicitly set the keys in ranks_to_clock_sync to be global ranks. So it looks like composer profiler is not multi-node compatible right now?

def _get_rank_to_clock_syncs(trace_files: Tuple[Union[str, pathlib.Path], ...]) -> Dict[int, int]:
    rank_to_clock_sync: Dict[int, int] = {}
    for filename in trace_files:
        rank = _get_global_rank_from_file(filename)
        trace_json = _load_trace(filename)
        if isinstance(trace_json, list):
            for event in trace_json:
                if event['ph'] == 'M' and event['name'] == 'clock_sync_timestamp_us':
                    clock_sync = event['args']['value']
                    rank_to_clock_sync[rank] = clock_sync
                    break
        else:
            assert isinstance(trace_json, dict)
            if trace_json.get('clock_sync_timestamp_us') is not None:
                rank_to_clock_sync[rank] = trace_json['clock_sync_timestamp_us']

    return rank_to_clock_sync
@YilunKuang YilunKuang added the bug Something isn't working label Jul 10, 2022
@bandish-shah bandish-shah self-assigned this Jul 11, 2022
@bandish-shah
Copy link
Contributor

@YilunKuang thanks for reporting and digging into this issue. We'll get this fixed in the next Composer release. Are you running off of dev or from a specific Composer release?

@YilunKuang
Copy link
Contributor Author

@bandish-shah I am using the 0.8.0 release. Thanks! Look forward to the new changes.

ravi-mosaicml added a commit to ravi-mosaicml/ravi-composer that referenced this issue Aug 3, 2022
mosaicml#1270 identified how the trace merger did not work on multi-node training. This PR fixes that by using the local rank zero, rather than the global rank zero, to synchronize timestamps.

Did not add test cases as we do not support multi-node testing.

Closes mosaicml#1270
https://mosaicml.atlassian.net/browse/CO-674
ravi-mosaicml added a commit that referenced this issue Aug 4, 2022
#1270 identified how the trace merger did not work on multi-node training. This PR fixes that by using the local rank zero, rather than the global rank zero, to synchronize timestamps.

Did not add test cases as we do not support multi-node testing.

Closes #1270
Closes https://mosaicml.atlassian.net/browse/CO-674
@hanlint
Copy link
Contributor

hanlint commented Aug 4, 2022

Note @YilunKuang we have fixed this and will go out in the next release!

@YilunKuang
Copy link
Contributor Author

Thank you @hanlint ! Look forward to the new release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants