Composer profiler not supporting multi-node settings #1270

YilunKuang · 2022-07-10T20:28:10Z

Environment

OS: [CentOS Linux 7 (Core)]
Hardware: [8 NVIDIA A100-SXM4-40GB, two nodes with four GPUs each]

Setting
I was running composer trainer for ResNet50 training with profiler in a multi-node environment (two nodes, each with 4 GPUs) using PyTorch with SLURM. All of the environment variables (RANK, WORLD_SIZE, LOCAL_RANK, LOCAL_WORLD_SIZE, NODE_RANK, MASTER_ADDR, MASTER_PORT) are properly set and the code runs smoothly without profiler.

composer_profiler = Profiler(
                                    trace_handlers=JSONTraceHandler(folder=comprof_folder, overwrite=True),
                                    schedule=cyclic_schedule(
                                        wait=0,
                                        warmup=1,
                                        active=4,
                                        repeat=1,
                                    ),
                                    torch_prof_folder=torch_prof_folder,
                                    torch_prof_overwrite=True,
                                )

I am able to get a merge trace file from processes 0-3 on the first node saved as merged_trace.json, but processes 4 on the second node give the following error

Traceback (most recent call last):
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/submitit/core/submission.py", line 72, in submitit_main
    process_job(args.folder)
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/submitit/core/submission.py", line 65, in process_job
    raise error
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/submitit/core/submission.py", line 54, in process_job
    result = delayed.result()
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/submitit/core/utils.py", line 133, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "/mnt/ceph/users/ykuang/MCMC/scripts/train_model.py", line 32, in __call__
    train(None, self.args)
  File "/mnt/ceph/users/ykuang/MCMC/src/train.py", line 387, in train
    trainer.fit()
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/composer/trainer/trainer.py", line 1348, in fit
    self._train_loop()
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/composer/trainer/trainer.py", line 1564, in _train_loop
    self.engine.run_event(Event.BATCH_END)
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/composer/core/engine.py", line 249, in run_event
    self._run_callbacks(event)
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/composer/core/engine.py", line 374, in _run_callbacks
    cb.run_event(event, self.state, self.logger)
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/composer/core/callback.py", line 96, in run_event
    return event_cb(state, logger)
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/composer/profiler/json_trace_handler.py", line 332, in batch_end
    merge_traces(merged_trace_filename, *trace_files_to_merge)
  File "/mnt/home/ykuang/miniconda3/envs/mcmc2/lib/python3.10/site-packages/composer/profiler/json_trace_merger.py", line 84, in merge_traces
    rank_zero_clock_sync = ranks_to_clock_sync[0]
KeyError: 0

The values of ranks_to_clock_sync on process 4 (local_rank = 0 for the second node) are
ranks_to_clock_sync={4: 1657466255542351, 5: 1657466255636291, 6: 1657466255665198, 7: 1657466255576674}, and the values of ranks_to_clock_sync on process 0 are ranks_to_clock_sync={0: 1657466255521662, 1: 1657466255676957, 2: 1657466255537099, 3: 1657466255521094}.

It looks like the ranks_to_clock_sync dictionary is generated from the _get_rank_to_clock_syncs function in json_trace_merger.py, and _get_rank_to_clock_syncs explicitly set the keys in ranks_to_clock_sync to be global ranks. So it looks like composer profiler is not multi-node compatible right now?

def _get_rank_to_clock_syncs(trace_files: Tuple[Union[str, pathlib.Path], ...]) -> Dict[int, int]:
    rank_to_clock_sync: Dict[int, int] = {}
    for filename in trace_files:
        rank = _get_global_rank_from_file(filename)
        trace_json = _load_trace(filename)
        if isinstance(trace_json, list):
            for event in trace_json:
                if event['ph'] == 'M' and event['name'] == 'clock_sync_timestamp_us':
                    clock_sync = event['args']['value']
                    rank_to_clock_sync[rank] = clock_sync
                    break
        else:
            assert isinstance(trace_json, dict)
            if trace_json.get('clock_sync_timestamp_us') is not None:
                rank_to_clock_sync[rank] = trace_json['clock_sync_timestamp_us']

    return rank_to_clock_sync

The text was updated successfully, but these errors were encountered:

bandish-shah · 2022-07-11T23:56:40Z

@YilunKuang thanks for reporting and digging into this issue. We'll get this fixed in the next Composer release. Are you running off of dev or from a specific Composer release?

YilunKuang · 2022-07-12T01:27:36Z

@bandish-shah I am using the 0.8.0 release. Thanks! Look forward to the new changes.

mosaicml#1270 identified how the trace merger did not work on multi-node training. This PR fixes that by using the local rank zero, rather than the global rank zero, to synchronize timestamps. Did not add test cases as we do not support multi-node testing. Closes mosaicml#1270 https://mosaicml.atlassian.net/browse/CO-674

#1270 identified how the trace merger did not work on multi-node training. This PR fixes that by using the local rank zero, rather than the global rank zero, to synchronize timestamps. Did not add test cases as we do not support multi-node testing. Closes #1270 Closes https://mosaicml.atlassian.net/browse/CO-674

hanlint · 2022-08-04T06:17:01Z

Note @YilunKuang we have fixed this and will go out in the next release!

YilunKuang · 2022-08-06T18:16:04Z

Thank you @hanlint ! Look forward to the new release.

YilunKuang added the bug Something isn't working label Jul 10, 2022

bandish-shah self-assigned this Jul 11, 2022

ravi-mosaicml mentioned this issue Aug 3, 2022

Fix the profiler on multi-node training #1358

Merged

ravi-mosaicml closed this as completed in #1358 Aug 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Composer profiler not supporting multi-node settings #1270

Composer profiler not supporting multi-node settings #1270

YilunKuang commented Jul 10, 2022

bandish-shah commented Jul 11, 2022

YilunKuang commented Jul 12, 2022

hanlint commented Aug 4, 2022

YilunKuang commented Aug 6, 2022

Composer profiler not supporting multi-node settings #1270

Composer profiler not supporting multi-node settings #1270

Comments

YilunKuang commented Jul 10, 2022

bandish-shah commented Jul 11, 2022

YilunKuang commented Jul 12, 2022

hanlint commented Aug 4, 2022

YilunKuang commented Aug 6, 2022