Skip to content

Commit

Permalink
busy wait for the rank 0 download
Browse files Browse the repository at this point in the history
  • Loading branch information
dakinggg committed Mar 15, 2023
1 parent f674dcc commit 448d8b7
Showing 1 changed file with 12 additions and 0 deletions.
12 changes: 12 additions & 0 deletions composer/utils/checkpoint.py
Original file line number Diff line number Diff line change
Expand Up @@ -323,10 +323,22 @@ def download_checkpoint(

finally:
# Wait for all checkpoints on the node to finish downloading
# First we wait for the local rank 0 to finish its download. This prevents timeouts
# in cases where the local rank 0 is downloading a monolithic checkpoint, and so takes
# much longer than the other ranks, which have nothing to download
# Putting the barrier in a finally so the rank will always block on the barrier,
# even if it has an exception.
# Any exception will be re-raised after the barrier passes. The launcher script
# will detect the process crash and terminate the other ranks

signal_file_path = os.path.join(node_checkpoint_folder, '.local_rank0_completed')
if dist.get_local_rank() == 0:
with open(signal_file_path, 'wb') as f:
f.write(b'local_rank0_completed')
dist.local_rank_zero_download_and_wait(signal_file_path)
if dist.get_local_rank() == 0:
os.remove(signal_file_path)

dist.barrier()

return composer_states_filepath, extracted_checkpoint_folder, extracted_rank_n
Expand Down

0 comments on commit 448d8b7

Please sign in to comment.