Skip to content

Commit

Permalink
busy wait for the rank 0 download (#2071)
Browse files Browse the repository at this point in the history
  • Loading branch information
dakinggg authored Mar 15, 2023
1 parent 0ce9a13 commit a7f33cb
Showing 1 changed file with 12 additions and 0 deletions.
12 changes: 12 additions & 0 deletions composer/utils/checkpoint.py
Original file line number Diff line number Diff line change
Expand Up @@ -323,10 +323,22 @@ def download_checkpoint(

finally:
# Wait for all checkpoints on the node to finish downloading
# First we wait for the local rank 0 to finish its download. This prevents timeouts
# in cases where the local rank 0 is downloading a monolithic checkpoint, and so takes
# much longer than the other ranks, which have nothing to download
# Putting the barrier in a finally so the rank will always block on the barrier,
# even if it has an exception.
# Any exception will be re-raised after the barrier passes. The launcher script
# will detect the process crash and terminate the other ranks

signal_file_path = os.path.join(node_checkpoint_folder, '.local_rank0_completed')
if dist.get_local_rank() == 0:
with open(signal_file_path, 'wb') as f:
f.write(b'local_rank0_completed')
dist.local_rank_zero_download_and_wait(signal_file_path)
if dist.get_local_rank() == 0:
os.remove(signal_file_path)

dist.barrier()

return composer_states_filepath, extracted_checkpoint_folder, extracted_rank_n
Expand Down

0 comments on commit a7f33cb

Please sign in to comment.