Bug/4319 ddp checkpoint #4323

SeanNaren · 2020-10-23T15:34:11Z

What does this PR do?

Fixes #4319 and related to #4223. We now synchronise the rank 0 checkpoint, as well as wait for saving to finish before ending ddp train.

PR review

Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

ananthsub · 2020-10-23T16:35:44Z

pytorch_lightning/accelerators/ddp_accelerator.py

@@ -188,6 +189,13 @@ def get_device_ids(self):
        device_ids = [self.trainer.root_gpu]
        return device_ids

+    def sync_processes_best_model_path(self):
+        best_model_path = self.broadcast(self.trainer.checkpoint_callback.best_model_path)


i'm curious why the broadcast is needed. the issue in #4323 seems like metrics aren't synced globally, leading to inconsistent paths across ranks.

if we want to cover this case, do we also need to broadcast other properties like best_model_score from rank 0?

I don't think #4323 is really covered well in this PR, and I was questioning whether whether to include the broadcast or not... the primary thing we have to do is await all processes completion of saving for #4319. Forcing the same best model path on all ranks allows them to correctly pick the best model, as rank 0 is the only rank that has control over the saving of checkpoints.

If you're getting different scores across processes, this is a deeper issue of how to consolidate them. I think moving forward we should push the metrics API which handles syncing of metrics across ranks: https://pytorch-lightning.readthedocs.io/en/latest/metrics.html

I'm still curious if theres a better way to handle the natively the different val scores, but unsure if it's required for this PR. Maybe omitting the broadcast and just doing a barrier would suffice, thoughts @ananthsub?

If you're getting different scores across processes, this is a deeper issue of how to consolidate them. I think moving forward we should push the metrics API which handles syncing of metrics across ranks: https://pytorch-lightning.readthedocs.io/en/latest/metrics.html

I totally agree

If we really need the broadcasts, can they move to the model checkpoint callback? the callback could implement the hook for on_fit_end and do the broadcasts inside so we can keep the properties in sync in a single location

other notes:

do we need these on other accelerators? it looks like the same bug could affect the DDP torchelastic accelerator

we can skip this barrier if the checkpoint callback doesn't exist

yeah good shout, will do!

@ananthsub let me know if you agree with the changes I just made :)

EDIT: bit early, need to fix the integration still it seems...

EDIT2: works fine, my test deletes the temp dir a bit too early, because it happens across procs :)

Looks good! I'm debugging what I think is the same race condition in DDP torchelastic. I think we need the same barrier across the DDP accelerators

Want me to add them across the accelerators in the same place? In particular the ddp_torchelastic accelerator, unsure what else

@williamFalcon is this needed across all DDP accelerators? ddp/ddp_cpu/ddp2 + slurm and torchelastic variants?

yeah. i think any dist type of call needs to be implemented in each accelerator with however they do it and then call it from the accelerator.

ie: change

if tpu: x() if gpu: y()

...

TO

TPU:

def shared_fx(): x()

GPU

def shared_fx(): y()

then you use
accelerator.shared_fx()

pytorch_lightning/callbacks/model_checkpoint.py

SeanNaren · 2020-10-23T18:17:36Z

Windows tests are failing due to lack of torch dist being available here:

https://github.com/PyTorchLightning/pytorch-lightning/blob/bug/4319-ddp-checkpoint/pytorch_lightning/callbacks/model_checkpoint.py#L242

any ideas to ensure this check @ananthsub? This is starting to bleed into what may be the accelerators role... I could modify the check for 'ddp' but that seems a bit hacky

ananthsub · 2020-10-24T00:48:21Z

pytorch_lightning/accelerators/ddp_accelerator.py

@@ -145,6 +145,7 @@ def train(self):
        results = self.ddp_train(process_idx=self.task_idx, model=model)
        if 'WORLD_SIZE' in os.environ:
            del os.environ['WORLD_SIZE']
+        self.barrier('ddp_end_train')


IMO we should land the barriers across DDP accelerators first as that'll solve the primary race condition. We're now getting some crashes from this bug so I want to mitigate as quickly as possible. cc @williamFalcon

We can address the possible rank inconsistency separately and whether/how to broadcast checkpoint state across ranks. This scenario seems inevitable right now:

the checkpoint callback running on rank 0 thinks the best model path is A due to the metrics it sees

the checkpoint callback running on rank 1 thinks the best model path is B

the callback is configured with save_top_k > 1 so not all checkpoints are saved

The top-K checkpoints fall out of sync across ranks due to differences in local metrics

We delete checkpoints outside the top-K from rank 0

Later, rank 1 tries to load the checkpoint (based on its local state), but it turns out it's been deleted

I'm conflicted: broadcasting from rank-0 would make it deterministic, but it might hide more issues in the user's code. I think we'd need to clearly document it and have a giant warning that metrics need to be synced across ranks in distributed training when saving checkpoints based on that monitored value. the same could apply for early stopping too

should the barrier be done per accelerator? or should it be added to the trainer itself, right after this spot? https://github.com/PyTorchLightning/pytorch-lightning/blob/f6efb712edccfa83d630db3e97493cd36ab71497/pytorch_lightning/trainer/trainer.py#L439-L440

Yeah I agree currently the logic is severely broken because of no barrier. From your report it seems like this issue is larger than just DDP (which was my worry), so I think enforcing a barrier in the trainer is important.

EDIT:

The alternative I see is by incorporating it into the teardown of each accelerator (that overrides the teardown), but that makes it less obvious of the importance. However the test made me release the logic within the TPU accelerator doesn't parallelise teardown/any further communication. I'll make changes to the branch and let me know what you think @ananthsub

…for main process to save

…to ensure all processes complete

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

… syncronize and wait for all processes to finish before completing training

… from base accelerator

codecov · 2020-10-24T14:23:35Z

Codecov Report

Merging #4323 into master will decrease coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #4323   +/-   ##
======================================
- Coverage      93%     93%   -0%     
======================================
  Files         111     111           
  Lines        8024    8021    -3     
======================================
- Hits         7462    7459    -3     
  Misses        562     562

ananthsub

much simpler!

awaelchli

nice
I can confirm for dpp the issue on master and this PR fixes it.

williamFalcon · 2020-10-24T20:55:43Z

wow, yes, very elegant fix.

I missed the question about whether we need to do a barrier on each accelerator. The answer to that is yes. Anytime we need a barrier we should call accelerator.barrier(). In accelerators without a barrier (like CPU), that's a no-op anyhow.

williamFalcon · 2020-10-24T20:56:10Z

Thanks for the quick fix @SeanNaren and the great review @ananthsub !

ananthsub · 2020-10-25T02:09:41Z

this was a fun one :) It's pretty amazing how much restructuring there's been as a result of #3545 , especially for DDP. awesome fix @SeanNaren !

SeanNaren requested review from ananyahjha93, awaelchli, Borda, justusschock, nateraw, tchaton, teddykoker and williamFalcon as code owners October 23, 2020 15:34

mergify bot requested a review from a team October 23, 2020 15:47

SeanNaren added the bug Something isn't working label Oct 23, 2020

ananthsub reviewed Oct 23, 2020

View reviewed changes

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

ananthsub reviewed Oct 23, 2020

View reviewed changes

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

ananthsub reviewed Oct 24, 2020

View reviewed changes

SeanNaren force-pushed the bug/4319-ddp-checkpoint branch 2 times, most recently from 7f1a01a to 473a3c5 Compare October 24, 2020 10:20

Lightning-AI deleted a comment from pep8speaks Oct 24, 2020

SeanNaren and others added 10 commits October 24, 2020 15:06

Broadcast best model path to ensure we sync with main process + wait …

4dc222c

…for main process to save

Add barrier call to ensure all processes are in sync

ba3dc51

Added changelog commit

2587fbc

Move sync of best model path/score to model checkpoint, keep barrier …

1ed382b

…to ensure all processes complete

Ensure we broadcast as tuple

62836b9

Add init check

ce7ae8f

Update pytorch_lightning/callbacks/model_checkpoint.py

2973c2d

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

Update pytorch_lightning/callbacks/model_checkpoint.py

6bae355

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

Removed model checkpoint code, added barrier to trainer to enforce we…

3c915c4

… syncronize and wait for all processes to finish before completing training

Add barrier within teardown call, removed horovod teardown to inherit…

043e06c

… from base accelerator

SeanNaren force-pushed the bug/4319-ddp-checkpoint branch from 02c4616 to 043e06c Compare October 24, 2020 14:06

ananthsub approved these changes Oct 24, 2020

View reviewed changes

awaelchli added checkpointing Related to checkpointing distributed Generic distributed-related topic labels Oct 24, 2020

awaelchli approved these changes Oct 24, 2020

View reviewed changes

mergify bot requested a review from a team October 24, 2020 17:03

SeanNaren added this to the 1.0.x milestone Oct 24, 2020

williamFalcon merged commit 5641b26 into master Oct 24, 2020

SeanNaren deleted the bug/4319-ddp-checkpoint branch October 24, 2020 21:02

edenlightning modified the milestones: 1.0.x, 1.1 Oct 24, 2020

Borda modified the milestones: 1.1, 1.0.x Oct 25, 2020

SeanNaren mentioned this pull request Nov 16, 2020

Ensure sync across val/test step when using DDP Lightning-Universe/lightning-bolts#371

Merged

1 task

ananthsub mentioned this pull request Apr 4, 2021

[fix] Add barrier to accelerator's teardown #6814

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug/4319 ddp checkpoint #4323

Bug/4319 ddp checkpoint #4323

SeanNaren commented Oct 23, 2020 •

edited

Loading

ananthsub Oct 23, 2020

SeanNaren Oct 23, 2020

ananthsub Oct 23, 2020 •

edited

Loading

SeanNaren Oct 23, 2020

SeanNaren Oct 23, 2020 •

edited

Loading

ananthsub Oct 23, 2020

SeanNaren Oct 23, 2020 •

edited

Loading

ananthsub Oct 23, 2020

williamFalcon Oct 23, 2020

williamFalcon Oct 23, 2020 •

edited by Borda

Loading

SeanNaren commented Oct 23, 2020 •

edited

Loading

ananthsub Oct 24, 2020 •

edited

Loading

ananthsub Oct 24, 2020

SeanNaren Oct 24, 2020 •

edited

Loading

codecov bot commented Oct 24, 2020

ananthsub left a comment

awaelchli left a comment

williamFalcon commented Oct 24, 2020

williamFalcon commented Oct 24, 2020

ananthsub commented Oct 25, 2020

Bug/4319 ddp checkpoint #4323

Bug/4319 ddp checkpoint #4323

Conversation

SeanNaren commented Oct 23, 2020 • edited Loading

What does this PR do?

PR review

Did you have fun?

ananthsub Oct 23, 2020

Choose a reason for hiding this comment

SeanNaren Oct 23, 2020

Choose a reason for hiding this comment

ananthsub Oct 23, 2020 • edited Loading

Choose a reason for hiding this comment

SeanNaren Oct 23, 2020

Choose a reason for hiding this comment

SeanNaren Oct 23, 2020 • edited Loading

Choose a reason for hiding this comment

ananthsub Oct 23, 2020

Choose a reason for hiding this comment

SeanNaren Oct 23, 2020 • edited Loading

Choose a reason for hiding this comment

ananthsub Oct 23, 2020

Choose a reason for hiding this comment

williamFalcon Oct 23, 2020

Choose a reason for hiding this comment

williamFalcon Oct 23, 2020 • edited by Borda Loading

Choose a reason for hiding this comment

SeanNaren commented Oct 23, 2020 • edited Loading

ananthsub Oct 24, 2020 • edited Loading

Choose a reason for hiding this comment

ananthsub Oct 24, 2020

Choose a reason for hiding this comment

SeanNaren Oct 24, 2020 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Oct 24, 2020

Codecov Report

ananthsub left a comment

Choose a reason for hiding this comment

awaelchli left a comment

Choose a reason for hiding this comment

williamFalcon commented Oct 24, 2020

williamFalcon commented Oct 24, 2020

ananthsub commented Oct 25, 2020

SeanNaren commented Oct 23, 2020 •

edited

Loading

ananthsub Oct 23, 2020 •

edited

Loading

SeanNaren Oct 23, 2020 •

edited

Loading

SeanNaren Oct 23, 2020 •

edited

Loading

williamFalcon Oct 23, 2020 •

edited by Borda

Loading

SeanNaren commented Oct 23, 2020 •

edited

Loading

ananthsub Oct 24, 2020 •

edited

Loading

SeanNaren Oct 24, 2020 •

edited

Loading