Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Object Store Logger Race Condition + EMA Fix #1552

Merged

Conversation

mvpatel2000
Copy link
Contributor

@mvpatel2000 mvpatel2000 commented Sep 21, 2022

Fixes CO-1097. Object logger calls join in post_close. At this point, the interpreter has shutdown so new futures cannot be scheduled. So, if the worker for the last checkpoint save has not scheduled the future yet, the upload will error out. To avoid this, we need to wait for all workers to finish after each Trainer function call to ensure all futures are scheduled.

This is a partial fix because anything logged in close by another callback won't (potentially) be logged. I don't think this is possible to support though since once you're in the close phase, you can't create futures

Note that we can't really add a test for this since it's a race condition.

@mvpatel2000 mvpatel2000 requested a review from eracah as a code owner September 21, 2022 23:15
@mvpatel2000 mvpatel2000 requested a review from hanlint September 21, 2022 23:17
@mvpatel2000
Copy link
Contributor Author

mvpatel2000 commented Sep 23, 2022

Update: It looks like there were actually two (!!) bugs going on here. The previous implementation of checkpoint_save_interval was not idempotent, so when EMA was modified last week to move the EMA weights into the default weight slot on a checkpoint event, it broke checkpointing for end of training. Fixed checkpoint_save_interval to be idempotent.

@mvpatel2000 mvpatel2000 changed the title Object Store Logger Race Condition Fix Object Store Logger Race Condition + EMA Fix Sep 23, 2022
Copy link
Contributor

@hanlint hanlint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for tracing down this nasty race condition!

composer/callbacks/checkpoint_saver.py Show resolved Hide resolved
@mvpatel2000 mvpatel2000 merged commit 4e68bdf into mosaicml:dev Sep 23, 2022
@mvpatel2000
Copy link
Contributor Author

Red button merging because Evan is on vacation.

@mvpatel2000 mvpatel2000 deleted the mvpatel2000/object-store-logger-race branch September 23, 2022 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants