-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Object Store Logger Race Condition + EMA Fix #1552
Object Store Logger Race Condition + EMA Fix #1552
Conversation
Update: It looks like there were actually two (!!) bugs going on here. The previous implementation of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for tracing down this nasty race condition!
Red button merging because Evan is on vacation. |
Fixes CO-1097. Object logger calls join in
post_close
. At this point, the interpreter has shutdown so new futures cannot be scheduled. So, if the worker for the last checkpoint save has not scheduled the future yet, the upload will error out. To avoid this, we need to wait for all workers to finish after each Trainer function call to ensure all futures are scheduled.This is a partial fix because anything logged in close by another callback won't (potentially) be logged. I don't think this is possible to support though since once you're in the close phase, you can't create futures
Note that we can't really add a test for this since it's a race condition.