-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tyler/protected #2666
Tyler/protected #2666
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @siriuslee for this PR!
Can you add some tests here as well: https://github.com/mosaicml/composer/blob/dev/tests/loggers/test_mosaicml_logger.py#L143
Also, not sure if I'm missing something but does the current logic append the same future f = mcli.update_run_metadata(self.run_name, self.buffered_metadata, future=True, protect=True)
multiple times?
try: | ||
mcli.update_run_metadata(self.run_name, self.buffered_metadata) | ||
try: | ||
f = mcli.update_run_metadata(self.run_name, self.buffered_metadata, future=True, protect=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq: what does protect=True do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
protect=True
will attempt to protect the call during retries from being SIGTERMed. This allows metadata updates to go through when a run is terminating via a SIGTERM
self.allowed_fails_left += 1 | ||
self.time_failed_count_adjusted = time.time() | ||
self._futures.append(f) | ||
done, incomplete = wait(self._futures, timeout=0.01) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this timeout too small? Could see instances with large metadata timing out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal here is to catch any done futures already but be nearly non-blocking. Timing out is totally fine, the incomplete futures should be picked up on the next go around
What does this PR do?
What issue(s) does this change relate to?
Before submitting
pre-commit
on your change? (see thepre-commit
section of prerequisites)