Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tyler/protected #2666

Closed
wants to merge 6 commits into from
Closed

Tyler/protected #2666

wants to merge 6 commits into from

Conversation

siriuslee
Copy link
Contributor

What does this PR do?

What issue(s) does this change relate to?

Before submitting

  • Have you read the contributor guidelines?
  • Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
  • Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
  • Did you update any related docs and document your change?
  • Did you update any related tests and add any new tests related to your change? (see testing)
  • Did you run the tests locally to make sure they pass?
  • Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

@j316chuck j316chuck self-requested a review November 7, 2023 19:21
Copy link
Contributor

@j316chuck j316chuck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @siriuslee for this PR!

Can you add some tests here as well: https://github.com/mosaicml/composer/blob/dev/tests/loggers/test_mosaicml_logger.py#L143

Also, not sure if I'm missing something but does the current logic append the same future f = mcli.update_run_metadata(self.run_name, self.buffered_metadata, future=True, protect=True)
multiple times?

try:
mcli.update_run_metadata(self.run_name, self.buffered_metadata)
try:
f = mcli.update_run_metadata(self.run_name, self.buffered_metadata, future=True, protect=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: what does protect=True do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protect=True will attempt to protect the call during retries from being SIGTERMed. This allows metadata updates to go through when a run is terminating via a SIGTERM

self.allowed_fails_left += 1
self.time_failed_count_adjusted = time.time()
self._futures.append(f)
done, incomplete = wait(self._futures, timeout=0.01)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this timeout too small? Could see instances with large metadata timing out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal here is to catch any done futures already but be nearly non-blocking. Timing out is totally fine, the incomplete futures should be picked up on the next go around

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants