Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update file exists checkpointing error messages to be more helpful #2668

Merged
merged 9 commits into from
Nov 16, 2023

Conversation

irenedea
Copy link
Contributor

@irenedea irenedea commented Oct 24, 2023

What does this PR do?

The parameter in Trainer is called save_overwrite, so I'm assuming these error messages should be updated to reflect that. There are no other usages of allow_overwrite as parameters, variables, or error messages.

I ran into this issue because I launched a run via foundry/mcli yaml, received one of these error messages, set add allow_overwrite: true, and my run still failed with a warning that allow_overwrite was unused 😄

What issue(s) does this change relate to?

Before submitting

  • Have you read the contributor guidelines?
  • Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
  • Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
  • Did you update any related docs and document your change?
  • Did you update any related tests and add any new tests related to your change? (see testing)
  • Did you run the tests locally to make sure they pass?
  • Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

@dakinggg
Copy link
Contributor

The parameter name to the RemoteUploaderDownloader.upload_file is actually just overwrite. So maybe the error message there should say "overwrite" but maybe add an extra note that the top level Trainer arg is called save_overwrite?

@irenedea
Copy link
Contributor Author

@dakinggg Added a better error message at checkpoint uploading and renamed the checkpoint saver param to match the top level trainer param.

@mvpatel2000
Copy link
Contributor

@irenedea is this PR still supposed to be worked on or will it be closed

composer/callbacks/checkpoint_saver.py Outdated Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Outdated Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Outdated Show resolved Hide resolved
composer/trainer/trainer.py Outdated Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Outdated Show resolved Hide resolved
@irenedea
Copy link
Contributor Author

@mvpatel2000 making some updates right now.

@irenedea irenedea changed the title Update error messages to use "save_overwrite" instead of "allow_overwrite" Update overwrite error messages to be more helpful Nov 14, 2023
@irenedea irenedea changed the title Update overwrite error messages to be more helpful Update file exists checkpointing error messages to be more helpful Nov 14, 2023
@irenedea irenedea requested a review from dakinggg November 15, 2023 22:23
@irenedea irenedea enabled auto-merge (squash) November 16, 2023 21:58
@irenedea irenedea merged commit 15d4c34 into mosaicml:dev Nov 16, 2023
16 checks passed
Comment on lines +387 to +390
except FileExistsError as e:
raise FileExistsError(
f'Uploading checkpoint failed with error: {e}. overwrite was set to {self.overwrite}. To overwrite checkpoints with Trainer, set save_overwrite to True.'
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@irenedea I think you are missing a raise ... from e here. As is, this will discard the stack trace. Can you open a new PR adding in the trace?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants