Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can automatic save checkpoint when crashed or press Ctrl+C? #2461

Open
BoyuanJiang opened this issue Aug 22, 2023 · 5 comments
Open

Can automatic save checkpoint when crashed or press Ctrl+C? #2461

BoyuanJiang opened this issue Aug 22, 2023 · 5 comments
Labels
enhancement New (engineering) enhancements, such as features or API changes.

Comments

@BoyuanJiang
Copy link

🚀 Feature Request

Can I save latest checkpoint when crashed or press Ctrl+C?

Motivation

[Optional] Implementation

Additional context

@BoyuanJiang BoyuanJiang added the enhancement New (engineering) enhancements, such as features or API changes. label Aug 22, 2023
@bcui19
Copy link
Contributor

bcui19 commented Aug 23, 2023

@BoyuanJiang, hey I think we do support this, but it's not very well documented. In particular, I think if we press Ctrl + C once, then the model should checkpoint? Let me know if that doesn't work

@BoyuanJiang
Copy link
Author

@bcui19 it seems when press Ctrl+C, this code will be executed, it just kill the program without saving latest state.

@mvpatel2000
Copy link
Contributor

@BoyuanJiang in that code snippet, the following should hold processes for timeout length https://github.com/mosaicml/composer/blob/dev/composer/cli/launcher.py#L406-L416

Do you see all ranks die immediately if you hit Ctrl + C once?

@BoyuanJiang
Copy link
Author

yes it will hold for 30 second and then all rank will be killed. But I am not sure which line of code will save the latest state to checkpoint in this duration(30 second)?

@mvpatel2000
Copy link
Contributor

https://github.com/mosaicml/composer/blob/dev/composer/callbacks/checkpoint_saver.py#L307-L313 On close, it should try to flush a checkpoint. This hasn't been extensively tested though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New (engineering) enhancements, such as features or API changes.
Projects
None yet
Development

No branches or pull requests

3 participants