Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add train finished run event #2714

Merged
merged 5 commits into from
Nov 16, 2023
Merged

Add train finished run event #2714

merged 5 commits into from
Nov 16, 2023

Conversation

jjanezhang
Copy link
Contributor

@jjanezhang jjanezhang commented Nov 14, 2023

Add train finished run event

  • Adds timestamp when model is finished training to the run metadata
  • Will create a new TRAIN_FINISHED event in MAPI

Testing

  1. Added an assert in the unit test to make sure train finished is logged.
  2. Confirmed that train finished time is added to run metadata with same timestamp as the last batch finishes
image image

Copy link
Contributor

@eracah eracah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good! But I am curious if we want to log when the actual training finished or when everything is finished and about to shutdown. The reason is that as the code stands now, the train_finish_time wouldn't get logged until the last checkpoint was sucessfully uploaded to the cloud, which can take an extra 30min-1hour after the last batch_end depending on the size of the checkpoint.

@eracah
Copy link
Contributor

eracah commented Nov 14, 2023

The code looks good! But I am curious if we want to log when the actual training finished or when everything is finished and about to shutdown. The reason is that as the code stands now, the train_finish_time wouldn't get logged until the last checkpoint was sucessfully uploaded to the cloud, which can take an extra 30min-1hour after the last batch_end depending on the size of the checkpoint.

Perhaps running this log event on the last batch_end or making sure MosaicMLLogger comes before RemoteUploaderDownloader in the list of loggers would ensure we log after training is done, but before all checkpoints are uploaded. @dakinggg, wdyt?

@dakinggg
Copy link
Contributor

Yeah @eracah I think we want to log this before the wait for checkpoint upload. Is it simplest to make sure the mosaic logger runs before the RUD?

@eracah
Copy link
Contributor

eracah commented Nov 15, 2023

Yeah @eracah I think we want to log this before the wait for checkpoint upload. Is it simplest to make sure the mosaic logger runs before the RUD?

Yup I think so! And any other RUDs that callbacks create (but I think those will always be added to the end of the list of callbacks)

@jjanezhang jjanezhang requested a review from eracah November 16, 2023 00:01
@jjanezhang jjanezhang self-assigned this Nov 16, 2023
composer/trainer/trainer.py Show resolved Hide resolved
Copy link
Contributor

@eracah eracah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks, @jjanezhang !

@eracah
Copy link
Contributor

eracah commented Nov 16, 2023

LGTM. Thanks, @jjanezhang !

Maybe add a comment describing why RUD has to be last?

@jjanezhang jjanezhang merged commit 3cf73cc into dev Nov 16, 2023
16 checks passed
@jjanezhang jjanezhang deleted the jane/add-train-finished branch November 16, 2023 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants