Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime estimator #1991

Merged

Conversation

mvpatel2000
Copy link
Contributor

@mvpatel2000 mvpatel2000 commented Feb 22, 2023

What does this PR do?

Estimates remaining runtime. Uses a simple timer interpolation + correction for eval. Graph shows a "normal" run and a corrected run with eval.

image

Note that this doesn't always work super smoothly, e.g. resnet estimates are not great because first epoch is a lot slower due to dataloader issues.

image

We will probably hit more issues as well later on, so this will require corrections.

Import Questions:

  • is there a different place to log this? wall_clock/ is intuitive, but everything else is in seconds so we need to report in seconds (whereas we would prefer to probably report hours?)

What issue(s) does this change relate to?

CO-1664

@mvpatel2000
Copy link
Contributor Author

mvpatel2000 commented Feb 23, 2023

Note: The weird spikes in the wandb graph are because wandb associates the last logged time for a given step, which makes graph weird when x-axis is time. I spent a while investigating -- happy to discuss offline since it's a little too complicated to writeup. As a side effect, @eracah and I discovered where timestamp is advanced for batches vs. epochs compared to when metrics are calculated is inconsistent -- along for the ride I reorder the epoch one to be consistent.

Before and after for metrics not changing
image

Copy link

@Glabred Glabred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say sorry to me

@eracah eracah requested review from Glabred and removed request for Glabred February 24, 2023 02:28
@Glabred Glabred self-requested a review February 24, 2023 02:42
Copy link

@Glabred Glabred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jk

@mvpatel2000 mvpatel2000 dismissed Glabred’s stale review February 24, 2023 05:23

Let's discuss offline... we should get you through a starter task and maybe a training project before you're ready

@mvpatel2000 mvpatel2000 requested a review from bcui19 February 24, 2023 05:53
Copy link
Contributor

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some comments

composer/trainer/trainer.py Show resolved Hide resolved
composer/callbacks/runtime_estimator.py Outdated Show resolved Hide resolved
composer/callbacks/runtime_estimator.py Outdated Show resolved Hide resolved
composer/callbacks/runtime_estimator.py Outdated Show resolved Hide resolved
composer/callbacks/runtime_estimator.py Outdated Show resolved Hide resolved
composer/callbacks/runtime_estimator.py Outdated Show resolved Hide resolved
@dakinggg
Copy link
Contributor

Also, I'm willing to merge this with just manual tests, but please do add some unit tests in a follow on pr

Copy link
Contributor

@bcui19 bcui19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, after addressing comments. +1 to adding tests in a separate PR

@mvpatel2000 mvpatel2000 requested a review from a team as a code owner February 24, 2023 18:46
@mvpatel2000
Copy link
Contributor Author

Added a test

@mvpatel2000 mvpatel2000 merged commit 850e34d into mosaicml:dev Feb 24, 2023
@mvpatel2000 mvpatel2000 deleted the mvpatel2000/bottom-up-runtime-estimator branch February 24, 2023 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants