Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distributed sampler error #1598

Merged
merged 20 commits into from
Oct 13, 2022

Conversation

mvpatel2000
Copy link
Contributor

What does this PR do?

Adds error if dataloader is used without distributed sampler.

What issue(s) does this change relate to?

CO-1066

@mvpatel2000
Copy link
Contributor Author

I'm very unsure why the grad accum check is failing. Would love fresh eyes

composer/core/data_spec.py Outdated Show resolved Hide resolved
@mvpatel2000 mvpatel2000 requested a review from dakinggg October 11, 2022 19:03
@mvpatel2000 mvpatel2000 requested a review from bcui19 October 11, 2022 19:58
@mvpatel2000
Copy link
Contributor Author

@eracah @dakinggg @bcui19 would love some help on this failing test.... I have no idea what's going on :(. Would love suggestions

@eracah
Copy link
Contributor

eracah commented Oct 11, 2022

@eracah @dakinggg @bcui19 would love some help on this failing test.... I have no idea what's going on :(. Would love suggestions

Make the tolerance higher?

@mvpatel2000
Copy link
Contributor Author

mvpatel2000 commented Oct 11, 2022

@eracah @dakinggg @bcui19 would love some help on this failing test.... I have no idea what's going on :(. Would love suggestions

Make the tolerance higher?

@hanlint thinks there might be a genuine failure. It's not clear to me how the PR changes would affect the values we see... so its weird to see failures starting now. Maybe just randomness change?

@eracah
Copy link
Contributor

eracah commented Oct 11, 2022

@eracah @dakinggg @bcui19 would love some help on this failing test.... I have no idea what's going on :(. Would love suggestions

Make the tolerance higher?

@hanlint thinks there might be a genuine failure. It's not clear to me how the PR changes would affect the values we see... so its weird to see failures starting now. Maybe just randomness change?

what sources of stochasticity do you need to fix? just the rank_zero_seed?

@bcui19
Copy link
Contributor

bcui19 commented Oct 12, 2022

I checked this last night, I also wasn't sure why it was failing. Talked with @bandish-shah, mixed precision won't always produce the same results, so it could've been the random seed.

to repo it would be possible to re-run w/ the same seed that made the test fail.

Also now tests are passing, so it could've been the rank_zero_seed?

@mvpatel2000
Copy link
Contributor Author

I checked this last night, I also wasn't sure why it was failing. Talked with @bandish-shah, mixed precision won't always produce the same results, so it could've been the random seed. to repo it would be possible to re-run w/ the same seed that made the test fail.
It seems to always be running the same seed. Each time the tests run, the amount it fails by is the same. So in this case, I don't think its per stochasticity @eracah @bcui19.

Greatest absolute difference: 1.691281795501709e-06 at index (1, 0) (up to 1e-08 allowed)
Greatest relative difference: 2.489731502748192e-05 at index (1, 0) (up to 1e-05 allowed)

It could just be the seed we used before worked, and the new seed doesn't work....

Also now tests are passing, so it could've been the rank_zero_seed?

I think it marked as passed because jenkins failed before it ran the tests.

@mvpatel2000
Copy link
Contributor Author

on batch 12, we get different gradients because we get different forward pass outputs. The parameters are the same, and the grad_accum 2 value is closer to being correct (value you get with full precision) than the reference. So, I'm going to raise thresholds and call it a day

Copy link
Contributor

@bcui19 bcui19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on offline discussion around the error, LGTM! Thanks for deep dive.

@mvpatel2000 mvpatel2000 merged commit 07428c4 into mosaicml:dev Oct 13, 2022
@mvpatel2000 mvpatel2000 deleted the mvpatel20000/dist-sampling-2 branch October 13, 2022 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants