-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add distributed sampler error #1598
Add distributed sampler error #1598
Conversation
…/composer into mvpatel20000/dist-sampling-2
I'm very unsure why the grad accum check is failing. Would love fresh eyes |
…/composer into mvpatel20000/dist-sampling-2
@hanlint thinks there might be a genuine failure. It's not clear to me how the PR changes would affect the values we see... so its weird to see failures starting now. Maybe just randomness change? |
what sources of stochasticity do you need to fix? just the |
I checked this last night, I also wasn't sure why it was failing. Talked with @bandish-shah, mixed precision won't always produce the same results, so it could've been the random seed. to repo it would be possible to re-run w/ the same seed that made the test fail. Also now tests are passing, so it could've been the |
It could just be the seed we used before worked, and the new seed doesn't work....
I think it marked as passed because jenkins failed before it ran the tests. |
…/composer into mvpatel20000/dist-sampling-2
on batch 12, we get different gradients because we get different forward pass outputs. The parameters are the same, and the grad_accum 2 value is closer to being correct (value you get with full precision) than the reference. So, I'm going to raise thresholds and call it a day |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on offline discussion around the error, LGTM! Thanks for deep dive.
What does this PR do?
Adds error if dataloader is used without distributed sampler.
What issue(s) does this change relate to?
CO-1066