-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix drop_first
checking in partitioning to account for world_size
divisibility
#706
Conversation
…resumption_bugs merging main
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
…resumption_bugs merging main
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, wouldn't mind another pair of eyes since I haven't looked at this code in a long time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank You!
Description of changes:
This PR loosens restrictions on checking
epoch_size
during the sample partition creation function. If we attempt to create a partition onepoch_size
number of samples, butepoch_size
is not divisible by the world size, then we repeat samples to make sure theepoch_size
is divisible by the world size. However, the currentepoch_size
checking did not take this into account. An explanation of why we have to repeat a few samples is below.When we partition all samples over nodes/ranks/workers, we need to repeat some samples to make sure that the number of samples in the epoch is divisible by the world size. This need arises because:
For 1., after talking with stakeholders and customers, repeating just a few samples instead of dropping them entirely has been seen as the better tradeoff. So, suppose someone sets
epoch_size
to 500, with a world size of 32. Then 12 samples will be repeated in the epoch to take the epoch size to512
to make theepoch_size
divisible by the world size, meaning that we assign the same number of samples to every GPU.We need 2. because imagine a training step where some GPUs have a full batch of samples, but others have only a partial batch. If
drop_last=True
for the dataloader, then some GPUs will attempt to train since they don't have a partial batch, but other GPUs will drop the last batch, causing an error during training since some GPUs try to train while others don't. Repeating samples to make sure theepoch_size
is divisible by the world size takes care of this because each GPU will have the same number of samples in every single global batch -- either all GPUs have a full batch, or all GPUs have a partial batch. As a result, we never run into a case where some GPUs attempt to train, while others don't.Manual test run (
resumption-issue-testing-500ep-WoeCrv
) that successfully resumed, addressing bug report that raised this issue in the first place.Originally reported here.
Issue #, if available:
Merge Checklist:
Put an
x
without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
pre-commit
on my change. (check out thepre-commit
section of prerequisites)