Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid using infiniteloop in train_cifar10_ddp.py #145

Closed

Conversation

Xiaoming-Zhao
Copy link
Contributor

What does this PR do?

This PR avoids using infinite generator provided by infiniteloop and directly use the dataloader instead as discussed in #144. This change follows the structure provided by pytorch.

I tested the change locally and make sure that it can run smoothly.

I also added a --standalone command line argument in README, without which I cannot make the script run. This argument is also provided by the official example for single node usage.

Before submitting

  • Did you make sure title is self-explanatory and the description concisely explains the PR?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you test your PR locally with pytest command?
  • Did you run pre-commit hooks with pre-commit run -a command?

Did you have fun?

Make sure you had fun coding 🙃

Comment on lines +163 to +198
for x1, _ in tqdm.tqdm(dataloader, total=len(dataloader)):
global_step += 1

optim.zero_grad()
x1 = x1.to(rank)
x0 = torch.randn_like(x1)
t, xt, ut = FM.sample_location_and_conditional_flow(x0, x1)
vt = net_model(t, xt)
loss = torch.mean((vt - ut) ** 2)
loss.backward()
torch.nn.utils.clip_grad_norm_(
net_model.parameters(), FLAGS.grad_clip
) # new
optim.step()
sched.step()
ema(net_model, ema_model, FLAGS.ema_decay) # new

# sample and Saving the weights
if FLAGS.save_step > 0 and global_step % FLAGS.save_step == 0:
generate_samples(
net_model, FLAGS.parallel, savedir, global_step, net_="normal"
)
generate_samples(
ema_model, FLAGS.parallel, savedir, global_step, net_="ema"
)
torch.save(
{
"net_model": net_model.state_dict(),
"ema_model": ema_model.state_dict(),
"sched": sched.state_dict(),
"optim": optim.state_dict(),
"step": global_step,
},
savedir
+ f"{FLAGS.model}_cifar10_weights_step_{global_step}.pt",
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This chunk seems to change a lot but it actually only modifies where x1 is read. The chunk comes from the indentation change as we do not need step_pbar now.

Copy link

codecov bot commented Nov 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 45.22%. Comparing base (29c0ed6) to head (db65caa).
Report is 3 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main     #145       +/-   ##
===========================================
+ Coverage   34.23%   45.22%   +10.98%     
===========================================
  Files          55       12       -43     
  Lines        6268     1130     -5138     
===========================================
- Hits         2146      511     -1635     
+ Misses       4122      619     -3503     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ImahnShekhzadeh
Copy link
Contributor

ImahnShekhzadeh commented Nov 17, 2024

Hi @Xiaoming-Zhao,

Thanks for this PR!

1.) I'm pretty sure that using infiniteloop is correct, and hence does not need to be replaced (though that doesn't mean we cannot do it). In issue #144, you wrote:

I also noticed that the use sampler.set_epoch(epoch). Based on my previous experience, this is crucial to ensure randomness across epochs. However, with he current generator provided by infiniteloop, I am not sure whether the set_epoch will actually affect the dataloader , I need to double check.

I wrote a small script that I uploaded on my website (https://imahnshekhzadeh.github.io/#Blog), which uses an infiniteloop dataloader, and which I experimented extensively. I ran it like this:

torchrun --nproc_per_node=NUM_GPUS_YOU_HAVE test_inf_loop.py --master_addr [...] --master_port [...]
# e.g.: `torchrun --nproc_per_node=2 test_inf_loop.py --master_addr [...] --master_port [...]`

When sampler.set_epoch(epoch) is used, we observe:

Rank: 1, Epoch: 0, Batch: 2, Data:
[tensor([[-1.3042, -1.1097],
        [-0.1320, -0.2751]])]

Rank: 1, Epoch: 1, Batch: 2, Data:
[tensor([[-0.1752,  0.6990],
        [-0.2350,  0.0937]])]

So clearly, sampler.set_epoch(epoch) does its job! However, when commenting out the two lines, we see this in the output:

Rank: 1, Epoch: 0, Batch: 2, Data:
[tensor([[-1.3042, -1.1097],
        [-0.1320, -0.2751]])]

Rank: 1, Epoch: 1, Batch: 2, Data:
[tensor([[-1.3042, -1.1097],
        [-0.1320, -0.2751]])]

Clearly, no shuffling happened!

2.) About the --standalone flag: Can you please open an issue for this, since this is unrelated to the infiniteloop discussion?Thanks!

@Xiaoming-Zhao
Copy link
Contributor Author

Thanks for the detailed example, @ImahnShekhzadeh! This is incredibly helpful. Lessons learned.

I will close this PR for now as it seems like all required changes have been implemented in #116.

Regarding the torchrun command, I create a separate PR in #149

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants