Avoid using `infiniteloop` in `train_cifar10_ddp.py` #145

Xiaoming-Zhao · 2024-11-16T23:00:23Z

What does this PR do?

This PR avoids using infinite generator provided by infiniteloop and directly use the dataloader instead as discussed in #144. This change follows the structure provided by pytorch.

I tested the change locally and make sure that it can run smoothly.

I also added a --standalone command line argument in README, without which I cannot make the script run. This argument is also provided by the official example for single node usage.

Before submitting

Did you make sure title is self-explanatory and the description concisely explains the PR?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you list all the breaking changes introduced by this pull request?
Did you test your PR locally with pytest command?
Did you run pre-commit hooks with pre-commit run -a command?

Did you have fun?

Make sure you had fun coding 🙃

Xiaoming-Zhao · 2024-11-16T23:01:58Z

examples/images/cifar10/train_cifar10_ddp.py

+            for x1, _ in tqdm.tqdm(dataloader, total=len(dataloader)):
+                global_step += 1
+
+                optim.zero_grad()
+                x1 = x1.to(rank)
+                x0 = torch.randn_like(x1)
+                t, xt, ut = FM.sample_location_and_conditional_flow(x0, x1)
+                vt = net_model(t, xt)
+                loss = torch.mean((vt - ut) ** 2)
+                loss.backward()
+                torch.nn.utils.clip_grad_norm_(
+                    net_model.parameters(), FLAGS.grad_clip
+                )  # new
+                optim.step()
+                sched.step()
+                ema(net_model, ema_model, FLAGS.ema_decay)  # new
+
+                # sample and Saving the weights
+                if FLAGS.save_step > 0 and global_step % FLAGS.save_step == 0:
+                    generate_samples(
+                        net_model, FLAGS.parallel, savedir, global_step, net_="normal"
+                    )
+                    generate_samples(
+                        ema_model, FLAGS.parallel, savedir, global_step, net_="ema"
+                    )
+                    torch.save(
+                        {
+                            "net_model": net_model.state_dict(),
+                            "ema_model": ema_model.state_dict(),
+                            "sched": sched.state_dict(),
+                            "optim": optim.state_dict(),
+                            "step": global_step,
+                        },
+                        savedir
+                        + f"{FLAGS.model}_cifar10_weights_step_{global_step}.pt",
+                    )


This chunk seems to change a lot but it actually only modifies where x1 is read. The chunk comes from the indentation change as we do not need step_pbar now.

codecov · 2024-11-17T11:10:11Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 45.22%. Comparing base (29c0ed6) to head (db65caa).
Report is 3 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff             @@
##             main     #145       +/-   ##
===========================================
+ Coverage   34.23%   45.22%   +10.98%     
===========================================
  Files          55       12       -43     
  Lines        6268     1130     -5138     
===========================================
- Hits         2146      511     -1635     
+ Misses       4122      619     -3503

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ImahnShekhzadeh · 2024-11-17T15:18:39Z

Hi @Xiaoming-Zhao,

Thanks for this PR!

1.) I'm pretty sure that using infiniteloop is correct, and hence does not need to be replaced (though that doesn't mean we cannot do it). In issue #144, you wrote:

I also noticed that the use sampler.set_epoch(epoch). Based on my previous experience, this is crucial to ensure randomness across epochs. However, with he current generator provided by infiniteloop, I am not sure whether the set_epoch will actually affect the dataloader , I need to double check.

I wrote a small script that I uploaded on my website (https://imahnshekhzadeh.github.io/#Blog), which uses an infiniteloop dataloader, and which I experimented extensively. I ran it like this:

torchrun --nproc_per_node=NUM_GPUS_YOU_HAVE test_inf_loop.py --master_addr [...] --master_port [...]
# e.g.: `torchrun --nproc_per_node=2 test_inf_loop.py --master_addr [...] --master_port [...]`

When sampler.set_epoch(epoch) is used, we observe:

Rank: 1, Epoch: 0, Batch: 2, Data:
[tensor([[-1.3042, -1.1097],
        [-0.1320, -0.2751]])]

Rank: 1, Epoch: 1, Batch: 2, Data:
[tensor([[-0.1752,  0.6990],
        [-0.2350,  0.0937]])]

So clearly, sampler.set_epoch(epoch) does its job! However, when commenting out the two lines, we see this in the output:

Rank: 1, Epoch: 0, Batch: 2, Data:
[tensor([[-1.3042, -1.1097],
        [-0.1320, -0.2751]])]

Rank: 1, Epoch: 1, Batch: 2, Data:
[tensor([[-1.3042, -1.1097],
        [-0.1320, -0.2751]])]

Clearly, no shuffling happened!

2.) About the --standalone flag: Can you please open an issue for this, since this is unrelated to the infiniteloop discussion?Thanks!

Xiaoming-Zhao · 2024-11-17T21:52:46Z

Thanks for the detailed example, @ImahnShekhzadeh! This is incredibly helpful. Lessons learned.

I will close this PR for now as it seems like all required changes have been implemented in #116.

Regarding the torchrun command, I create a separate PR in #149

Xiaoming-Zhao and others added 3 commits November 16, 2024 15:39

Fixed global_step in train_cifar10_ddp.py

4178ea5

avoding using infinite generator for train_cifar10_ddp.py

2781067

remove unused infiniteloop

db65caa

Xiaoming-Zhao commented Nov 16, 2024

View reviewed changes

Xiaoming-Zhao mentioned this pull request Nov 16, 2024

Fixed global_step in train_cifar10_ddp.py #144

Closed

5 tasks

ImahnShekhzadeh mentioned this pull request Nov 17, 2024

Fix DDP Example for CIFAR10 by Using Epochs Only #147

Open

Xiaoming-Zhao closed this Nov 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid using `infiniteloop` in `train_cifar10_ddp.py` #145

Avoid using `infiniteloop` in `train_cifar10_ddp.py` #145

Xiaoming-Zhao commented Nov 16, 2024

Xiaoming-Zhao Nov 16, 2024

codecov bot commented Nov 17, 2024 •

edited

Loading

ImahnShekhzadeh commented Nov 17, 2024 •

edited

Loading

Xiaoming-Zhao commented Nov 17, 2024

Avoid using infiniteloop in train_cifar10_ddp.py #145

Avoid using infiniteloop in train_cifar10_ddp.py #145

Conversation

Xiaoming-Zhao commented Nov 16, 2024

What does this PR do?

Before submitting

Did you have fun?

Xiaoming-Zhao Nov 16, 2024

Choose a reason for hiding this comment

codecov bot commented Nov 17, 2024 • edited Loading

Codecov Report

ImahnShekhzadeh commented Nov 17, 2024 • edited Loading

Xiaoming-Zhao commented Nov 17, 2024

Avoid using `infiniteloop` in `train_cifar10_ddp.py` #145

Avoid using `infiniteloop` in `train_cifar10_ddp.py` #145

codecov bot commented Nov 17, 2024 •

edited

Loading

ImahnShekhzadeh commented Nov 17, 2024 •

edited

Loading