add support for distributed data parallel training #116

ImahnShekhzadeh · 2024-05-21T10:12:10Z

This PR adds support for distributed data parallel (DDP) and replaces DataParallel with DistributedDataParallel in train_cifar.py, which can be used via the flag parallel. To achieve this, the code is refactored, and the flags master_addr and master_port are added.

I tested the changes, on a single GPU, I get an FID of 3.74 (with the OT-CFM method), on two GPUs with DDP, I get an FID of 3.81.

Before submitting

Did you make sure title is self-explanatory and the description concisely explains the PR?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you list all the breaking changes introduced by this pull request?
Did you test your PR locally with pytest command?
Did you run pre-commit hooks with pre-commit run -a command?

…a parallel)

…e on multiple GPUs

…k` for device setting

kilianFatras · 2024-05-21T14:05:53Z

Hi, thank you for your contribution!

I had an internal implementation with fabric form litghtning but I like to rely only on PyTorch for this example. I need some time to review it (a few days/weeks). I will come back to it soon.

kilianFatras

Thank you for this nice contribution! My main concern is about the data shuffling. I think we should keep shuffling them. Maybe I am missing something and happy to learn about it.

examples/images/cifar10/compute_fid.py

examples/images/cifar10/train_cifar10.py

kilianFatras · 2024-07-16T13:42:53Z

examples/images/cifar10/train_cifar10.py

@@ -81,7 +89,8 @@ def train(argv):
    dataloader = torch.utils.data.DataLoader(
        dataset,
        batch_size=FLAGS.batch_size,
-        shuffle=True,
+        sampler=DistributedSampler(dataset) if FLAGS.parallel else None,
+        shuffle=False if FLAGS.parallel else True,


hum. I am rather unsure about this. where do you shuffle the data then?

Fair point. I found this warning in the PyTorch docs:

So in my current implementation, train_sampler.set_epoch(epoch) is missing, which I will add now.

Perfect. Once you have finished your change, I will run the code myself. Once I get it working, I will merge the PR.

Final question, can you try to load and run the existing checkpoints? I just want to be sure that people can reproduce our results. Thx.

Ok, I refactored the training loop to use num_epochs instead of FLAGS.total_steps, since sampler.set_epoch(epoch) uses an epoch count. However, I think we need to change more than this. The PyTorch warning I pasted above mentions that we need to use sampler.set_epoch(epoch) "before creating the DataLoader iterator", but right now, the data loader iterator is created once before the training loop:

from utils_cifar import infiniteloop datalooper = infiniteloop(dataloader)

The way I would change this is by having a training loop like this:

# datalooper = infiniteloop(dataloader) with trange(num_epochs, dynamic_ncols=True) as epoch_pbar: for epoch in epoch_pbar: epoch_pbar.set_description(f"Epoch {epoch + 1}/{num_epochs}") if sampler is not None: sampler.set_epoch(epoch) for batch_idx, data in enumerate(dataloader): # step += 1 # tricky optim.zero_grad() x1 = data.to(rank) # old: `x1 = next(datalooper)` [...]

Is this fine by you? IMO, what is a bit tricky is to handle the step counter correctly (based on which checkpoints are saved and some samples during training are generated). In a distributed setup, we'll have several processes running in parallel, and thus, we would probably save checkpoints and images multiple times (once per process/GPU). However, since the filenames do not reflect the process ID, one process would also overwrite the files of the other. What do you think?

About your question: When you say "existing checkpoints", which ones do you mean? I had once run the training and generation of samples on one GPU and gotten an FID of 3.8 (which is only slightly worse than the 3.5 you report).

* change pytorch lightning version * fix pip version * fix pip in code cov

…eps, rewrite training loop to use epochs instead of steps

…in distributed mode

kilianFatras · 2024-08-02T21:50:47Z

I like the new changes. @atong01 do you mind having a look? I also think it would be great to keep the original train_cifar10.py.

While I like this code, it is slightly more complicated than the previous one. So I would keep both. The idea of this package is that any master student can easily understand it in 1hour. @ImahnShekhzadeh can you rename this file train_cifar10_ddp.py please? and re-add the previous file? Thanks

ImahnShekhzadeh · 2024-08-08T08:08:54Z

@ImahnShekhzadeh can you rename this file train_cifar10_ddp.py please? and re-add the previous file? Thanks

Done

atong01 · 2024-08-21T18:26:47Z

LGTM. Thanks for the contribution @ImahnShekhzadeh

ImahnShekhzadeh added 4 commits May 19, 2024 19:24

make code changes in train_cifar10.py to allow DDP (distributed dat…

51ea2cd

…a parallel)

add instructions to README on how to run cifar10 image generation cod…

71122c9

…e on multiple GPUs

fix: when running cifar10 image generation on multiple gpus, use `ran…

d0b0da2

…k` for device setting

fix: load checkpoint on right device

333d73f

kilianFatras requested changes Jul 16, 2024

View reviewed changes

atong01 and others added 2 commits July 29, 2024 19:07

fix runner ci requirements (atong01#125)

707dfbe

* change pytorch lightning version * fix pip version * fix pip in code cov

change variable name world_size to total_num_gpus

eb90f19

ImahnShekhzadeh force-pushed the main branch from 6530473 to eb90f19 Compare July 29, 2024 17:12

ImahnShekhzadeh added 3 commits July 29, 2024 19:16

change: do not overwrite batch size flag

f2586ce

add, refactor: calculate number of epochs based on total number of st…

443b000

…eps, rewrite training loop to use epochs instead of steps

fix: add sampler.set_epoch(epoch) to training loop to shuffle data …

f8bc646

…in distributed mode

ImahnShekhzadeh added 2 commits August 8, 2024 10:07

rename file, update README

c929434

add original CIFAR10 training file

0389ea4

atong01 merged commit c25e191 into atong01:main Aug 21, 2024
31 checks passed

This was referenced Nov 16, 2024

Fixed global_step in train_cifar10_ddp.py #144

Closed

Fix DDP Example for CIFAR10 by Using Epochs Only #147

Open

Xiaoming-Zhao mentioned this pull request Nov 17, 2024

Avoid using infiniteloop in train_cifar10_ddp.py #145

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for distributed data parallel training #116

add support for distributed data parallel training #116

ImahnShekhzadeh commented May 21, 2024 •

edited

Loading

kilianFatras commented May 21, 2024

kilianFatras left a comment

kilianFatras Jul 16, 2024

ImahnShekhzadeh Jul 29, 2024

kilianFatras Jul 29, 2024

ImahnShekhzadeh Jul 29, 2024 •

edited

Loading

ImahnShekhzadeh Jul 29, 2024

kilianFatras commented Aug 2, 2024

ImahnShekhzadeh commented Aug 8, 2024

atong01 commented Aug 21, 2024

add support for distributed data parallel training #116

add support for distributed data parallel training #116

Conversation

ImahnShekhzadeh commented May 21, 2024 • edited Loading

Before submitting

kilianFatras commented May 21, 2024

kilianFatras left a comment

Choose a reason for hiding this comment

kilianFatras Jul 16, 2024

Choose a reason for hiding this comment

ImahnShekhzadeh Jul 29, 2024

Choose a reason for hiding this comment

kilianFatras Jul 29, 2024

Choose a reason for hiding this comment

ImahnShekhzadeh Jul 29, 2024 • edited Loading

Choose a reason for hiding this comment

ImahnShekhzadeh Jul 29, 2024

Choose a reason for hiding this comment

kilianFatras commented Aug 2, 2024

ImahnShekhzadeh commented Aug 8, 2024

atong01 commented Aug 21, 2024

ImahnShekhzadeh commented May 21, 2024 •

edited

Loading

ImahnShekhzadeh Jul 29, 2024 •

edited

Loading