Fixed `global_step` in `train_cifar10_ddp.py` #144

Xiaoming-Zhao · 2024-11-16T21:40:18Z

What does this PR do?

The current gloabl_step in train_cifar10_ddp.py is not correct. The global step should only increase one at a time instead of accumulating current step number.

Before submitting

Did you make sure title is self-explanatory and the description concisely explains the PR?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you list all the breaking changes introduced by this pull request?
Did you test your PR locally with pytest command?
Did you run pre-commit hooks with pre-commit run -a command?

Did you have fun?

Make sure you had fun coding 🙃

ImahnShekhzadeh · 2024-11-16T21:47:44Z

Yes, it's correct that the script train_cifar10_ddp.py currently does not handle the training loop correctly. In this context, I'll point out the following post of mine: #116 (comment)

Can you explain why global_step += 1 is correct? I still think that switching from steps to epochs would be easiest (@kilianFatras).

Xiaoming-Zhao · 2024-11-16T22:21:35Z

Thanks for the pointer.

I was blindly running the script to check whether I could reproduce the results but realized the saved checkpoints did not have the correct iteration indicator.

Switching from steps to epochs sounds good to me. For the change I gave, I was mainly trying to comply with the tradition current repo had and used step as a training progress indicator.

I also noticed that the use sampler.set_epoch(epoch). Based on my previous experience, this is crucial to ensure randomness across epochs. However, with he current generator provided by infiniteloop, I am not sure whether the set_epoch will actually affect the dataloader , I need to double check.

But I think it is easy to change from datalooper to dataloader to ensure randomness. It merely changes the following

conditional-flow-matching/examples/images/cifar10/train_cifar10_ddp.py

Line 165 in 72ae2fd

with trange(steps_per_epoch, dynamic_ncols=True) as step_pbar:

to for batch in tqdm.tqdm(dataloader, total=len(dataloader):. And in this way, I am sure that the sampler.set_epoch will work as expected.

Xiaoming-Zhao · 2024-11-16T23:02:32Z

@ImahnShekhzadeh Added a working example in #145

atong01 · 2024-11-17T13:02:40Z

Closing. Superseded by #145

Fixed global_step in train_cifar10_ddp.py

4178ea5

Xiaoming-Zhao mentioned this pull request Nov 16, 2024

Avoid using infiniteloop in train_cifar10_ddp.py #145

Closed

5 tasks

atong01 closed this Nov 17, 2024

ImahnShekhzadeh mentioned this pull request Nov 17, 2024

Fix DDP Example for CIFAR10 by Using Epochs Only #147

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed `global_step` in `train_cifar10_ddp.py` #144

Fixed `global_step` in `train_cifar10_ddp.py` #144

Xiaoming-Zhao commented Nov 16, 2024

ImahnShekhzadeh commented Nov 16, 2024

Xiaoming-Zhao commented Nov 16, 2024

Xiaoming-Zhao commented Nov 16, 2024 •

edited

Loading

atong01 commented Nov 17, 2024

Fixed global_step in train_cifar10_ddp.py #144

Fixed global_step in train_cifar10_ddp.py #144

Conversation

Xiaoming-Zhao commented Nov 16, 2024

What does this PR do?

Before submitting

Did you have fun?

ImahnShekhzadeh commented Nov 16, 2024

Xiaoming-Zhao commented Nov 16, 2024

Xiaoming-Zhao commented Nov 16, 2024 • edited Loading

atong01 commented Nov 17, 2024

Fixed `global_step` in `train_cifar10_ddp.py` #144

Fixed `global_step` in `train_cifar10_ddp.py` #144

Xiaoming-Zhao commented Nov 16, 2024 •

edited

Loading