Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed global_step in train_cifar10_ddp.py #144

Closed
wants to merge 1 commit into from

Conversation

Xiaoming-Zhao
Copy link
Contributor

What does this PR do?

The current gloabl_step in train_cifar10_ddp.py is not correct. The global step should only increase one at a time instead of accumulating current step number.

Before submitting

  • Did you make sure title is self-explanatory and the description concisely explains the PR?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you test your PR locally with pytest command?
  • Did you run pre-commit hooks with pre-commit run -a command?

Did you have fun?

Make sure you had fun coding 🙃

@ImahnShekhzadeh
Copy link
Contributor

Yes, it's correct that the script train_cifar10_ddp.py currently does not handle the training loop correctly. In this context, I'll point out the following post of mine: #116 (comment)

Can you explain why global_step += 1 is correct? I still think that switching from steps to epochs would be easiest (@kilianFatras).

@Xiaoming-Zhao
Copy link
Contributor Author

Thanks for the pointer.

I was blindly running the script to check whether I could reproduce the results but realized the saved checkpoints did not have the correct iteration indicator.

Switching from steps to epochs sounds good to me. For the change I gave, I was mainly trying to comply with the tradition current repo had and used step as a training progress indicator.

I also noticed that the use sampler.set_epoch(epoch). Based on my previous experience, this is crucial to ensure randomness across epochs. However, with he current generator provided by infiniteloop, I am not sure whether the set_epoch will actually affect the dataloader , I need to double check.

But I think it is easy to change from datalooper to dataloader to ensure randomness. It merely changes the following

with trange(steps_per_epoch, dynamic_ncols=True) as step_pbar:

to for batch in tqdm.tqdm(dataloader, total=len(dataloader):. And in this way, I am sure that the sampler.set_epoch will work as expected.

@Xiaoming-Zhao
Copy link
Contributor Author

Xiaoming-Zhao commented Nov 16, 2024

@ImahnShekhzadeh Added a working example in #145

@atong01
Copy link
Owner

atong01 commented Nov 17, 2024

Closing. Superseded by #145

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants