Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable AdamW Optimizer #6153

Closed
2 tasks done
bilzard opened this issue Jan 2, 2022 · 8 comments · Fixed by #6152
Closed
2 tasks done

Enable AdamW Optimizer #6153

bilzard opened this issue Jan 2, 2022 · 8 comments · Fixed by #6152
Labels
enhancement New feature or request

Comments

@bilzard
Copy link
Contributor

bilzard commented Jan 2, 2022

Search before asking

  • I have searched the YOLOv5 issues and found no similar feature requests.

Description

When we use Adam, we have to tune learning rate along with the batch size.
It is cumbersome; with AdamW, we don't have to re-tune learning rate even if we change batch size.
So, it is nice to be able to use this option.

I have created PR to enable AdamW optimizer. Please check it out.
#6152

Use case

No response

Additional

No response

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@bilzard bilzard added the enhancement New feature or request label Jan 2, 2022
@bilzard bilzard mentioned this issue Jan 2, 2022
5 tasks
@glenn-jocher glenn-jocher linked a pull request Jan 2, 2022 that will close this issue
5 tasks
@glenn-jocher
Copy link
Member

glenn-jocher commented Jan 2, 2022

@bilzard great, thank you! So the main difference between AdamW and Adam is that AdamW scales with batch-size automatically?

We automatically scale the loss by the batch size for SGD/Adam in YOLOv5 here. Does this mean AdamW will be double-scaled, or does it automatically compensate so not a problem?

yolov5/utils/loss.py

Lines 165 to 167 in d95978a

bs = tobj.shape[0] # batch size
return (lbox + lobj + lcls) * bs, torch.cat((lbox, lobj, lcls)).detach()

@bilzard
Copy link
Contributor Author

bilzard commented Jan 2, 2022

@glenn-jocher

Hi, thank you for the fast responding.

So the main difference between AdamW and Adam is that AdamW scales with batch-size automatically?

That's what I initially thought, but since I couldn't find any articles to support it, I concluded it was my misunderstanding.
In other words, I couldn't find any evidence that the learning rate doesn't need to be re-fitted along with the batch size just because the optimizer is changed to AdamW.

My understanding is that AdamW is an improvement version of Adam with respect to the weight decay update algorithm[1].

However, I don't think it is necessary to change my PR. I think some people may want to adopt AdamW for the reasons above.

I didn't know about the loss scaling in SGD. Thank you for letting me know.
However, my understanding is that the current code scales the learning rate proportionally to the batch size even when using Adam/AdamW as well as SGD.

I found the article [3] in web forum [2] that suggests scaling the learning rate proportionally to the batch size in the case of SGD. However, I could not find any evidence that the learning rate should be scaled by such a simple rule for adaptive optimizers such as Adam/AdamW.

I don't think many people who use YOLOv5 will be aware of such scaling at first.
I think it is better that, If we are using the Adam/AdamW option, it won't multiply the batch size by the learning rate.
What do you think?


@bilzard
Copy link
Contributor Author

bilzard commented Jan 3, 2022

P.S. Looking at some of the recent Issue responses, it seems that sometimes a PR is merged in the middle of a discussion.

It's good to see a quick response, but isn't it a bit dangerous?
Because merging changes that have not been well discussed may cause new bugs and problems. What do you think?

@glenn-jocher
Copy link
Member

@bilzard yes the idea is that users can change --batch-size without worrying about anything else. We've done a --batch-size study in #2452 to confirm that results are essentially independent from batch size with SGD due to the scaling we have in place in loss.py above.

With Adam and AdamW I'm not sure.

Note I also applied PR #6152 changes to the YOLOv5 classifier branch here to maintain consistency:
https://github.com/ultralytics/yolov5/blob/classifier/classifier.py

Thanks for the feedback! Usually we get criticism in the other direction, that PRs stay open too long.

@bilzard
Copy link
Contributor Author

bilzard commented Jan 3, 2022

@glenn-jocher Thank you for sharing the experiment result.
Great. It was verified for SGD. (All the important changes should be verified.)
So, if it is not sure on Adam/AdamW, then I think it is reasonable to disable learning rate scaling with respect to batch size for these optimizers. I think Adam/AdamW user will be confused if leaning rate varies along with batch size (what do you tnink?).

I created PR for this change on my forked repository. Do you mind merging this on this original repository?
https://github.com/bilzard/yolov5/pull/2

Or, there is another option that a user can control this behavior by command line option like --disable-loss-scaling.
This option will bring more freedom to the user, since the default behavior can be changed even when using SGD.


Note: PR has been updated with the latter choice.

@glenn-jocher
Copy link
Member

@bilzard wait there are two topics here:

  • LR scaling. This is NOT done anywhere by default.
  • Loss scaling. This is done automatically by YOLOv5 in loss.py.

The LR not adjusting automatically may be an issue, as someone will need to pair --optimizer Adam with a hyp.yaml file with a much lower learning rate to get similar results. i.e. if lr0=0.1 for SGD then they may want to start with lr0=0.01 for Adam.

Regarding modifying the loss scaling we'd need to repeat a few points on the batch-size study using Adam and AdamW to see their real-world results.

@bilzard
Copy link
Contributor Author

bilzard commented Jan 3, 2022

@glenn-jocher O.K. I agree with the idea of repeating the same study for Adam and AdamW. I will wait for that.

@glenn-jocher
Copy link
Member

@bilzard SGD/Adam batch size study results are here: https://wandb.ai/glenn-jocher/study-Adam. Dividing the SGD LR by 10 manually for the Adam/AdamW runs.

# VOC
for b, m in zip([16, 64, 16, 64, 16, 64], ['SGD', 'SGD', 'Adam', 'Adam', 'AdamW', 'AdamW']):  # zip(batch_size, model)
  hyp = 'hyp.finetune.yaml' if m.startswith('SGD') else 'hyp.finetuneAdam.yaml'
  !python train.py --batch {b} --weights yolov5s.pt --data VOC.yaml --epochs 50 --cache --img 512 --nosave --hyp {hyp} --project study-Adam --name {m}-{b} --optimizer {m}

Adam seems to handle batch-size changes without issue, so it seems like no changes are required:
Screen Shot 2022-01-03 at 9 39 29 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants