Enable AdamW Optimizer #6153

bilzard · 2022-01-02T03:06:46Z

Search before asking

I have searched the YOLOv5 issues and found no similar feature requests.

Description

When we use Adam, we have to tune learning rate along with the batch size.
It is cumbersome; with AdamW, we don't have to re-tune learning rate even if we change batch size.
So, it is nice to be able to use this option.

I have created PR to enable AdamW optimizer. Please check it out.
#6152

Use case

No response

Additional

No response

Are you willing to submit a PR?

Yes I'd like to help by submitting a PR!

glenn-jocher · 2022-01-02T21:06:21Z

@bilzard great, thank you! So the main difference between AdamW and Adam is that AdamW scales with batch-size automatically?

We automatically scale the loss by the batch size for SGD/Adam in YOLOv5 here. Does this mean AdamW will be double-scaled, or does it automatically compensate so not a problem?

yolov5/utils/loss.py

Lines 165 to 167 in d95978a

    
           bs = tobj.shape[0]  # batch size 
        
           return (lbox + lobj + lcls) * bs, torch.cat((lbox, lobj, lcls)).detach()

bilzard · 2022-01-02T23:47:49Z

@glenn-jocher

Hi, thank you for the fast responding.

So the main difference between AdamW and Adam is that AdamW scales with batch-size automatically?

That's what I initially thought, but since I couldn't find any articles to support it, I concluded it was my misunderstanding.
In other words, I couldn't find any evidence that the learning rate doesn't need to be re-fitted along with the batch size just because the optimizer is changed to AdamW.

My understanding is that AdamW is an improvement version of Adam with respect to the weight decay update algorithm[1].

However, I don't think it is necessary to change my PR. I think some people may want to adopt AdamW for the reasons above.

I didn't know about the loss scaling in SGD. Thank you for letting me know.
However, my understanding is that the current code scales the learning rate proportionally to the batch size even when using Adam/AdamW as well as SGD.

I found the article [3] in web forum [2] that suggests scaling the learning rate proportionally to the batch size in the case of SGD. However, I could not find any evidence that the learning rate should be scaled by such a simple rule for adaptive optimizers such as Adam/AdamW.

I don't think many people who use YOLOv5 will be aware of such scaling at first.
I think it is better that, If we are using the Adam/AdamW option, it won't multiply the batch size by the learning rate.
What do you think?

bilzard · 2022-01-03T00:01:03Z

P.S. Looking at some of the recent Issue responses, it seems that sometimes a PR is merged in the middle of a discussion.

It's good to see a quick response, but isn't it a bit dangerous?
Because merging changes that have not been well discussed may cause new bugs and problems. What do you think?

glenn-jocher · 2022-01-03T00:08:13Z

@bilzard yes the idea is that users can change --batch-size without worrying about anything else. We've done a --batch-size study in #2452 to confirm that results are essentially independent from batch size with SGD due to the scaling we have in place in loss.py above.

With Adam and AdamW I'm not sure.

Note I also applied PR #6152 changes to the YOLOv5 classifier branch here to maintain consistency:
https://github.com/ultralytics/yolov5/blob/classifier/classifier.py

Thanks for the feedback! Usually we get criticism in the other direction, that PRs stay open too long.

bilzard · 2022-01-03T03:43:43Z

@glenn-jocher Thank you for sharing the experiment result.
Great. It was verified for SGD. (All the important changes should be verified.)
So, if it is not sure on Adam/AdamW, then I think it is reasonable to disable learning rate scaling with respect to batch size for these optimizers. I think Adam/AdamW user will be confused if leaning rate varies along with batch size (what do you tnink?).

I created PR for this change on my forked repository. Do you mind merging this on this original repository?
https://github.com/bilzard/yolov5/pull/2

Or, there is another option that a user can control this behavior by command line option like --disable-loss-scaling.
This option will bring more freedom to the user, since the default behavior can be changed even when using SGD.

Note: PR has been updated with the latter choice.

glenn-jocher · 2022-01-03T04:04:45Z

@bilzard wait there are two topics here:

LR scaling. This is NOT done anywhere by default.
Loss scaling. This is done automatically by YOLOv5 in loss.py.

The LR not adjusting automatically may be an issue, as someone will need to pair --optimizer Adam with a hyp.yaml file with a much lower learning rate to get similar results. i.e. if lr0=0.1 for SGD then they may want to start with lr0=0.01 for Adam.

Regarding modifying the loss scaling we'd need to repeat a few points on the batch-size study using Adam and AdamW to see their real-world results.

bilzard · 2022-01-03T04:12:47Z

@glenn-jocher O.K. I agree with the idea of repeating the same study for Adam and AdamW. I will wait for that.

glenn-jocher · 2022-01-03T17:40:11Z

@bilzard SGD/Adam batch size study results are here: https://wandb.ai/glenn-jocher/study-Adam. Dividing the SGD LR by 10 manually for the Adam/AdamW runs.

# VOC
for b, m in zip([16, 64, 16, 64, 16, 64], ['SGD', 'SGD', 'Adam', 'Adam', 'AdamW', 'AdamW']):  # zip(batch_size, model)
  hyp = 'hyp.finetune.yaml' if m.startswith('SGD') else 'hyp.finetuneAdam.yaml'
  !python train.py --batch {b} --weights yolov5s.pt --data VOC.yaml --epochs 50 --cache --img 512 --nosave --hyp {hyp} --project study-Adam --name {m}-{b} --optimizer {m}

Adam seems to handle batch-size changes without issue, so it seems like no changes are required:

bilzard added the enhancement New feature or request label Jan 2, 2022

bilzard mentioned this issue Jan 2, 2022

Enable AdamW optimizer #6152

Merged

5 tasks

glenn-jocher linked a pull request Jan 2, 2022 that will close this issue

Enable AdamW optimizer #6152

Merged

5 tasks

glenn-jocher closed this as completed in #6152 Jan 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable AdamW Optimizer #6153

Enable AdamW Optimizer #6153

bilzard commented Jan 2, 2022 •

edited

Loading

glenn-jocher commented Jan 2, 2022 •

edited

Loading

bilzard commented Jan 2, 2022

bilzard commented Jan 3, 2022

glenn-jocher commented Jan 3, 2022

bilzard commented Jan 3, 2022 •

edited

Loading

glenn-jocher commented Jan 3, 2022

bilzard commented Jan 3, 2022

glenn-jocher commented Jan 3, 2022

Enable AdamW Optimizer #6153

Enable AdamW Optimizer #6153

Comments

bilzard commented Jan 2, 2022 • edited Loading

Search before asking

Description

Use case

Additional

Are you willing to submit a PR?

glenn-jocher commented Jan 2, 2022 • edited Loading

bilzard commented Jan 2, 2022

bilzard commented Jan 3, 2022

glenn-jocher commented Jan 3, 2022

bilzard commented Jan 3, 2022 • edited Loading

glenn-jocher commented Jan 3, 2022

bilzard commented Jan 3, 2022

glenn-jocher commented Jan 3, 2022

bilzard commented Jan 2, 2022 •

edited

Loading

glenn-jocher commented Jan 2, 2022 •

edited

Loading

bilzard commented Jan 3, 2022 •

edited

Loading