Scheduled(Stable)-Weight-Decay-Regularization

The PyTorch Implementation of Scheduled (Stable) Weight Decay.

The algorithms were first proposed in our arxiv paper.

A formal version with major revision and theoretical mechanism "On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective" is accepted at NeurIPS 2023.

Why Scheduled (Stable) Weight Decay?

We proposed the Scheduled (Stable) Weight Decay (SWD) method to mitigate overlooked large-gradient-norm pitfalls of weight decay in modern deep learning libraries.

SWD can penalize the large gradient norms at the final phase of training.
SWD usually makes significant improvements over both L2 regularization and decoupled weight decay.
Simply fixing weight decay in Adam by SWD, with no extra hyperparameter, can usually outperform complex Adam variants, which have more hyperparameters.

The environment is as bellow:

Python 3.7.3

PyTorch >= 1.4.0

Usage

You may use it as a standard PyTorch optimizer.

import swd_optim

optimizer = swd_optim.AdamS(net.parameters(), lr=1e-3, betas=(0.9, 0.999), eps=1e-08, weight_decay=5e-4, amsgrad=False)

Test performance

Dataset	Model	AdamS	SGD M	Adam	AMSGrad	AdamW	AdaBound	Padam	Yogi	RAdam
CIFAR-10	ResNet18	4.91_0.04	5.01_0.03	6.53_0.03	6.16_0.18	5.08_0.07	5.65_0.08	5.12_0.04	5.87_0.12	6.01_0.10
	VGG16	6.09_0.11	6.42_0.02	7.31_0.25	7.14_0.14	6.48_0.13	6.76_0.12	6.15_0.06	6.90_0.22	6.56_0.04
CIFAR-100	DenseNet121	20.52_0.26	19.81_0.33	25.11_0.15	24.43_0.09	21.55_0.14	22.69_0.15	21.10_0.23	22.15_0.36	22.27_0.22
	GoogLeNet	21.05_0.18	21.21_0.29	26.12_0.33	25.53_0.17	21.29_0.17	23.18_0.31	21.82_0.17	24.24_0.16	22.23_0.15

Citing

If you use Scheduled (Stable) Weight Decay in your work, please cite "On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective".

@inproceedings{xie2023onwd,
    title={On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective},
    author={Xie, Zeke and Xu, Zhiqiang and Zhang, Jingzhao and Sato, Issei and Sugiyama, Masashi},
    booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
    year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
model		model
swd_optim		swd_optim
LICENSE		LICENSE
README.md		README.md
swd_cifar10_demo.ipynb		swd_cifar10_demo.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scheduled(Stable)-Weight-Decay-Regularization

Why Scheduled (Stable) Weight Decay?

The environment is as bellow:

Usage

Test performance

Citing

About

Releases

Packages

Languages

License

zeke-xie/stable-weight-decay-regularization

Folders and files

Latest commit

History

Repository files navigation

Scheduled(Stable)-Weight-Decay-Regularization

Why Scheduled (Stable) Weight Decay?

The environment is as bellow:

Usage

Test performance

Citing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages