-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent computation of weight_decay and grad_residual among pytorch versions #56
Comments
Thanks for pointing it out. This part is a bit tricky if weight decay is not decoupled. Currently, Adam and AdamW implementation in PyTorch also does the weight decay before the update, I have not got into details and just follow the convention. But I think it would be very interesting to perform a careful comparison. |
Thanks for the speedy reply. I had looked through your project page and in the presentation there was a comment from a user that gradient clipping caused problems for the algorithm. But it worked well when he turned gradient clipping off. So I was thinking that you don't want to affect the values stored for the gradients in any way. Also, I had a question that is a bit of a newbie question WRT pytorch. I noticed that you had a number of |
Hi, thanks for the comments. The gradient clip could be problematic, say suppress the difference For the The eps in-place add is slightly different from Adam, and also default eps=1e-16 for AdaBelief but is 1e-8 for Adam. In AdaBelief, eps is added both inside and outside sqrt, but the one outside sqrt can be safely ignored. (sqrt(1e-16)>>1e-16) For the in-place add that eps will add to Currently, I would suggest keeping PS: I noticed your comments on the Newton-style modification. It looks very interesting to me. I'll take a careful look and perhaps test on larger NLP tasks later, but for now I'm a bit busy with other stuff and could not pursue very deep. But you are very welcome to post a comment here. |
Thanks for helping me understand the application of WRT the Newton-style modification in my other reply and my question of initialization of
|
Hi |
Hi
I was looking at the various versions you have in the
pypi_packages
folder and noticed that the order of computation of weight decay (which for some options modifiesgrad
) and ofgrad_residual
(which usesgrad
) differs for the different versions. Inadabelief_pytorch0.0.5
,adabelief_pytorch0.2.0
, andadabelief_pytorch0.2.1
weight decay is done before computinggrad_residual
but inadabelief_pytorch0.1.0
it is done afterwards. It seems thatadabelief_pytorch0.1.0
is more closely following what your paper described as the second-order momentum computation. Shouldn't the others be changes to align withadabelief_pytorch0.1.0
?The text was updated successfully, but these errors were encountered: