Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adam: A method for stochastic optimization #2

Open
standing-o opened this issue Dec 12, 2021 · 0 comments
Open

Adam: A method for stochastic optimization #2

standing-o opened this issue Dec 12, 2021 · 0 comments

Comments

@standing-o
Copy link
Owner

standing-o commented Dec 12, 2021

Adam: A method for stochastic optimization

Abstract

  • Adam is...
    ➔ Algorithm for first-order gradient-based optimization of stochastic objective functions based on adaptive estimates of lower-order moments.
    ➔ Straightforward to implement, computationally efficient, little memory requirements, invariant to diagonal rescaling of the gradients, well suited for problems that are large in terms of data or params.
    ➔ Appropriate for non-stationary objectives and problems with very noisy and sparse gradients
    ➔ Its hyper params have intuitive interpretations and typically require little tuning.

Introduction

  • If an objective function requiring maximization or minimization with respect to its parameters is differentiable, gradient descent is a relatively efficient optimization method, since the computation of first-order partial derivatives is of the same computational complexity as just evaluating the function.
  • Adam computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.
    ➔ designed to combine the advantages of AdaGrad and RMSProp.
    Advantages : magnitudes of param updates are invariant to rescaling of the gradient, its stepsizes are approximately bounded by the stepsize hyperparameters.

Adam algorithm pseudo-code

Adam algorithm

  • The algorithm updates exponential moving averages of the gradient (m) and The squared gradient (v) where the hyper-params β1, β2 control the exponential decay rates of these moving averages.
  • These moving averages are initialized as 0, leading to moment estimates that are biased towards zero, especially during the initial timesteps, and especially when the decay rates are small (i.e β are close to 1)

Initialization bias correction

  • Let us initialize the exponential moving average as v0 = 0, then vt can be written as a function of the gradients at all previous timesteps:


    where ζ = 0 if the true second moment E[gi2] is stationary (can be kept small).
  • we divide by (1-β2t to correct the initialization bias.

Experiment

  • Logistic regression training negative log likelihood on MNIST images
  • Training of multilayer neural networks on MNIST images
  • Convolutional neural networks training cost on CIFAR10
Repository owner locked and limited conversation to collaborators Jul 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant