Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch normalization: Accelerating deep network training by reducing internal covariate shift #4

Open
standing-o opened this issue Dec 12, 2021 · 0 comments

Comments

@standing-o
Copy link
Owner

standing-o commented Dec 12, 2021

Batch normalization: Accelerating deep network training by reducing internal covariate shift.

Abstract

  • Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change.
  • This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities Internal covariate shift.
    ➔ Internal covariate shift
  • Batch normalization : normalizing layer inputs for each training mini-batch

Introduction

SGD Optimizer

  • The gradient of the loss over a mini-batch

  • It requires careful tuning of the model hyper-parameters, specifically the learning rate and the initial parameter values.

  • Sigmoid activation function z = g(Wu + b). As |x| increases, g‘(x) tends to zero. (non-linearity saturation regime)
    ➔ The gradient flowing down to u will vanish and the model will train slowly

  • Batch normalization : the distribution of nonlinearity inputs remains more stable as the network trains, then the optimizer would be less likely to get stuck in the saturated regime, and the training would accelerate.

Internal Covariate Shift

  • The change in the distributions of internal nodes of a deep network, in the course of training.
  • Batch normalization takes a step towards reducing internal covariate shift.
    ➔ Accelerating the training of deep neural nets
    ➔ Reducing the dependence of gradients on the scale of the parameters or of their initial values i.e. allowing us to use much higher learning rates
    ➔ Regularizing the model and reducing the need for Dropout
    ➔ Using saturating nonlinearities by preventing the network from getting stuck in the saturated modes

Normalization via mini-batch statistics

  1. Normalize each scalar feature independently, by making it have zero mean and unit variance.
  • Normalize each dimension where the expectation and variance are computed over the training data set.
  • Normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity.
    • For each activation 𝑥(k), a pair of trainable parameters γ(k), β(k), which scale and shift the normalized value : y(k) = γ(k)(k) + β(k).
    • Recover the original activations, if that were the optimal thing to do.
  1. Use mini-batches in stochastic gradient training, each mini-batch produces estimates of the mean and variance of each activation.

Training and Inference with Batch normalization networks

  • The normalization of activations that depends on the mini-batch allows efficient training, but is neither necessary nor desirable during inference.
  • We want the output to depend only on the input, deterministically. For this, once the network has been trained, we use the normalization.
    ➔ Since the means and variances are fixed during inference, the normalization is simply a linear transform applied to each activation.

Batch-normalized convolutional networks

  • We add the BN transform immediately before the nonlinearity by normalizing x = Wu + b.

Batch Normalization enables higher learning rates

  • Batch Normalization makes training more resilient to the parameter scale.
  • Back-propagation through a layer is unaffected by the scale of its parameter, so larger weights lead to smaller gradients, and batch normalization will stabilize the parameter growth.

Experiment

  • Batch normalization makes the distribution more stable and reduces the internal covariate shift.
  • Test accuracy on MNIST
Repository owner locked and limited conversation to collaborators Jul 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant