Batch normalization: Accelerating deep network training by reducing internal covariate shift #4

standing-o · 2021-12-12T11:18:10Z

Batch normalization: Accelerating deep network training by reducing internal covariate shift.

Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change.
This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities Internal covariate shift.
➔ Internal covariate shift
Batch normalization : normalizing layer inputs for each training mini-batch

The gradient of the loss over a mini-batch
It requires careful tuning of the model hyper-parameters, specifically the learning rate and the initial parameter values.
Sigmoid activation function z = g(Wu + b). As |x| increases, g‘(x) tends to zero. (non-linearity saturation regime)
➔ The gradient flowing down to u will vanish and the model will train slowly
Batch normalization : the distribution of nonlinearity inputs remains more stable as the network trains, then the optimizer would be less likely to get stuck in the saturated regime, and the training would accelerate.

The change in the distributions of internal nodes of a deep network, in the course of training.
Batch normalization takes a step towards reducing internal covariate shift.
➔ Accelerating the training of deep neural nets
➔ Reducing the dependence of gradients on the scale of the parameters or of their initial values i.e. allowing us to use much higher learning rates
➔ Regularizing the model and reducing the need for Dropout
➔ Using saturating nonlinearities by preventing the network from getting stuck in the saturated modes

Normalize each scalar feature independently, by making it have zero mean and unit variance.

Normalize each dimension where the expectation and variance are computed over the training data set.
Normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity.
- For each activation 𝑥^(k), a pair of trainable parameters γ^(k), β^(k), which scale and shift the normalized value : y^(k) = γ^(k)x̂^(k) + β^(k).
- Recover the original activations, if that were the optimal thing to do.

Use mini-batches in stochastic gradient training, each mini-batch produces estimates of the mean and variance of each activation.

The normalization of activations that depends on the mini-batch allows efficient training, but is neither necessary nor desirable during inference.
We want the output to depend only on the input, deterministically. For this, once the network has been trained, we use the normalization.
➔ Since the means and variances are fixed during inference, the normalization is simply a linear transform applied to each activation.

We add the BN transform immediately before the nonlinearity by normalizing x = Wu + b.

Batch Normalization makes training more resilient to the parameter scale.
Back-propagation through a layer is unaffected by the scale of its parameter, so larger weights lead to smaller gradients, and batch normalization will stabilize the parameter growth.

Batch normalization makes the distribution more stable and reduces the internal covariate shift.
Test accuracy on MNIST

The text was updated successfully, but these errors were encountered:

standing-o added Batch-Normalization Optimization labels Dec 12, 2021

Repository owner locked and limited conversation to collaborators Jul 26, 2022