You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change.
This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities Internal covariate shift.
➔ Internal covariate shift
Batch normalization : normalizing layer inputs for each training mini-batch
Introduction
SGD Optimizer
The gradient of the loss over a mini-batch
It requires careful tuning of the model hyper-parameters, specifically the learning rate and the initial parameter values.
Sigmoid activation function z = g(Wu + b). As |x| increases, g‘(x) tends to zero. (non-linearity saturation regime)
➔ The gradient flowing down to u will vanish and the model will train slowly
Batch normalization : the distribution of nonlinearity inputs remains more stable as the network trains, then the optimizer would be less likely to get stuck in the saturated regime, and the training would accelerate.
Internal Covariate Shift
The change in the distributions of internal nodes of a deep network, in the course of training.
Batch normalization takes a step towards reducing internal covariate shift.
➔ Accelerating the training of deep neural nets
➔ Reducing the dependence of gradients on the scale of the parameters or of their initial values i.e. allowing us to use much higher learning rates
➔ Regularizing the model and reducing the need for Dropout
➔ Using saturating nonlinearities by preventing the network from getting stuck in the saturated modes
Normalization via mini-batch statistics
Normalize each scalar feature independently, by making it have zero mean and unit variance.
Normalize each dimension where the expectation and variance are computed over the training data set.
Normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity.
For each activation 𝑥(k), a pair of trainable parameters γ(k), β(k), which scale and shift the normalized value : y(k) = γ(k)x̂(k) + β(k).
Recover the original activations, if that were the optimal thing to do.
Use mini-batches in stochastic gradient training, each mini-batch produces estimates of the mean and variance of each activation.
Training and Inference with Batch normalization networks
The normalization of activations that depends on the mini-batch allows efficient training, but is neither necessary nor desirable during inference.
We want the output to depend only on the input, deterministically. For this, once the network has been trained, we use the normalization.
➔ Since the means and variances are fixed during inference, the normalization is simply a linear transform applied to each activation.
Batch-normalized convolutional networks
We add the BN transform immediately before the nonlinearity by normalizing x = Wu + b.
Batch Normalization enables higher learning rates
Batch Normalization makes training more resilient to the parameter scale.
Back-propagation through a layer is unaffected by the scale of its parameter, so larger weights lead to smaller gradients, and batch normalization will stabilize the parameter growth.
Experiment
Batch normalization makes the distribution more stable and reduces the internal covariate shift.
Test accuracy on MNIST
The text was updated successfully, but these errors were encountered:
Batch normalization: Accelerating deep network training by reducing internal covariate shift.
Abstract
➔ Internal covariate shift
Batch normalization
: normalizing layer inputs for each training mini-batchIntroduction
SGD Optimizer
The gradient of the loss over a mini-batch
It requires careful tuning of the model hyper-parameters, specifically the learning rate and the initial parameter values.
Sigmoid activation function z = g(Wu + b). As |x| increases, g‘(x) tends to zero. (non-linearity saturation regime)
➔ The gradient flowing down to u will vanish and the model will train slowly
Batch normalization
: the distribution of nonlinearity inputs remains more stable as the network trains, then the optimizer would be less likely to get stuck in the saturated regime, and the training would accelerate.Internal Covariate Shift
Batch normalization
takes a step towards reducing internal covariate shift.➔ Accelerating the training of deep neural nets
➔ Reducing the dependence of gradients on the scale of the parameters or of their initial values i.e. allowing us to use much higher learning rates
➔ Regularizing the model and reducing the need for Dropout
➔ Using saturating nonlinearities by preventing the network from getting stuck in the saturated modes
Normalization via mini-batch statistics
Training and Inference with
Batch normalization
networks➔ Since the means and variances are fixed during inference, the normalization is simply a linear transform applied to each activation.
Batch-normalized convolutional networks
BN
transform immediately before the nonlinearity by normalizing x = Wu + b.Batch Normalization
enables higher learning ratesBatch Normalization
makes training more resilient to the parameter scale.batch normalization
will stabilize the parameter growth.Experiment
Batch normalization
makes the distribution more stable and reduces the internal covariate shift.The text was updated successfully, but these errors were encountered: