Batch Norm: Further Documentation and Simplified Definition #4704

shelhamer · 2016-09-10T02:32:02Z

The current state of batch norm is somewhat unclear and requires a tedious specification of learning rates for correctness. This PR documents the batch norm layer in more detail, to clarify its blobs and how to handle the bias and shift, and ensures that the batch norm statistics are not mistakenly mangled by the solver. As the mean, variance, and bias correction are not learnable parameters to be optimized, they should be not be updated by the solver, and this PR enforces this exclusion.

A further PR could optionally include the scale and shift in the batch norm layer (which for now are handled by a separate layer), and this would align with the cuDNN interface, but this PR is helpful in itself to avoid accidents.

A bias/scaling can be applied wherever desired by defining the respective layers, and `ScaleLayer` can handle both as a memory optimization.

bwilbertz · 2016-09-11T14:55:57Z

@shelhamer Thanks a lot for these clarifications. It took me quite some time, when I started using Caffe's BatchNorm a few weeks ago, to get all these things clear by myself.

Since you are already looking into the BatchNorm implementation, please, please also revert the commit 0ad1d8a, since it makes it impossible to build up an accurate global estimator for the variance.

This commit was switching the calculation of Var(X) from EX^2 - (EX)^2 to E(X-EX)^2. Unfortunately, in each forward pass, the estimator m_b for EX is only based on the values of the current mini-batch (a rather bad estimator compared to the moving mean in blob[0]), which then adds the empirical mean of the non-linear transformation (X - m_b)^2 to the global stats.

In contrast, @cdoersch 's original implementation was storing EX^2 in blob[1], so that the final estimator for Var(X) was computed as EX^2 - m^2, where m is the global estimator for EX, which was computed using all batches (modulo the moving average factor).

This becomes especially an issue, if you want to compute some high accuracy estimators of mean and variance for further finetuning or inference and set moving_average_fraction = 1 (similar to what Kaiming He is doing for his ResNets (c.f. 5. in Disclaimer and known issues).

bwilbertz · 2016-09-11T15:16:48Z

include/caffe/layers/batch_norm_layer.hpp

+ * use_global_stats option. IMPORTANT: for this feature to work, you MUST set
+ * the learning rate to zero for all three blobs, i.e., param {lr_mult: 0} three
+ * times in the layer definition. For reference, these three blobs are (0)
+ * mean, (1) variance and (2) m, the correction for the batch size.


Isn't blob[2] rather the cumulative moving average factor than a correction for the batch size?

It is updated as b(n+1) = alpha*b(n) + 1 with b(0) = 0.

Right, it is. Comment updated accordingly.

batch norm statistics are not learnable parameters subject to solver updates, so they must be shielded from the solver. `BatchNorm` layer now masks its statistics for itself by zeroing parameter learning rates instead of relying on the layer definition. n.b. declaring `param`s for batch norm layers is no longer allowed.

automatically strip old batch norm layer definitions including `param` messages. the batch norm layer used to require manually masking its state from the solver by setting `param { lr_mult: 0 }` messages for each of its statistics. this is now handled automatically by the layer.

shelhamer · 2016-09-13T06:21:18Z

@bwilbertz

Since you are already looking into the BatchNorm implementation, please, please also revert the commit 0ad1d8a, since it makes it impossible to build up an accurate global estimator for the variance.

This commit was switching the calculation of Var(X) from EX^2 - (EX)^2 to E(X-EX)^2. Unfortunately, in each forward pass, the estimator m_b for EX is only based on the values of the current mini-batch (a rather bad estimator compared to the moving mean in blob[0]), which then adds the empirical mean of the non-linear transformation (X - m_b)^2 to the global stats.

Re: #3299 I'll try to review this once I have time for another round of batch norm reform. Note that it was addressing a numerical instability issue, so the solution isn't as simple as just reverting it.

bwilbertz · 2016-09-13T20:13:16Z

@shelhamer This kind of numerical issue only occurs if N is extreme large and in the same time the variance is very small.
But for that case, the parameter epsilon was introduced into the batch norm, so that one can easily avoid any numeric catastrophe by increasing epsilon (in case someone is really piping more or less constant data into the net).

Btw: epsilon correction is missing in MVNLayer (where this kind of numerical problem had its origin, see #3162)

bwilbertz · 2016-09-23T19:40:12Z

@shelhamer I think there is still an issue with the batch norm upgrade:

The operation new Net<Dtype>(net_param)->ToProto(...) is not idempotent any more.

That means if we pass an upgraded net_param into the Net constructor, we receive back a net, which now needs batchNormUpgrade again (because the first round LayerSetUp will add 3 params and the second one will refuse to work unless all params are removed).

Unless this is intended behaviour, I would suggest to maybe change those lines into something like:

  for (int i = 0; i < this->blobs_.size(); ++i) {
    if (this->layer_param_.param_size() == i) {
      ParamSpec* fixed_param_spec = this->layer_param_.add_param();
      fixed_param_spec->set_lr_mult(0.f);
    } else {
      CHECK_EQ(this->layer_param_.param(i).lr_mult(), 0.f)
          << "Cannot configure batch normalization statistics as layer parameters.";
    }
  }

shelhamer · 2016-09-24T23:44:32Z

@bwilbertz thanks for raising this, I will look into it soon.

[docs] clarify handling of bias and scaling by BiasLayer, ScaleLayer

04f9a77

A bias/scaling can be applied wherever desired by defining the respective layers, and `ScaleLayer` can handle both as a memory optimization.

shelhamer force-pushed the groom-batch-norm branch from 8588b94 to 2e61ef1 Compare September 10, 2016 04:21

bwilbertz reviewed Sep 11, 2016
View reviewed changes

shelhamer added 3 commits September 12, 2016 23:11

[docs] identify batch norm layer blobs

3b6fd1d

shelhamer force-pushed the groom-batch-norm branch from 2e61ef1 to a8ec123 Compare September 13, 2016 06:14

shelhamer merged commit 25422de into BVLC:master Sep 16, 2016

shelhamer deleted the groom-batch-norm branch September 16, 2016 18:51

shelhamer added a commit that referenced this pull request Sep 17, 2016

25422de Merge pull request #4704 from shelhamer/groom-batch-norm

6d6b88c

bwilbertz mentioned this pull request Sep 29, 2016

slightly relax batch norm check #4785

Merged

shaibagon mentioned this pull request Jan 15, 2017

Fixing harsh upgrade_proto for "BatchNorm" layer #5184

Merged

eric612 mentioned this pull request Mar 1, 2020

关于mobiletnetv2_voc里train.prototxt的bn eric612/MobileNet-YOLO#238

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch Norm: Further Documentation and Simplified Definition #4704

Batch Norm: Further Documentation and Simplified Definition #4704

shelhamer commented Sep 10, 2016 •

edited

Loading

bwilbertz commented Sep 11, 2016

bwilbertz Sep 11, 2016

shelhamer Sep 13, 2016

shelhamer commented Sep 13, 2016

bwilbertz commented Sep 13, 2016

bwilbertz commented Sep 23, 2016

shelhamer commented Sep 24, 2016

Batch Norm: Further Documentation and Simplified Definition #4704

Batch Norm: Further Documentation and Simplified Definition #4704

Conversation

shelhamer commented Sep 10, 2016 • edited Loading

bwilbertz commented Sep 11, 2016

bwilbertz Sep 11, 2016

Choose a reason for hiding this comment

shelhamer Sep 13, 2016

Choose a reason for hiding this comment

shelhamer commented Sep 13, 2016

bwilbertz commented Sep 13, 2016

bwilbertz commented Sep 23, 2016

shelhamer commented Sep 24, 2016

shelhamer commented Sep 10, 2016 •

edited

Loading