Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch Norm: Further Documentation and Simplified Definition #4704

Merged
merged 4 commits into from
Sep 16, 2016

Conversation

shelhamer
Copy link
Member

@shelhamer shelhamer commented Sep 10, 2016

The current state of batch norm is somewhat unclear and requires a tedious specification of learning rates for correctness. This PR documents the batch norm layer in more detail, to clarify its blobs and how to handle the bias and shift, and ensures that the batch norm statistics are not mistakenly mangled by the solver. As the mean, variance, and bias correction are not learnable parameters to be optimized, they should be not be updated by the solver, and this PR enforces this exclusion.

A further PR could optionally include the scale and shift in the batch norm layer (which for now are handled by a separate layer), and this would align with the cuDNN interface, but this PR is helpful in itself to avoid accidents.

A bias/scaling can be applied wherever desired by defining the
respective layers, and `ScaleLayer` can handle both as a memory
optimization.
@bwilbertz
Copy link
Contributor

@shelhamer Thanks a lot for these clarifications. It took me quite some time, when I started using Caffe's BatchNorm a few weeks ago, to get all these things clear by myself.

Since you are already looking into the BatchNorm implementation, please, please also revert the commit 0ad1d8a, since it makes it impossible to build up an accurate global estimator for the variance.

This commit was switching the calculation of Var(X) from EX^2 - (EX)^2 to E(X-EX)^2. Unfortunately, in each forward pass, the estimator m_b for EX is only based on the values of the current mini-batch (a rather bad estimator compared to the moving mean in blob[0]), which then adds the empirical mean of the non-linear transformation (X - m_b)^2 to the global stats.

In contrast, @cdoersch 's original implementation was storing EX^2 in blob[1], so that the final estimator for Var(X) was computed as EX^2 - m^2, where m is the global estimator for EX, which was computed using all batches (modulo the moving average factor).

This becomes especially an issue, if you want to compute some high accuracy estimators of mean and variance for further finetuning or inference and set moving_average_fraction = 1 (similar to what Kaiming He is doing for his ResNets (c.f. 5. in Disclaimer and known issues).

* use_global_stats option. IMPORTANT: for this feature to work, you MUST set
* the learning rate to zero for all three blobs, i.e., param {lr_mult: 0} three
* times in the layer definition. For reference, these three blobs are (0)
* mean, (1) variance and (2) m, the correction for the batch size.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't blob[2] rather the cumulative moving average factor than a correction for the batch size?

It is updated as b(n+1) = alpha*b(n) + 1 with b(0) = 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it is. Comment updated accordingly.

batch norm statistics are not learnable parameters subject to solver
updates, so they must be shielded from the solver. `BatchNorm` layer now
masks its statistics for itself by zeroing parameter learning rates
instead of relying on the layer definition.

n.b. declaring `param`s for batch norm layers is no longer allowed.
automatically strip old batch norm layer definitions including `param`
messages. the batch norm layer used to require manually masking its
state from the solver by setting `param { lr_mult: 0 }` messages for
each of its statistics. this is now handled automatically by the layer.
@shelhamer
Copy link
Member Author

@bwilbertz

Since you are already looking into the BatchNorm implementation, please, please also revert the commit 0ad1d8a, since it makes it impossible to build up an accurate global estimator for the variance.

This commit was switching the calculation of Var(X) from EX^2 - (EX)^2 to E(X-EX)^2. Unfortunately, in each forward pass, the estimator m_b for EX is only based on the values of the current mini-batch (a rather bad estimator compared to the moving mean in blob[0]), which then adds the empirical mean of the non-linear transformation (X - m_b)^2 to the global stats.

Re: #3299 I'll try to review this once I have time for another round of batch norm reform. Note that it was addressing a numerical instability issue, so the solution isn't as simple as just reverting it.

@bwilbertz
Copy link
Contributor

@shelhamer This kind of numerical issue only occurs if N is extreme large and in the same time the variance is very small.
But for that case, the parameter epsilon was introduced into the batch norm, so that one can easily avoid any numeric catastrophe by increasing epsilon (in case someone is really piping more or less constant data into the net).

Btw: epsilon correction is missing in MVNLayer (where this kind of numerical problem had its origin, see #3162)

@shelhamer shelhamer merged commit 25422de into BVLC:master Sep 16, 2016
@shelhamer shelhamer deleted the groom-batch-norm branch September 16, 2016 18:51
@bwilbertz
Copy link
Contributor

@shelhamer I think there is still an issue with the batch norm upgrade:

The operation new Net<Dtype>(net_param)->ToProto(...) is not idempotent any more.

That means if we pass an upgraded net_param into the Net constructor, we receive back a net, which now needs batchNormUpgrade again (because the first round LayerSetUp will add 3 params and the second one will refuse to work unless all params are removed).

Unless this is intended behaviour, I would suggest to maybe change those lines into something like:

  for (int i = 0; i < this->blobs_.size(); ++i) {
    if (this->layer_param_.param_size() == i) {
      ParamSpec* fixed_param_spec = this->layer_param_.add_param();
      fixed_param_spec->set_lr_mult(0.f);
    } else {
      CHECK_EQ(this->layer_param_.param(i).lr_mult(), 0.f)
          << "Cannot configure batch normalization statistics as layer parameters.";
    }
  }

@shelhamer
Copy link
Member Author

@bwilbertz thanks for raising this, I will look into it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants