-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch Norm: Further Documentation and Simplified Definition #4704
Conversation
A bias/scaling can be applied wherever desired by defining the respective layers, and `ScaleLayer` can handle both as a memory optimization.
8588b94
to
2e61ef1
Compare
@shelhamer Thanks a lot for these clarifications. It took me quite some time, when I started using Caffe's BatchNorm a few weeks ago, to get all these things clear by myself. Since you are already looking into the BatchNorm implementation, please, please also revert the commit 0ad1d8a, since it makes it impossible to build up an accurate global estimator for the variance. This commit was switching the calculation of Var(X) from EX^2 - (EX)^2 to E(X-EX)^2. Unfortunately, in each forward pass, the estimator m_b for EX is only based on the values of the current mini-batch (a rather bad estimator compared to the moving mean in In contrast, @cdoersch 's original implementation was storing EX^2 in This becomes especially an issue, if you want to compute some high accuracy estimators of mean and variance for further finetuning or inference and set |
* use_global_stats option. IMPORTANT: for this feature to work, you MUST set | ||
* the learning rate to zero for all three blobs, i.e., param {lr_mult: 0} three | ||
* times in the layer definition. For reference, these three blobs are (0) | ||
* mean, (1) variance and (2) m, the correction for the batch size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't blob[2]
rather the cumulative moving average factor than a correction for the batch size?
It is updated as b(n+1) = alpha*b(n) + 1 with b(0) = 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it is. Comment updated accordingly.
batch norm statistics are not learnable parameters subject to solver updates, so they must be shielded from the solver. `BatchNorm` layer now masks its statistics for itself by zeroing parameter learning rates instead of relying on the layer definition. n.b. declaring `param`s for batch norm layers is no longer allowed.
automatically strip old batch norm layer definitions including `param` messages. the batch norm layer used to require manually masking its state from the solver by setting `param { lr_mult: 0 }` messages for each of its statistics. this is now handled automatically by the layer.
2e61ef1
to
a8ec123
Compare
Re: #3299 I'll try to review this once I have time for another round of batch norm reform. Note that it was addressing a numerical instability issue, so the solution isn't as simple as just reverting it. |
@shelhamer This kind of numerical issue only occurs if N is extreme large and in the same time the variance is very small. Btw: epsilon correction is missing in MVNLayer (where this kind of numerical problem had its origin, see #3162) |
@shelhamer I think there is still an issue with the batch norm upgrade: The operation That means if we pass an upgraded net_param into the Net constructor, we receive back a net, which now needs batchNormUpgrade again (because the first round LayerSetUp will add 3 params and the second one will refuse to work unless all params are removed). Unless this is intended behaviour, I would suggest to maybe change those lines into something like:
|
@bwilbertz thanks for raising this, I will look into it soon. |
The current state of batch norm is somewhat unclear and requires a tedious specification of learning rates for correctness. This PR documents the batch norm layer in more detail, to clarify its blobs and how to handle the bias and shift, and ensures that the batch norm statistics are not mistakenly mangled by the solver. As the mean, variance, and bias correction are not learnable parameters to be optimized, they should be not be updated by the solver, and this PR enforces this exclusion.
A further PR could optionally include the scale and shift in the batch norm layer (which for now are handled by a separate layer), and this would align with the cuDNN interface, but this PR is helpful in itself to avoid accidents.