Skip to content

Commit

Permalink
batch norm: hide statistics from solver, simplifying layer definition
Browse files Browse the repository at this point in the history
batch norm statistics are not learnable parameters subject to solver
updates, so they must be shielded from the solver. `BatchNorm` layer now
masks its statistics for itself by zeroing parameter learning rates
instead of relying on the layer definition.

n.b. declaring `param`s for batch norm layers is no longer allowed.
  • Loading branch information
shelhamer committed Sep 13, 2016
1 parent 3b6fd1d commit c8f446f
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 4 deletions.
6 changes: 2 additions & 4 deletions include/caffe/layers/batch_norm_layer.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,8 @@ namespace caffe {
* mean/variance statistics via a running average, which is then used at test
* time to allow deterministic outputs for each input. You can manually toggle
* whether the network is accumulating or using the statistics via the
* use_global_stats option. IMPORTANT: for this feature to work, you MUST set
* the learning rate to zero for all three blobs, i.e., param {lr_mult: 0} three
* times in the layer definition. For reference, these three blobs are (0)
* mean, (1) variance, and (2) the moving average factor.
* use_global_stats option. For reference, these statistics are kept in the
* layer's three blobs: (0) mean, (1) variance, and (2) moving average factor.
*
* Note that the original paper also included a per-channel learned bias and
* scaling factor. To implement this in Caffe, define a `ScaleLayer` configured
Expand Down
8 changes: 8 additions & 0 deletions src/caffe/layers/batch_norm_layer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,14 @@ void BatchNormLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom,
this->blobs_[i]->mutable_cpu_data());
}
}
// Mask statistics from optimization by setting local learning rates
// for mean, variance, and the bias correction to zero.
CHECK_EQ(this->layer_param_.param_size(), 0)
<< "Cannot configure batch normalization statistics as layer parameters.";
for (int i = 0; i < this->blobs_.size(); ++i) {
ParamSpec* fixed_param_spec = this->layer_param_.add_param();
fixed_param_spec->set_lr_mult(0.);
}
}

template <typename Dtype>
Expand Down

6 comments on commit c8f446f

@matthieu637
Copy link

@matthieu637 matthieu637 commented on c8f446f Sep 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this commit, I cannot copy anymore a network.

caffe::Net<double>* old_network; //network to copy that contains a batch norm layer

caffe::NetParameter net_param;
old_network->ToProto(&net_param);
new caffe::Net<double>(net_param);
//  batch_norm_layer.cpp:39] Check failed: this->layer_param_.param_size() == 0 (3 vs. 0) Cannot configure batch normalization statistics as layer parameters.

What is the new way to do it?

@matthieu637
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently calling clear_param() on LayerParameter is enough.

@shelhamer
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that Caffe has an automatic definition upgrade path that is made use of in the caffe binary and the interfaces like pycaffe. This is the part that handles batch norm https://github.com/BVLC/caffe/blob/master/src/caffe/util/upgrade_proto.cpp#L1003-L1024 so you don't have to do it manually.

@D-X-Y
Copy link

@D-X-Y D-X-Y commented on c8f446f Oct 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean I don't need to specify

  param {
    lr_mult: 0
    decay_mult: 0
  }

in batchnorm layer when training, since this commit?
Thanks.

@jspark1105
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't decay_mult also need to be set to zero? I'm seeing a very large regularization term when using batch normalization.

@shaibagon
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about

  param { name: "want_to_share_this" }

How can one share BatchNorm params (for e.g., Siamese network, following "Scale" layer?)

related to #5171

Please sign in to comment.