-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update BatchNormLayer #3299
Update BatchNormLayer #3299
Conversation
I think you mean E(X)^2 for parts of what you're saying, but yes, this is a good point. This is incorrect in the current implementation. I agree with merging the first two points in this PR. @ronghanghu, can you mark this as a bugfix as well as an enhancement? However, I don't agree with merging the PR as-is. The problem with the using S = alpha * Xt + (1-alpha)*S is that it will be biased if the number of iterations is small relative to 1/alpha. It's actually possible that some users will want to set alpha=0 if they want to integrate over an entire dataset. Hence, I would recommend against making this change. Also, while we're doing this, can we add two more simple changes:
instead of the current CHECK. There's really no good reason why this shouldn't be allowed, and arguably people will sometimes want to turn batch normalization off halfway through training. |
641e3b4
to
b0707be
Compare
@cdoersch I miswrote it. As you said it is E(X)^2. Thanks for clarification.
It is true that the exponential moving average (EMA) has a problem of setting initial values. Actually the formula (S = alpha * Xt + (1-alpha)*S) is a standard EMA and it can be used with (although not always) better init tricks. The init value problem can be ameliorated by using some number of starting samples to calculate the mean, and use it as an initial seed value. I think the current implementation can have a bias towards recent samples, since with a small alpha few new samples dominate the sum.
What you are saying is probably the cumulative moving average (CMA). In commits of my another branch CMA and the new init method have been implemented. I am thinking about bringing these commits here. Can you please check these? c1884c1 18ef30f
I mad the changes with a new commit. Please check this. |
I still disagree with these changes. It's a lot of extra complexity--not to mention breaking backwards compatibility--for essentially no benefit. You are certainly correct that the current initialization scheme is biased toward more recent samples, but it's exactly the same bias that would be there in the moving average. In general, it's better to use such an exponential weighting of the samples even in the beginning, since the user explicitly told us how much more to trust recent samples than old samples. It's common that you want to do this if you don't expect the statistics to be stationary. A side note, from reading your other commits--we assign numbers to fields in prototxt's because in binary files, the fields are stored by number, not by name. Hence for backward compatibility you should never change the number associated with a field; new fields should get the next higher number. Finally, why did you take out the line that copies the top diff into |
OK. The extra complexity was the reason I did not bring in those changes at first. I agree that there will be little benefit with the two separate moving averages.
Thanks for this. I did not know about it.
I did not catch that. I will revert the code. Maybe it is better to add an in-place test in the test. I will update the moving average part in the first commit. |
0d50137
to
daa8860
Compare
Update is done. |
@cdoersch could you review the latest update? Thanks. |
I'd say it's almost there. Howeer, @kkhoot, you also removed the m/(m-1) correction, and I don't see any replacement for it in your code. This is needed to get the unbiased estimate of the variance. What happened to it? |
@cdoersch Please see the above lines. |
Ok, I see it now. That's not where I expected it. If you do it there and When you're doing batch normalization at training time, the empirical variance--i.e. \sum_i X_i^2--of the output should be 1; as far as I understand, you're not supposed to do the bias correction until you're using global statistics at test time. |
The above lines are executed only when |
Oh I see--thought it was if(!use_global_stats_) If that's the case, then it's wrong for a different reason. There's no reason that m shouldn't change from training time to testing time; it's common for people to deploy networks with a different batch size than the one they trained with. This correction needs to be done if and only if use_global_stats_ is false, but the corrected value must only be stored in the parameter, and not used for actually performing batch normalization. I guess this isn't obvious either. You should add a comment. |
OK. Good point. I moved the correction to the !use_global_stats_ part. |
I used bn layers in my model and the GPU memory consumption increased significantly. So any idea about this? |
@kkhoot @shelhamer LGTM. Evan, please merge this when you get a chance. Thanks @kkhoot for all your hard work! @cysin we try to keep conversation on PRs relevant to the PR itself, and so I'm not going to answer here unless it's specifically this PR which increases your memory usage (and I don't think it does). You can ask your question on caffe-users or look at the code to see memory requirements. |
Update BatchNormLayer: more numerical stability and backward with global stats
@cdoersch Thanks for the reviews. BatchNormLayer helps to train my model a lot. Thanks again. |
This PR updates BatchNormLayer.
change the moving average formula to S = alpha * Xt + (1-alpha)*S which does not need to scale.