Skip to content

Latest commit

 

History

History
135 lines (80 loc) · 4.36 KB

batchnorm.md

File metadata and controls

135 lines (80 loc) · 4.36 KB

This is quick evaluation of BatchNorm layer (BVLC/caffe#3229) performance on ImageNet-2012.

Other on-going evaluations:

The architecture is similar to CaffeNet, but has differences:

  1. Images are resized to small side = 128 for speed reasons.
  2. fc6 and fc7 layers have 2048 neurons instead of 4096.
  3. Networks are initialized with LSUV-init

Because LRN layers add nothing to accuracy, they were removed for speed reasons in further experiments.

Batch normalization

BN-paper, caffe-PR Note, that results are obtained without mentioned in paper y=kx+b additional layer.

BN -- before or after ReLU?

Name Accuracy LogLoss Comments
Before 0.474 2.35 As in paper
Before + scale&bias layer 0.478 2.33 As in paper
After 0.499 2.21
After + scale&bias layer 0.493 2.24

So in all next experiments, BN is put after non-linearity

BN and activations

Name Accuracy LogLoss Comments
ReLU 0.499 2.21
RReLU 0.500 2.20
PReLU 0.503 2.19
ELU 0.498 2.23
Maxout 0.487 2.28
Sigmoid 0.475 2.35
TanH 0.448 2.50
No 0.384 2.96

BN and dropout

ReLU non-linearity, fc6 and fc7 layer only

Name Accuracy LogLoss Comments
Dropout = 0.5 0.499 2.21
Dropout = 0.2 0.527 2.09
Dropout = 0 0.513 2.19

Prototxt, logs

BN-arch-init

Name Accuracy LogLoss Comments
Caffenet 0.471 2.36
Caffenet BN Before + scale&bias layer LSUV 0.478 2.33
Caffenet BN Before + scale&bias layer Ortho 0.482 2.31
Caffenet BN After LSUV 0.499 2.21
Caffenet BN After Ortho 0.500 2.20
Name Accuracy LogLoss Comments
GoogLeNet128 0.619 1.61
GoogLeNet BN Before + scale&bias layer LSUV 0.603 1.68
GoogLeNet BN Before + scale&bias layer Ortho 0.607 1.67
GoogLeNet BN After LSUV 0.596 1.70
GoogLeNet BN After Ortho 0.584 1.77

CaffeNet128 test accuracy

CaffeNet128 test loss

CaffeNet128 train loss

GoogleNet128 test accuracy

GoogleNet128 test loss

GoogleNet128 train loss

Prototxt, logs

BatchNorm evaluation ReLU

CaffeNet128 test accuracy

CaffeNet128 test loss

CaffeNet128 train loss

Different activations plus BN

As one can see, BN makes difference between ReLU, ELU and PReLU negligable. It may confirm that main source of VLReLU and ELU advantages is that their output is closer to mean=0, var=1, than standard ReLU.

CaffeNet128 test accuracy

CaffeNet128 test loss

CaffeNet128 train loss

Batch Normalization and Dropout

BN+Dropout = 0.5 is too much regularization. Dropout=0.2 is just enough :) CaffeNet128 test accuracy

CaffeNet128 test loss

CaffeNet128 train loss

Do we need EltwiseAffine layer?

CaffeNet128 test accuracy

P.S. Logs are merged from lots of "save-resume", because were trained at nights, so plot "Anything vs. seconds" will give weird results.