Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exponential Linear Units #3388

Merged
merged 1 commit into from
Jan 22, 2016
Merged

Conversation

mohomran
Copy link
Contributor

Implementation of the Exponential Linear Units proposed in:

Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2015). Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). http://arxiv.org/abs/1511.07289

I made one minor modification to the formula from the paper: f(x) = x, if x > 0 rather than if x >=0, with the corresponding change to the gradient. I did this for two reasons:

  1. This way, when alpha = 0, an ELU reduces exactly to a ReLU as implemented in Caffe - basically: f'(0) = 0 instead of 1 as specified in the paper.
  2. Also given the original formula, when alpha = 0 the loss would diverge during MNIST training. I would be happy to receive additional verification and to revise this change if necessary.

@mohomran mohomran changed the title ELU layer with basic tests Exponential Linear Units Nov 26, 2015
@beniz
Copy link

beniz commented Nov 26, 2015

Great job! Was actually coming to check on ELU, found this :) Will report on performances when I can.

@f0k
Copy link

f0k commented Dec 1, 2015

I made one minor modification to the formula from the paper: f(x) = x, if x > 0 rather than if x >=0, with the corresponding change to the gradient.

It seems this is actually what they did for the paper as well:
untom/binet@2c8a6bd
@untom, you might want to change the formula in the paper accordingly!

@untom
Copy link

untom commented Dec 1, 2015

Thanks for the head's up :)

Note that mathematically that as long as alpha == 1, this doesn't make a difference since exp(0) == 1, so both transfer function and gradient output the same thing regardless of > vs >= . Also due to the way ELUs look, it's pretty hard for an activation to hit 0 precisely, anyhow. But you're right, we used > 0 during our own experiments, both in the binet code as well as in our own caffe fork. If we make another paper revision we will definitely include that change.

@shelhamer
Copy link
Member

Thanks for this @mohomran! That was quick.

I'm sorry that this was caught by the switch to layer headers in #3315 but could you update this to reflect the new arrangement? See the new ReLU header for an example.

@mohomran
Copy link
Contributor Author

mohomran commented Dec 3, 2015

@beniz: Thanks. :) So far, I've only tested it on MNIST and CIFAR-10 ("quick"), but neither network is deep enough to result in significant gains according to the paper. The updated CIFAR-10 network seemed to converge a bit faster though.

@f0k, @untom: Thanks, good to know! As said, I encountered problems when alpha was set to 0, which prompted the change.

@shelhamer: Rebased and ready to go. :)

ELUForward<Dtype><<<CAFFE_GET_BLOCKS(count), CAFFE_CUDA_NUM_THREADS>>>(
count, bottom_data, top_data, alpha);
CUDA_POST_KERNEL_CHECK;
// << " count: " << count << " bottom_data: "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop commented code.

@shelhamer
Copy link
Member

@jeffdonahue when Leaky ReLU was added it was incorporated into ReLU in #740. Do you have an opinion on a separate ELU layer?

@jeffdonahue
Copy link
Contributor

I'd be fine with incorporating it into ReLU if there's a near 0 performance impact, but this feels more to me like it should be a separate layer than leaky ReLU (which felt like a more natural generalization to me, still being piecewise linear).

@beniz
Copy link

beniz commented Dec 4, 2015

@mohomran so I've tested on GoogleNet, an even with BN activated, just for the sake of it. It appears to work fine, though the memory requirement appears to grow significantly, which translates into smaller batches. The typical memory error (or so I guess) happens on the CUDA_POST_KERNEL_CHECK in elu_layer.cu. FTR, I had cuDNN activated though of course ELU is not using it. I have some GPU time to kill over the next few days if some more experiments or reports can help.
EDIT: the memory bump is likely due to not using CuDNN vs using it for ReLU.

vchuravy added a commit to oist/mxnet that referenced this pull request Dec 7, 2015
Following the discussion in [1] and the original implementation in [2].
In the original implementation > 0 was used not as reported in the
paper.

[1] BVLC/caffe#3388
[2]
untom/binet@2c8a6bd
@vchuravy
Copy link

vchuravy commented Dec 7, 2015

@untom It does make a difference for the gradient, for any a != 0

@untom
Copy link

untom commented Dec 16, 2015

Is there anything I can do to help move this PR forward?

shelhamer added a commit that referenced this pull request Jan 22, 2016
@shelhamer shelhamer merged commit a7ac8bc into BVLC:master Jan 22, 2016
@shelhamer
Copy link
Member

Thanks for the non-linearity @mohomran and thanks for checking in regarding the paper details @untom!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants