-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exponential Linear Units #3388
Exponential Linear Units #3388
Conversation
Great job! Was actually coming to check on ELU, found this :) Will report on performances when I can. |
It seems this is actually what they did for the paper as well: |
Thanks for the head's up :) Note that mathematically that as long as alpha == 1, this doesn't make a difference since exp(0) == 1, so both transfer function and gradient output the same thing regardless of > vs >= . Also due to the way ELUs look, it's pretty hard for an activation to hit 0 precisely, anyhow. But you're right, we used > 0 during our own experiments, both in the binet code as well as in our own caffe fork. If we make another paper revision we will definitely include that change. |
Thanks for this @mohomran! That was quick. I'm sorry that this was caught by the switch to layer headers in #3315 but could you update this to reflect the new arrangement? See the new ReLU header for an example. |
cf55322
to
03c9846
Compare
@beniz: Thanks. :) So far, I've only tested it on MNIST and CIFAR-10 ("quick"), but neither network is deep enough to result in significant gains according to the paper. The updated CIFAR-10 network seemed to converge a bit faster though. @f0k, @untom: Thanks, good to know! As said, I encountered problems when alpha was set to 0, which prompted the change. @shelhamer: Rebased and ready to go. :) |
ELUForward<Dtype><<<CAFFE_GET_BLOCKS(count), CAFFE_CUDA_NUM_THREADS>>>( | ||
count, bottom_data, top_data, alpha); | ||
CUDA_POST_KERNEL_CHECK; | ||
// << " count: " << count << " bottom_data: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop commented code.
@jeffdonahue when Leaky ReLU was added it was incorporated into ReLU in #740. Do you have an opinion on a separate ELU layer? |
I'd be fine with incorporating it into ReLU if there's a near 0 performance impact, but this feels more to me like it should be a separate layer than leaky ReLU (which felt like a more natural generalization to me, still being piecewise linear). |
@mohomran so I've tested on GoogleNet, an even with BN activated, just for the sake of it. It appears to work fine, though the memory requirement appears to grow significantly, which translates into smaller batches. The typical memory error (or so I guess) happens on the CUDA_POST_KERNEL_CHECK in elu_layer.cu. FTR, I had cuDNN activated though of course ELU is not using it. I have some GPU time to kill over the next few days if some more experiments or reports can help. |
03c9846
to
a668194
Compare
Following the discussion in [1] and the original implementation in [2]. In the original implementation > 0 was used not as reported in the paper. [1] BVLC/caffe#3388 [2] untom/binet@2c8a6bd
@untom It does make a difference for the gradient, for any |
Is there anything I can do to help move this PR forward? |
Implementation of the Exponential Linear Units proposed in:
Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2015). Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). http://arxiv.org/abs/1511.07289
I made one minor modification to the formula from the paper: f(x) = x, if x > 0 rather than if x >=0, with the corresponding change to the gradient. I did this for two reasons: