-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adam solver #2856
Adam solver #2856
Conversation
@philkr could you review this if you have a chance? |
optional float delta = 31 [default = 1e-8]; | ||
// parameters for the Adam solver | ||
optional float beta1 = 37 [default = 0.9]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use momentum here, to be consistent with other solvers?
@PatWie thanks for the solver! All SGD solvers need gradient checks. See for instance the AdaGrad tests https://github.com/BVLC/caffe/blob/master/src/caffe/test/test_gradient_based_solver.cpp#L431-L483 |
const Dtype beta1 = this->param_.beta1(); | ||
const Dtype beta2 = this->param_.beta2(); | ||
|
||
const int t = this->iter_ / this->param_.stepsize() +1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why divide by stepsize here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think t is the epoch rather than the iteration by the definition of Caffe.
@shelhamer Ah. i didn't realized that there is already a unit-test. I will add one of course. |
caffe_add(N, | ||
this->val_t_[param_id]->cpu_data(), | ||
this->val_m_[param_id]->cpu_data(), | ||
this->val_m_[param_id]->mutable_cpu_data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The three commands above can be written as a single caffe_cpu_axpby using beta1 instead of 0 and val_m_ instead of val_t_.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. Blas is completely new for me. I will change this tomorrow at all places in my code.
I think there's some confusion going on in the code due to the usage of the stepsize param. Which is made worse by using lr_policy "step" in the MNIST example. The alpha/stepsize from the paper should be set via the base_lr, and used with lr_policy: "fixed" as I don't see any recommendations for changing alpha during training. This way you can also get rid of gamma and power in the prototxt (the latter wasn't being used anyway). The stepsize param should only be used together with lr_policy "step", and if we already set the alpha via base_lr then it is not needed at all. t can just be iter_ +1, as it's afaik not needed to compute the effective stepsize. This also removes the need for a check if stepsize > 0 in the header. Moreover, it makes sense to change the MNIST example to use the recommended value from the paper for the base_lr (0.001) and explicitly set momentum and momentum2 to 0.9 and 0.999 respectively, rather than relying on the default values. |
I applied all changes in memory usage, solver_mnist proto,
|
const Dtype beta2 = this->param_.momentum2(); | ||
|
||
// we create alias for memory from the SGD for convenience | ||
shared_ptr<Blob<Dtype> > &val_m = this->history_[param_id]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A reference to a shared pointer is never good idea, just copy the pointer.
The solver looks good to me now. I haven't tested it though. I do think that caffe_ipow should be it's own PR if we really want to add it. Currently it just bloats this PR request and doesn't add any benefit (see https://en.wikipedia.org/wiki/Amdahl%27s_law). |
|
||
// update v <- \beta_2 m_{t-1} + (1-\beta_2)g_t^2 | ||
caffe_mul(N, | ||
net_params[param_id]->cpu_diff(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please fix indentation -- use 4 space indents when continuing from previous lines (https://google-styleguide.googlecode.com/svn/trunk/cppguide.html#Spaces_vs._Tabs)
@PatWie Thank you for this great PR! I just added some comment on the code. Please fix indent to 4 space indents when continuing from previous lines, and add more test cases. After that, squash into one commit and I can merge. |
|
||
template <typename Dtype> | ||
void AdamSolver<Dtype>::ComputeUpdateValue(int param_id, Dtype rate) { | ||
const vector<shared_ptr<Blob<Dtype> > >& net_params = this->net_->params(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with #2866, use const vector<Blob<Dtype>*>& net_params = this->net_->learnable_params();
instead.
This commit implements the Adam solver by Kingma et. al for CPU and GPU. All solver parameters are defined in the caffe.proto. This also adds an example for the MNIST dataset.
I rebased to the latest master branch commit and fixed up the conflicts with #2836 and #2866. It seems to be difficult for me to write the other tests (although I am pretty sure, that the implementation is correct) without rewriting fairly large amount of code (mostly duplicating code). In addition, it is not clear what's the favored way and some details like, should the code prevent the usage of weight decay, regularization and other parameters. I think only the following tests make sense (checked the existing tests)
One has to deal with the following issues:
Possible solutions would be:
or
This seems to be a serious design issue. Implementing the solver is fairly easy, but writing nearly the same code again and squeezing the code into the current testing class needs hacky solutions. To refactore the unit test cases one can use the curiously recurring template pattern to put everything from the solvers into its derived classes and just refer to the members in the base class. But then again the testing method should use the solver as a template not just as a member field Since, this needs profound changes in the code it would suggest to let a BVLC maintainer decide the next steps or how to rewrite the solver-interface w.r.t. to these issues. I am glad to help, but don't want to rewrite mostly everything in reference to #2890. |
@PatWie Thanks for your update! I checked the math and am quite confident with your solver implementation. However, at this point the PR is still not working, because snapshot currently won't work for AdamSolver.
Since right now all the weight decay for AdamSolver is handled in
The original design is to use https://github.com/matthiasplappert/caffe/blob/adadelta/src/caffe/solver.cpp#L937-L947
Test-implementation can have access to solver's history, you may also take a look at AdaDelta #2782 implementation: https://github.com/matthiasplappert/caffe/blob/adadelta/src/caffe/test/test_gradient_based_solver.cpp#L299-L307
You should use
I'll handle #2890 after merging Adam and AdaDelta solvers, and I would like to refactor these solvers and look into curiously recurring template pattern. For now to get this PR merged, we need to make the snapshot work. So let's simply put val_m and val_v into |
Rather than putting both vectors into |
@PatWie after a private discussion with @jeffdonahue, I still feel using existing history (and expanding it, like what is done in #2782) would be easier to implement.
For the indexing, you can create a reference variable like After merging this PR and AdaDelta, I can address #2890 and refactor solvers afterwards. |
This commit implements the Adam solver by Kingma et. al for CPU and
GPU. All solver parameters are defined in the caffe.proto. This also
adds an example for the MNIST dataset.
see issue #2827
Before merging, please review the code. I will add changes to this branch (and rebase) if there should be something to change.