-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
python/mxnet/optimizer.py
Outdated
sigma_t = d_t - self.beta1 * prev_d | ||
z_t = self.beta1 * prev_z + (1 - self.beta1) * grad - sigma_t * weight | ||
# update weight | ||
weight[:] = - z_t / d_t - lr * wd * weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should merge the wd term into the gradient. @szhengac could you help check this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest formulas look good.
Testing the optimizer is a difficult problem and we haven't found a good solution. Currently I think this kind of test, which optimizes a simple problem and checks the error, should be enough, https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_optimizer.py#L648-L672. Also, would you also add the C++ version? If C++ is added, we can test it against the python version. |
Thanks for your comments.
|
@@ -529,6 +529,55 @@ def update_multi_precision(self, index, weight, grad, state): | |||
self._update_impl(index, weight, grad, state, | |||
multi_precision=use_multi_precision) | |||
|
|||
|
|||
@register | |||
class FTML(Optimizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we already have this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have FTRL (Follow the regularized leader). This PR adds FTML (Follow the moving leader).
grad = grad * self.rescale_grad | ||
if self.clip_gradient is not None: | ||
grad = mx.nd.clip(grad, -self.clip_gradient, self.clip_gradient) | ||
grad += wd * weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should clip after adding the gradient of L2. This is consistent with other optimizers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that L2 term is out of clip in other optimizers, such as sgd in https://github.com/apache/incubator-mxnet/blob/master/src/operator/optimizer_op-inl.h#L76-L78.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Adam, it’s clipped outside. So our current optimizes have such kinds of inconsist behavior. I think clip the gradient without we is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean without the WD part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks. Now WD part is added into gradients.
src/operator/optimizer_op-inl.h
Outdated
using namespace mshadow_op; | ||
const DType grad_i = clip_grad >= 0.0f | ||
? (clip::Map(rescale_grad * grad[i], clip_grad) + wd * weight[i]) | ||
: (rescale_grad * grad[i] + wd * weight[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should clip after adding the gradient of L2. This is consistent with other optimizers.
* ftml implemention * c++ version and test * merge WD into gradients
* ftml implemention * c++ version and test * merge WD into gradients
* ftml implemention * c++ version and test * merge WD into gradients
Description
FTML optimizer implementation, requested in #9182
The default values of
beta1, beta2, epsilon
is the same with keras-team/keras-contrib#110.How to add test to verify the correctness of implementation? @sxjscience
Here is the test for FTML in keras-contrib. Is that OK?
I have done only one experiment, FTML (val acc : 0.756210 at 10th epoch) can converge faster than momentum SGD (val acc : 0.684095 at 10th epoch) on cifar10, both using
lr = 0.001, wd = 0
andresnet18_v1
.Checklist
Essentials
make lint
)Changes
Comments