Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

FTML optimizer implementation #9262

Merged
merged 3 commits into from
Jan 3, 2018
Merged

FTML optimizer implementation #9262

merged 3 commits into from
Jan 3, 2018

Conversation

ZiyueHuang
Copy link
Member

@ZiyueHuang ZiyueHuang commented Dec 30, 2017

Description

FTML optimizer implementation, requested in #9182

The default values of beta1, beta2, epsilon is the same with keras-team/keras-contrib#110.

How to add test to verify the correctness of implementation? @sxjscience

Here is the test for FTML in keras-contrib. Is that OK?

I have done only one experiment, FTML (val acc : 0.756210 at 10th epoch) can converge faster than momentum SGD (val acc : 0.684095 at 10th epoch) on cifar10, both using lr = 0.001, wd = 0 and resnet18_v1.

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

sigma_t = d_t - self.beta1 * prev_d
z_t = self.beta1 * prev_z + (1 - self.beta1) * grad - sigma_t * weight
# update weight
weight[:] = - z_t / d_t - lr * wd * weight
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should merge the wd term into the gradient. @szhengac could you help check this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest formulas look good.

@sxjscience
Copy link
Member

Testing the optimizer is a difficult problem and we haven't found a good solution. Currently I think this kind of test, which optimizes a simple problem and checks the error, should be enough, https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_optimizer.py#L648-L672. Also, would you also add the C++ version? If C++ is added, we can test it against the python version.

@szhengac
Copy link
Contributor

For weight decay, it may be correct, but for l2 regularizer, we can either incorporate the grad w.r.t. l2 regularizer into the complete grad or use following formula:
screen shot 2017-12-31 at 4 43 19 pm
where \lambda_2 is the regularization parameter. If elastic net is considered, the following one can be used:
screen shot 2017-12-31 at 9 46 48 am
where lambda_1 is the regularization parameter for $\ell_1$ part.

Also, I think it is more efficient to update the powers of beta 1and beta 2 iteratively.

@ZiyueHuang ZiyueHuang changed the title [WIP] FTML optimizer implementation FTML optimizer implementation Dec 31, 2017
@ZiyueHuang
Copy link
Member Author

Thanks for your comments.

  • c++ version is added.
  • I think weight decay is for l2 regularization in MXNet. l2 regularizer is now incorporated into the gradients.
  • Right, it is more efficient to update the powers of beta1 and beta 2 iteratively. But t = self._index_update_count[index] which is not always strictly increasing from 0.

@@ -529,6 +529,55 @@ def update_multi_precision(self, index, weight, grad, state):
self._update_impl(index, weight, grad, state,
multi_precision=use_multi_precision)


@register
class FTML(Optimizer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we already have this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have FTRL (Follow the regularized leader). This PR adds FTML (Follow the moving leader).

grad = grad * self.rescale_grad
if self.clip_gradient is not None:
grad = mx.nd.clip(grad, -self.clip_gradient, self.clip_gradient)
grad += wd * weight
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should clip after adding the gradient of L2. This is consistent with other optimizers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that L2 term is out of clip in other optimizers, such as sgd in https://github.com/apache/incubator-mxnet/blob/master/src/operator/optimizer_op-inl.h#L76-L78.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Adam, it’s clipped outside. So our current optimizes have such kinds of inconsist behavior. I think clip the gradient without we is wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean without the WD part.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks. Now WD part is added into gradients.

using namespace mshadow_op;
const DType grad_i = clip_grad >= 0.0f
? (clip::Map(rescale_grad * grad[i], clip_grad) + wd * weight[i])
: (rescale_grad * grad[i] + wd * weight[i]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should clip after adding the gradient of L2. This is consistent with other optimizers.

@piiswrong piiswrong merged commit 12cb0d2 into apache:master Jan 3, 2018
yuxiangw pushed a commit to yuxiangw/incubator-mxnet that referenced this pull request Jan 25, 2018
* ftml implemention

* c++ version and test

* merge WD into gradients
@ZiyueHuang ZiyueHuang deleted the ftml branch January 30, 2018 11:31
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* ftml implemention

* c++ version and test

* merge WD into gradients
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* ftml implemention

* c++ version and test

* merge WD into gradients
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants