Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fuse L2Decay and momentum when param.regularizer is set #32845

Merged
merged 5 commits into from
Jun 10, 2021

Conversation

zhangting2020
Copy link
Contributor

@zhangting2020 zhangting2020 commented May 11, 2021

PR types

Performance optimization

PR changes

Others

Describe

fuse L2Decay and momentum when param.regularizer is set

before

当前Paddle支持momentum + L2Decay的融合:

  • 当weight_decay被设置为float或者L2Decay时,会将py_regular设置为None,避免做weight_decay
    predicate = lambda regular: isinstance(regular, (L2DecayRegularizer, float))
    py_regular = None if predicate(weight_decay) else weight_decay
    super(Momentum, self).__init__(
    learning_rate=learning_rate,
    parameters=parameters,
    weight_decay=py_regular,
    grad_clip=grad_clip,
    name=name)
  • 之后在_append_optimize_op时,通过设置momentum op的以下参数,将weight_decay和momentum计算都在momentum op中完成,达到融合的目的
    'use_nesterov', self._use_nesterov, 'regularization_method',
    self._regularization_method, 'regularization_coeff',

但是如果模型中通过momentum的weight_decay参数设置了全局的regularizer=L2Decay,但是某些层又通过paddle.ParamAttr设置了特定的regularizer,则会发生以下情况:

  • (1)首先Momentum API中,由于符合融合条件,将设置self.regularization=None,同时设置self._regularization_method和self._regularization_coeff,用于之后在_append_optimize_op时,设置momentum op的参数,以实现融合
  • (2)反向有2个重要过程 append_regularization_ops(params_grads, self.regularization) 以及 self._create_optimization_pass(params_grads)
    • 其中append_regularization_ops中调用_create_regularization_of_grad完成weight_decay,如下代码,会执行param的regularizer
      def _create_regularization_of_grad(param, grad, regularization=None):
      """ Create and add backward regularization Operators
      Function helper of append_regularization_ops.
      """
      # If no gradient or no regularization is specified, then we don't need to do anything
      if grad is None or ((not hasattr(param, 'regularizer') or (
      hasattr(param, 'regularizer') and param.regularizer is None)) and
      regularization is None):
      return grad
      regularization_term = None
      if hasattr(param, 'regularizer') and param.regularizer is not None:
      # Add variable for regularization term in grad block
      regularization_term = param.regularizer(param, grad, grad.block)
      elif regularization is not None:
      regularization_term = regularization(param, grad, grad.block)
    • 接下来,在self._create_optimization_pass中,就调用到了momentum的_append_optimize_op,因(1)中设置了self._regularization_methodself._regularization_coeff,将会导致momentum op中再次做weight_decay

after

由于在append_regularization_ops(params_grads, self.regularization) 中会遍历所有参数,执行参数的regularization。如果是使用momentum,则需要在遍历参数时,判断参数的regularizer是否为L2Decay,如果是,则跳过做regularization。然后在_append_optimize_op时,去设置momentum op的regularization_method参数。因此本PR做了以下修改:

  • 将原来/python/paddle/fluid/regularizer.py 文件中的append_regularization_ops_create_regularization_of_grad删除,移动到了optimizer.py文件中,作为Optimizer Class的实例方法。这样保证了不影响到其他优化器。
    def _create_regularization_of_grad(param, grad, regularization=None):
    """ Create and add backward regularization Operators
    Function helper of append_regularization_ops.
    """
    # If no gradient or no regularization is specified, then we don't need to do anything
    if grad is None or ((not hasattr(param, 'regularizer') or (
    hasattr(param, 'regularizer') and param.regularizer is None)) and
    regularization is None):
    return grad
    regularization_term = None
    if hasattr(param, 'regularizer') and param.regularizer is not None:
    # Add variable for regularization term in grad block
    regularization_term = param.regularizer(param, grad, grad.block)
    elif regularization is not None:
    regularization_term = regularization(param, grad, grad.block)
    assert regularization_term is not None
    new_grad = grad
    if grad.type == core.VarDesc.VarType.SELECTED_ROWS:
    # FIXME(zcd): If the grad is SELECTED_ROWS, after regularization,
    # the grad's type and name will be changed. But the gradient's name
    # is used in ParallelExecutor Reduce mode, so I add a flag for
    # the new_grad here.
    new_grad = grad.block.create_var(
    name=grad.name + core.kNewGradSuffix(),
    dtype=param.dtype,
    shape=param.shape,
    lod_level=param.lod_level,
    type=core.VarDesc.VarType.LOD_TENSOR)
    inputs = {"X": [grad, regularization_term]}
    outputs = {"Out": [new_grad]}
    if in_dygraph_mode():
    new_grad = core.ops.sum([grad, regularization_term])
    else:
    grad.block.append_op(type='sum', inputs=inputs, outputs=outputs)
    return new_grad
    def append_regularization_ops(parameters_and_grads, regularization=None):
    r"""Create and add backward regularization Operators
    Creates and adds backward regularization operators in the BlockDesc.
    This will add gradients of the regularizer function to the gradients
    of the parameters and return these modified gradients. This is the
    same as implementing weight decay in optimizers for regularization.
    Args:
    parameters_and_grads: A list of (parameters, gradients) pairs
    that need to be regularized.
    regularization: A global regularizer. If the parameter is not
    set. It will be applied with regularizer.
    Returns:
    list[(Variable, Variable)]: list of (parameters, gradients) \
    pair with the regularized gradient
    Raises:
    Exception: Unknown regularization type
    """
    params_and_grads = []
    if in_dygraph_mode():
    for param, grad in parameters_and_grads:
    new_grad = _create_regularization_of_grad(param, grad,
    regularization)
    params_and_grads.append((param, new_grad))
    else:
    repeate_regularizer = False
    with framework.name_scope('regularization'):
    for param, grad in parameters_and_grads:
    if not repeate_regularizer and param.regularizer is not None and regularization is not None:
    repeate_regularizer = True
    logging.info(
    "If regularizer of a Parameter has been set by 'fluid.ParamAttr' or 'fluid.WeightNormParamAttr' already. "
    "The Regularization[%s] in Optimizer will not take effect, and it will only be applied to other Parameters!"
    % regularization.__str__())
    with param.block.program._optimized_guard([param, grad]):
    new_grad = _create_regularization_of_grad(param, grad,
    regularization)
    params_and_grads.append((param, new_grad))
    return params_and_grads
  • 为Momentum重写了_create_regularization_of_grad方法,和父类此方法唯一的区别是:当param设置了L2Decay,就直接跳过参数的regularization。具体参考本PR中momentum.py文件的修改:
    def _create_regularization_of_grad(self, param, grad, regularization=None):
        """ Create and add backward regularization Operators
    
        Function helper of append_regularization_ops.
        """
        # If ParamAttr is set to L2Decay, we skip doing regularization here. And then we fused
        # L2Decay with momentum which can refer to _append_optimize_op below.
        if hasattr(param, 'regularizer') and isinstance(param.regularizer,
                                                        L2DecayRegularizer):
            return grad
        return super(Momentum, self)._create_regularization_of_grad(
            param, grad, regularization)
  • _append_optimize_op:在为每个参数append optimizer op前,添加如下代码,保证了该参数指定的L2DecayRegularizer具有最高优先级。如果参数本身没有设置regularizer,那么依然使用的是全局的regularizer设置。
        if hasattr(param, 'regularizer'):
            # we skip param's l2decay before, so fuse it with momentum here.
            if isinstance(param.regularizer, L2DecayRegularizer):
                self._regularization_method = "l2_decay"
                self._regularization_coeff = param.regularizer._regularization_coeff
            # the param's regularization has been done before, we avoid do l2decay in momentum.
            elif param.regularizer is not None:
                self._regularization_method = ""
                self._regularization_coeff = 0

综上,只要参数指定的regularizer是L2Decay,就会用该参数的regularizer替代全局的设置,避免了进行2次regularization,同时依然达到融合的效果。

performance

拿TSM进行测试,该模型为一些参数设置了自己的regularizer=L2Decay,bug修复前,会导致某些参数进行2次regularization。从profile report中可以看到,会有多次scale和sum的调用。

  • 修复前
-------------------------       Event Summary       -------------------------

Event                                    Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
BufferedReader:MemoryCopy                11          1840.16     1840.155303 (1.000000)  0.000000 (0.000000)     87.699      914.545     167.287     0.256218
GpuMemcpySync:GPU->CPU                   50          1298.04     1297.940395 (0.999920)  0.103584 (0.000080)     0.028043    133.824     25.9609     0.180736
conv2d_grad                              530         803.236     184.263699 (0.229402)   618.971983 (0.770598)   1.02905     2.70643     1.51554     0.11184
conv2d                                   530         504.971     176.378244 (0.349284)   328.593083 (0.650716)   0.564997    8.46123     0.952776    0.0703108
  cast                                   540         42.9775     37.497626 (0.872495)    5.479826 (0.127505)     0.036599    5.4637      0.0795879   0.00598406
softmax_with_cross_entropy               10          480.287     480.170577 (0.999758)   0.116033 (0.000242)     36.9285     58.1924     48.0287     0.0668738
  GpuMemcpySync:CUDAPinned->GPU          10          478.828     478.809623 (0.999963)   0.017889 (0.000037)     36.7578     58.0672     47.8828     0.0666706
batch_norm_grad                          530         353.873     37.233877 (0.105218)    316.638753 (0.894782)   0.130019    2.42401     0.667684    0.0492722
batch_norm                               530         300.522     77.403224 (0.257563)    223.118328 (0.742437)   0.17553     3.34711     0.567022    0.0418438
temporal_shift_grad                      160         287.278     6.675565 (0.023237)     280.602745 (0.976763)   0.518944    4.18037     1.79549     0.0399998
reshape2                                 40          249.519     132.272389 (0.530109)   117.246915 (0.469891)   0.018984    36.5348     6.23798     0.0347424
  GpuMemcpySync:CUDAPinned->GPU          10          243.271     126.024500 (0.518041)   117.246915 (0.481959)   23.4653     31.7538     24.3271     0.0338724
relu_grad                                490         180.565     12.798285 (0.070879)    167.766331 (0.929121)   0.068427    1.43153     0.368499    0.0251413
relu                                     490         137.57      19.896579 (0.144628)    117.673761 (0.855372)   0.069271    1.06388     0.280756    0.0191549
temporal_shift                           160         125.477     6.973246 (0.055574)     118.503274 (0.944426)   0.268912    1.84501     0.784228    0.017471
elementwise_add_grad                     170         106.375     5.121596 (0.048147)     101.253481 (0.951853)   0.05269     1.50784     0.625736    0.0148114
  GpuMemcpyAsync(same_gpu):GPU->GPU      10          0.145483    0.132651 (0.911797)     0.012832 (0.088203)     0.013715    0.01677     0.0145483   2.02566e-05
elementwise_add                          170         104.823     10.197299 (0.097282)    94.625259 (0.902718)    0.096113    1.46551     0.616603    0.0145952
  cast                                   10          0.388427    0.370507 (0.953865)     0.017920 (0.046135)     0.035419    0.049115    0.0388427   5.40835e-05
reduce_sum                               1620        81.8967     77.897542 (0.951168)    3.999151 (0.048832)     0.034907    0.21187     0.0505535   0.0114031
elementwise_mul                          1620        52.9484     48.119899 (0.908807)    4.828500 (0.091193)     0.025545    0.077848    0.0326842   0.00737239
momentum                                 1610        52.8795     44.607367 (0.843567)    8.272096 (0.156433)     0.024489    0.101409    0.0328444   0.00736279
square                                   1610        46.2763     42.074616 (0.909204)    4.201711 (0.090796)     0.021599    0.229078    0.0287431   0.00644339
pool2d_grad                              20          42.9491     1.158003 (0.026962)     41.791107 (0.973038)    0.990633    3.42645     2.14746     0.00598011
scale                                    1120        31.4837     29.965698 (0.951785)    1.517989 (0.048215)     0.022827    0.112202    0.0281104   0.0043837
sum                                      1080        30.8782     29.404994 (0.952289)    1.473219 (0.047711)     0.023598    0.115157    0.0285909   0.0042994
ClearGradient                            1610        19.811      16.890283 (0.852570)    2.920743 (0.147430)     0.009895    0.061578    0.012305    0.00275843
cast                                     550         15.0069     12.385047 (0.825290)    2.621865 (0.174710)     0.019099    0.063333    0.0272853   0.00208952
pool2d                                   20          9.58054     1.805983 (0.188505)     7.774558 (0.811495)     0.178302    0.846014    0.479027    0.00133397
check_finite_and_unscale                 10          5.66584     2.835464 (0.500449)     2.830379 (0.499551)     0.532772    0.611612    0.566584    0.000788896
  GpuMemcpyAsync:CPU->GPU                20          0.460414    0.422398 (0.917431)     0.038016 (0.082569)     0.012238    0.037567    0.0230207   6.41068e-05
top_k                                    20          4.05014     2.910747 (0.718678)     1.139395 (0.281322)     0.162751    0.25443     0.202507    0.00056393
matmul                                   10          2.26218     2.049406 (0.905945)     0.212769 (0.094055)     0.205148    0.268485    0.226218    0.000314979
  cast                                   10          0.428644    0.402468 (0.938933)     0.026176 (0.061067)     0.040432    0.051356    0.0428644   5.96832e-05
concat                                   10          2.19238     2.143129 (0.977536)     0.049249 (0.022464)     0.170437    0.323567    0.219238    0.000305261
  GpuMemcpyAsync:CPU->GPU                10          0.365593    0.347161 (0.949583)     0.018432 (0.050417)     0.027336    0.055756    0.0365593   5.09041e-05
matmul_grad                              10          1.5975      1.266170 (0.792594)     0.331332 (0.207406)     0.149483    0.180775    0.15975     0.000222432
fill_constant                            20          1.4446      1.420055 (0.983010)     0.024544 (0.016990)     0.052769    0.108241    0.07223     0.000201142
accuracy                                 20          1.08109     0.943103 (0.872365)     0.137984 (0.127635)     0.041539    0.072848    0.0540543   0.000150528
reshape2_grad                            30          0.993977    0.943513 (0.949230)     0.050464 (0.050770)     0.028249    0.043074    0.0331326   0.000138399
  GpuMemcpyAsync(same_gpu):GPU->GPU      30          0.586984    0.536520 (0.914028)     0.050464 (0.085972)     0.015645    0.029971    0.0195661   8.173e-05
softmax_with_cross_entropy_grad          10          0.949492    0.892500 (0.939976)     0.056992 (0.060024)     0.083984    0.113048    0.0949492   0.000132205
  GpuMemcpyAsync(same_gpu):GPU->GPU      10          0.37421     0.360642 (0.963742)     0.013568 (0.036258)     0.033563    0.042212    0.037421    5.2104e-05
elementwise_max                          10          0.813877    0.799477 (0.982307)     0.014400 (0.017693)     0.066026    0.124007    0.0813877   0.000113322
reduce_mean                              10          0.722829    0.650829 (0.900391)     0.072000 (0.099609)     0.066224    0.085834    0.0722829   0.000100645
dropout                                  10          0.682767    0.621263 (0.909919)     0.061504 (0.090081)     0.061739    0.093977    0.0682767   9.50666e-05
mean                                     10          0.6813      0.657172 (0.964585)     0.024128 (0.035415)     0.06315     0.084201    0.06813     9.48623e-05
elementwise_mul_grad                     10          0.551161    0.537145 (0.974570)     0.014016 (0.025430)     0.042076    0.099413    0.0551161   7.67421e-05
elementwise_div                          10          0.438527    0.424031 (0.966944)     0.014496 (0.033056)     0.035509    0.068166    0.0438527   6.10593e-05
reduce_mean_grad                         10          0.407826    0.369522 (0.906078)     0.038304 (0.093922)     0.037561    0.046573    0.0407826   5.67845e-05
dropout_grad                             10          0.39087     0.356566 (0.912237)     0.034304 (0.087763)     0.035845    0.043486    0.039087    5.44236e-05
sqrt                                     10          0.37579     0.357038 (0.950100)     0.018752 (0.049900)     0.031394    0.059225    0.037579    5.23239e-05
mean_grad                                10          0.282389    0.268181 (0.949686)     0.014208 (0.050314)     0.024482    0.033846    0.0282389   3.93191e-05
  • 修复后:多做的这部分scale和sum就被消除了
-------------------------       Event Summary       -------------------------

Event                                    Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
BufferedReader:MemoryCopy                11          2039.99     2039.987536 (1.000000)  0.000000 (0.000000)     87.3453     1077.69     185.453     0.281856
GpuMemcpySync:GPU->CPU                   50          1279.59     1279.485086 (0.999919)  0.103712 (0.000081)     0.02676     130.373     25.5918     0.176795
conv2d_grad                              530         806.911     191.328553 (0.237112)   615.582408 (0.762888)   1.02098     2.85402     1.52247     0.111487
conv2d                                   530         501.536     174.536693 (0.348004)   326.999217 (0.651996)   0.563483    6.97771     0.946294    0.0692949
  cast                                   540         33.4735     28.018821 (0.837045)    5.454658 (0.162955)     0.037195    3.43293     0.0619879   0.00462488
softmax_with_cross_entropy               10          439.398     439.334647 (0.999856)   0.063232 (0.000144)     12.0012     50.9567     43.9398     0.0607096
  GpuMemcpySync:CUDAPinned->GPU          10          438.112     438.093984 (0.999958)   0.018464 (0.000042)     11.8627     50.8194     43.8112     0.060532
batch_norm_grad                          530         354.622     38.316624 (0.108049)    316.305055 (0.891951)   0.129126    2.42289     0.669098    0.0489964
batch_norm                               530         304.848     82.266444 (0.269861)    222.581451 (0.730139)   0.173148    1.83929     0.575185    0.0421194
temporal_shift_grad                      160         287.232     6.774842 (0.023587)     280.457419 (0.976413)   0.517822    4.14435     1.7952      0.0396855
reshape2                                 40          258.072     140.727420 (0.545302)   117.344955 (0.454698)   0.018812    38.3899     6.45181     0.0356567
  GpuMemcpySync:CUDAPinned->GPU          10          247.936     130.591178 (0.526713)   117.344955 (0.473287)   23.4667     30.0521     24.7936     0.0342562
relu_grad                                490         178.291     13.779154 (0.077285)    164.511777 (0.922715)   0.068098    1.40939     0.363859    0.0246336
relu                                     490         137.099     22.466188 (0.163868)    114.632853 (0.836132)   0.068474    1.07078     0.279794    0.0189423
temporal_shift                           160         125.693     7.633702 (0.060733)     118.059392 (0.939267)   0.267909    1.89513     0.785582    0.0173664
elementwise_add_grad                     170         106.368     5.216390 (0.049041)     101.152051 (0.950959)   0.052676    1.50915     0.625697    0.0146964
  GpuMemcpyAsync(same_gpu):GPU->GPU      10          0.159193    0.146297 (0.918991)     0.012896 (0.081009)     0.013933    0.023808    0.0159193   2.1995e-05
elementwise_add                          170         106.054     11.438158 (0.107852)    94.616143 (0.892148)    0.100335    1.51075     0.623849    0.014653
  cast                                   10          0.421136    0.403664 (0.958512)     0.017472 (0.041488)     0.035507    0.063357    0.0421136   5.81864e-05
reduce_sum                               1620        68.1046     64.061045 (0.940627)    4.043587 (0.059373)     0.032251    0.217129    0.0420399   0.0094097
momentum                                 1610        46.9955     38.774473 (0.825068)    8.221030 (0.174932)     0.0237      0.100193    0.0291898   0.00649315
elementwise_mul                          1620        46.2526     41.393091 (0.894936)    4.859490 (0.105064)     0.024591    0.092522    0.028551    0.0063905
pool2d_grad                              20          42.7475     1.207247 (0.028241)     41.540218 (0.971759)    0.992029    3.2843      2.13737     0.00590622
square                                   1610        38.5644     34.540008 (0.895646)    4.024354 (0.104354)     0.020296    0.132276    0.023953    0.00532826
ClearGradient                            1610        17.6084     14.661259 (0.832630)    2.947104 (0.167370)     0.009408    0.035958    0.0109369   0.00243287
cast                                     550         15.4529     12.836667 (0.830695)    2.616257 (0.169305)     0.018939    0.067145    0.0280962   0.00213506
pool2d                                   20          9.62612     1.848928 (0.192074)     7.777188 (0.807926)     0.175864    0.893863    0.481306    0.00133
check_finite_and_unscale                 10          5.56346     2.737282 (0.492011)     2.826177 (0.507989)     0.529087    0.619342    0.556346    0.000768677
  GpuMemcpyAsync:CPU->GPU                20          0.45365     0.415570 (0.916059)     0.038080 (0.083941)     0.011739    0.03796     0.0226825   6.26787e-05
top_k                                    20          4.18323     3.251266 (0.777213)     0.931969 (0.222787)     0.161114    0.304403    0.209162    0.000577978
matmul                                   10          2.35787     2.144747 (0.909613)     0.213120 (0.090387)     0.207042    0.348031    0.235787    0.000325776
  cast                                   10          0.472581    0.447557 (0.947048)     0.025024 (0.052952)     0.040221    0.068483    0.0472581   6.52943e-05
concat                                   10          1.91381     1.869490 (0.976842)     0.044320 (0.023158)     0.179366    0.210768    0.191381    0.000264422
  GpuMemcpyAsync:CPU->GPU                10          0.325034    0.306922 (0.944277)     0.018112 (0.055723)     0.028022    0.037014    0.0325034   4.49084e-05
scale                                    40          1.61485     1.562497 (0.967581)     0.052352 (0.032419)     0.024309    0.077732    0.0403712   0.000223116
matmul_grad                              10          1.58093     1.251904 (0.791879)     0.329024 (0.208121)     0.149025    0.185475    0.158093    0.000218429
fill_constant                            20          1.33589     1.311374 (0.981651)     0.024512 (0.018349)     0.057007    0.077949    0.0667943   0.000184573
accuracy                                 20          1.11247     0.974321 (0.875822)     0.138144 (0.124178)     0.040128    0.081296    0.0556232   0.000153704
reshape2_grad                            30          1.05939     1.008763 (0.952214)     0.050624 (0.047786)     0.028512    0.051742    0.0353129   0.000146371
  GpuMemcpyAsync(same_gpu):GPU->GPU      30          0.625437    0.574813 (0.919058)     0.050624 (0.080942)     0.014734    0.032356    0.0208479   8.64137e-05
softmax_with_cross_entropy_grad          10          0.798398    0.757342 (0.948577)     0.041056 (0.051423)     0.068995    0.118012    0.0798398   0.000110311
  GpuMemcpyAsync(same_gpu):GPU->GPU      10          0.360293    0.346021 (0.960388)     0.014272 (0.039612)     0.029698    0.052553    0.0360293   4.978e-05
reduce_mean                              10          0.780475    0.709627 (0.909225)     0.070848 (0.090775)     0.06818     0.111653    0.0780475   0.000107835
dropout                                  10          0.728231    0.666599 (0.915368)     0.061632 (0.084632)     0.063023    0.112013    0.0728231   0.000100616
mean                                     10          0.690836    0.665652 (0.963546)     0.025184 (0.036454)     0.059973    0.088384    0.0690836   9.54496e-05
elementwise_max                          10          0.683114    0.667818 (0.977608)     0.015296 (0.022392)     0.064246    0.077212    0.0683114   9.43827e-05
elementwise_mul_grad                     10          0.4944      0.480640 (0.972168)     0.013760 (0.027832)     0.044396    0.064752    0.04944     6.8309e-05
reduce_mean_grad                         10          0.43625     0.395898 (0.907503)     0.040352 (0.092497)     0.037647    0.056163    0.043625    6.02746e-05
dropout_grad                             10          0.379695    0.344239 (0.906620)     0.035456 (0.093380)     0.034048    0.046717    0.0379695   5.24607e-05
elementwise_div                          10          0.352212    0.337652 (0.958661)     0.014560 (0.041339)     0.034394    0.037972    0.0352212   4.86635e-05
sqrt                                     10          0.30973     0.291586 (0.941420)     0.018144 (0.058580)     0.029357    0.039883    0.030973    4.2794e-05
mean_grad                                10          0.276216    0.262040 (0.948678)     0.014176 (0.051322)     0.024579    0.037008    0.0276216   3.81635e-05

同时该bug可能还影响了收敛速度和精度。对比了bug修复前的训练log:

  • 修复前:40个epoch达到0.7018
    image
  • 修复后:27个epoch附近已经达到0.7036
    image

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot-old
Copy link

Sorry to inform you that e88475d's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

if framework.in_dygraph_mode():
new_grad = core.ops.sum([grad, regularization_term])
else:
grad.block.append_op(type='sum', inputs=inputs, outputs=outputs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L270 - L297可以直接写成调用基类的_create_regularization_of_grad函数?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Xreki
Xreki previously approved these changes Jun 4, 2021
Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhangting2020 zhangting2020 merged commit a526b3e into PaddlePaddle:develop Jun 10, 2021
zhangting2020 added a commit to zhangting2020/Paddle that referenced this pull request Jun 10, 2021
lanxianghit pushed a commit that referenced this pull request Jun 10, 2021
#32845) (#32881)

 fuse L2Decay and momentum when param.regularizer is set

cherry-pick #32845
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants