梯度全部为0(v2 api) #4381

liuyuuan · 2017-09-26T03:42:22Z

我使用了如下的训练过程：

    def train(args):
           paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count)
           
           optimizer = paddle.optimizer.Momentum(
                               momentum=0.9,
                               learning_rate=1e-3,
                               regularization=paddle.optimizer.L2Regularization(rate=1e-3))
           feeding = {...}
           reader = some_dataset.create_reader()
           train_batch_reader = paddle.batch(reader=reader, batch_size=args.batch_size)
           cost = my_network(args)
           parameters = paddle.parameters.create(cost)
           def event_handler(event):
                  """print logs"""
                  ...
           trainer.train(reader=train_batch_reader,
                                event_handler=event_handler,
                                num_passes=10,
                                feeding=feeding)

cost是用的是 sum_cost, 简化如下：

    label_probs = paddle.layer.scaling(input=neg_log_probs, weight=labels)
    cost = paddle.layer.sum_cost(input=label_probs)

其中 neg_log_probs 和 labels 是两个dense_vector(1) 的sequence。

但是训练得到的梯度全部是0，而参数的值看起来正常，cost的值也正常。看起来就像只执行了forward一样，请问这有可能是什么问题导致的错误？

The text was updated successfully, but these errors were encountered:

guoshengCS · 2017-09-26T03:59:36Z

感觉这里的paddle.layer.scaling可能不是你期望的，ScalingLayer所做计算参见

Paddle/paddle/gserver/layers/ScalingLayer.cpp

Line 26 in ed808f5

* y.row[i] = w[i] * x.row[i]

和

Paddle/python/paddle/trainer_config_helpers/layers.py

Line 2107 in 14a7399

where :math:`x` is size=dataDim input, :math:`w` is size=1 weight,

，如果需要element-wise multiplication的话可能dotmul_operator

Paddle/python/paddle/trainer_config_helpers/layers.py

Line 681 in 14a7399

out.row[i] += scale * (a.row[i] .* b.row[i])

是你需要的操作

liuyuuan · 2017-09-26T04:38:56Z

@guoshengCS 多谢，不过我试过了dotmul_operator, 梯度依然是全0。

lcy-seso · 2017-09-26T05:04:56Z

换成CPU试试。

liuyuuan · 2017-09-26T05:17:39Z

@lcy-seso 多谢，cpu就ok了，是sum_cost的backward在gpu上没有实现？最终还是希望能在gpu上正常训练，cpu太慢了。

lcy-seso · 2017-09-26T05:24:22Z

a related issue #3714

lcy-seso · 2017-09-26T05:32:57Z

sum_cost 可以在GPU下使用，问题可以参考 #3714 ，我建议可以调参稳定之后，切换到GPU下训练。

liuyuuan · 2017-09-26T06:18:54Z

恩，好的，多谢，这个issue我close掉吧

liuyuuan closed this as completed Sep 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

梯度全部为0(v2 api) #4381

梯度全部为0(v2 api) #4381

liuyuuan commented Sep 26, 2017 •

edited by lcy-seso

Loading

guoshengCS commented Sep 26, 2017

liuyuuan commented Sep 26, 2017

lcy-seso commented Sep 26, 2017

liuyuuan commented Sep 26, 2017

lcy-seso commented Sep 26, 2017

lcy-seso commented Sep 26, 2017

liuyuuan commented Sep 26, 2017

梯度全部为0(v2 api) #4381

梯度全部为0(v2 api) #4381

Comments

liuyuuan commented Sep 26, 2017 • edited by lcy-seso Loading

guoshengCS commented Sep 26, 2017

liuyuuan commented Sep 26, 2017

lcy-seso commented Sep 26, 2017

liuyuuan commented Sep 26, 2017

lcy-seso commented Sep 26, 2017

lcy-seso commented Sep 26, 2017

liuyuuan commented Sep 26, 2017

liuyuuan commented Sep 26, 2017 •

edited by lcy-seso

Loading