paddle任务在gpu集群上多卡执行，报an illegal memory access was encountered #3089

JayEworld · 2017-07-27T12:12:48Z

Pass 0, Batch 41, Cost 50.020385, {'__auc_evaluator_0__': 0.5005420446395874, 'classification_error_evaluator': 0.7870000004768372} 
F0727 19:42:26.472332 5626 hl_cuda_device.cc:565] Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered 
*** Check failure stack trace: *** 
F0727 19:42:26.472338 5639 hl_cuda_device.cc:565] Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encounteredF0727 19:42:26.472342 5641 hl_cuda_device.cc:565] Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encounteredF0727 19:42:26.472340 5631 hl_cuda_device.cc:565] Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered 
*** Check failure stack trace: ***

typhoonzero · 2017-07-27T12:14:44Z

Maybe related to: #1399

JayEworld · 2017-07-27T12:15:52Z

补充：

版本：paddle.v2 集群：P40

paddle.init(use_gpu=True,
        trainer_count=8,
        port=int(os.getenv("PADDLE_PORT", "7164")),
        ports_num=int(os.getenv("PADDLE_PORTS_NUM", "1")),
        num_gradient_servers=int(os.getenv("PADDLE_NUM_GRADIENT_SERVERS", "1")),
        trainer_id=int(os.getenv("PADDLE_TRAINER_ID", "0")),
        ports_num_for_sparse=int(os.getenv("PADDLE_PORTS_NUM_FOR_SPARSE", "1")),
        pservers=os.getenv("PADDLE_PSERVERS", "127.0.0.1"))

JayEworld · 2017-07-27T12:18:33Z

@typhoonzero 改大了batch_size，还是会报这样的错，感觉不能从根本上解决问题

Yancey1989 · 2017-07-27T12:31:11Z

@JayEworld 使用单卡也会报同样的错误么？

Yancey1989 · 2017-07-28T05:38:39Z

在PaddleCloud上尝试了单机双卡(P40)，也会报同样的错误

I0728 05:23:43.030844    19 ParameterClient2.cpp:114] pserver 0 10.1.14.13:7164
I0728 05:25:51.030596    84 ParameterClient2.cpp:114] pserver 0 10.1.14.13:7164
F0728 05:25:55.374768    75 hl_cuda_device.cc:565] Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered
*** Check failure stack trace: ***
    @     0x7f7a78e4bb6d  google::LogMessage::Fail()
    @     0x7f7a78e4deb8  google::LogMessage::SendToLog()
    @     0x7f7a78e4b67b  google::LogMessage::Flush()
    @     0x7f7a78e4ed8e  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f7a78df8467  hl_stream_synchronize()
    @     0x7f7a78e05d1f  hl_matrix_csr_mul_dense()
    @     0x7f7a78c11b59  paddle::GpuMatrix::mul()
    @     0x7f7a78c16939  paddle::GpuMatrix::mul()
    @     0x7f7a78a92fee  paddle::FullyConnectedLayer::forward()
    @     0x7f7a78b2874f  paddle::NeuralNetwork::forward()
    @     0x7f7a78b35b3c  paddle::TrainerThread::forward()
    @     0x7f7a78b394e8  paddle::TrainerThread::computeThread()
    @     0x7f7a77c43c80  (unknown)
    @     0x7f7a7be596ba  start_thread
    @     0x7f7a7bb8f3dd  clone
    @              (nil)  (unknown)
/usr/bin/paddle_k8s: line 31:    19 Aborted                 (core dumped) ${ENTRY}

typhoonzero · 2017-07-28T08:11:34Z

可以测试下单机多卡，并把sparse的embedding layer，使用CPU，其他layer使用GPU么？需要增加下面配置：

增加配置：paddle.init(parallel_nn=1, ...)
对于data_layer是sparse对应的embedding_layer，增加配置layer_attr=paddle.attr.Extra(device=-1)，如：

emb1_1 = paddle.layer.embedding(input=data1, size=128, param_attr=paddle.attr.Param(
                                 initial_std=emb_layer_init_std[0]),
                                 layer_attr=paddle.attr.Extra(device=-1))

JayEworld · 2017-08-01T11:09:01Z

@typhoonzero 用GPU+CPU混合模式跑，GPU的运行为0，这种能查下原因吗？

Yancey1989 · 2017-08-01T23:07:39Z

@JayEworld 正在查相关的code，但可能需要些时间。

typhoonzero · 2017-11-10T02:30:32Z

Duplicate: #3040, please refer to that issue instead.
Closing this due to low activity, feel free to reopen.

Yancey1989 added the User 用于标记用户问题 label Jul 28, 2017

typhoonzero mentioned this issue Jul 28, 2017

SRL任务中CRF-layer使用gpu训练出core #3091

Closed

typhoonzero closed this as completed Nov 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paddle任务在gpu集群上多卡执行，报an illegal memory access was encountered #3089

paddle任务在gpu集群上多卡执行，报an illegal memory access was encountered #3089

JayEworld commented Jul 27, 2017 •

edited by Yancey1989

Loading

typhoonzero commented Jul 27, 2017

JayEworld commented Jul 27, 2017 •

edited by Yancey1989

Loading

JayEworld commented Jul 27, 2017

Yancey1989 commented Jul 27, 2017

Yancey1989 commented Jul 28, 2017 •

edited

Loading

typhoonzero commented Jul 28, 2017

JayEworld commented Aug 1, 2017

Yancey1989 commented Aug 1, 2017

typhoonzero commented Nov 10, 2017

paddle任务在gpu集群上多卡执行，报an illegal memory access was encountered #3089

paddle任务在gpu集群上多卡执行，报an illegal memory access was encountered #3089

Comments

JayEworld commented Jul 27, 2017 • edited by Yancey1989 Loading

typhoonzero commented Jul 27, 2017

JayEworld commented Jul 27, 2017 • edited by Yancey1989 Loading

JayEworld commented Jul 27, 2017

Yancey1989 commented Jul 27, 2017

Yancey1989 commented Jul 28, 2017 • edited Loading

typhoonzero commented Jul 28, 2017

JayEworld commented Aug 1, 2017

Yancey1989 commented Aug 1, 2017

typhoonzero commented Nov 10, 2017

JayEworld commented Jul 27, 2017 •

edited by Yancey1989

Loading

JayEworld commented Jul 27, 2017 •

edited by Yancey1989

Loading

Yancey1989 commented Jul 28, 2017 •

edited

Loading