Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

paddle任务在gpu集群上多卡执行,报an illegal memory access was encountered #3089

Closed
JayEworld opened this issue Jul 27, 2017 · 9 comments
Labels
User 用于标记用户问题

Comments

@JayEworld
Copy link

JayEworld commented Jul 27, 2017

Pass 0, Batch 41, Cost 50.020385, {'__auc_evaluator_0__': 0.5005420446395874, 'classification_error_evaluator': 0.7870000004768372} 
F0727 19:42:26.472332 5626 hl_cuda_device.cc:565] Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered 
*** Check failure stack trace: *** 
F0727 19:42:26.472338 5639 hl_cuda_device.cc:565] Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encounteredF0727 19:42:26.472342 5641 hl_cuda_device.cc:565] Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encounteredF0727 19:42:26.472340 5631 hl_cuda_device.cc:565] Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered 
*** Check failure stack trace: *** 
@typhoonzero
Copy link
Contributor

Maybe related to: #1399

@JayEworld
Copy link
Author

JayEworld commented Jul 27, 2017

补充:

版本:paddle.v2 集群:P40

paddle.init(use_gpu=True,
        trainer_count=8,
        port=int(os.getenv("PADDLE_PORT", "7164")),
        ports_num=int(os.getenv("PADDLE_PORTS_NUM", "1")),
        num_gradient_servers=int(os.getenv("PADDLE_NUM_GRADIENT_SERVERS", "1")),
        trainer_id=int(os.getenv("PADDLE_TRAINER_ID", "0")),
        ports_num_for_sparse=int(os.getenv("PADDLE_PORTS_NUM_FOR_SPARSE", "1")),
        pservers=os.getenv("PADDLE_PSERVERS", "127.0.0.1"))

@JayEworld
Copy link
Author

@typhoonzero 改大了batch_size,还是会报这样的错,感觉不能从根本上解决问题

@Yancey1989
Copy link
Contributor

@JayEworld 使用单卡也会报同样的错误么?

@Yancey1989 Yancey1989 added the User 用于标记用户问题 label Jul 28, 2017
@Yancey1989
Copy link
Contributor

Yancey1989 commented Jul 28, 2017

在PaddleCloud上尝试了单机双卡(P40),也会报同样的错误

I0728 05:23:43.030844    19 ParameterClient2.cpp:114] pserver 0 10.1.14.13:7164
I0728 05:25:51.030596    84 ParameterClient2.cpp:114] pserver 0 10.1.14.13:7164
F0728 05:25:55.374768    75 hl_cuda_device.cc:565] Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered
*** Check failure stack trace: ***
    @     0x7f7a78e4bb6d  google::LogMessage::Fail()
    @     0x7f7a78e4deb8  google::LogMessage::SendToLog()
    @     0x7f7a78e4b67b  google::LogMessage::Flush()
    @     0x7f7a78e4ed8e  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f7a78df8467  hl_stream_synchronize()
    @     0x7f7a78e05d1f  hl_matrix_csr_mul_dense()
    @     0x7f7a78c11b59  paddle::GpuMatrix::mul()
    @     0x7f7a78c16939  paddle::GpuMatrix::mul()
    @     0x7f7a78a92fee  paddle::FullyConnectedLayer::forward()
    @     0x7f7a78b2874f  paddle::NeuralNetwork::forward()
    @     0x7f7a78b35b3c  paddle::TrainerThread::forward()
    @     0x7f7a78b394e8  paddle::TrainerThread::computeThread()
    @     0x7f7a77c43c80  (unknown)
    @     0x7f7a7be596ba  start_thread
    @     0x7f7a7bb8f3dd  clone
    @              (nil)  (unknown)
/usr/bin/paddle_k8s: line 31:    19 Aborted                 (core dumped) ${ENTRY}

@typhoonzero
Copy link
Contributor

可以测试下单机多卡,并把sparse的embedding layer,使用CPU,其他layer使用GPU么?需要增加下面配置:

  1. 增加配置:paddle.init(parallel_nn=1, ...)
  2. 对于data_layer是sparse对应的embedding_layer,增加配置layer_attr=paddle.attr.Extra(device=-1),如:
emb1_1 = paddle.layer.embedding(input=data1, size=128, param_attr=paddle.attr.Param(
                                 initial_std=emb_layer_init_std[0]),
                                 layer_attr=paddle.attr.Extra(device=-1))

@JayEworld
Copy link
Author

@typhoonzero 用GPU+CPU混合模式跑,GPU的运行为0,这种能查下原因吗?

@Yancey1989
Copy link
Contributor

@JayEworld 正在查相关的code,但可能需要些时间。

@typhoonzero
Copy link
Contributor

Duplicate: #3040, please refer to that issue instead.
Closing this due to low activity, feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
User 用于标记用户问题
Projects
None yet
Development

No branches or pull requests

3 participants