-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
paddle任务在gpu集群上多卡执行,报an illegal memory access was encountered #3089
Labels
User
用于标记用户问题
Comments
Maybe related to: #1399 |
补充: 版本:paddle.v2 集群:P40 paddle.init(use_gpu=True,
trainer_count=8,
port=int(os.getenv("PADDLE_PORT", "7164")),
ports_num=int(os.getenv("PADDLE_PORTS_NUM", "1")),
num_gradient_servers=int(os.getenv("PADDLE_NUM_GRADIENT_SERVERS", "1")),
trainer_id=int(os.getenv("PADDLE_TRAINER_ID", "0")),
ports_num_for_sparse=int(os.getenv("PADDLE_PORTS_NUM_FOR_SPARSE", "1")),
pservers=os.getenv("PADDLE_PSERVERS", "127.0.0.1")) |
@typhoonzero 改大了batch_size,还是会报这样的错,感觉不能从根本上解决问题 |
@JayEworld 使用单卡也会报同样的错误么? |
在PaddleCloud上尝试了单机双卡(P40),也会报同样的错误 I0728 05:23:43.030844 19 ParameterClient2.cpp:114] pserver 0 10.1.14.13:7164
I0728 05:25:51.030596 84 ParameterClient2.cpp:114] pserver 0 10.1.14.13:7164
F0728 05:25:55.374768 75 hl_cuda_device.cc:565] Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered
*** Check failure stack trace: ***
@ 0x7f7a78e4bb6d google::LogMessage::Fail()
@ 0x7f7a78e4deb8 google::LogMessage::SendToLog()
@ 0x7f7a78e4b67b google::LogMessage::Flush()
@ 0x7f7a78e4ed8e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f7a78df8467 hl_stream_synchronize()
@ 0x7f7a78e05d1f hl_matrix_csr_mul_dense()
@ 0x7f7a78c11b59 paddle::GpuMatrix::mul()
@ 0x7f7a78c16939 paddle::GpuMatrix::mul()
@ 0x7f7a78a92fee paddle::FullyConnectedLayer::forward()
@ 0x7f7a78b2874f paddle::NeuralNetwork::forward()
@ 0x7f7a78b35b3c paddle::TrainerThread::forward()
@ 0x7f7a78b394e8 paddle::TrainerThread::computeThread()
@ 0x7f7a77c43c80 (unknown)
@ 0x7f7a7be596ba start_thread
@ 0x7f7a7bb8f3dd clone
@ (nil) (unknown)
/usr/bin/paddle_k8s: line 31: 19 Aborted (core dumped) ${ENTRY} |
可以测试下单机多卡,并把sparse的embedding layer,使用CPU,其他layer使用GPU么?需要增加下面配置:
|
@typhoonzero 用GPU+CPU混合模式跑,GPU的运行为0,这种能查下原因吗? |
@JayEworld 正在查相关的code,但可能需要些时间。 |
Duplicate: #3040, please refer to that issue instead. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The text was updated successfully, but these errors were encountered: