-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
使用RecordIO和ParallelExector进行训练出现SegmentionFault #13809
Comments
请问有可复现的code么 |
等下我贴一下,我是单机单卡训练的 |
您的运行环境是什么?@zzhzz |
@chengduoZH centos release 6.9(final) with paddlepaddle-gpu 0.14.0 |
*** Aborted at 1539834486 (unix time) try "date -d @1539834486" if you are using GNU date *** @chengduoZH |
core dump 信息如下
|
@zzhzz 您这是迭代几轮之后报错的还是第一次迭代就报错? |
@chengduoZH 第一次训练过程中报错 |
@zzhzz 如果不适用RecordIO是没有问题的 是吗? |
@chengduoZH 这个没有测试。但是如果同时不使用ParallelExecutor和RecordIO是可以运行的 |
您先验证一下 如果不用RecordIO会不会有问题 |
@chengduoZH 就是使用py_reader那个接口吧,我试试 |
您也可以直接用feed的方式 |
@chengduoZH 使用py_reader + ParallelExecutor不会出问题 |
@zzhzz 您找到这个问题的原因了吗 |
还没有,我目前的GPU需要跑训练任务,暂时没有闲置资源用来debug |
好的,我跟进一下 |
目前Paddle这边不再继续维护RecordIO,建议大家使用py_reader读取数据。因为使用RecordIO需要用户将数据转成RecordIO格式,并且在大部分模型中用RecordIO的收益并不大。 |
在使用RecordIO以及ParallelExector加速训练的过程中,发生了SegmentionFault,错误信息如下:
*** Aborted at 1539160971 (unix time) try "date -d @1539160971" if you are using GNU date ***
2079471 PC: @ 0x0 (unknown)
2079472 *** SIGSEGV (@0x7f3000000002) received by PID 51269 (TID 0x7f305c3ac700) from PID 2; stack trace: ***
2079473 @ 0x7f305bb7e7e0 (unknown)
2079474 @ 0x7f3000000002 (unknown)
神经网络是一个词向量模型,通过设置环境变量输出Paddle的log,报错前的一部分log如下:
I1010 08:42:51.656551 51287 operator.cc:130] CUDAPlace(0) Op(adam), inputs:{Beta1Pow[beta1_pow_acc_3:float1], Beta2Pow[beta2_po w_acc_3:float1], Grad[fc_1.b_0@GRAD:float173], LearningRate[learning_rate_0:float1], Moment1[moment1_3:float[173]({ })], Moment2[moment2_3:float173], Param[fc_1.b_0:float173]}, outputs:{Moment1Out[moment1_3173], Moment2Out[moment2_ 3173], ParamOut[fc_1.b_0173]}.
2079458 I1010 08:42:51.656599 51287 operator.cc:663] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library type[PLAIN]
2079459 I1010 08:42:51.656657 51287 operator.cc:142] CUDAPlace(0) Op(adam), inputs:{Beta1Pow[beta1_pow_acc_3:float1], Beta2Pow[beta2_po w_acc_3:float1], Grad[fc_1.b_0@GRAD:float173], LearningRate[learning_rate_0:float1], Moment1[moment1_3:float[173]({ })], Moment2[moment2_3:float173], Param[fc_1.b_0:float173]}, outputs:{Moment1Out[moment1_3173], Moment2Out[moment2 3173], ParamOut[fc_1.b_0173]}.
2079460 I1010 08:42:51.660423 51287 operator.cc:130] CUDAPlace(0) Op(scale), inputs:{X[beta2_pow_acc_3:float1]}, outputs:{Out[beta2_pow _acc_31]}.
2079461 I1010 08:42:51.660465 51287 operator.cc:663] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library _type[PLAIN]
2079462 I1010 08:42:51.660521 51287 operator.cc:142] CUDAPlace(0) Op(scale), inputs:{X[beta2_pow_acc_3:float1]}, outputs:{Out[beta2_pow _acc_31]}.
2079463 I1010 08:42:51.660552 51287 operator.cc:130] CUDAPlace(0) Op(scale), inputs:{X[beta1_pow_acc_3:float1]}, outputs:{Out[beta1_pow _acc_31]}.
2079464 I1010 08:42:51.660575 51287 operator.cc:663] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library _type[PLAIN]
2079465 I1010 08:42:51.660604 51287 operator.cc:142] CUDAPlace(0) Op(scale), inputs:{X[beta1_pow_acc_3:float1]}, outputs:{Out[beta1_pow _acc_31]}.
2079466 I1010 08:42:51.663774 51288 tensor_util.cu:107] TensorCopySync 1 from CUDAPlace(0) to CPUPlace
2079467 I1010 08:42:51.700305 51288 tensor_util.cu:25] TensorCopy 1 from CPUPlace to CPUPlace
2079468 I1010 08:42:51.700296 51286 tensor_util.cu:107] TensorCopySync 21639, 200 from CUDAPlace(0) to CPUPlace
2079469 I1010 08:42:51.703213 51286 tensor_util.cu:25] TensorCopy 21639, 200 from CPUPlace to CPUPlace
2079470 *** Aborted at 1539160971 (unix time) try "date -d @1539160971" if you are using GNU date ***
2079471 PC: @ 0x0 (unknown)
2079472 *** SIGSEGV (@0x7f3000000002) received by PID 51269 (TID 0x7f305c3ac700) from PID 2; stack trace: ***
2079473 @ 0x7f305bb7e7e0 (unknown)
2079474 @ 0x7f3000000002 (unknown)
The text was updated successfully, but these errors were encountered: