-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
不能稳定运行20小时:paddle inference c++多线程预测 常出现gpu相关异常退出 #44323
Comments
您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快~ Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API,FAQ,Github Issue and AI community to get the answer.Have a nice day! |
这种错误在压力较小的不能复现。压力较大时候(多个模型,数百个predictor多线程运行时)必现 |
报错不固定,有时报错如下:
|
+1: Compile Traceback (most recent call last):
C++ Traceback (most recent call last):0 paddle::AnalysisPredictor::ZeroCopyRun() Error Message Summary:ExternalError: CUDA error(1), invalid argument. |
使用的选项如下:
|
+1: C++ Traceback (most recent call last):0 paddle::AnalysisPredictor::ClearIntermediateTensor() Error Message Summary:NotFoundError: The memory block is not found in cache ...............
Exiting after fatal event (FATAL_SIGNAL). Fatal type: SIGABRT |
+1:
|
请问下,多线程下报错是Run之后调用到ClearIntermediateTensor和TryShrinkMemory的情况下才会报错吗?不调用ClearIntermediateTensor和TryShrinkMemory还会报错吗? |
不调用的话显存直接崩了。占用过大,16G都挡不住。 问题依旧。 |
有一个显存复用的接口,可以试下在您的模型上是否有作用 |
我本地先看看能否复现出问题。 |
config.EnableMemoryOptim() 这个一直都设置上了的(见前述)。 您说的加锁这个,是只需要保护paddle::memory::Release(place_)这个调用是吧? 我这样修改的paddle代码的: paddle\fluid\inference\api\analysis_predictor.cc: std::lock_guardstd::mutex lk(memrel_mutex_); } ` |
在应用程序中,每次调用TryShrinkMemory()的时候都加了锁,代码如下:
但还是有一样的问题。 日志如下: C++ Traceback (most recent call last):0 paddle::AnalysisPredictor::ZeroCopyRun() Error Message Summary:NotFoundError: The memory block is not found in cache |
有时候是SEGV的错误:
|
请问下可以share下复现环境吗,我本地拿demo里的模型跑不出来这个结果。 |
复现程序稍等我这边整理一下。 |
我也出现这个问题,直接改的cpp_infer,多线程连续GPU预测报错,台式机(CUDA11.6 cuDNN8.4 trt8.4.15 NVIDIA A10)跑1-4小时连续预测才能复现,笔记本(CUDA10.1 cuDNN7.6.5 noTrt NVIDIA GTX1660TI)跑十多秒就复现了。 操作系统: PaddleInference 版本使用的2.3.2 起初以为是代码问题,但是CPU连续预测就不会报错,只有GPU预测会报错。 后面又查找,以为是cuda版本和显卡版本不符合,但实际上是恰巧对应的,已经查看了控制面板-系统信息-[ 显示 | 组件 ] 完整代码: 可直接运行去复现的dll,已经写好py测试代码,可运行: |
大佬们,我在PaddleOCR 新建了一个issue,目前问题还未解决:#7757 |
Since you haven't replied for more than a year, we have closed this issue/pr. |
bug描述 Describe the Bug
使用paddle inference c++接口进行多线程推理。按照样例一个线程一个predictor, 并按照省显存的方式,在每个preictor.Run之后调用了 ClearIntermediateTensor和TryShrinkMemory。 (相关#43346)
运行中常出现各种异常退出:
paddle及相关版本:
cuda:10.2
cudnn:7.6.5.32
paddle:2.2.2
简要错误信息与堆栈如下:
Error Message Summary:
ExternalError: CUDNN error(8), CUDNN_STATUS_EXECUTION_FAILED.
[Hint: Please search for the error code(8) on website (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnStatus_t) to get Nvidia's official solution and advice about CUDNN Error.] (at /home/xiangbin_train_workspace/PaddlePaddleWorkspace/Paddle_2.2.2/Paddle/paddle/fluid/operators/fused/conv_fusion_op.cu:381)
[operator < conv2d_fusion > error]
2022/07/13 08:15:47 454804
详细错误信息见如下。
其他补充信息 Additional Supplementary Information
错误信息与堆栈:
`terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
what():
Compile Traceback (most recent call last):
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\Scripts\x2paddle-script.py", line 33, in
sys.exit(load_entry_point('x2paddle==1.3.5', 'console_scripts', 'x2paddle')())
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\convert.py", line 373, in main
lite_model_type=args.lite_model_type)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\convert.py", line 234, in onnx2paddle
mapper.paddle_graph.gen_model(save_dir)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\core\program.py", line 296, in gen_model
self.dygraph2static(save_dir, input_shapes, input_types)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\core\program.py", line 580, in dygraph2static
osp.join(save_dir, "inference_model/model"))
File "", line 2, in save
C++ Traceback (most recent call last):
0 paddle::AnalysisPredictor::ZeroCopyRun()
1 paddle::framework::NaiveExecutor::Run()
2 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
3 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
5 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::CUDNNConvFusionOpKernel, paddle::operators::CUDNNConvFusionOpKernel >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
6 paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const
7 void paddle::platform::CudnnWorkspaceHandle::RunFunc<paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const::{lambda(void*)#2}&>(paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const::{lambda(void*)#2}&, unsigned long)
8 paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int)
9 paddle::platform::GetCurrentTraceBackStringabi:cxx11
Error Message Summary:
ExternalError: CUDNN error(8), CUDNN_STATUS_EXECUTION_FAILED.
[Hint: Please search for the error code(8) on website (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnStatus_t) to get Nvidia's official solution and advice about CUDNN Error.] (at /home/xiangbin_train_workspace/PaddlePaddleWorkspace/Paddle_2.2.2/Paddle/paddle/fluid/operators/fused/conv_fusion_op.cu:381)
[operator < conv2d_fusion > error]
2022/07/13 08:15:47 454804
***** FATAL SIGNAL RECEIVED *******
Received fatal signal: SIGABRT(6) PID: 30571
***** SIGNAL SIGABRT(6)
******* STACKDUMP *******
stack dump [1] /usr/local/lib/libg3log.so.2.1.0-0+0x1465a [0x7fa3a64e865a]
stack dump [2] /lib/x86_64-linux-gnu/libpthread.so.0+0x12980 [0x7fa3c173c980]
stack dump [3] /lib/x86_64-linux-gnu/libc.so.6gsignal+0xc7 [0x7fa3a5b80e87]
stack dump [4] /lib/x86_64-linux-gnu/libc.so.6abort+0x141 [0x7fa3a5b827f1]
stack dump [5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x8c957 [0x7fa3a61d7957]
stack dump [6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x92ae6 [0x7fa3a61ddae6]
stack dump [7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x92b21 [0x7fa3a61ddb21]
stack dump [8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x92d54 [0x7fa3a61ddd54]
stack dump [9] /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so+0x1ebe224 [0x7fa3ac59d224]
The text was updated successfully, but these errors were encountered: