不能稳定运行20小时：paddle inference c++多线程预测常出现gpu相关异常退出 #44323

kouhinn · 2022-07-14T03:59:45Z

bug描述 Describe the Bug

使用paddle inference c++接口进行多线程推理。按照样例一个线程一个predictor, 并按照省显存的方式，在每个preictor.Run之后调用了 ClearIntermediateTensor和TryShrinkMemory。 (相关#43346)

运行中常出现各种异常退出：
paddle及相关版本：
cuda:10.2
cudnn:7.6.5.32
paddle:2.2.2

简要错误信息与堆栈如下：

Error Message Summary:

ExternalError: CUDNN error(8), CUDNN_STATUS_EXECUTION_FAILED.
[Hint: Please search for the error code(8) on website (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnStatus_t) to get Nvidia's official solution and advice about CUDNN Error.] (at /home/xiangbin_train_workspace/PaddlePaddleWorkspace/Paddle_2.2.2/Paddle/paddle/fluid/operators/fused/conv_fusion_op.cu:381)
[operator < conv2d_fusion > error]
2022/07/13 08:15:47 454804

详细错误信息见如下。

其他补充信息 Additional Supplementary Information

错误信息与堆栈：
`terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
what():

Compile Traceback (most recent call last):
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\Scripts\x2paddle-script.py", line 33, in
sys.exit(load_entry_point('x2paddle==1.3.5', 'console_scripts', 'x2paddle')())
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\convert.py", line 373, in main
lite_model_type=args.lite_model_type)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\convert.py", line 234, in onnx2paddle
mapper.paddle_graph.gen_model(save_dir)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\core\program.py", line 296, in gen_model
self.dygraph2static(save_dir, input_shapes, input_types)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\core\program.py", line 580, in dygraph2static
osp.join(save_dir, "inference_model/model"))
File "", line 2, in save

File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 25, in __impl__
  return wrapped_func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\base.py", line 51, in __impl__
  return func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\jit.py", line 744, in save
  inner_input_spec)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 517, in concrete_program_specify_input_spec
  *desired_input_spec)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 427, in get_concrete_program
  concrete_program, partial_program_layer = self._program_cache[cache_key]
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 723, in __getitem__
  self._caches[item] = self._build_once(item)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 714, in _build_once
  **cache_key.kwargs)
File "<decorator-gen-99>", line 2, in from_func_spec
  
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 25, in __impl__
  return wrapped_func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\base.py", line 51, in __impl__
  return func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 662, in from_func_spec
  outputs = static_func(*inputs)
File "personbasemodelonnx2paddle\x2paddle_code.py", line 315, in forward
  x2paddle_convolution_output96 = self.conv1(x2paddle_convolution_output96_paded)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\layers.py", line 917, in __call__
  return self._dygraph_call_func(*inputs, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\layers.py", line 907, in _dygraph_call_func
  outputs = self.forward(*inputs, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\nn\layer\conv.py", line 677, in forward
  use_cudnn=self._use_cudnn)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\nn\functional\conv.py", line 148, in _conv_nd
  type=op_type, inputs=inputs, outputs=outputs, attrs=attrs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\layer_helper.py", line 43, in append_op
  return self.main_program.current_block().append_op(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\framework.py", line 3184, in append_op
  attrs=kwargs.get("attrs", None))
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\framework.py", line 2224, in __init__
  for frame in traceback.extract_stack():

C++ Traceback (most recent call last):

0 paddle::AnalysisPredictor::ZeroCopyRun()
1 paddle::framework::NaiveExecutor::Run()
2 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
3 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
5 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::CUDNNConvFusionOpKernel, paddle::operators::CUDNNConvFusionOpKernel >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
6 paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const
7 void paddle::platform::CudnnWorkspaceHandle::RunFunc<paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const::{lambda(void*)#2}&>(paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const::{lambda(void*)#2}&, unsigned long)
8 paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int)
9 paddle::platform::GetCurrentTraceBackStringabi:cxx11

Error Message Summary:

ExternalError: CUDNN error(8), CUDNN_STATUS_EXECUTION_FAILED.
[Hint: Please search for the error code(8) on website (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnStatus_t) to get Nvidia's official solution and advice about CUDNN Error.] (at /home/xiangbin_train_workspace/PaddlePaddleWorkspace/Paddle_2.2.2/Paddle/paddle/fluid/operators/fused/conv_fusion_op.cu:381)
[operator < conv2d_fusion > error]
2022/07/13 08:15:47 454804

***** FATAL SIGNAL RECEIVED *******
Received fatal signal: SIGABRT(6) PID: 30571

***** SIGNAL SIGABRT(6)

******* STACKDUMP *******
stack dump [1] /usr/local/lib/libg3log.so.2.1.0-0+0x1465a [0x7fa3a64e865a]
stack dump [2] /lib/x86_64-linux-gnu/libpthread.so.0+0x12980 [0x7fa3c173c980]
stack dump [3] /lib/x86_64-linux-gnu/libc.so.6gsignal+0xc7 [0x7fa3a5b80e87]
stack dump [4] /lib/x86_64-linux-gnu/libc.so.6abort+0x141 [0x7fa3a5b827f1]
stack dump [5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x8c957 [0x7fa3a61d7957]
stack dump [6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x92ae6 [0x7fa3a61ddae6]
stack dump [7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x92b21 [0x7fa3a61ddb21]
stack dump [8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x92d54 [0x7fa3a61ddd54]
stack dump [9] /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so+0x1ebe224 [0x7fa3ac59d224]

stack dump [10]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::framework::NaiveExecutor::Run()+0x130 [0x7fa3acce05d0]

stack dump [11]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::AnalysisPredictor::ZeroCopyRun()+0x293 [0x7fa3ac98be73]

stack dump [12]  ./xxxx : doInference(paddle_infer::Predictor&, std::vector<float, std::allocator<float> > const&, std::vector<int, std::allocator<int> > const&, std::vector<float, std::allocator<float> >&)+0x10d [0x563a23f08aad]`

The text was updated successfully, but these errors were encountered:

paddle-bot · 2022-07-14T03:59:47Z

您好，我们已经收到了您的问题，会安排技术人员尽快解答您的问题，请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时，您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快～

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API，FAQ，Github Issue and AI community to get the answer.Have a nice day!

kouhinn · 2022-07-14T04:03:24Z

这种错误在压力较小的不能复现。压力较大时候（多个模型，数百个predictor多线程运行时）必现

kouhinn · 2022-07-14T07:20:57Z

报错不固定，有时报错如下：

  what():  

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::AnalysisPredictor::ZeroCopyRun()
1   paddle::framework::NaiveExecutor::Run()
2   paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
3   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
4   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
5   std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 5ul, paddle::operators::ShapeKernel<bool>, paddle::operators::ShapeKernel<int>, paddle::operators::ShapeKernel<signed char>, paddle::operators::ShapeKernel<unsigned char>, paddle::operators::ShapeKernel<long>, paddle::operators::ShapeKernel<float>, paddle::operators::ShapeKernel<double>, paddle::operators::ShapeKernel<paddle::platform::float16>, paddle::operators::ShapeKernel<paddle::platform::complex<float> >, paddle::operators::ShapeKernel<paddle::platform::complex<double> > >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
6   paddle::framework::Tensor::mutable_data(paddle::platform::Place const&, paddle::framework::proto::VarType_Type, unsigned long)
7   std::_Sp_counted_deleter<paddle::memory::allocation::Allocation*, paddle::memory::allocation::Allocator::AllocationDeleter, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose()
8   paddle::memory::allocation::RetryAllocator::FreeImpl(paddle::memory::allocation::Allocation*)
9   paddle::memory::allocation::NaiveBestFitAllocator::FreeImpl(paddle::memory::allocation::Allocation*)
10  paddle::memory::detail::BuddyAllocator::Free(void*)
11  paddle::memory::detail::MetadataCache::LoadDesc(paddle::memory::detail::MemoryBlock*)
12  paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int)
13  paddle::platform::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
NotFoundError: The memory block is not found in cache
  [Hint: Expected iter != cache_.end(), but received iter == cache_.end().] (at /home/xiangbin_train_workspace/PaddlePaddleWorkspace/Paddle_2.2.2/Paddle/paddle/fluid/memory/detail/meta_cache.cc:30)

***** FATAL SIGNAL RECEIVED ******* 
Received fatal signal: SIGABRT(6)	PID: 23645

***** SIGNAL SIGABRT(6)

*******	STACKDUMP *******
	stack dump [1]  /usr/local/lib/libg3log.so.2.1.0-0+0x1465a [0x7f857d85365a]
	stack dump [2]  /lib/x86_64-linux-gnu/libpthread.so.0+0x12980 [0x7f8598aa7980]
	stack dump [3]  /lib/x86_64-linux-gnu/libc.so.6gsignal+0xc7 [0x7f857ceebe87]
	stack dump [4]  /lib/x86_64-linux-gnu/libc.so.6abort+0x141 [0x7f857ceed7f1]
	stack dump [5]  /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x8c957 [0x7f857d542957]
	stack dump [6]  /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x92ae6 [0x7f857d548ae6]
	stack dump [7]  /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x91b49 [0x7f857d547b49]
	stack dump [8]  /usr/lib/x86_64-linux-gnu/libstdc++.so.6__gxx_personality_v0+0x2a8 [0x7f857d5484b8]
	stack dump [9]  /lib/x86_64-linux-gnu/libgcc_s.so.1+0x10573 [0x7f857d2ae573]
	stack dump [10]  /lib/x86_64-linux-gnu/libgcc_s.so.1_Unwind_Resume+0x125 [0x7f857d2aedf5]
	stack dump [11]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so+0x1ecc378 [0x7f8583916378]

	stack dump [12]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::memory::allocation::NaiveBestFitAllocator::FreeImpl(paddle::memory::allocation::Allocation*)+0xc5 [0x7f8589b53c95]

	stack dump [13]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::memory::allocation::RetryAllocator::FreeImpl(paddle::memory::allocation::Allocation*)+0x41 [0x7f8589b66e31]

	stack dump [14]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : std::_Sp_counted_deleter<paddle::memory::allocation::Allocation*, paddle::memory::allocation::Allocator::AllocationDeleter, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x25 [0x7f8584921f85]
	stack dump [15]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so+0x227e757 [0x7f8583cc8757]

	stack dump [16]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::framework::Tensor::mutable_data(paddle::platform::Place const&, paddle::framework::proto::VarType_Type, unsigned long)+0xc5 [0x7f8583fdafa5]

	stack dump [17]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 5ul, paddle::operators::ShapeKernel<bool>, paddle::operators::ShapeKernel<int>, paddle::operators::ShapeKernel<signed char>, paddle::operators::ShapeKernel<unsigned char>, paddle::operators::ShapeKernel<long>, paddle::operators::ShapeKernel<float>, paddle::operators::ShapeKernel<double>, paddle::operators::ShapeKernel<paddle::platform::float16>, paddle::operators::ShapeKernel<paddle::platform::complex<float> >, paddle::operators::ShapeKernel<paddle::platform::complex<double> > >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)+0x12f [0x7f85885f5ddf]

	stack dump [18]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const+0x312 [0x7f8589a35dd2]

	stack dump [19]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const+0x148 [0x7f8589a36628]

	stack dump [20]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)+0x1c7 [0x7f8589a324c7]

	stack dump [21]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::framework::NaiveExecutor::Run()+0x130 [0x7f858404b5d0]

	stack dump [22]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::AnalysisPredictor::ZeroCopyRun()+0x293 [0x7f8583cf6e73]

kouhinn · 2022-07-15T02:51:55Z

+1:
read frame put queue ret: 0
terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
what():

Compile Traceback (most recent call last):
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\Scripts\x2paddle-script.py", line 33, in
sys.exit(load_entry_point('x2paddle==1.3.5', 'console_scripts', 'x2paddle')())
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\convert.py", line 373, in main
lite_model_type=args.lite_model_type)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\convert.py", line 234, in onnx2paddle
mapper.paddle_graph.gen_model(save_dir)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\core\program.py", line 296, in gen_model
self.dygraph2static(save_dir, input_shapes, input_types)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\core\program.py", line 580, in dygraph2static
osp.join(save_dir, "inference_model/model"))
File "", line 2, in save

File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 25, in __impl__
  return wrapped_func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\base.py", line 51, in __impl__
  return func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\jit.py", line 744, in save
  inner_input_spec)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 517, in concrete_program_specify_input_spec
  *desired_input_spec)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 427, in get_concrete_program
  concrete_program, partial_program_layer = self._program_cache[cache_key]
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 723, in __getitem__
  self._caches[item] = self._build_once(item)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 714, in _build_once
  **cache_key.kwargs)
File "<decorator-gen-99>", line 2, in from_func_spec
  
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 25, in __impl__
  return wrapped_func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\base.py", line 51, in __impl__
  return func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 662, in from_func_spec
  outputs = static_func(*inputs)
File "personbasemodelonnx2paddle\x2paddle_code.py", line 401, in forward
  x2paddle_convolution_output36_paded = self.pad2(x2paddle_mish_17_mul_0)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\layers.py", line 917, in __call__
  return self._dygraph_call_func(*inputs, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\layers.py", line 907, in _dygraph_call_func
  outputs = self.forward(*inputs, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\nn\layer\common.py", line 1103, in forward
  name=self._name)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\nn\functional\common.py", line 1334, in pad
  x = unsqueeze(x, axis=unsqueezed_dim)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\tensor\manipulation.py", line 1229, in unsqueeze
  return layers.unsqueeze(x, axis, name)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\layers\nn.py", line 6422, in unsqueeze
  "XShape": x_shape})
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\layer_helper.py", line 43, in append_op
  return self.main_program.current_block().append_op(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\framework.py", line 3184, in append_op
  attrs=kwargs.get("attrs", None))
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\framework.py", line 2224, in __init__
  for frame in traceback.extract_stack():

C++ Traceback (most recent call last):

0 paddle::AnalysisPredictor::ZeroCopyRun()
1 paddle::framework::NaiveExecutor::Run()
2 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
3 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
5 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::UnsqueezeKernel<paddle::platform::CUDADeviceContext, float>, paddle::operators::UnsqueezeKernel<paddle::platform::CUDADeviceContext, double>, paddle::operators::UnsqueezeKernel<paddle::platform::CUDADeviceContext, paddle::platform::float16>, paddle::operators::UnsqueezeKernel<paddle::platform::CUDADeviceContext, bool>, paddle::operators::UnsqueezeKernel<paddle::platform::CUDADeviceContext, int>, paddle::operators::UnsqueezeKernel<paddle::platform::CUDADeviceContext, unsigned char>, paddle::operators::UnsqueezeKernel<paddle::platform::CUDADeviceContext, signed char>, paddle::operators::UnsqueezeKernel<paddle::platform::CUDADeviceContext, long>, paddle::operators::UnsqueezeKernel<paddle::platform::CUDADeviceContext, paddle::platform::complex >, paddle::operators::UnsqueezeKernel<paddle::platform::CUDADeviceContext, paddle::platform::complex > >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
6 paddle::operators::UnsqueezeKernel<paddle::platform::CUDADeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const
7 paddle::framework::TensorCopy(paddle::framework::Tensor const&, paddle::platform::Place const&, paddle::platform::DeviceContext const&, paddle::framework::Tensor*)
8 void paddle::memory::Copy<paddle::platform::CUDAPlace, paddle::platform::CUDAPlace>(paddle::platform::CUDAPlace, void*, paddle::platform::CUDAPlace, void const*, unsigned long, CUstream_st*)
9 paddle::platform::GpuMemcpyAsync(void*, void const*, unsigned long, cudaMemcpyKind, CUstream_st*)
10 paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int)
11 paddle::platform::GetCurrentTraceBackStringabi:cxx11

Error Message Summary:

ExternalError: CUDA error(1), invalid argument.
[Hint: Please search for the error code(1) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /home/xiangbin_train_workspace/PaddlePaddleWorkspace/Paddle_2.2.2/Paddle/paddle/fluid/platform/gpu_info.cc:429)
[operator < unsqueeze2 > error]

kouhinn · 2022-07-16T08:36:41Z

使用的选项如下：
// 第一个参数表示预先分配显存数目，第二个参数表示设备的ID。
config.EnableUseGpu(200, 0);

    //// 开启内存/显存复用
config.EnableMemoryOptim();

//// 该配置设置为false后，会关闭模型图分析阶段的任何图优化，预测期间运行同训练前向代码一致
config.SwitchIrOptim(true);


// 启用 CUDNN 进行预测加速
config.EnableCUDNN();

kouhinn · 2022-07-17T03:28:30Z

+1:
what():

C++ Traceback (most recent call last):

0 paddle::AnalysisPredictor::ClearIntermediateTensor()
1 std::_Sp_counted_deleter<paddle::memory::allocation::Allocation*, paddle::memory::allocation::Allocator::AllocationDeleter, std::allocator, (__gnu_cxx::_Lock_policy)2>::_M_dispose()
2 paddle::memory::allocation::RetryAllocator::FreeImpl(paddle::memory::allocation::Allocation*)
3 paddle::memory::allocation::NaiveBestFitAllocator::FreeImpl(paddle::memory::allocation::Allocation*)
4 paddle::memory::detail::BuddyAllocator::Free(void*)
5 paddle::memory::detail::MetadataCache::LoadDesc(paddle::memory::detail::MemoryBlock*)
6 paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int)
7 paddle::platform::GetCurrentTraceBackStringabi:cxx11

Error Message Summary:

NotFoundError: The memory block is not found in cache
[Hint: Expected iter != cache_.end(), but received iter == cache_.end().] (at /xxxxxxx/Paddle_2.2.2/Paddle/paddle/fluid/memory/detail/meta_cache.cc:30)

...............
stack dump [11] /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so+0x1ecc378 [0x7ff8d7961378]

stack dump [12]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::memory::allocation::NaiveBestFitAllocator::FreeImpl(paddle::memory::allocation::Allocation*)+0xc5 [0x7ff8ddb9ec95]

stack dump [13]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::memory::allocation::RetryAllocator::FreeImpl(paddle::memory::allocation::Allocation*)+0x41 [0x7ff8ddbb1e31]

stack dump [14]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : std::_Sp_counted_deleter<paddle::memory::allocation::Allocation*, paddle::memory::allocation::Allocator::AllocationDeleter, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x25 [0x7ff8d896cf85]
stack dump [15]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so+0x227e757 [0x7ff8d7d13757]

stack dump [16]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::AnalysisPredictor::ClearIntermediateTensor()+0x238 [0x7ff8d7d3db68]

stack dump [17]  ./anniwo_c.bin : doInference(paddle_infer::Predictor&, std::vector<float, std::allocator<float> > const&, std::vector<int, std::allocator<int> > const&, std::vector<float, std::allocator<float> >&)+0x1e5 [0x55732eba4a85]

stack dump [18]  ./anniwo_c.bin : BasePersonDetection::detect(int, cv::Mat&, std::vector<Object, std::allocator<Object> >&)+0x56a [0x55732eae11fa]

stack dump [19]  ./anniwo_c.bin : PERSONBASE_DET::do_detect(int, cv::Mat&)+0x1b9 [0x55732e9d9a13]
stack dump [20]  ./anniwo_c.bin+0x57f871 [0x55732e9ac871]

stack dump [21]  ./anniwo_c.bin : detectFunc(void*)+0x1f6e [0x55732e9af75f]

stack dump [22]  ./anniwo_c.bin : void* std::__invoke_impl<void*, void* (*)(void*), void*>(std::__invoke_other, void* (*&&)(void*), void*&&)+0x34 [0x55732e9fb4c9]

stack dump [23]  ./anniwo_c.bin : std::__invoke_result<void* (*)(void*), void*>::type std::__invoke<void* (*)(void*), void*>(void* (*&&)(void*), void*&&)+0x46 [0x55732e9e86fa]

stack dump [24]  ./anniwo_c.bin : decltype (__invoke((_S_declval<0ul>)(), (_S_declval<1ul>)())) std::thread::_Invoker<std::tuple<void* (*)(void*), void*> >::_M_invoke<0ul, 1ul>(std::_Index_tuple<0ul, 1ul>)+0x43 [0x55732ea8fbef]

stack dump [25]  ./anniwo_c.bin : std::thread::_Invoker<std::tuple<void* (*)(void*), void*> >::operator()()+0x27 [0x55732ea8ea5b]

stack dump [26]  ./anniwo_c.bin : std::thread::_State_impl<std::thread::_Invoker<std::tuple<void* (*)(void*), void*> > >::_M_run()+0x1c [0x55732ea8b58a]
stack dump [27]  /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0xbd6df [0x7ff8d15be6df]
stack dump [28]  /lib/x86_64-linux-gnu/libpthread.so.0+0x76db [0x7ff8ecae76db]
stack dump [29]  /lib/x86_64-linux-gnu/libc.so.6clone+0x3f [0x7ff8d101961f]

Exiting after fatal event (FATAL_SIGNAL). Fatal type: SIGABRT
Log content flushed sucessfully to sink

kouhinn · 2022-07-19T06:41:21Z

+1:
C++ Traceback (most recent call last):

0 paddle::AnalysisPredictor::ZeroCopyRun()
1 paddle::framework::NaiveExecutor::Run()
2 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
3 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
5 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::CUDNNConvFusionOpKernel, paddle::operators::CUDNNConvFusionOpKernel >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
6 paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const
7 void paddle::platform::CudnnWorkspaceHandle::RunFunc<paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const::{lambda(void*)#2}&>(paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const::{lambda(void*)#2}&, unsigned long)
8 paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int)
9 paddle::platform::GetCurrentTraceBackStringabi:cxx11

Error Message Summary:

ExternalError: CUDNN error(7), CUDNN_STATUS_MAPPING_ERROR.
[Hint: Please search for the error code(7) on website (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnStatus_t) to get Nvidia's official solution and advice about CUDNN Error.] (at /home/xiangbin_train_workspace/PaddlePaddleWorkspace/Paddle_2.2.2/Paddle/paddle/fluid/operators/fused/conv_fusion_op.cu:381)
[operator < conv2d_fusion > error]

jiweibo · 2022-07-19T11:30:13Z

bug描述 Describe the Bug

使用paddle inference c++接口进行多线程推理。按照样例一个线程一个predictor, 并按照省显存的方式，在每个preictor.Run之后调用了 ClearIntermediateTensor和TryShrinkMemory。 (相关#43346)

运行中常出现各种异常退出： paddle及相关版本： cuda:10.2 cudnn:7.6.5.32 paddle:2.2.2

简要错误信息与堆栈如下：

Error Message Summary:

ExternalError: CUDNN error(8), CUDNN_STATUS_EXECUTION_FAILED. [Hint: Please search for the error code(8) on website (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnStatus_t) to get Nvidia's official solution and advice about CUDNN Error.] (at /home/xiangbin_train_workspace/PaddlePaddleWorkspace/Paddle_2.2.2/Paddle/paddle/fluid/operators/fused/conv_fusion_op.cu:381) [operator < conv2d_fusion > error] 2022/07/13 08:15:47 454804

详细错误信息见如下。

其他补充信息 Additional Supplementary Information

错误信息与堆栈： `terminate called after throwing an instance of 'paddle::platform::EnforceNotMet' what():

Compile Traceback (most recent call last): File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\Scripts\x2paddle-script.py", line 33, in sys.exit(load_entry_point('x2paddle==1.3.5', 'console_scripts', 'x2paddle')()) File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\convert.py", line 373, in main lite_model_type=args.lite_model_type) File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\convert.py", line 234, in onnx2paddle mapper.paddle_graph.gen_model(save_dir) File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\core\program.py", line 296, in gen_model self.dygraph2static(save_dir, input_shapes, input_types) File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\core\program.py", line 580, in dygraph2static osp.join(save_dir, "inference_model/model")) File "", line 2, in save
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 25, in __impl__
  return wrapped_func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\base.py", line 51, in __impl__
  return func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\jit.py", line 744, in save
  inner_input_spec)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 517, in concrete_program_specify_input_spec
  *desired_input_spec)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 427, in get_concrete_program
  concrete_program, partial_program_layer = self._program_cache[cache_key]
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 723, in __getitem__
  self._caches[item] = self._build_once(item)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 714, in _build_once
  **cache_key.kwargs)
File "<decorator-gen-99>", line 2, in from_func_spec
  
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 25, in __impl__
  return wrapped_func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\base.py", line 51, in __impl__
  return func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 662, in from_func_spec
  outputs = static_func(*inputs)
File "personbasemodelonnx2paddle\x2paddle_code.py", line 315, in forward
  x2paddle_convolution_output96 = self.conv1(x2paddle_convolution_output96_paded)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\layers.py", line 917, in __call__
  return self._dygraph_call_func(*inputs, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\layers.py", line 907, in _dygraph_call_func
  outputs = self.forward(*inputs, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\nn\layer\conv.py", line 677, in forward
  use_cudnn=self._use_cudnn)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\nn\functional\conv.py", line 148, in _conv_nd
  type=op_type, inputs=inputs, outputs=outputs, attrs=attrs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\layer_helper.py", line 43, in append_op
  return self.main_program.current_block().append_op(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\framework.py", line 3184, in append_op
  attrs=kwargs.get("attrs", None))
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\framework.py", line 2224, in __init__
  for frame in traceback.extract_stack():
C++ Traceback (most recent call last):

0 paddle::AnalysisPredictor::ZeroCopyRun() 1 paddle::framework::NaiveExecutor::Run() 2 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&) 3 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const 4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const 5 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::CUDNNConvFusionOpKernel, paddle::operators::CUDNNConvFusionOpKernel >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&) 6 paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const 7 void paddle::platform::CudnnWorkspaceHandle::RunFunc<paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const::{lambda(void*)#2}&>(paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const::{lambda(void*)#2}&, unsigned long) 8 paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int) 9 paddle::platform::GetCurrentTraceBackStringabi:cxx11

Error Message Summary:

ExternalError: CUDNN error(8), CUDNN_STATUS_EXECUTION_FAILED. [Hint: Please search for the error code(8) on website (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnStatus_t) to get Nvidia's official solution and advice about CUDNN Error.] (at /home/xiangbin_train_workspace/PaddlePaddleWorkspace/Paddle_2.2.2/Paddle/paddle/fluid/operators/fused/conv_fusion_op.cu:381) [operator < conv2d_fusion > error] 2022/07/13 08:15:47 454804

***** FATAL SIGNAL RECEIVED ******* Received fatal signal: SIGABRT(6) PID: 30571

***** SIGNAL SIGABRT(6)

******* STACKDUMP ******* stack dump [1] /usr/local/lib/libg3log.so.2.1.0-0+0x1465a [0x7fa3a64e865a] stack dump [2] /lib/x86_64-linux-gnu/libpthread.so.0+0x12980 [0x7fa3c173c980] stack dump [3] /lib/x86_64-linux-gnu/libc.so.6gsignal+0xc7 [0x7fa3a5b80e87] stack dump [4] /lib/x86_64-linux-gnu/libc.so.6abort+0x141 [0x7fa3a5b827f1] stack dump [5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x8c957 [0x7fa3a61d7957] stack dump [6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x92ae6 [0x7fa3a61ddae6] stack dump [7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x92b21 [0x7fa3a61ddb21] stack dump [8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x92d54 [0x7fa3a61ddd54] stack dump [9] /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so+0x1ebe224 [0x7fa3ac59d224]
stack dump [10]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::framework::NaiveExecutor::Run()+0x130 [0x7fa3acce05d0]

stack dump [11]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::AnalysisPredictor::ZeroCopyRun()+0x293 [0x7fa3ac98be73]

stack dump [12]  ./xxxx : doInference(paddle_infer::Predictor&, std::vector<float, std::allocator<float> > const&, std::vector<int, std::allocator<int> > const&, std::vector<float, std::allocator<float> >&)+0x10d [0x563a23f08aad]`

请问下，多线程下报错是Run之后调用到ClearIntermediateTensor和TryShrinkMemory的情况下才会报错吗？不调用ClearIntermediateTensor和TryShrinkMemory还会报错吗？

kouhinn · 2022-07-20T06:58:57Z

请问下，多线程下报错是Run之后调用到ClearIntermediateTensor和TryShrinkMemory的情况下才会报错吗？不调用ClearIntermediateTensor和TryShrinkMemory还会报错吗？

不调用的话显存直接崩了。占用过大，16G都挡不住。
尝试过去掉如下设置：
// 启用 CUDNN 进行预测加速
//config.EnableCUDNN();

问题依旧。

jiweibo · 2022-07-21T02:53:56Z

请问下，多线程下报错是Run之后调用到ClearIntermediateTensor和TryShrinkMemory的情况下才会报错吗？不调用ClearIntermediateTensor和TryShrinkMemory还会报错吗？

不调用的话显存直接崩了。占用过大，16G都挡不住。尝试过去掉如下设置： // 启用 CUDNN 进行预测加速 //config.EnableCUDNN();

问题依旧。

有一个显存复用的接口，可以试下在您的模型上是否有作用 config.EnableMemoryOptim()

jiweibo · 2022-07-21T02:58:56Z

请问下，多线程下报错是Run之后调用到ClearIntermediateTensor和TryShrinkMemory的情况下才会报错吗？不调用ClearIntermediateTensor和TryShrinkMemory还会报错吗？

不调用的话显存直接崩了。占用过大，16G都挡不住。尝试过去掉如下设置： // 启用 CUDNN 进行预测加速 //config.EnableCUDNN();
问题依旧。

有一个显存复用的接口，可以试下在您的模型上是否有作用 config.EnableMemoryOptim()

TryShrinkMemory看起来确实不是线程安全的，您可以先试下显存复用的方式能否work，如果不行的话，在这个接口上加锁看下 https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/api/analysis_predictor.cc#L1809

我本地先看看能否复现出问题。

kouhinn · 2022-07-21T07:13:00Z

请问下，多线程下报错是Run之后调用到ClearIntermediateTensor和TryShrinkMemory的情况下才会报错吗？不调用ClearIntermediateTensor和TryShrinkMemory还会报错吗？

不调用的话显存直接崩了。占用过大，16G都挡不住。尝试过去掉如下设置： // 启用 CUDNN 进行预测加速 //config.EnableCUDNN();
问题依旧。

有一个显存复用的接口，可以试下在您的模型上是否有作用 config.EnableMemoryOptim()

TryShrinkMemory看起来确实不是线程安全的，您可以先试下显存复用的方式能否work，如果不行的话，在这个接口上加锁看下 https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/api/analysis_predictor.cc#L1809

我本地先看看能否复现出问题。

config.EnableMemoryOptim() 这个一直都设置上了的（见前述）。

您说的加锁这个,是只需要保护paddle::memory::Release(place_)这个调用是吧？我这样修改的paddle代码的：
`paddle\fluid\inference\api\analysis_predictor.h：
// A mutex help to memory release thread safe.
std::mutex memrel_mutex_;

paddle\fluid\inference\api\analysis_predictor.cc：
uint64_t AnalysisPredictor::TryShrinkMemory() {
ClearIntermediateTensor();

std::lock_guardstd::mutex lk(memrel_mutex_);
return paddle::memory::Release(place_);

}

`

kouhinn · 2022-07-23T06:25:09Z

在应用程序中，每次调用TryShrinkMemory()的时候都加了锁，代码如下：
ANNIWOCHECK( m_id_predictors[camID]->Run() );

  if(true)
  {
    std::lock_guard<std::mutex> lk(anniwo_memrel_mutex);
    m_id_predictors[camID]->TryShrinkMemory();
  }

但还是有一样的问题。

日志如下：
`terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
what():

C++ Traceback (most recent call last):

0 paddle::AnalysisPredictor::ZeroCopyRun()
1 paddle::framework::NaiveExecutor::Run()
2 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
3 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
5 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::CUDNNConvFusionOpKernel, paddle::operators::CUDNNConvFusionOpKernel >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
6 paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const
7 paddle::memory::allocation::RetryAllocator::FreeImpl(paddle::memory::allocation::Allocation*)
8 paddle::memory::allocation::NaiveBestFitAllocator::FreeImpl(paddle::memory::allocation::Allocation*)
9 paddle::memory::detail::BuddyAllocator::Free(void*)
10 paddle::memory::detail::MetadataCache::LoadDesc(paddle::memory::detail::MemoryBlock*)
11 paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int)
12 paddle::platform::GetCurrentTraceBackStringabi:cxx11

Error Message Summary:

NotFoundError: The memory block is not found in cache
[Hint: Expected iter != cache_.end(), but received iter == cache_.end().] (at /home/xxx/PaddlePaddleWorkspace/Paddle_2.2.2/Paddle/paddle/fluid/memory/detail/meta_cache.cc:30)`

kouhinn · 2022-07-23T06:30:57Z

有时候是SEGV的错误：
stack dump [10] /usr/local/cuda/lib64/libcudnn.socudnnGetConvolutionForwardAlgorithm+0x659 [0x7f49a87ee5d9]

stack dump [11]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : decltype (cudnnGetConvolutionForwardAlgorithm({parm#1}...)) paddle::platform::dynload::DynLoad__cudnnGetConvolutionForwardAlgorithm::operator()<cudnnContext*, cudnnTensorStruct*, cudnnFilterStruct*, cudnnConvolutionStruct*, cudnnTensorStruct*, cudnnConvolutionFwdPreference_t, unsigned long, cudnnConvolutionFwdAlgo_t*>(cudnnContext*, cudnnTensorStruct*, cudnnFilterStruct*, cudnnConvolutionStruct*, cudnnTensorStruct*, cudnnConvolutionFwdPreference_t, unsigned long, cudnnConvolutionFwdAlgo_t*)+0xd3 [0x7f4e80105653]

stack dump [12]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::operators::CUDNNConvFusionOpKernel<float>::Compute(paddle::framework::ExecutionContext const&) const+0x1185 [0x7f4e8010bf75]

stack dump [13]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::CUDNNConvFusionOpKernel<float>, paddle::operators::CUDNNConvFusionOpKernel<double> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)+0x33 [0x7f4e8010dfe3]

stack dump [14]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const+0x312 [0x7f4e846a0dd2]

stack dump [15]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const+0x148 [0x7f4e846a1628]

stack dump [16]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)+0x1c7 [0x7f4e8469d4c7]

stack dump [17]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::framework::NaiveExecutor::Run()+0x130 [0x7f4e7ecb65d0]

stack dump [18]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::AnalysisPredictor::ZeroCopyRun()+0x293 [0x7f4e7e961e73]

stack dump [19]  ./xxx : doInference(paddle_infer::Predictor&, std::vector<float, std::allocator<float> > const&, std::vector<int, std::allocator<int> > const&, std::vector<float, std::allocator<float> >&)+0x10d [0x55fef893393d]

jiweibo · 2022-08-03T12:54:12Z

请问下可以share下复现环境吗，我本地拿demo里的模型跑不出来这个结果。

kouhinn · 2022-08-11T03:39:07Z

复现程序稍等我这边整理一下。

Fire-Star · 2022-09-28T06:31:12Z

我也出现这个问题，直接改的cpp_infer，多线程连续GPU预测报错，台式机（CUDA11.6 cuDNN8.4 trt8.4.15 NVIDIA A10）跑1-4小时连续预测才能复现，笔记本（CUDA10.1 cuDNN7.6.5 noTrt NVIDIA GTX1660TI）跑十多秒就复现了。

操作系统：
笔记本是win11
台式机是 win server 2019 datacenter

PaddleInference 版本使用的2.3.2

起初以为是代码问题，但是CPU连续预测就不会报错，只有GPU预测会报错。

后面又查找，以为是cuda版本和显卡版本不符合，但实际上是恰巧对应的，已经查看了控制面板-系统信息-[ 显示 | 组件 ]
下面是台式机的信息：

完整代码：
cpp_infer.zip

可直接运行去复现的dll，已经写好py测试代码，可运行：
由于限制了上传大小，分3个压缩包，解压到同一文件夹即可。
GPU_Recurring_BUG_1部分.zip
GPU_Recurring_BUG_2部分.zip
GPU_Recurring_BUG_3部分.zip

Fire-Star · 2022-09-28T11:31:27Z

大佬们，我在PaddleOCR 新建了一个issue，目前问题还未解决：#7757

paddle-bot · 2023-10-03T06:32:13Z

Since you haven't replied for more than a year, we have closed this issue/pr.
If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up.
由于您超过一年未回复，我们将关闭这个issue/pr。
若问题未解决或有后续问题，请随时重新打开，我们会继续跟进。

kouhinn added status/new-issue 新建 type/bug-report 报bug labels Jul 14, 2022

paddle-bot bot assigned sljlp Jul 14, 2022

paddle-bot bot added status/following-up 跟进中 and removed status/new-issue 新建 labels Jul 21, 2022

paddle-bot bot closed this as completed Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

不能稳定运行20小时：paddle inference c++多线程预测常出现gpu相关异常退出 #44323

不能稳定运行20小时：paddle inference c++多线程预测常出现gpu相关异常退出 #44323

kouhinn commented Jul 14, 2022

paddle-bot bot commented Jul 14, 2022

kouhinn commented Jul 14, 2022

kouhinn commented Jul 14, 2022

kouhinn commented Jul 15, 2022

kouhinn commented Jul 16, 2022

kouhinn commented Jul 17, 2022

kouhinn commented Jul 19, 2022

jiweibo commented Jul 19, 2022

bug描述 Describe the Bug

简要错误信息与堆栈如下：

Error Message Summary:

其他补充信息 Additional Supplementary Information

C++ Traceback (most recent call last):

Error Message Summary:

kouhinn commented Jul 20, 2022 •

edited

Loading

jiweibo commented Jul 21, 2022

jiweibo commented Jul 21, 2022 •

edited

Loading

kouhinn commented Jul 21, 2022

kouhinn commented Jul 23, 2022 •

edited

Loading

kouhinn commented Jul 23, 2022

jiweibo commented Aug 3, 2022

kouhinn commented Aug 11, 2022

Fire-Star commented Sep 28, 2022 •

edited

Loading

Fire-Star commented Sep 28, 2022

paddle-bot bot commented Oct 3, 2023

不能稳定运行20小时：paddle inference c++多线程预测 常出现gpu相关异常退出 #44323

不能稳定运行20小时：paddle inference c++多线程预测 常出现gpu相关异常退出 #44323

Comments

kouhinn commented Jul 14, 2022

bug描述 Describe the Bug

简要错误信息与堆栈如下：

Error Message Summary:

其他补充信息 Additional Supplementary Information

C++ Traceback (most recent call last):

Error Message Summary:

paddle-bot bot commented Jul 14, 2022

kouhinn commented Jul 14, 2022

kouhinn commented Jul 14, 2022

kouhinn commented Jul 15, 2022

C++ Traceback (most recent call last):

Error Message Summary:

kouhinn commented Jul 16, 2022

kouhinn commented Jul 17, 2022

C++ Traceback (most recent call last):

Error Message Summary:

kouhinn commented Jul 19, 2022

+1: C++ Traceback (most recent call last):

Error Message Summary:

jiweibo commented Jul 19, 2022

bug描述 Describe the Bug

简要错误信息与堆栈如下：

Error Message Summary:

其他补充信息 Additional Supplementary Information

C++ Traceback (most recent call last):

Error Message Summary:

kouhinn commented Jul 20, 2022 • edited Loading

jiweibo commented Jul 21, 2022

jiweibo commented Jul 21, 2022 • edited Loading

kouhinn commented Jul 21, 2022

kouhinn commented Jul 23, 2022 • edited Loading

C++ Traceback (most recent call last):

Error Message Summary:

kouhinn commented Jul 23, 2022

jiweibo commented Aug 3, 2022

kouhinn commented Aug 11, 2022

Fire-Star commented Sep 28, 2022 • edited Loading

Fire-Star commented Sep 28, 2022

paddle-bot bot commented Oct 3, 2023

不能稳定运行20小时：paddle inference c++多线程预测常出现gpu相关异常退出 #44323

不能稳定运行20小时：paddle inference c++多线程预测常出现gpu相关异常退出 #44323

+1:
C++ Traceback (most recent call last):

kouhinn commented Jul 20, 2022 •

edited

Loading

jiweibo commented Jul 21, 2022 •

edited

Loading

kouhinn commented Jul 23, 2022 •

edited

Loading

Fire-Star commented Sep 28, 2022 •

edited

Loading