Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

不能稳定运行20小时:paddle inference c++多线程预测 常出现gpu相关异常退出 #44323

Closed
kouhinn opened this issue Jul 14, 2022 · 19 comments
Assignees

Comments

@kouhinn
Copy link

kouhinn commented Jul 14, 2022

bug描述 Describe the Bug

使用paddle inference c++接口进行多线程推理。按照样例一个线程一个predictor, 并按照省显存的方式,在每个preictor.Run之后调用了 ClearIntermediateTensor和TryShrinkMemory。 (相关#43346)

运行中常出现各种异常退出:
paddle及相关版本:
cuda:10.2
cudnn:7.6.5.32
paddle:2.2.2

简要错误信息与堆栈如下:

Error Message Summary:

ExternalError: CUDNN error(8), CUDNN_STATUS_EXECUTION_FAILED.
[Hint: Please search for the error code(8) on website (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnStatus_t) to get Nvidia's official solution and advice about CUDNN Error.] (at /home/xiangbin_train_workspace/PaddlePaddleWorkspace/Paddle_2.2.2/Paddle/paddle/fluid/operators/fused/conv_fusion_op.cu:381)
[operator < conv2d_fusion > error]
2022/07/13 08:15:47 454804

详细错误信息见如下。

其他补充信息 Additional Supplementary Information

错误信息与堆栈:
`terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
what():

Compile Traceback (most recent call last):
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\Scripts\x2paddle-script.py", line 33, in
sys.exit(load_entry_point('x2paddle==1.3.5', 'console_scripts', 'x2paddle')())
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\convert.py", line 373, in main
lite_model_type=args.lite_model_type)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\convert.py", line 234, in onnx2paddle
mapper.paddle_graph.gen_model(save_dir)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\core\program.py", line 296, in gen_model
self.dygraph2static(save_dir, input_shapes, input_types)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\core\program.py", line 580, in dygraph2static
osp.join(save_dir, "inference_model/model"))
File "", line 2, in save

File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 25, in __impl__
  return wrapped_func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\base.py", line 51, in __impl__
  return func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\jit.py", line 744, in save
  inner_input_spec)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 517, in concrete_program_specify_input_spec
  *desired_input_spec)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 427, in get_concrete_program
  concrete_program, partial_program_layer = self._program_cache[cache_key]
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 723, in __getitem__
  self._caches[item] = self._build_once(item)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 714, in _build_once
  **cache_key.kwargs)
File "<decorator-gen-99>", line 2, in from_func_spec
  
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 25, in __impl__
  return wrapped_func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\base.py", line 51, in __impl__
  return func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 662, in from_func_spec
  outputs = static_func(*inputs)
File "personbasemodelonnx2paddle\x2paddle_code.py", line 315, in forward
  x2paddle_convolution_output96 = self.conv1(x2paddle_convolution_output96_paded)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\layers.py", line 917, in __call__
  return self._dygraph_call_func(*inputs, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\layers.py", line 907, in _dygraph_call_func
  outputs = self.forward(*inputs, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\nn\layer\conv.py", line 677, in forward
  use_cudnn=self._use_cudnn)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\nn\functional\conv.py", line 148, in _conv_nd
  type=op_type, inputs=inputs, outputs=outputs, attrs=attrs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\layer_helper.py", line 43, in append_op
  return self.main_program.current_block().append_op(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\framework.py", line 3184, in append_op
  attrs=kwargs.get("attrs", None))
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\framework.py", line 2224, in __init__
  for frame in traceback.extract_stack():

C++ Traceback (most recent call last):

0 paddle::AnalysisPredictor::ZeroCopyRun()
1 paddle::framework::NaiveExecutor::Run()
2 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
3 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
5 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::CUDNNConvFusionOpKernel, paddle::operators::CUDNNConvFusionOpKernel >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
6 paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const
7 void paddle::platform::CudnnWorkspaceHandle::RunFunc<paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const::{lambda(void*)#2}&>(paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const::{lambda(void*)#2}&, unsigned long)
8 paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int)
9 paddle::platform::GetCurrentTraceBackStringabi:cxx11


Error Message Summary:

ExternalError: CUDNN error(8), CUDNN_STATUS_EXECUTION_FAILED.
[Hint: Please search for the error code(8) on website (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnStatus_t) to get Nvidia's official solution and advice about CUDNN Error.] (at /home/xiangbin_train_workspace/PaddlePaddleWorkspace/Paddle_2.2.2/Paddle/paddle/fluid/operators/fused/conv_fusion_op.cu:381)
[operator < conv2d_fusion > error]
2022/07/13 08:15:47 454804

***** FATAL SIGNAL RECEIVED *******
Received fatal signal: SIGABRT(6) PID: 30571

***** SIGNAL SIGABRT(6)

******* STACKDUMP *******
stack dump [1] /usr/local/lib/libg3log.so.2.1.0-0+0x1465a [0x7fa3a64e865a]
stack dump [2] /lib/x86_64-linux-gnu/libpthread.so.0+0x12980 [0x7fa3c173c980]
stack dump [3] /lib/x86_64-linux-gnu/libc.so.6gsignal+0xc7 [0x7fa3a5b80e87]
stack dump [4] /lib/x86_64-linux-gnu/libc.so.6abort+0x141 [0x7fa3a5b827f1]
stack dump [5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x8c957 [0x7fa3a61d7957]
stack dump [6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x92ae6 [0x7fa3a61ddae6]
stack dump [7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x92b21 [0x7fa3a61ddb21]
stack dump [8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x92d54 [0x7fa3a61ddd54]
stack dump [9] /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so+0x1ebe224 [0x7fa3ac59d224]

stack dump [10]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::framework::NaiveExecutor::Run()+0x130 [0x7fa3acce05d0]

stack dump [11]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::AnalysisPredictor::ZeroCopyRun()+0x293 [0x7fa3ac98be73]

stack dump [12]  ./xxxx : doInference(paddle_infer::Predictor&, std::vector<float, std::allocator<float> > const&, std::vector<int, std::allocator<int> > const&, std::vector<float, std::allocator<float> >&)+0x10d [0x563a23f08aad]`
@paddle-bot
Copy link

paddle-bot bot commented Jul 14, 2022

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

@kouhinn
Copy link
Author

kouhinn commented Jul 14, 2022

这种错误在压力较小的不能复现。压力较大时候(多个模型,数百个predictor多线程运行时)必现

@kouhinn
Copy link
Author

kouhinn commented Jul 14, 2022

报错不固定,有时报错如下:

  what():  

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::AnalysisPredictor::ZeroCopyRun()
1   paddle::framework::NaiveExecutor::Run()
2   paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
3   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
4   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
5   std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 5ul, paddle::operators::ShapeKernel<bool>, paddle::operators::ShapeKernel<int>, paddle::operators::ShapeKernel<signed char>, paddle::operators::ShapeKernel<unsigned char>, paddle::operators::ShapeKernel<long>, paddle::operators::ShapeKernel<float>, paddle::operators::ShapeKernel<double>, paddle::operators::ShapeKernel<paddle::platform::float16>, paddle::operators::ShapeKernel<paddle::platform::complex<float> >, paddle::operators::ShapeKernel<paddle::platform::complex<double> > >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
6   paddle::framework::Tensor::mutable_data(paddle::platform::Place const&, paddle::framework::proto::VarType_Type, unsigned long)
7   std::_Sp_counted_deleter<paddle::memory::allocation::Allocation*, paddle::memory::allocation::Allocator::AllocationDeleter, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose()
8   paddle::memory::allocation::RetryAllocator::FreeImpl(paddle::memory::allocation::Allocation*)
9   paddle::memory::allocation::NaiveBestFitAllocator::FreeImpl(paddle::memory::allocation::Allocation*)
10  paddle::memory::detail::BuddyAllocator::Free(void*)
11  paddle::memory::detail::MetadataCache::LoadDesc(paddle::memory::detail::MemoryBlock*)
12  paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int)
13  paddle::platform::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
NotFoundError: The memory block is not found in cache
  [Hint: Expected iter != cache_.end(), but received iter == cache_.end().] (at /home/xiangbin_train_workspace/PaddlePaddleWorkspace/Paddle_2.2.2/Paddle/paddle/fluid/memory/detail/meta_cache.cc:30)

***** FATAL SIGNAL RECEIVED ******* 
Received fatal signal: SIGABRT(6)	PID: 23645

***** SIGNAL SIGABRT(6)

*******	STACKDUMP *******
	stack dump [1]  /usr/local/lib/libg3log.so.2.1.0-0+0x1465a [0x7f857d85365a]
	stack dump [2]  /lib/x86_64-linux-gnu/libpthread.so.0+0x12980 [0x7f8598aa7980]
	stack dump [3]  /lib/x86_64-linux-gnu/libc.so.6gsignal+0xc7 [0x7f857ceebe87]
	stack dump [4]  /lib/x86_64-linux-gnu/libc.so.6abort+0x141 [0x7f857ceed7f1]
	stack dump [5]  /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x8c957 [0x7f857d542957]
	stack dump [6]  /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x92ae6 [0x7f857d548ae6]
	stack dump [7]  /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x91b49 [0x7f857d547b49]
	stack dump [8]  /usr/lib/x86_64-linux-gnu/libstdc++.so.6__gxx_personality_v0+0x2a8 [0x7f857d5484b8]
	stack dump [9]  /lib/x86_64-linux-gnu/libgcc_s.so.1+0x10573 [0x7f857d2ae573]
	stack dump [10]  /lib/x86_64-linux-gnu/libgcc_s.so.1_Unwind_Resume+0x125 [0x7f857d2aedf5]
	stack dump [11]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so+0x1ecc378 [0x7f8583916378]

	stack dump [12]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::memory::allocation::NaiveBestFitAllocator::FreeImpl(paddle::memory::allocation::Allocation*)+0xc5 [0x7f8589b53c95]

	stack dump [13]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::memory::allocation::RetryAllocator::FreeImpl(paddle::memory::allocation::Allocation*)+0x41 [0x7f8589b66e31]

	stack dump [14]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : std::_Sp_counted_deleter<paddle::memory::allocation::Allocation*, paddle::memory::allocation::Allocator::AllocationDeleter, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x25 [0x7f8584921f85]
	stack dump [15]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so+0x227e757 [0x7f8583cc8757]

	stack dump [16]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::framework::Tensor::mutable_data(paddle::platform::Place const&, paddle::framework::proto::VarType_Type, unsigned long)+0xc5 [0x7f8583fdafa5]

	stack dump [17]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 5ul, paddle::operators::ShapeKernel<bool>, paddle::operators::ShapeKernel<int>, paddle::operators::ShapeKernel<signed char>, paddle::operators::ShapeKernel<unsigned char>, paddle::operators::ShapeKernel<long>, paddle::operators::ShapeKernel<float>, paddle::operators::ShapeKernel<double>, paddle::operators::ShapeKernel<paddle::platform::float16>, paddle::operators::ShapeKernel<paddle::platform::complex<float> >, paddle::operators::ShapeKernel<paddle::platform::complex<double> > >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)+0x12f [0x7f85885f5ddf]

	stack dump [18]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const+0x312 [0x7f8589a35dd2]

	stack dump [19]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const+0x148 [0x7f8589a36628]

	stack dump [20]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)+0x1c7 [0x7f8589a324c7]

	stack dump [21]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::framework::NaiveExecutor::Run()+0x130 [0x7f858404b5d0]

	stack dump [22]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::AnalysisPredictor::ZeroCopyRun()+0x293 [0x7f8583cf6e73]

@kouhinn
Copy link
Author

kouhinn commented Jul 15, 2022

+1:
read frame put queue ret: 0
terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
what():

Compile Traceback (most recent call last):
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\Scripts\x2paddle-script.py", line 33, in
sys.exit(load_entry_point('x2paddle==1.3.5', 'console_scripts', 'x2paddle')())
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\convert.py", line 373, in main
lite_model_type=args.lite_model_type)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\convert.py", line 234, in onnx2paddle
mapper.paddle_graph.gen_model(save_dir)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\core\program.py", line 296, in gen_model
self.dygraph2static(save_dir, input_shapes, input_types)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\core\program.py", line 580, in dygraph2static
osp.join(save_dir, "inference_model/model"))
File "", line 2, in save

File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 25, in __impl__
  return wrapped_func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\base.py", line 51, in __impl__
  return func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\jit.py", line 744, in save
  inner_input_spec)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 517, in concrete_program_specify_input_spec
  *desired_input_spec)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 427, in get_concrete_program
  concrete_program, partial_program_layer = self._program_cache[cache_key]
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 723, in __getitem__
  self._caches[item] = self._build_once(item)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 714, in _build_once
  **cache_key.kwargs)
File "<decorator-gen-99>", line 2, in from_func_spec
  
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 25, in __impl__
  return wrapped_func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\base.py", line 51, in __impl__
  return func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 662, in from_func_spec
  outputs = static_func(*inputs)
File "personbasemodelonnx2paddle\x2paddle_code.py", line 401, in forward
  x2paddle_convolution_output36_paded = self.pad2(x2paddle_mish_17_mul_0)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\layers.py", line 917, in __call__
  return self._dygraph_call_func(*inputs, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\layers.py", line 907, in _dygraph_call_func
  outputs = self.forward(*inputs, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\nn\layer\common.py", line 1103, in forward
  name=self._name)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\nn\functional\common.py", line 1334, in pad
  x = unsqueeze(x, axis=unsqueezed_dim)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\tensor\manipulation.py", line 1229, in unsqueeze
  return layers.unsqueeze(x, axis, name)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\layers\nn.py", line 6422, in unsqueeze
  "XShape": x_shape})
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\layer_helper.py", line 43, in append_op
  return self.main_program.current_block().append_op(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\framework.py", line 3184, in append_op
  attrs=kwargs.get("attrs", None))
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\framework.py", line 2224, in __init__
  for frame in traceback.extract_stack():

C++ Traceback (most recent call last):

0 paddle::AnalysisPredictor::ZeroCopyRun()
1 paddle::framework::NaiveExecutor::Run()
2 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
3 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
5 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::UnsqueezeKernel<paddle::platform::CUDADeviceContext, float>, paddle::operators::UnsqueezeKernel<paddle::platform::CUDADeviceContext, double>, paddle::operators::UnsqueezeKernel<paddle::platform::CUDADeviceContext, paddle::platform::float16>, paddle::operators::UnsqueezeKernel<paddle::platform::CUDADeviceContext, bool>, paddle::operators::UnsqueezeKernel<paddle::platform::CUDADeviceContext, int>, paddle::operators::UnsqueezeKernel<paddle::platform::CUDADeviceContext, unsigned char>, paddle::operators::UnsqueezeKernel<paddle::platform::CUDADeviceContext, signed char>, paddle::operators::UnsqueezeKernel<paddle::platform::CUDADeviceContext, long>, paddle::operators::UnsqueezeKernel<paddle::platform::CUDADeviceContext, paddle::platform::complex >, paddle::operators::UnsqueezeKernel<paddle::platform::CUDADeviceContext, paddle::platform::complex > >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
6 paddle::operators::UnsqueezeKernel<paddle::platform::CUDADeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const
7 paddle::framework::TensorCopy(paddle::framework::Tensor const&, paddle::platform::Place const&, paddle::platform::DeviceContext const&, paddle::framework::Tensor*)
8 void paddle::memory::Copy<paddle::platform::CUDAPlace, paddle::platform::CUDAPlace>(paddle::platform::CUDAPlace, void*, paddle::platform::CUDAPlace, void const*, unsigned long, CUstream_st*)
9 paddle::platform::GpuMemcpyAsync(void*, void const*, unsigned long, cudaMemcpyKind, CUstream_st*)
10 paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int)
11 paddle::platform::GetCurrentTraceBackStringabi:cxx11


Error Message Summary:

ExternalError: CUDA error(1), invalid argument.
[Hint: Please search for the error code(1) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /home/xiangbin_train_workspace/PaddlePaddleWorkspace/Paddle_2.2.2/Paddle/paddle/fluid/platform/gpu_info.cc:429)
[operator < unsqueeze2 > error]

@kouhinn
Copy link
Author

kouhinn commented Jul 16, 2022

使用的选项如下:
// 第一个参数表示预先分配显存数目,第二个参数表示设备的ID。
config.EnableUseGpu(200, 0);

    //// 开启内存/显存复用
config.EnableMemoryOptim();

//// 该配置设置为false后,会关闭模型图分析阶段的任何图优化,预测期间运行同训练前向代码一致
config.SwitchIrOptim(true);


// 启用 CUDNN 进行预测加速
config.EnableCUDNN();

@kouhinn
Copy link
Author

kouhinn commented Jul 17, 2022

+1:
what():


C++ Traceback (most recent call last):

0 paddle::AnalysisPredictor::ClearIntermediateTensor()
1 std::_Sp_counted_deleter<paddle::memory::allocation::Allocation*, paddle::memory::allocation::Allocator::AllocationDeleter, std::allocator, (__gnu_cxx::_Lock_policy)2>::_M_dispose()
2 paddle::memory::allocation::RetryAllocator::FreeImpl(paddle::memory::allocation::Allocation*)
3 paddle::memory::allocation::NaiveBestFitAllocator::FreeImpl(paddle::memory::allocation::Allocation*)
4 paddle::memory::detail::BuddyAllocator::Free(void*)
5 paddle::memory::detail::MetadataCache::LoadDesc(paddle::memory::detail::MemoryBlock*)
6 paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int)
7 paddle::platform::GetCurrentTraceBackStringabi:cxx11


Error Message Summary:

NotFoundError: The memory block is not found in cache
[Hint: Expected iter != cache_.end(), but received iter == cache_.end().] (at /xxxxxxx/Paddle_2.2.2/Paddle/paddle/fluid/memory/detail/meta_cache.cc:30)

...............
stack dump [11] /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so+0x1ecc378 [0x7ff8d7961378]

stack dump [12]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::memory::allocation::NaiveBestFitAllocator::FreeImpl(paddle::memory::allocation::Allocation*)+0xc5 [0x7ff8ddb9ec95]

stack dump [13]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::memory::allocation::RetryAllocator::FreeImpl(paddle::memory::allocation::Allocation*)+0x41 [0x7ff8ddbb1e31]

stack dump [14]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : std::_Sp_counted_deleter<paddle::memory::allocation::Allocation*, paddle::memory::allocation::Allocator::AllocationDeleter, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x25 [0x7ff8d896cf85]
stack dump [15]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so+0x227e757 [0x7ff8d7d13757]

stack dump [16]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::AnalysisPredictor::ClearIntermediateTensor()+0x238 [0x7ff8d7d3db68]

stack dump [17]  ./anniwo_c.bin : doInference(paddle_infer::Predictor&, std::vector<float, std::allocator<float> > const&, std::vector<int, std::allocator<int> > const&, std::vector<float, std::allocator<float> >&)+0x1e5 [0x55732eba4a85]

stack dump [18]  ./anniwo_c.bin : BasePersonDetection::detect(int, cv::Mat&, std::vector<Object, std::allocator<Object> >&)+0x56a [0x55732eae11fa]

stack dump [19]  ./anniwo_c.bin : PERSONBASE_DET::do_detect(int, cv::Mat&)+0x1b9 [0x55732e9d9a13]
stack dump [20]  ./anniwo_c.bin+0x57f871 [0x55732e9ac871]

stack dump [21]  ./anniwo_c.bin : detectFunc(void*)+0x1f6e [0x55732e9af75f]

stack dump [22]  ./anniwo_c.bin : void* std::__invoke_impl<void*, void* (*)(void*), void*>(std::__invoke_other, void* (*&&)(void*), void*&&)+0x34 [0x55732e9fb4c9]

stack dump [23]  ./anniwo_c.bin : std::__invoke_result<void* (*)(void*), void*>::type std::__invoke<void* (*)(void*), void*>(void* (*&&)(void*), void*&&)+0x46 [0x55732e9e86fa]

stack dump [24]  ./anniwo_c.bin : decltype (__invoke((_S_declval<0ul>)(), (_S_declval<1ul>)())) std::thread::_Invoker<std::tuple<void* (*)(void*), void*> >::_M_invoke<0ul, 1ul>(std::_Index_tuple<0ul, 1ul>)+0x43 [0x55732ea8fbef]

stack dump [25]  ./anniwo_c.bin : std::thread::_Invoker<std::tuple<void* (*)(void*), void*> >::operator()()+0x27 [0x55732ea8ea5b]

stack dump [26]  ./anniwo_c.bin : std::thread::_State_impl<std::thread::_Invoker<std::tuple<void* (*)(void*), void*> > >::_M_run()+0x1c [0x55732ea8b58a]
stack dump [27]  /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0xbd6df [0x7ff8d15be6df]
stack dump [28]  /lib/x86_64-linux-gnu/libpthread.so.0+0x76db [0x7ff8ecae76db]
stack dump [29]  /lib/x86_64-linux-gnu/libc.so.6clone+0x3f [0x7ff8d101961f]

Exiting after fatal event (FATAL_SIGNAL). Fatal type: SIGABRT
Log content flushed sucessfully to sink

@kouhinn
Copy link
Author

kouhinn commented Jul 19, 2022

+1:
C++ Traceback (most recent call last):

0 paddle::AnalysisPredictor::ZeroCopyRun()
1 paddle::framework::NaiveExecutor::Run()
2 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
3 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
5 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::CUDNNConvFusionOpKernel, paddle::operators::CUDNNConvFusionOpKernel >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
6 paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const
7 void paddle::platform::CudnnWorkspaceHandle::RunFunc<paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const::{lambda(void*)#2}&>(paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const::{lambda(void*)#2}&, unsigned long)
8 paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int)
9 paddle::platform::GetCurrentTraceBackStringabi:cxx11


Error Message Summary:

ExternalError: CUDNN error(7), CUDNN_STATUS_MAPPING_ERROR.
[Hint: Please search for the error code(7) on website (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnStatus_t) to get Nvidia's official solution and advice about CUDNN Error.] (at /home/xiangbin_train_workspace/PaddlePaddleWorkspace/Paddle_2.2.2/Paddle/paddle/fluid/operators/fused/conv_fusion_op.cu:381)
[operator < conv2d_fusion > error]

@jiweibo
Copy link
Contributor

jiweibo commented Jul 19, 2022

bug描述 Describe the Bug

使用paddle inference c++接口进行多线程推理。按照样例一个线程一个predictor, 并按照省显存的方式,在每个preictor.Run之后调用了 ClearIntermediateTensor和TryShrinkMemory。 (相关#43346)

运行中常出现各种异常退出: paddle及相关版本: cuda:10.2 cudnn:7.6.5.32 paddle:2.2.2

简要错误信息与堆栈如下:

Error Message Summary:

ExternalError: CUDNN error(8), CUDNN_STATUS_EXECUTION_FAILED. [Hint: Please search for the error code(8) on website (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnStatus_t) to get Nvidia's official solution and advice about CUDNN Error.] (at /home/xiangbin_train_workspace/PaddlePaddleWorkspace/Paddle_2.2.2/Paddle/paddle/fluid/operators/fused/conv_fusion_op.cu:381) [operator < conv2d_fusion > error] 2022/07/13 08:15:47 454804

详细错误信息见如下。

其他补充信息 Additional Supplementary Information

错误信息与堆栈: `terminate called after throwing an instance of 'paddle::platform::EnforceNotMet' what():

Compile Traceback (most recent call last): File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\Scripts\x2paddle-script.py", line 33, in sys.exit(load_entry_point('x2paddle==1.3.5', 'console_scripts', 'x2paddle')()) File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\convert.py", line 373, in main lite_model_type=args.lite_model_type) File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\convert.py", line 234, in onnx2paddle mapper.paddle_graph.gen_model(save_dir) File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\core\program.py", line 296, in gen_model self.dygraph2static(save_dir, input_shapes, input_types) File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\x2paddle-1.3.5-py3.7.egg\x2paddle\core\program.py", line 580, in dygraph2static osp.join(save_dir, "inference_model/model")) File "", line 2, in save

File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 25, in __impl__
  return wrapped_func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\base.py", line 51, in __impl__
  return func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\jit.py", line 744, in save
  inner_input_spec)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 517, in concrete_program_specify_input_spec
  *desired_input_spec)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 427, in get_concrete_program
  concrete_program, partial_program_layer = self._program_cache[cache_key]
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 723, in __getitem__
  self._caches[item] = self._build_once(item)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 714, in _build_once
  **cache_key.kwargs)
File "<decorator-gen-99>", line 2, in from_func_spec
  
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 25, in __impl__
  return wrapped_func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\base.py", line 51, in __impl__
  return func(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\dygraph_to_static\program_translator.py", line 662, in from_func_spec
  outputs = static_func(*inputs)
File "personbasemodelonnx2paddle\x2paddle_code.py", line 315, in forward
  x2paddle_convolution_output96 = self.conv1(x2paddle_convolution_output96_paded)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\layers.py", line 917, in __call__
  return self._dygraph_call_func(*inputs, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\dygraph\layers.py", line 907, in _dygraph_call_func
  outputs = self.forward(*inputs, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\nn\layer\conv.py", line 677, in forward
  use_cudnn=self._use_cudnn)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\nn\functional\conv.py", line 148, in _conv_nd
  type=op_type, inputs=inputs, outputs=outputs, attrs=attrs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\layer_helper.py", line 43, in append_op
  return self.main_program.current_block().append_op(*args, **kwargs)
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\framework.py", line 3184, in append_op
  attrs=kwargs.get("attrs", None))
File "C:\Users\admin\anaconda3\envs\py37_tensorflow1_14\lib\site-packages\paddle\fluid\framework.py", line 2224, in __init__
  for frame in traceback.extract_stack():

C++ Traceback (most recent call last):

0 paddle::AnalysisPredictor::ZeroCopyRun() 1 paddle::framework::NaiveExecutor::Run() 2 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&) 3 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const 4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const 5 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::CUDNNConvFusionOpKernel, paddle::operators::CUDNNConvFusionOpKernel >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&) 6 paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const 7 void paddle::platform::CudnnWorkspaceHandle::RunFunc<paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const::{lambda(void*)#2}&>(paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const::{lambda(void*)#2}&, unsigned long) 8 paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int) 9 paddle::platform::GetCurrentTraceBackStringabi:cxx11

Error Message Summary:

ExternalError: CUDNN error(8), CUDNN_STATUS_EXECUTION_FAILED. [Hint: Please search for the error code(8) on website (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnStatus_t) to get Nvidia's official solution and advice about CUDNN Error.] (at /home/xiangbin_train_workspace/PaddlePaddleWorkspace/Paddle_2.2.2/Paddle/paddle/fluid/operators/fused/conv_fusion_op.cu:381) [operator < conv2d_fusion > error] 2022/07/13 08:15:47 454804

***** FATAL SIGNAL RECEIVED ******* Received fatal signal: SIGABRT(6) PID: 30571

***** SIGNAL SIGABRT(6)

******* STACKDUMP ******* stack dump [1] /usr/local/lib/libg3log.so.2.1.0-0+0x1465a [0x7fa3a64e865a] stack dump [2] /lib/x86_64-linux-gnu/libpthread.so.0+0x12980 [0x7fa3c173c980] stack dump [3] /lib/x86_64-linux-gnu/libc.so.6gsignal+0xc7 [0x7fa3a5b80e87] stack dump [4] /lib/x86_64-linux-gnu/libc.so.6abort+0x141 [0x7fa3a5b827f1] stack dump [5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x8c957 [0x7fa3a61d7957] stack dump [6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x92ae6 [0x7fa3a61ddae6] stack dump [7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x92b21 [0x7fa3a61ddb21] stack dump [8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x92d54 [0x7fa3a61ddd54] stack dump [9] /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so+0x1ebe224 [0x7fa3ac59d224]

stack dump [10]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::framework::NaiveExecutor::Run()+0x130 [0x7fa3acce05d0]

stack dump [11]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::AnalysisPredictor::ZeroCopyRun()+0x293 [0x7fa3ac98be73]

stack dump [12]  ./xxxx : doInference(paddle_infer::Predictor&, std::vector<float, std::allocator<float> > const&, std::vector<int, std::allocator<int> > const&, std::vector<float, std::allocator<float> >&)+0x10d [0x563a23f08aad]`

请问下,多线程下报错是Run之后调用到ClearIntermediateTensor和TryShrinkMemory的情况下才会报错吗?不调用ClearIntermediateTensor和TryShrinkMemory还会报错吗?

@kouhinn
Copy link
Author

kouhinn commented Jul 20, 2022

请问下,多线程下报错是Run之后调用到ClearIntermediateTensor和TryShrinkMemory的情况下才会报错吗?不调用ClearIntermediateTensor和TryShrinkMemory还会报错吗?

不调用的话显存直接崩了。占用过大,16G都挡不住。
尝试过去掉如下设置:
// 启用 CUDNN 进行预测加速
//config.EnableCUDNN();

问题依旧。

@jiweibo
Copy link
Contributor

jiweibo commented Jul 21, 2022

请问下,多线程下报错是Run之后调用到ClearIntermediateTensor和TryShrinkMemory的情况下才会报错吗?不调用ClearIntermediateTensor和TryShrinkMemory还会报错吗?

不调用的话显存直接崩了。占用过大,16G都挡不住。 尝试过去掉如下设置: // 启用 CUDNN 进行预测加速 //config.EnableCUDNN();

问题依旧。

有一个显存复用的接口,可以试下在您的模型上是否有作用 config.EnableMemoryOptim()

@jiweibo
Copy link
Contributor

jiweibo commented Jul 21, 2022

请问下,多线程下报错是Run之后调用到ClearIntermediateTensor和TryShrinkMemory的情况下才会报错吗?不调用ClearIntermediateTensor和TryShrinkMemory还会报错吗?

不调用的话显存直接崩了。占用过大,16G都挡不住。 尝试过去掉如下设置: // 启用 CUDNN 进行预测加速 //config.EnableCUDNN();
问题依旧。

有一个显存复用的接口,可以试下在您的模型上是否有作用 config.EnableMemoryOptim()

TryShrinkMemory看起来确实不是线程安全的,您可以先试下显存复用的方式能否work,如果不行的话,在这个接口上加锁看下 https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/api/analysis_predictor.cc#L1809

我本地先看看能否复现出问题。

@paddle-bot paddle-bot bot added status/following-up 跟进中 and removed status/new-issue 新建 labels Jul 21, 2022
@kouhinn
Copy link
Author

kouhinn commented Jul 21, 2022

请问下,多线程下报错是Run之后调用到ClearIntermediateTensor和TryShrinkMemory的情况下才会报错吗?不调用ClearIntermediateTensor和TryShrinkMemory还会报错吗?

不调用的话显存直接崩了。占用过大,16G都挡不住。 尝试过去掉如下设置: // 启用 CUDNN 进行预测加速 //config.EnableCUDNN();
问题依旧。

有一个显存复用的接口,可以试下在您的模型上是否有作用 config.EnableMemoryOptim()

TryShrinkMemory看起来确实不是线程安全的,您可以先试下显存复用的方式能否work,如果不行的话,在这个接口上加锁看下 https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/api/analysis_predictor.cc#L1809

我本地先看看能否复现出问题。

config.EnableMemoryOptim() 这个一直都设置上了的(见前述)。

您说的加锁这个,是只需要保护paddle::memory::Release(place_)这个调用是吧? 我这样修改的paddle代码的:
`paddle\fluid\inference\api\analysis_predictor.h:
// A mutex help to memory release thread safe.
std::mutex memrel_mutex_;

paddle\fluid\inference\api\analysis_predictor.cc:
uint64_t AnalysisPredictor::TryShrinkMemory() {
ClearIntermediateTensor();

std::lock_guardstd::mutex lk(memrel_mutex_);
return paddle::memory::Release(place_);

}

`

@kouhinn
Copy link
Author

kouhinn commented Jul 23, 2022

在应用程序中,每次调用TryShrinkMemory()的时候都加了锁,代码如下:
ANNIWOCHECK( m_id_predictors[camID]->Run() );

  if(true)
  {
    std::lock_guard<std::mutex> lk(anniwo_memrel_mutex);
    m_id_predictors[camID]->TryShrinkMemory();
  }

但还是有一样的问题。

日志如下:
`terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
what():


C++ Traceback (most recent call last):

0 paddle::AnalysisPredictor::ZeroCopyRun()
1 paddle::framework::NaiveExecutor::Run()
2 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
3 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
5 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::CUDNNConvFusionOpKernel, paddle::operators::CUDNNConvFusionOpKernel >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
6 paddle::operators::CUDNNConvFusionOpKernel::Compute(paddle::framework::ExecutionContext const&) const
7 paddle::memory::allocation::RetryAllocator::FreeImpl(paddle::memory::allocation::Allocation*)
8 paddle::memory::allocation::NaiveBestFitAllocator::FreeImpl(paddle::memory::allocation::Allocation*)
9 paddle::memory::detail::BuddyAllocator::Free(void*)
10 paddle::memory::detail::MetadataCache::LoadDesc(paddle::memory::detail::MemoryBlock*)
11 paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int)
12 paddle::platform::GetCurrentTraceBackStringabi:cxx11


Error Message Summary:

NotFoundError: The memory block is not found in cache
[Hint: Expected iter != cache_.end(), but received iter == cache_.end().] (at /home/xxx/PaddlePaddleWorkspace/Paddle_2.2.2/Paddle/paddle/fluid/memory/detail/meta_cache.cc:30)`

@kouhinn
Copy link
Author

kouhinn commented Jul 23, 2022

有时候是SEGV的错误:
stack dump [10] /usr/local/cuda/lib64/libcudnn.socudnnGetConvolutionForwardAlgorithm+0x659 [0x7f49a87ee5d9]

stack dump [11]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : decltype (cudnnGetConvolutionForwardAlgorithm({parm#1}...)) paddle::platform::dynload::DynLoad__cudnnGetConvolutionForwardAlgorithm::operator()<cudnnContext*, cudnnTensorStruct*, cudnnFilterStruct*, cudnnConvolutionStruct*, cudnnTensorStruct*, cudnnConvolutionFwdPreference_t, unsigned long, cudnnConvolutionFwdAlgo_t*>(cudnnContext*, cudnnTensorStruct*, cudnnFilterStruct*, cudnnConvolutionStruct*, cudnnTensorStruct*, cudnnConvolutionFwdPreference_t, unsigned long, cudnnConvolutionFwdAlgo_t*)+0xd3 [0x7f4e80105653]

stack dump [12]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::operators::CUDNNConvFusionOpKernel<float>::Compute(paddle::framework::ExecutionContext const&) const+0x1185 [0x7f4e8010bf75]

stack dump [13]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::CUDNNConvFusionOpKernel<float>, paddle::operators::CUDNNConvFusionOpKernel<double> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)+0x33 [0x7f4e8010dfe3]

stack dump [14]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const+0x312 [0x7f4e846a0dd2]

stack dump [15]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const+0x148 [0x7f4e846a1628]

stack dump [16]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)+0x1c7 [0x7f4e8469d4c7]

stack dump [17]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::framework::NaiveExecutor::Run()+0x130 [0x7f4e7ecb65d0]

stack dump [18]  /opt/paddle_lib/paddle_inference/paddle/lib/libpaddle_inference.so : paddle::AnalysisPredictor::ZeroCopyRun()+0x293 [0x7f4e7e961e73]

stack dump [19]  ./xxx : doInference(paddle_infer::Predictor&, std::vector<float, std::allocator<float> > const&, std::vector<int, std::allocator<int> > const&, std::vector<float, std::allocator<float> >&)+0x10d [0x55fef893393d]

@jiweibo
Copy link
Contributor

jiweibo commented Aug 3, 2022

请问下可以share下复现环境吗,我本地拿demo里的模型跑不出来这个结果。

@kouhinn
Copy link
Author

kouhinn commented Aug 11, 2022

复现程序稍等我这边整理一下。

@Fire-Star
Copy link

Fire-Star commented Sep 28, 2022

我也出现这个问题,直接改的cpp_infer,多线程连续GPU预测报错,台式机(CUDA11.6 cuDNN8.4 trt8.4.15 NVIDIA A10)跑1-4小时连续预测才能复现,笔记本(CUDA10.1 cuDNN7.6.5 noTrt NVIDIA GTX1660TI)跑十多秒就复现了。

操作系统:
笔记本是win11
台式机是 win server 2019 datacenter

PaddleInference 版本使用的2.3.2

image

起初以为是代码问题,但是CPU连续预测就不会报错,只有GPU预测会报错。

后面又查找,以为是cuda版本和显卡版本不符合,但实际上是恰巧对应的,已经查看了控制面板-系统信息-[ 显示 | 组件 ]
下面是台式机的信息:
OQGGIXB(%UNN3 UL@GGF5KN
1CVFUN3QHNC68B6NU{9V8

完整代码:
cpp_infer.zip

可直接运行去复现的dll,已经写好py测试代码,可运行:
由于限制了上传大小,分3个压缩包,解压到同一文件夹即可。
GPU_Recurring_BUG_1部分.zip
GPU_Recurring_BUG_2部分.zip
GPU_Recurring_BUG_3部分.zip

@Fire-Star
Copy link

大佬们,我在PaddleOCR 新建了一个issue,目前问题还未解决:#7757

@paddle-bot paddle-bot bot closed this as completed Oct 3, 2023
@paddle-bot
Copy link

paddle-bot bot commented Oct 3, 2023

Since you haven't replied for more than a year, we have closed this issue/pr.
If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up.
由于您超过一年未回复,我们将关闭这个issue/pr。
若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants