Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support trt cuda graph. #53406

Merged
merged 10 commits into from
May 9, 2023
Merged

Support trt cuda graph. #53406

merged 10 commits into from
May 9, 2023

Conversation

jiweibo
Copy link
Contributor

@jiweibo jiweibo commented Apr 27, 2023

PR types

Others

PR changes

Others

Description

背景:部分模型全图进入trt后,其enqueue耗时很长,采用cudaGraph的方式降低其enqueue耗时。
实现上主要参考了trtexec,具体集成流程如下:

  1. EnableTensorRtEngine接口新增了bool use_cuda_graph选项,默认为false。(Log提示用户如果开启cudaGraph,保证输入shape不变);
  2. Pass阶段判断是否全图进入trt,仅当全图进入trt的时候才支持启用此功能;
  3. trt engine构建阶段,无需任何修改;
  4. run阶段,增加试运行步骤,如果成功(trt下也不是全部算子支持cudaGraph,存在失败可能),则后续run都走cudaGraph逻辑,否则按原逻辑执行;

TODO:

  • 验证开启cudaGraph后,是否与memory sharing冲突。
  • 代码中添加cuda版本限制,cuda10.0引入的该feature,发版最低cuda10.2,不需要通过宏来控制代码的编译

拿两个典型模型查看测试数据:

模型 enqueue耗时 运行耗时
pnc1+no cudagraph 12.88ms 14.64ms
pnc1+cudagraph 0.12ms 14.44ms
某视觉模型+no cudagraph 0.29ms 6.50ms
某视觉模型+cudagraph 0.04ms 6.50ms

cudaGraph可降低enqueue耗时,增强接口异步性能,但几乎不影响kernel实际执行耗时;

@paddle-bot
Copy link

paddle-bot bot commented Apr 27, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot
Copy link

paddle-bot bot commented Apr 27, 2023

❌ The PR is not created using PR's template. You can refer to this Demo.
Please use PR's template, it helps save our maintainers' time so that more developers get helped.

Copy link
Contributor

@qingqing01 qingqing01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Need to add UT.
  2. If set use_cuda_graph, is the enqueue latency reduced? Give the testing time.
  3. Add docs to explain when to use it

if (use_cuda_graph && !all_nodes_offload_to_trt) {
LOG_FIRST_N(WARNING, 1)
<< "You have enabled CudaGraph, but not the entire graph offload to "
"trt, now return to normal mode.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whether need to set use_cuda_graph false here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@paddle-ci-bot
Copy link

paddle-ci-bot bot commented May 6, 2023

Sorry to inform you that d296778's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@zhangjun
Copy link
Contributor

zhangjun commented May 8, 2023

有测试数据吗 这里存在疑问: 全图只有一次enqueue函数调用,相当于只有一次enqueue kernel发射,这与cudaGraph优化多组kernel发射开销场景不符

@jiweibo
Copy link
Contributor Author

jiweibo commented May 8, 2023

有测试数据吗 这里存在疑问: 全图只有一次enqueue函数调用,相当于只有一次enqueue kernel发射,这与cudaGraph优化多组kernel发射开销场景不符

已补充

@jiweibo
Copy link
Contributor Author

jiweibo commented May 8, 2023

  • Need to add UT.

Done.

  • If set use_cuda_graph, is the enqueue latency reduced? Give the testing time.

Done, listed in description.

  • Add docs to explain when to use it

Add comments in paddle_analysis_config.h.

Comment on lines +169 to +177
bool TensorRTEngine::Enqueue(nvinfer1::IExecutionContext *context,
std::vector<void *> *buffers,
int batch_size,
cudaStream_t stream) {
if (cudagraph_inited_) {
VLOG(1) << "cuda_graph init success, so we will use cuda graph launch the "
"entire graph.";
return cuda_graph_.Launch(stream);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: shape检查与报错

Copy link
Contributor

@zhangjun zhangjun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jiweibo jiweibo merged commit ea0abf9 into PaddlePaddle:develop May 9, 2023
@jiweibo jiweibo deleted the idg_test branch May 9, 2023 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants