-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support trt cuda graph. #53406
Support trt cuda graph. #53406
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
❌ The PR is not created using PR's template. You can refer to this Demo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Need to add UT.
- If set use_cuda_graph, is the enqueue latency reduced? Give the testing time.
- Add docs to explain when to use it
if (use_cuda_graph && !all_nodes_offload_to_trt) { | ||
LOG_FIRST_N(WARNING, 1) | ||
<< "You have enabled CudaGraph, but not the entire graph offload to " | ||
"trt, now return to normal mode."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whether need to set use_cuda_graph false here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Sorry to inform you that d296778's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
有测试数据吗 这里存在疑问: 全图只有一次enqueue函数调用,相当于只有一次enqueue kernel发射,这与cudaGraph优化多组kernel发射开销场景不符 |
已补充 |
Done.
Done, listed in description.
Add comments in |
bool TensorRTEngine::Enqueue(nvinfer1::IExecutionContext *context, | ||
std::vector<void *> *buffers, | ||
int batch_size, | ||
cudaStream_t stream) { | ||
if (cudagraph_inited_) { | ||
VLOG(1) << "cuda_graph init success, so we will use cuda graph launch the " | ||
"entire graph."; | ||
return cuda_graph_.Launch(stream); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: shape检查与报错
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Others
PR changes
Others
Description
背景:部分模型全图进入trt后,其enqueue耗时很长,采用cudaGraph的方式降低其enqueue耗时。
实现上主要参考了trtexec,具体集成流程如下:
EnableTensorRtEngine
接口新增了bool use_cuda_graph
选项,默认为false。(Log提示用户如果开启cudaGraph,保证输入shape不变);TODO:
代码中添加cuda版本限制,cuda10.0引入的该feature,发版最低cuda10.2,不需要通过宏来控制代码的编译拿两个典型模型查看测试数据:
cudaGraph可降低enqueue耗时,增强接口异步性能,但几乎不影响kernel实际执行耗时;