-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【Hackathon No.32】为 Paddle 优化 expand_as 前向&反向 op 在 GPU 上的计算性能 #51510
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
"expand_as_v2 op must be less than or equal to %d.", | ||
target_rank, | ||
MAX_RANK_SUPPORTED)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这两处的判断删除,已经在ExpandAsInferMeta
中存在对应的判断逻辑.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Good Work ! Paddle/paddle/phi/kernels/gpu/expand_as_kernel.cu Lines 22 to 29 in 6bd5b7c
中补充注册 phi::dtype::float16 类型,并补充相应的性能测试数据至comment Describe,同时在 PaddlePaddle/community#414文档表格中补充fp16的的性能数据. |
@JamesLim-sy @luotao1 已为expand_as算子注册fp16数据类型,并补充了相应的性能测试数据。另外我们尝试增加了针对fp16的测试样例( |
@luotao1 注册了fp16之后无法通过单测,麻烦看一下 |
Paddle内置了一套单测系统,请在cmake的时候加入编译选项,
测试下单测精度。 |
Co-authored-by: Hanchiao <ye_hanqiao@163.com> Co-authored-by: BrianQian1999 <brianqianhitsz@gmail.com>
Co-authored-by: Hanchiao <ye_hanqiao@163.com> Co-authored-by: BrianQian1999 <brianqianhitsz@gmail.com>
Co-authored-by: Hanchiao <ye_hanqiao@163.com> Co-authored-by: BrianQian1999 <brianqianhitsz@gmail.com>
很抱歉,经过我们的反复讨论,你的PR暂未达到合入标准,请阅读飞桨原生算子开发规范,你可以重新提交新的PR,我们先将此PR关闭,感谢你的贡献。 |
PR types
Performance optimization
PR changes
OPs
Describe
目前 Paddle 内
expand_as
前向和反向算子的 GPU 实现采用 Eigen 组合的模式,缺少 GPU Kernel,性能相对不足,希望实现高性能的 GPU 计算 Kernel,为 Paddle 优化expand_as
op 在 GPU 上的计算性能。【算子性能优化设计文档】
由于expand_as前向的过程与广播机制类似,后向的过程与求和约归类似,因此直接通过使用飞桨内部的
BroadcastKernel
和ReduceKernel
来对expand_as算子进行优化。完成优化后,Paddle(Optimized)与优化前的Paddle(Baseline)的性能对比:
针对以上9种不同的case, 优化后的性能有所提升,并且要扩展的Tensor元素数量越多,性能提升越明显,优化后的算子在case 5上的用时更是直接缩短至baseline的1/814。