Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Hackathon No.32】为 Paddle 优化 expand_as 前向&反向 op 在 GPU 上的计算性能 #51510

Closed
wants to merge 8 commits into from

Conversation

Timber-Ye
Copy link
Contributor

@Timber-Ye Timber-Ye commented Mar 11, 2023

PR types

Performance optimization

PR changes

OPs

Describe

目前 Paddle 内 expand_as 前向和反向算子的 GPU 实现采用 Eigen 组合的模式,缺少 GPU Kernel,性能相对不足,希望实现高性能的 GPU 计算 Kernel,为 Paddle 优化 expand_as op 在 GPU 上的计算性能。

  • 开发环境
  1. 设备:Tesla V100-32G
  2. CUDA 11.2,cuDNN v8.1.1
  • 优化方法

【算子性能优化设计文档】

由于expand_as前向的过程与广播机制类似,后向的过程与求和约归类似,因此直接通过使用飞桨内部的 BroadcastKernelReduceKernel 来对expand_as算子进行优化。

  • 优化效果

完成优化后,Paddle(Optimized)与优化前的Paddle(Baseline)的性能对比:

Case Data type src_shape dst_shape Paddle Baseline(ms) Optimized(ms) Diff
0 float16 [1785, 1] [1785, 128] 0.1971 0.1166 faster than 40.835%
1 float16 [5, 1, 1] [5, 128, 128] 3.0909 0.1269 faster than 95.895%
2 float16 [32, 807, 1] [32, 807, 807] 1.3884 0.3940 faster than 71.620%
3 float32 [1785, 1] [1785, 128] 0.2244 0.1150 faster than 48.760%
4 float32 [5, 1, 1] [5, 128, 128] 3.6155 0.1179 faster than 96.738%
5 float32 [32, 807, 1] [32, 807, 807] 1.4826 0.6428 faster than 56.646%
6 float64 [32, 1, 1] [32, 807, 807] 288.7776 1.2293 faster than 99.570%
7 float64 [1, 1, 64 ,5] [64, 128, 64, 5] 3.1326 0.2746 faster than 91.645%
8 float64 [5, 1, 1, 1, 1] [5, 1, 713, 1, 889] 240.8861 0.2960 faster than 99.877%

针对以上9种不同的case, 优化后的性能有所提升,并且要扩展的Tensor元素数量越多,性能提升越明显,优化后的算子在case 5上的用时更是直接缩短至baseline的1/814。

@paddle-bot
Copy link

paddle-bot bot commented Mar 11, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

"expand_as_v2 op must be less than or equal to %d.",
target_rank,
MAX_RANK_SUPPORTED));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这两处的判断删除,已经在ExpandAsInferMeta 中存在对应的判断逻辑.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@JamesLim-sy
Copy link
Contributor

JamesLim-sy commented Mar 15, 2023

Good Work !
由于是IO瓶颈类计算,请在

GPU,
ALL_LAYOUT,
phi::ExpandAsKernel,
float,
double,
int,
int64_t,
bool) {}

中补充注册 phi::dtype::float16 类型,并补充相应的性能测试数据至comment Describe,同时在 PaddlePaddle/community#414
文档表格中补充fp16的的性能数据.

@Timber-Ye
Copy link
Contributor Author

Timber-Ye commented Mar 17, 2023

Good Work ! 由于是IO瓶颈类计算,请在

GPU,
ALL_LAYOUT,
phi::ExpandAsKernel,
float,
double,
int,
int64_t,
bool) {}

中补充注册 phi::dtype::float16 类型,并补充相应的性能测试数据至comment Describe,同时在 PaddlePaddle/community#414
文档表格中补充fp16的的性能数据.

@JamesLim-sy @luotao1 已为expand_as算子注册fp16数据类型,并补充了相应的性能测试数据。另外我们尝试增加了针对fp16的测试样例(TestExpandAsOpRank4FP16),然而目前该样例无法通过CI测试,可否请教老师意见~

@Timber-Ye
Copy link
Contributor Author

@luotao1 注册了fp16之后无法通过单测,麻烦看一下

@JamesLim-sy
Copy link
Contributor

@luotao1 注册了fp16之后无法通过单测,麻烦看一下

Paddle内置了一套单测系统,请在cmake的时候加入编译选项, -DWITH_TESTING=ON,编译安装后,使用

ctest -V -R test_expand_as 

测试下单测精度。

Timber-Ye and others added 4 commits March 28, 2023 20:27
Co-authored-by: Hanchiao <ye_hanqiao@163.com>
Co-authored-by: BrianQian1999 <brianqianhitsz@gmail.com>
Co-authored-by: Hanchiao <ye_hanqiao@163.com>
Co-authored-by: BrianQian1999 <brianqianhitsz@gmail.com>
Co-authored-by: Hanchiao <ye_hanqiao@163.com>
Co-authored-by: BrianQian1999 <brianqianhitsz@gmail.com>
@paddle-bot
Copy link

paddle-bot bot commented Mar 28, 2023

很抱歉,经过我们的反复讨论,你的PR暂未达到合入标准,请阅读飞桨原生算子开发规范,你可以重新提交新的PR,我们先将此PR关闭,感谢你的贡献。
Sorry to inform you that through our discussion, your PR fails to meet the merging standard (Reference: Paddle Custom Operator Design Doc). You can also submit an new one. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor External developers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants