【Hackathon No.32】为 Paddle 优化 expand_as 前向&反向 op 在 GPU 上的计算性能 #52684

Timber-Ye · 2023-04-08T10:07:19Z

PR types

Performance optimization

PR changes

OPs

Describe

目前 Paddle 内 expand_as 前向和反向算子的 GPU 实现采用 Eigen 组合的模式，缺少 GPU Kernel，性能相对不足，希望实现高性能的 GPU 计算 Kernel，为 Paddle 优化 expand_as op 在 GPU 上的计算性能。

开发环境

设备：Tesla V100-32G
CUDA 11.2，cuDNN v8.1.1

优化方法

【算子性能优化设计文档】

由于expand_as前向的过程与广播机制类似，后向的过程与求和约归类似，因此直接通过使用飞桨内部的 BroadcastKernel 和 ReduceKernel 来对expand_as算子进行优化。

优化效果

完成优化后，Paddle(Optimized)与优化前的Paddle(Baseline)的性能对比:

Case	Data type	src_shape	dst_shape	Paddle Baseline(ms)	Optimized(ms)	Diff
0	float16	[1785, 1]	[1785, 128]	0.1971	0.1166	faster than 40.835%
1	float16	[5, 1, 1]	[5, 128, 128]	3.0909	0.1269	faster than 95.895%
2	float16	[32, 807, 1]	[32, 807, 807]	1.3884	0.3940	faster than 71.620%
3	float32	[1785, 1]	[1785, 128]	0.2244	0.1150	faster than 48.760%
4	float32	[5, 1, 1]	[5, 128, 128]	3.6155	0.1179	faster than 96.738%
5	float32	[32, 807, 1]	[32, 807, 807]	1.4826	0.6428	faster than 56.646%
6	float64	[32, 1, 1]	[32, 807, 807]	288.7776	1.2293	faster than 99.570%
7	float64	[1, 1, 64 ,5]	[64, 128, 64, 5]	3.1326	0.2746	faster than 91.645%
8	float64	[5, 1, 1, 1, 1]	[5, 1, 713, 1, 889]	240.8861	0.2960	faster than 99.877%

针对以上9种不同的case, 优化后的性能有所提升，并且要扩展的Tensor元素数量越多，性能提升越明显，优化后的算子在case 5上的用时更是直接缩短至baseline的1/814。

Co-authored-by: Timber-Ye <ye_hanqiao@163.com> Co-authored-by: BrianQian1999 <brianqianhitsz@gmail.com>

paddle-bot · 2023-04-08T10:07:24Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Co-authored-by: Timber-Ye <ye_hanqiao@163.com> Co-authored-by: BrianQian1999 <brianqianhitsz@gmail.com>

paddle-bot · 2023-04-08T14:27:50Z

很抱歉，经过我们的反复讨论，你的PR暂未达到合入标准，请阅读飞桨原生算子开发规范，你可以重新提交新的PR，我们先将此PR关闭，感谢你的贡献。
Sorry to inform you that through our discussion, your PR fails to meet the merging standard (Reference: Paddle Custom Operator Design Doc). You can also submit an new one. Thank you.

Timber-Ye and others added 5 commits March 11, 2023 14:27

Implemented optimized kernel for OP-expand_as.

2930983

removed two checks for the input validation

837509a

the micro 'MAX_RANK_SUPPORTED' removed

05777cc

Merge branch 'PaddlePaddle:develop' into expand_as_perf

0f3e554

add fp16 support

b01766c

Co-authored-by: Timber-Ye <ye_hanqiao@163.com> Co-authored-by: BrianQian1999 <brianqianhitsz@gmail.com>

paddle-bot bot added contributor External developers status: proposed labels Apr 8, 2023

Support 0D Tensor input.

5718888

Co-authored-by: Timber-Ye <ye_hanqiao@163.com> Co-authored-by: BrianQian1999 <brianqianhitsz@gmail.com>

Timber-Ye closed this Apr 8, 2023

Timber-Ye deleted the expand_as_perf branch April 8, 2023 14:27

paddle-bot bot added status: not progressed and removed status: proposed labels Apr 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【Hackathon No.32】为 Paddle 优化 expand_as 前向&反向 op 在 GPU 上的计算性能 #52684

【Hackathon No.32】为 Paddle 优化 expand_as 前向&反向 op 在 GPU 上的计算性能 #52684

Timber-Ye commented Apr 8, 2023

paddle-bot bot commented Apr 8, 2023

paddle-bot bot commented Apr 8, 2023

【Hackathon No.32】为 Paddle 优化 expand_as 前向&反向 op 在 GPU 上的计算性能 #52684

【Hackathon No.32】为 Paddle 优化 expand_as 前向&反向 op 在 GPU 上的计算性能 #52684

Conversation

Timber-Ye commented Apr 8, 2023

PR types

PR changes

Describe

paddle-bot bot commented Apr 8, 2023

paddle-bot bot commented Apr 8, 2023