【Hackathon No.32】为 Paddle 优化 expand_as 前向&反向 op 在 GPU 上的计算性能 #51510

Timber-Ye · 2023-03-11T06:40:19Z

PR types

Performance optimization

PR changes

OPs

Describe

目前 Paddle 内 expand_as 前向和反向算子的 GPU 实现采用 Eigen 组合的模式，缺少 GPU Kernel，性能相对不足，希望实现高性能的 GPU 计算 Kernel，为 Paddle 优化 expand_as op 在 GPU 上的计算性能。

开发环境

设备：Tesla V100-32G
CUDA 11.2，cuDNN v8.1.1

优化方法

【算子性能优化设计文档】

由于expand_as前向的过程与广播机制类似，后向的过程与求和约归类似，因此直接通过使用飞桨内部的 BroadcastKernel 和 ReduceKernel 来对expand_as算子进行优化。

优化效果

完成优化后，Paddle(Optimized)与优化前的Paddle(Baseline)的性能对比:

Case	Data type	src_shape	dst_shape	Paddle Baseline(ms)	Optimized(ms)	Diff
0	float16	[1785, 1]	[1785, 128]	0.1971	0.1166	faster than 40.835%
1	float16	[5, 1, 1]	[5, 128, 128]	3.0909	0.1269	faster than 95.895%
2	float16	[32, 807, 1]	[32, 807, 807]	1.3884	0.3940	faster than 71.620%
3	float32	[1785, 1]	[1785, 128]	0.2244	0.1150	faster than 48.760%
4	float32	[5, 1, 1]	[5, 128, 128]	3.6155	0.1179	faster than 96.738%
5	float32	[32, 807, 1]	[32, 807, 807]	1.4826	0.6428	faster than 56.646%
6	float64	[32, 1, 1]	[32, 807, 807]	288.7776	1.2293	faster than 99.570%
7	float64	[1, 1, 64 ,5]	[64, 128, 64, 5]	3.1326	0.2746	faster than 91.645%
8	float64	[5, 1, 1, 1, 1]	[5, 1, 713, 1, 889]	240.8861	0.2960	faster than 99.877%

针对以上9种不同的case, 优化后的性能有所提升，并且要扩展的Tensor元素数量越多，性能提升越明显，优化后的算子在case 5上的用时更是直接缩短至baseline的1/814。

paddle-bot · 2023-03-11T06:40:23Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

JamesLim-sy · 2023-03-14T06:29:32Z

paddle/phi/kernels/gpu/expand_as_kernel.cu

+                        "expand_as_v2 op must be less than or equal to %d.",
+                        target_rank,
+                        MAX_RANK_SUPPORTED));
+


这两处的判断删除，已经在ExpandAsInferMeta 中存在对应的判断逻辑.

JamesLim-sy · 2023-03-15T09:19:46Z

Good Work !
由于是IO瓶颈类计算，请在

Paddle/paddle/phi/kernels/gpu/expand_as_kernel.cu

Lines 22 to 29 in 6bd5b7c

    
           GPU, 
        
           ALL_LAYOUT, 
        
           phi::ExpandAsKernel, 
        
           float, 
        
           double, 
        
           int, 
        
           int64_t, 
        
           bool) {}

中补充注册 phi::dtype::float16 类型，并补充相应的性能测试数据至comment Describe，同时在 PaddlePaddle/community#414
文档表格中补充fp16的的性能数据.

Timber-Ye · 2023-03-17T00:42:09Z

Good Work ! 由于是IO瓶颈类计算，请在

Paddle/paddle/phi/kernels/gpu/expand_as_kernel.cu

Lines 22 to 29 in 6bd5b7c

GPU,

ALL_LAYOUT,

phi::ExpandAsKernel,

float,

double,

int,

int64_t,

bool) {}

中补充注册 phi::dtype::float16 类型，并补充相应的性能测试数据至comment Describe，同时在 PaddlePaddle/community#414
文档表格中补充fp16的的性能数据.

@JamesLim-sy @luotao1 已为expand_as算子注册fp16数据类型，并补充了相应的性能测试数据。另外我们尝试增加了针对fp16的测试样例（TestExpandAsOpRank4FP16），然而目前该样例无法通过CI测试，可否请教老师意见～

Timber-Ye · 2023-03-28T02:32:14Z

@luotao1 注册了fp16之后无法通过单测，麻烦看一下

JamesLim-sy · 2023-03-28T03:34:56Z

@luotao1 注册了fp16之后无法通过单测，麻烦看一下

Paddle内置了一套单测系统，请在cmake的时候加入编译选项， -DWITH_TESTING=ON，编译安装后，使用

ctest -V -R test_expand_as

测试下单测精度。

Co-authored-by: Hanchiao <ye_hanqiao@163.com> Co-authored-by: BrianQian1999 <brianqianhitsz@gmail.com>

paddle-bot · 2023-03-28T12:45:36Z

很抱歉，经过我们的反复讨论，你的PR暂未达到合入标准，请阅读飞桨原生算子开发规范，你可以重新提交新的PR，我们先将此PR关闭，感谢你的贡献。
Sorry to inform you that through our discussion, your PR fails to meet the merging standard (Reference: Paddle Custom Operator Design Doc). You can also submit an new one. Thank you.

Implemented optimized kernel for OP-expand_as.

2930983

paddle-bot bot added contributor External developers status: proposed labels Mar 11, 2023

Timber-Ye mentioned this pull request Mar 11, 2023

【PaddlePaddle Hackathon 第四期】任务总览 #51281

Closed

luotao1 assigned luotao1, Ligoml and JamesLim-sy Mar 13, 2023

Ligoml requested a review from JamesLim-sy March 14, 2023 02:46

JamesLim-sy reviewed Mar 14, 2023

View reviewed changes

Timber-Ye added 2 commits March 14, 2023 16:55

removed two checks for the input validation

837509a

the micro 'MAX_RANK_SUPPORTED' removed

05777cc

force rollback

b6ec40f

Timber-Ye and others added 4 commits March 28, 2023 20:27

add ctest for expand_as_fp16

dac8ae5

Co-authored-by: Hanchiao <ye_hanqiao@163.com> Co-authored-by: BrianQian1999 <brianqianhitsz@gmail.com>

import a.b to from a import b

4b234a3

Co-authored-by: Hanchiao <ye_hanqiao@163.com> Co-authored-by: BrianQian1999 <brianqianhitsz@gmail.com>

Merge branch 'develop' into expand_as_perf

db1e81e

import a.b to from a import b

51323fa

Co-authored-by: Hanchiao <ye_hanqiao@163.com> Co-authored-by: BrianQian1999 <brianqianhitsz@gmail.com>

Timber-Ye closed this Mar 28, 2023

paddle-bot bot added status: not progressed and removed status: proposed labels Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【Hackathon No.32】为 Paddle 优化 expand_as 前向&反向 op 在 GPU 上的计算性能 #51510

【Hackathon No.32】为 Paddle 优化 expand_as 前向&反向 op 在 GPU 上的计算性能 #51510

Timber-Ye commented Mar 11, 2023 •

edited

Loading

paddle-bot bot commented Mar 11, 2023

JamesLim-sy Mar 14, 2023

Timber-Ye Mar 14, 2023

JamesLim-sy commented Mar 15, 2023 •

edited

Loading

Timber-Ye commented Mar 17, 2023 •

edited

Loading

Timber-Ye commented Mar 28, 2023

JamesLim-sy commented Mar 28, 2023

paddle-bot bot commented Mar 28, 2023

【Hackathon No.32】为 Paddle 优化 expand_as 前向&反向 op 在 GPU 上的计算性能 #51510

【Hackathon No.32】为 Paddle 优化 expand_as 前向&反向 op 在 GPU 上的计算性能 #51510

Conversation

Timber-Ye commented Mar 11, 2023 • edited Loading

PR types

PR changes

Describe

paddle-bot bot commented Mar 11, 2023

JamesLim-sy Mar 14, 2023

Choose a reason for hiding this comment

Timber-Ye Mar 14, 2023

Choose a reason for hiding this comment

JamesLim-sy commented Mar 15, 2023 • edited Loading

Timber-Ye commented Mar 17, 2023 • edited Loading

Timber-Ye commented Mar 28, 2023

JamesLim-sy commented Mar 28, 2023

paddle-bot bot commented Mar 28, 2023

Timber-Ye commented Mar 11, 2023 •

edited

Loading

JamesLim-sy commented Mar 15, 2023 •

edited

Loading

Timber-Ye commented Mar 17, 2023 •

edited

Loading