Optimize layer norm forward when cols is 1024. #39167

limin2021 · 2022-01-24T07:28:28Z

PR types

Performance optimization

PR changes

OPs

Describe

Optimize the performance of layer_norm forward kernel for layer_norm op and fused_dropout_reisdual_layer_norm op.

Performance results: (use nsys to collect kernel times)
(1) layer_norm op:

batch_size*seq_len	28672	14336	7168	3584	1792	896	448	224	112	56
时间（ns）：
apex_fast_layer_norm	74399.931	41035.3163	20555.8185	11111.16	8395.97375	6855.50875	6275.20375	5652.597	5469.5585	5408.00875
paddle	367119.365	187225.566	101543.132	54855.2878	29603.3463	18901.4535	13006.04	9529.146	8568.81325	8007.138
paddle_opt	74155.0323	40487.696	21343.357	11008.9173	8566.515	6866.36975	6272.02075	5826.07175	5712.10125	5598.60925
加速比：
apex/paddle	0.20265869	0.21917582	0.20243436	0.20255404	0.2836157	0.36269744	0.48248381	0.5931903	0.63830992	0.67539847
apex/paddle_opt	1.00330252	1.0135256	0.96310147	1.00928727	0.98009211	0.99841823	1.00050749	0.97022441	0.95753879	0.96595574

结论：优化后相比优化前获得2-5x加速；优化后基本打平竞品。在个别case相比竞品略差的原因是：在计算scale * (x-mean)/var + bias时，竞品均使用fp16进行计算，paddle采取转换为fp32，使用fp32进行计算的方法，后者相比前者有一些性能损耗。

(2) fused_dropout_reisudal_layer_norm op:

batch_size*seq_len	28672	14336	7168	3584	1792	896	448	224	112	56
时间 (ns)：
nv-mlperf-1.1	193148.0253	101705.2163	46974.2088	31105.4253	17486.40775	14277.5693	12606.103	12650.391	10847.596	10536.81
paddle	180305.741	93799.7805	51218.6403	31803.7753	21651.55675	16031.1558	13858.423	12667.184	12474.513	12245.71
paddle-opt	154923.7505	80434.30725	41667.2393	25594.6238	17437.93475	12359.7128	10222.886	9237.2238	9086.0273	8898.729
加速比：
nv/paddle	1.071225044	1.084279896	0.91713112	0.97804192	0.807628197	0.89061384	0.9096348	0.9986743	0.8695807	0.860449
nv/paddle-opt	1.246729598	1.264450702	1.12736552	1.2153109	1.002779744	1.15516999	1.2331257	1.3695014	1.1938767	1.184081

结论：优化后，fused_dropout_reisudal_layer_norm模块相比竞品，大约获得10%-20%左右加速比。

…_bias op.

… fused_dropout_op_support_ln_fp16

paddle-bot-old · 2022-01-24T07:28:57Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

sneaxiy · 2022-01-24T08:49:17Z

paddle/fluid/operators/fused/fused_layernorm_residual_dropout_bias.h

@@ -19,6 +19,9 @@ limitations under the License. */
 namespace paddle {
 namespace operators {

+#define DIVUP(x, y) (((x) + ((y)-1)) / (y))


I prefer using function instead of macro.

Done by using std::ceil function.

sneaxiy · 2022-01-24T08:59:45Z

paddle/fluid/operators/fused/fused_layernorm_residual_dropout_bias.h

@@ -19,6 +19,9 @@ limitations under the License. */
 namespace paddle {
 namespace operators {

+#define DIVUP(x, y) (((x) + ((y)-1)) / (y))
+#define COLS_ 1024


How about naming this with a more meaningful name? The name COLS_ is too simple to understand the exact meaning and easy to be conflict with the other macros.

sneaxiy · 2022-01-24T09:01:46Z

paddle/fluid/operators/fused/fused_layernorm_residual_dropout_bias.h

+    int ELTS_PER_ROW_PER_CTA = THREADS_PER_ROW *VecSize,
+    int LDGS = ELTS_PER_ROW / ELTS_PER_ROW_PER_CTA>
+__global__ __launch_bounds__(THREADS_PER_CTA) void fused_ln_fwd_1024_kernel(
+    void *__restrict__ y_, void *__restrict__ residual_out_,


The void * is too hard to read. Try to just write T *, U * or anything else.

sneaxiy · 2022-01-24T09:03:08Z

paddle/fluid/operators/layer_norm_kernel.cu.h

+    void *__restrict__ y_, void *__restrict__ mean_out_,
+    void *__restrict__ var_out_, const void *__restrict__ x_,
+    const void *__restrict__ gamma_, const void *__restrict__ beta_,
+    const float epsilon, int rows, int cols) {


Avoid to use void *. Same above.

sneaxiy

LGTM.

limin2021 added 5 commits January 6, 2022 13:04

Add fp16 support for scale/bias for fused_layernnorm_residual_dropout…

6194cef

…_bias op.

Remove useless code.

1c6fb39

Remove useless code.

584f3c9

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

5d38fec

… fused_dropout_op_support_ln_fp16

Optimize layer_norm fwd when cols is 1024.

b00fc42

limin2021 added 3 commits January 24, 2022 07:52

Remove useless code.

5aeb7c8

Minors.

02dda6b

Minors.

28dcbb1

limin2021 requested review from zkh2016 and sneaxiy January 24, 2022 08:36

sneaxiy reviewed Jan 24, 2022

View reviewed changes

limin2021 added 3 commits January 25, 2022 03:26

Modifications accordding to reviews.

1a29592

Minors.

561be44

Limited optimization to paddle_with_cuda.

da3ed67

zkh2016 approved these changes Jan 26, 2022

View reviewed changes

sneaxiy approved these changes Jan 26, 2022

View reviewed changes

limin2021 merged commit 01d04be into PaddlePaddle:develop Jan 26, 2022

limin2021 mentioned this pull request May 31, 2022

Extend forward fast layer_norm kernel to support more dimensions. #43118

Merged

ZzSean mentioned this pull request Jun 6, 2022

Supoort more dimensions in forward fast layer_norm kernel #43226

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize layer norm forward when cols is 1024. #39167

Optimize layer norm forward when cols is 1024. #39167

limin2021 commented Jan 24, 2022 •

edited

Loading

paddle-bot-old bot commented Jan 24, 2022

sneaxiy Jan 24, 2022

limin2021 Jan 25, 2022

sneaxiy Jan 24, 2022

limin2021 Jan 25, 2022

sneaxiy Jan 24, 2022

limin2021 Jan 25, 2022

sneaxiy Jan 24, 2022

limin2021 Jan 25, 2022

sneaxiy left a comment

Optimize layer norm forward when cols is 1024. #39167

Optimize layer norm forward when cols is 1024. #39167

Conversation

limin2021 commented Jan 24, 2022 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Jan 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sneaxiy left a comment

Choose a reason for hiding this comment

limin2021 commented Jan 24, 2022 •

edited

Loading