improve performance of DepthwiseConv(NHWC) #31677

OuyangChao · 2021-03-16T16:03:16Z

PR types

Performance optimization

PR changes

OPs

Describe

improve performance of DepthwiseConv(NHWC)

Forward of DepthwiseConv(NHWC)

import paddle
import paddle.nn as nn

x_var = paddle.uniform((8, 64, 64, 1024), dtype='float32', min=-1., max=1.)
conv = nn.Conv2D(1024, 1024, (3, 3), stride=1, padding=1, dilation=1, groups=1024, data_format='NHWC')
y_var = conv(x_var)

Before: Input transpose + NCHW kernel + Output transpose
This PR: Filter transpose + NHWC kernel

Tested with GeForce GTX Titan X

id	input_shape(NHWC)	filter_size(CHW)	stride	padding	dilation	groups	before	this PR	improve
0	(8, 64, 64, 1024)	(1024, 3, 3)	1	1	1	1024	4.52ms	2.00ms	+55.75%
1	(8, 64, 64, 2048)	(2048, 3, 3)	1	1	1	2048	9.21ms	4.09ms	+55.59%
2	(8, 64, 64, 1024)	(2048, 3, 3)	1	1	1	1024	9.41ms	3.65ms	+61.21%
3	(8, 64, 64, 1024)	(1024, 3, 3)	2	1	1	1024	2.72ms	0.86ms	+68.38%
4	(8, 64, 64, 1024)	(1024, 5, 5)	1	1	1	1024	14.47ms	7.24ms	+49.97%
5	(8, 256, 256, 64)	(64, 3, 3)	1	1	1	64	4.51ms	2.09ms	+53.66%
6	(8, 64, 128, 2048)	(2048, 3, 3)	1	12	12	2048	17.44ms	10.65ms	+38.93%
7	(8, 64, 128, 2048)	(2048, 3, 3)	1	24	24	2048	17.02ms	9.27ms	+45.53%
8	(8, 64, 128, 2048)	(2048, 3, 3)	1	36	36	2048	15.91ms	7.05ms	+55.69%

Backward of DepthwiseConv(NHWC)

import paddle
import paddle.nn as nn

x_var = paddle.uniform((8, 64, 64, 1024), dtype='float32', min=-1., max=1.)
x_var.stop_gradient = False
conv = nn.Conv2D(1024, 1024, (3, 3), stride=1, padding=1, dilation=1, groups=1024, data_format='NHWC')
y_var = conv(x_var)
paddle.grad(y_var, x_var)

Before: Input transpose + NCHW kernel + Output transpose
This PR: Filter transpose + NHWC kernel

Tested with GeForce GTX Titan X

id	input_shape(NHWC)	filter_size(CHW)	stride	padding	dilation	groups	before	this PR	improve
0	(8, 64, 64, 1024)	(1024, 3, 3)	1	1	1	1024	9.11ms	6.03ms	+33.81%
1	(8, 64, 64, 2048)	(2048, 3, 3)	1	1	1	2048	18.48ms	11.90ms	+35.61%
2	(8, 64, 64, 1024)	(2048, 3, 3)	1	1	1	1024	29.35ms	14.29ms	+51.31%
3	(8, 64, 64, 1024)	(1024, 3, 3)	2	1	1	1024	5.74ms	3.33ms	+41.99%
4	(8, 64, 64, 1024)	(1024, 5, 5)	1	1	1	1024	21.47ms	18.20ms	+15.23%
5	(8, 256, 256, 64)	(64, 3, 3)	1	1	1	64	8.81ms	6.33ms	+28.15%
6	(8, 64, 128, 2048)	(2048, 3, 3)	1	12	12	2048	34.35ms	21.86ms	+36.36%
7	(8, 64, 128, 2048)	(2048, 3, 3)	1	24	24	2048	33.46ms	19.09ms	+42.95%
8	(8, 64, 128, 2048)	(2048, 3, 3)	1	36	36	2048	31.62ms	17.00ms	+46.24%

paddle-bot-old · 2021-03-16T16:03:40Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

CLAassistant · 2021-03-16T16:05:47Z

All committers have signed the CLA.

zhangting2020 · 2021-03-31T15:11:07Z

Test the above cases on V100：

before：
- forward：KernelDepthwiseConvSp + 2 * TilingSwapDim1And2
- backward：KernelDepthwiseConvFilterGradSp + KernelDepthwiseConvInputGradSp + 2 * TilingSwapDim1And2
after
- forward：KernelDepthwiseConvSp + TransposeNormalKernel
- backward：KernelDepthwiseConvFilterGradSp + KernelDepthwiseConvInputGradSp + 2 * TransposeNormalKernel

zhangting2020 · 2021-04-01T03:33:15Z

paddle/fluid/operators/math/depthwise_conv.cu

@@ -142,13 +141,14 @@ __device__ __inline__ void KernelDepthwiseConvNHWC(
    for (int w_in = w_in_start; w_in < w_in_end; w_in += dilate_width) {
      if (h_in >= h_start && h_in < h_end && w_in >= w_start && w_in < w_end) {
        int offset = ((batch * input_height + h_in) * input_width + w_in) *
-                         output_channels +
+                         input_channels +


The original code here seems to cause an error when input_channels is not equal to the output_channels. We will add a case in unit tests.

Yes, it should be input_channels here.

zhangting2020 · 2021-04-01T03:37:44Z

paddle/fluid/operators/math/depthwise_conv.cu

        } else {
-          value += weight[weight_offset] * in_data;
+          value += weight[0] * in_data;


Could you describe why this change was made?

To improve gld_efficiency, filter_data was transposed from CHW to HWC in this PR. So weight in (h_f, w_f, c_out) should be const T* weight = filter_data + weight_offset * output_channels + c_out, in which weight_offset equals h_f * filter_width + w_f.

luotao1

LGTM

OuyangChao force-pushed the depthwise_conv branch from 17e68c2 to 19dbc23 Compare March 16, 2021 16:14

OuyangChao changed the title ~~improve performance of DepthwiseConv(NWHC)~~ improve performance of DepthwiseConv(NHWC) Mar 18, 2021

OuyangChao force-pushed the depthwise_conv branch from db36517 to 95b5d89 Compare March 20, 2021 05:05

OuyangChao force-pushed the depthwise_conv branch from a4db1c2 to 454a0f8 Compare March 21, 2021 14:09

OuyangChao force-pushed the depthwise_conv branch from acbfa82 to 14d16a7 Compare March 26, 2021 14:52

OuyangChao force-pushed the depthwise_conv branch from 14d16a7 to 50b5508 Compare March 27, 2021 02:24

OuyangChao mentioned this pull request Mar 27, 2021

fix bug of KernelDepthwiseConvNHWC #31895

Closed

zhangting2020 reviewed Apr 1, 2021

View reviewed changes

OuyangChao added 10 commits April 1, 2021 13:24

improve performance of DepthwiseConv(NWHC)

2850f07

fix diff of DepthwiseConv(NHWC)

927487d

optimize grids and blocks of DepthwiseConv(NHWC)

f15c262

improve backward performance of DepthwiseConv(NHWC)

026c20e

fix diff of DepthwiseConvTranspose(NHWC)

57ef8a6

improve performance of KernelDepthwiseConvFilterGradNHWC

e7a1e05

improve backward performance of KernelDepthwiseConv(NHWC)

7d3a4e8

improve backward performance of KernelDepthwiseConvCFilterNHWC

d1f877d

improve performance with using data_layout as template param

81796de

improve performance of KernelDepthwiseConvFilterGrad

06aadc3

OuyangChao force-pushed the depthwise_conv branch from 50b5508 to f1bca11 Compare April 1, 2021 14:08

fix conflict

29bb5a9

OuyangChao force-pushed the depthwise_conv branch from f1bca11 to 29bb5a9 Compare April 2, 2021 01:42

luotao1 approved these changes Apr 7, 2021

View reviewed changes

zhangting2020 approved these changes Apr 7, 2021

View reviewed changes

zhangting2020 merged commit 363b25a into PaddlePaddle:develop Apr 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve performance of DepthwiseConv(NHWC) #31677

improve performance of DepthwiseConv(NHWC) #31677

OuyangChao commented Mar 16, 2021 •

edited

Loading

paddle-bot-old bot commented Mar 16, 2021

CLAassistant commented Mar 16, 2021 •

edited

Loading

zhangting2020 commented Mar 31, 2021 •

edited

Loading

zhangting2020 Apr 1, 2021

OuyangChao Apr 1, 2021

zhangting2020 Apr 1, 2021

OuyangChao Apr 1, 2021 •

edited

Loading

luotao1 left a comment

improve performance of DepthwiseConv(NHWC) #31677

improve performance of DepthwiseConv(NHWC) #31677

Conversation

OuyangChao commented Mar 16, 2021 • edited Loading

PR types

PR changes

Describe

Forward of DepthwiseConv(NHWC)

Backward of DepthwiseConv(NHWC)

paddle-bot-old bot commented Mar 16, 2021

CLAassistant commented Mar 16, 2021 • edited Loading

zhangting2020 commented Mar 31, 2021 • edited Loading

zhangting2020 Apr 1, 2021

Choose a reason for hiding this comment

OuyangChao Apr 1, 2021

Choose a reason for hiding this comment

zhangting2020 Apr 1, 2021

Choose a reason for hiding this comment

OuyangChao Apr 1, 2021 • edited Loading

Choose a reason for hiding this comment

luotao1 left a comment

Choose a reason for hiding this comment

OuyangChao commented Mar 16, 2021 •

edited

Loading

CLAassistant commented Mar 16, 2021 •

edited

Loading

zhangting2020 commented Mar 31, 2021 •

edited

Loading

OuyangChao Apr 1, 2021 •

edited

Loading