Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve performance of DepthwiseConv(NHWC) #31677

Merged
merged 11 commits into from
Apr 7, 2021

Conversation

OuyangChao
Copy link
Contributor

@OuyangChao OuyangChao commented Mar 16, 2021

PR types

Performance optimization

PR changes

OPs

Describe

improve performance of DepthwiseConv(NHWC)

Forward of DepthwiseConv(NHWC)

import paddle
import paddle.nn as nn

x_var = paddle.uniform((8, 64, 64, 1024), dtype='float32', min=-1., max=1.)
conv = nn.Conv2D(1024, 1024, (3, 3), stride=1, padding=1, dilation=1, groups=1024, data_format='NHWC')
y_var = conv(x_var)
  • Before: Input transpose + NCHW kernel + Output transpose
  • This PR: Filter transpose + NHWC kernel

Tested with GeForce GTX Titan X

id input_shape(NHWC) filter_size(CHW) stride padding dilation groups before this PR improve
0 (8, 64, 64, 1024) (1024, 3, 3) 1 1 1 1024 4.52ms 2.00ms +55.75%
1 (8, 64, 64, 2048) (2048, 3, 3) 1 1 1 2048 9.21ms 4.09ms +55.59%
2 (8, 64, 64, 1024) (2048, 3, 3) 1 1 1 1024 9.41ms 3.65ms +61.21%
3 (8, 64, 64, 1024) (1024, 3, 3) 2 1 1 1024 2.72ms 0.86ms +68.38%
4 (8, 64, 64, 1024) (1024, 5, 5) 1 1 1 1024 14.47ms 7.24ms +49.97%
5 (8, 256, 256, 64) (64, 3, 3) 1 1 1 64 4.51ms 2.09ms +53.66%
6 (8, 64, 128, 2048) (2048, 3, 3) 1 12 12 2048 17.44ms 10.65ms +38.93%
7 (8, 64, 128, 2048) (2048, 3, 3) 1 24 24 2048 17.02ms 9.27ms +45.53%
8 (8, 64, 128, 2048) (2048, 3, 3) 1 36 36 2048 15.91ms 7.05ms +55.69%

Backward of DepthwiseConv(NHWC)

import paddle
import paddle.nn as nn

x_var = paddle.uniform((8, 64, 64, 1024), dtype='float32', min=-1., max=1.)
x_var.stop_gradient = False
conv = nn.Conv2D(1024, 1024, (3, 3), stride=1, padding=1, dilation=1, groups=1024, data_format='NHWC')
y_var = conv(x_var)
paddle.grad(y_var, x_var)
  • Before: Input transpose + NCHW kernel + Output transpose
  • This PR: Filter transpose + NHWC kernel

Tested with GeForce GTX Titan X

id input_shape(NHWC) filter_size(CHW) stride padding dilation groups before this PR improve
0 (8, 64, 64, 1024) (1024, 3, 3) 1 1 1 1024 9.11ms 6.03ms +33.81%
1 (8, 64, 64, 2048) (2048, 3, 3) 1 1 1 2048 18.48ms 11.90ms +35.61%
2 (8, 64, 64, 1024) (2048, 3, 3) 1 1 1 1024 29.35ms 14.29ms +51.31%
3 (8, 64, 64, 1024) (1024, 3, 3) 2 1 1 1024 5.74ms 3.33ms +41.99%
4 (8, 64, 64, 1024) (1024, 5, 5) 1 1 1 1024 21.47ms 18.20ms +15.23%
5 (8, 256, 256, 64) (64, 3, 3) 1 1 1 64 8.81ms 6.33ms +28.15%
6 (8, 64, 128, 2048) (2048, 3, 3) 1 12 12 2048 34.35ms 21.86ms +36.36%
7 (8, 64, 128, 2048) (2048, 3, 3) 1 24 24 2048 33.46ms 19.09ms +42.95%
8 (8, 64, 128, 2048) (2048, 3, 3) 1 36 36 2048 31.62ms 17.00ms +46.24%

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@CLAassistant
Copy link

CLAassistant commented Mar 16, 2021

CLA assistant check
All committers have signed the CLA.

@OuyangChao OuyangChao changed the title improve performance of DepthwiseConv(NWHC) improve performance of DepthwiseConv(NHWC) Mar 18, 2021
@zhangting2020
Copy link
Contributor

zhangting2020 commented Mar 31, 2021

Test the above cases on V100:

  • before:

    • forward:KernelDepthwiseConvSp + 2 * TilingSwapDim1And2
    • backward:KernelDepthwiseConvFilterGradSp + KernelDepthwiseConvInputGradSp + 2 * TilingSwapDim1And2
      image
  • after

    • forward:KernelDepthwiseConvSp + TransposeNormalKernel
    • backward:KernelDepthwiseConvFilterGradSp + KernelDepthwiseConvInputGradSp + 2 * TransposeNormalKernel
      image

@@ -142,13 +141,14 @@ __device__ __inline__ void KernelDepthwiseConvNHWC(
for (int w_in = w_in_start; w_in < w_in_end; w_in += dilate_width) {
if (h_in >= h_start && h_in < h_end && w_in >= w_start && w_in < w_end) {
int offset = ((batch * input_height + h_in) * input_width + w_in) *
output_channels +
input_channels +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code here seems to cause an error when input_channels is not equal to the output_channels. We will add a case in unit tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be input_channels here.

} else {
value += weight[weight_offset] * in_data;
value += weight[0] * in_data;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you describe why this change was made?

Copy link
Contributor Author

@OuyangChao OuyangChao Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To improve gld_efficiency, filter_data was transposed from CHW to HWC in this PR. So weight in (h_f, w_f, c_out) should be const T* weight = filter_data + weight_offset * output_channels + c_out, in which weight_offset equals h_f * filter_width + w_f.

Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhangting2020 zhangting2020 merged commit 363b25a into PaddlePaddle:develop Apr 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants