Optimization of pool2d grad #35389

JamesLim-sy · 2021-09-02T04:26:14Z

PR types

Performance optimization

PR changes

OPs

Describe

Feature :
- Replace the div and mod operation with fast_divmod operation.
- Divide AvgPool2dGrad functor and MaxPool2dGrad functor with template specialization, to avoid useless operation below :
  - Useless IO operation in AvgPool2dGrad
  - Useless inverse operation in MaxPool2dGrad
Performance (Take resnet50 for example):
- Before optimization, orginal performance is about 1273us
- After implement with fast_divmod, AvgPool2dGrad execution time shrinks from 1273us to 600us;
- After erasing usless IO in kernel, time consuming of AvgPool2dGrad shrinks from 600us to 490us.

paddle-bot-old · 2021-09-02T04:27:23Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

…/Paddle into Optimize_pool2d_grad

ZzSean · 2021-09-10T02:27:12Z

paddle/fluid/operators/math/pooling.cu

+      auto channel_divmod = divmods.channel.Divmod(input_height_divmod.val[0]);
+      w_offset = input_width_divmod.val[1] + padding_width;
+      h_offset = input_height_divmod.val[1] + padding_height;
+      offsetC = channel_divmod.val[1];


这个命名跟其他的变量不太一致

这块的命名会修改成 channel_offset

Xreki · 2021-09-12T02:35:13Z

paddle/fluid/operators/math/pooling.cu

 namespace paddle {
 namespace operators {
 namespace math {

+struct FastDivModOfPool {


一般来说，先线程池这样的概念才叫XxxPool，这个命名不太合适。

根据建议修改成 FastDivModForPoolingGrad

Xreki · 2021-09-12T02:36:04Z

paddle/fluid/operators/math/pooling.cu

 #include "paddle/fluid/platform/gpu_launch_config.h"

+#ifdef __HIPCC__
+#define BLOCK_SIZE 256


宏的名字加一下限定，POOL_BLOCK_SIZE之类的

根据建议修改。

Xreki · 2021-09-12T02:40:38Z

paddle/fluid/operators/math/pooling.cu

+
+  inline DEVICE void ParameterUpdate(int tid, int output_stride) {
+    input = input_data[tid];
+    output_data += output_stride;


output_data不是定义成了const吗，还可以+=？以及这种方式，感觉不太安全啊。

+= 在这里的目的是移动 output_data 所代表的指针地址。

感觉这个类型，是为了避免input_data、output_data的访存，强行做了一些封装，本身不具备完备的语义跟可解释性。

Xreki · 2021-09-12T02:45:29Z

paddle/fluid/operators/math/pooling.cu

+};
+
+template <typename T, typename PoolProcess, typename Enable = void>
+struct PoolingFunctor {


这个Functor看起来是用于反向计算，命名不够代表实际含义，以及每个函数是用来干什么的？

根据建议修改成 PoolingGradProcess，并对内部成员方法添加注释。

Xreki · 2021-09-12T02:46:56Z

paddle/fluid/operators/math/pooling.cu

+  }
+
+  inline HOSTDEVICE void operator()(const T* __restrict__ output_grad,
+                                    T* __restrict__ gradient, int pool_size,


gradient是谁的梯度？这个函数不太像常规的operator()，用具体的函数名代替合适些。

gradient会修改成input_grad_data， operator 会修改成Compute。

Xreki · 2021-09-12T02:50:16Z

paddle/fluid/operators/math/pooling.cu

+template <typename T, typename PoolProcess>
+struct PoolingFunctor<T, PoolProcess,
+                      typename std::enable_if<std::is_same<
+                          PoolProcess, math::AvgPoolGrad<T>>::value>::type> {


Kernel中PoolProcess不会再实际用到了吧？这个PoolProcess只是用来区分Avg和Max的Pool定义，没有实际用于计算，感觉没有必要。

嗯，Kenel内部的 PoolProcess 可以消除掉了。

Xreki · 2021-09-12T02:52:23Z

paddle/fluid/operators/math/pooling.cu

-    PoolProcess pool_process, bool exclusive, bool adaptive, T* input_grad,
-    bool channel_last = false) {
+    const int nthreads, const T* __restrict__ output_grad,
+    const int output_height, const int output_width, const int input_width,


参数列表中统一一下height、width的顺序。

根据建议修改。

Xreki · 2021-09-12T03:01:54Z

paddle/fluid/operators/math/pooling.cu

+    int w_offset, h_offset, c_offset;
+    int phstart, phend, pwstart, pwend;
+    int output_stride;
+
    if (!channel_last) { /* NCHW */


NHWC或NCHW这些索引的计算，感觉可能是比较常见的？可以进行封装一下，比如可以定义一个IndexCalculator4d`，并且针对NHWC、NCHW提供一些基础计算函数？

根据建议会进行封装化修改。

Xreki · 2021-09-12T03:09:32Z

paddle/fluid/operators/math/pooling.cu

+    auto pool_divmod =
+        FastDivModOfPool(input_channels, input_width, input_height, ksize_width,
+                         ksize_height, stride_width, stride_height);
+    auto pool_functor = PoolingFunctor<T, PoolProcess>(input_data, output_data);


这个Functor的引入，是为了减少IO？感觉可以基于原PoolProcess改造一下。

尝试过对于原始PoolProcess的修改，但是实现起来感觉需要在类对象中加入太多成员，就切换成用对CUDA计算进行特化的实现方式了。
原始PoolProcess是链接中的类对象，

Paddle/paddle/fluid/operators/math/pooling.h

Lines 68 to 84 in 8342403

template <class T>

class MaxPoolGrad {

public:

DEVICE inline void compute(const T& x, const T& y, const T& dy, T scale,

T* dx) {

*dx += dy * static_cast<T>(x == y);

}

};

template <class T>

class AvgPoolGrad {

public:

DEVICE inline void compute(const T& x, const T& y, const T& dy, T scale,

T* dx) {

*dx += (scale * dy);

}

};

类对象需要同时支持CPU和CUDA计算，其中CPU的计算逻辑和CUDA的计算逻辑差异比较大，CPU的计算逻辑是，input_data 赋值在前，output_data指针地址的偏移滞后，且CPU计算中output_data的指针地址需要伴随循环不断偏移 ，如下：

Paddle/paddle/fluid/operators/math/pooling.cc

Lines 1047 to 1067 in 8342403

float scale = 1.0 / pool_size;

for (int d = dstart; d < dend; ++d) {

for (int h = hstart; h < hend; ++h) {

for (int w = wstart; w < wend; ++w) {

int input_idx = (d * input_height + h) * input_width + w;

int output_idx =

(pd * output_height + ph) * output_width + pw;

pool_grad_process.compute(

input_data[input_idx], output_data[output_idx],

output_grad_data[output_idx], static_cast<T>(scale),

input_grad_data + input_idx);

}

}

}

}

}

}

input_data += input_stride;

output_data += output_stride;

input_grad_data += input_stride;

output_grad_data += output_stride;

和CUDA的计算逻辑不太相同，CUDA 的计算逻辑中数据读取和指针偏移都是一次性的，如下：

Paddle/paddle/fluid/operators/math/pooling.cu

Lines 142 to 169 in 8342403

output_data += output_stride;

output_grad += output_stride;

for (int ph = phstart; ph < phend; ++ph) {

for (int pw = pwstart; pw < pwend; ++pw) {

int pool_size;

if (adaptive) {

pool_size = static_cast<int>(ceil(static_cast<double>(input_height) /

ksize_height)) *

static_cast<int>(

ceil(static_cast<double>(input_width) / ksize_width));

} else {

int hstart = ph * stride_height - padding_height;

int wstart = pw * stride_width - padding_width;

int hend = min(hstart + ksize_height, input_height);

int wend = min(wstart + ksize_width, input_width);

hstart = max(hstart, 0);

wstart = max(wstart, 0);

pool_size = exclusive ? (hend - hstart) * (wend - wstart)

: ksize_height * ksize_width;

}

int output_sub_idx = channel_last

? (ph * output_width + pw) * channels + offsetC

: ph * output_width + pw;

pool_process.compute(input, output_data[output_sub_idx],

output_grad[output_sub_idx],

static_cast<T>(1.0 / pool_size), &gradient);

…/Paddle into Optimize_pool2d_grad

… kernels

Xreki · 2021-09-16T09:07:04Z

paddle/fluid/operators/math/pooling.cu

+
+  inline DEVICE void ParameterUpdate(int tid, int output_stride) {
+    input = input_data[tid];
+    output_data += output_stride;


感觉这个类型，是为了避免input_data、output_data的访存，强行做了一些封装，本身不具备完备的语义跟可解释性。

Xreki · 2021-09-16T09:13:27Z

paddle/fluid/operators/math/pooling.cu

-      batch_idx = index / channels / output_width / output_height;
-    }
+    int hstart, hend, wstart, wend;
+    int pw, ph, c, input_stride;


input_stride -> input_offset

pw、ph是什么缩写？

pw 和 ph 分表代表 w_offet 和 h_offset，下个commit会修改这里的不规范命名

Xreki · 2021-09-16T09:16:22Z

paddle/fluid/operators/math/pooling.cu

+    T input_grad_data = static_cast<T>(0);
+    int phstart, phend, pwstart, pwend;
+    int w_offset, h_offset, c_offset, output_stride;
+    ParamPreparationByDatalayout<>(index, channel_last, divmods, padding_width,


函数命名应该按照功能来，这个函数的功能应该是计算4D坐标？

Xreki · 2021-09-17T01:21:19Z

paddle/fluid/operators/math/pooling.h

-                             T* dx) {
-    *dx += dy * static_cast<T>(x == y);
+  static constexpr bool use_x = true;
+  DEVICE inline void compute(const T& x, const T* y, const T* dy, int out_idx,


为什么要把y和dy改成指针类型呢？

主要是把y改成指针，按照原始的数据类型传入的话，AvgPool 也需要从global_memory中读取output_data[output_index] 数据再传入到compute方法内，AvgPool中并不需要数据output_data[output_index] ，传指针可以避免这里的开销。

考虑到 dx是指针类型，对应的就把dy改成了指针类型。

… Optimize_pool2d_grad

Xreki

LGTM

* Optimization of pool2d grad, first commit. * remove useless print codes * refine codes * refine codes * seal more operation into template specialization * fix template struct error in MaxPool2dGrad. * Fix header including error * refine code with comment * Seal the param-preparation codes into function for common use. * Seal the param-preparation codes into function for common use. * Seal the param-preparation into funciton and make it common for other kernels * polish code and erase useless template speicalization * Rerun triger * rerun trigger

Optimization of pool2d grad, first commit.

97cb4d6

JamesLim-sy added 4 commits September 2, 2021 04:40

remove useless print codes

b6788a1

refine codes

3b8bbe4

refine codes

2402277

seal more operation into template specialization

2294254

JamesLim-sy force-pushed the Optimize_pool2d_grad branch from c3aaa3e to 2294254 Compare September 6, 2021 14:46

JamesLim-sy and others added 5 commits September 6, 2021 14:57

fix template struct error in MaxPool2dGrad.

acd08d9

Merge branch 'develop' into Optimize_pool2d_grad

4bd707d

Merge branch 'develop' into Optimize_pool2d_grad

5957ef2

Merge branch 'Optimize_pool2d_grad' of https://github.com/JamesLim-sy…

8190737

…/Paddle into Optimize_pool2d_grad

Fix header including error

20a3151

ZzSean reviewed Sep 10, 2021

View reviewed changes

refine code with comment

ff5e0d9

Xreki reviewed Sep 12, 2021

View reviewed changes

JamesLim-sy added 4 commits September 12, 2021 12:35

Seal the param-preparation codes into function for common use.

e63bf5f

Seal the param-preparation codes into function for common use.

c13dee9

Merge branch 'Optimize_pool2d_grad' of https://github.com/JamesLim-sy…

91fa296

…/Paddle into Optimize_pool2d_grad

Seal the param-preparation into funciton and make it common for other…

ffc7c01

… kernels

Xreki reviewed Sep 16, 2021

View reviewed changes

polish code and erase useless template speicalization

195b923

Xreki reviewed Sep 17, 2021

View reviewed changes

JamesLim-sy added 3 commits September 17, 2021 08:27

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

7136050

… Optimize_pool2d_grad

Rerun triger

8004f32

rerun trigger

d8e4f42

Xreki approved these changes Sep 19, 2021

View reviewed changes

JamesLim-sy merged commit 8668519 into PaddlePaddle:develop Sep 19, 2021

sneaxiy mentioned this pull request Nov 10, 2021

MLPerf Optimization for Release/2.2 #37109

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization of pool2d grad #35389

Optimization of pool2d grad #35389

JamesLim-sy commented Sep 2, 2021 •

edited

Loading

paddle-bot-old bot commented Sep 2, 2021

ZzSean Sep 10, 2021

JamesLim-sy Sep 10, 2021

Xreki Sep 12, 2021

JamesLim-sy Sep 12, 2021

Xreki Sep 12, 2021

JamesLim-sy Sep 12, 2021

Xreki Sep 12, 2021

JamesLim-sy Sep 12, 2021

Xreki Sep 16, 2021

Xreki Sep 12, 2021

JamesLim-sy Sep 12, 2021

Xreki Sep 12, 2021

JamesLim-sy Sep 12, 2021

Xreki Sep 12, 2021

JamesLim-sy Sep 12, 2021

Xreki Sep 12, 2021

JamesLim-sy Sep 12, 2021

Xreki Sep 12, 2021

JamesLim-sy Sep 12, 2021

Xreki Sep 12, 2021 •

edited

Loading

JamesLim-sy Sep 12, 2021 •

edited

Loading

Xreki Sep 16, 2021

Xreki Sep 16, 2021

JamesLim-sy Sep 17, 2021

Xreki Sep 16, 2021

Xreki Sep 17, 2021

JamesLim-sy Sep 17, 2021

Xreki left a comment

	template <class T>
	class MaxPoolGrad {
	public:
	DEVICE inline void compute(const T& x, const T& y, const T& dy, T scale,
	T* dx) {
	dx += dy static_cast<T>(x == y);
	}
	};

	template <class T>
	class AvgPoolGrad {
	public:
	DEVICE inline void compute(const T& x, const T& y, const T& dy, T scale,
	T* dx) {
	dx += (scale dy);
	}
	};

	float scale = 1.0 / pool_size;
	for (int d = dstart; d < dend; ++d) {
	for (int h = hstart; h < hend; ++h) {
	for (int w = wstart; w < wend; ++w) {
	int input_idx = (d * input_height + h) * input_width + w;
	int output_idx =
	(pd * output_height + ph) * output_width + pw;
	pool_grad_process.compute(
	input_data[input_idx], output_data[output_idx],
	output_grad_data[output_idx], static_cast<T>(scale),
	input_grad_data + input_idx);
	}
	}
	}
	}
	}
	}
	input_data += input_stride;
	output_data += output_stride;
	input_grad_data += input_stride;
	output_grad_data += output_stride;

	output_data += output_stride;
	output_grad += output_stride;

	for (int ph = phstart; ph < phend; ++ph) {
	for (int pw = pwstart; pw < pwend; ++pw) {
	int pool_size;
	if (adaptive) {
	pool_size = static_cast<int>(ceil(static_cast<double>(input_height) /
	ksize_height)) *
	static_cast<int>(
	ceil(static_cast<double>(input_width) / ksize_width));
	} else {
	int hstart = ph * stride_height - padding_height;
	int wstart = pw * stride_width - padding_width;
	int hend = min(hstart + ksize_height, input_height);
	int wend = min(wstart + ksize_width, input_width);
	hstart = max(hstart, 0);
	wstart = max(wstart, 0);
	pool_size = exclusive ? (hend - hstart) * (wend - wstart)
	: ksize_height * ksize_width;
	}

	int output_sub_idx = channel_last
	? (ph * output_width + pw) * channels + offsetC
	: ph * output_width + pw;
	pool_process.compute(input, output_data[output_sub_idx],
	output_grad[output_sub_idx],
	static_cast<T>(1.0 / pool_size), &gradient);

Optimization of pool2d grad #35389

Optimization of pool2d grad #35389

Conversation

JamesLim-sy commented Sep 2, 2021 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Sep 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xreki Sep 12, 2021 • edited Loading

Choose a reason for hiding this comment

JamesLim-sy Sep 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xreki left a comment

Choose a reason for hiding this comment

JamesLim-sy commented Sep 2, 2021 •

edited

Loading

Xreki Sep 12, 2021 •

edited

Loading

JamesLim-sy Sep 12, 2021 •

edited

Loading