Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization of pool2d grad #35389

Merged
merged 19 commits into from
Sep 19, 2021

Conversation

JamesLim-sy
Copy link
Contributor

@JamesLim-sy JamesLim-sy commented Sep 2, 2021

PR types

Performance optimization

PR changes

OPs

Describe

  • Feature :

    • Replace the div and mod operation with fast_divmod operation.
    • Divide AvgPool2dGrad functor and MaxPool2dGrad functor with template specialization, to avoid useless operation below :
      • Useless IO operation in AvgPool2dGrad
      • Useless inverse operation in MaxPool2dGrad
  • Performance (Take resnet50 for example):

    • Before optimization, orginal performance is about 1273us
    • After implement with fast_divmod, AvgPool2dGrad execution time shrinks from 1273us to 600us;
    • After erasing usless IO in kernel, time consuming of AvgPool2dGrad shrinks from 600us to 490us.

@paddle-bot-old
Copy link

paddle-bot-old bot commented Sep 2, 2021

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

auto channel_divmod = divmods.channel.Divmod(input_height_divmod.val[0]);
w_offset = input_width_divmod.val[1] + padding_width;
h_offset = input_height_divmod.val[1] + padding_height;
offsetC = channel_divmod.val[1];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个命名跟其他的变量不太一致

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块的命名会修改成 channel_offset

namespace paddle {
namespace operators {
namespace math {

struct FastDivModOfPool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一般来说,先线程池这样的概念才叫XxxPool,这个命名不太合适。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据建议修改成 FastDivModForPoolingGrad

#include "paddle/fluid/platform/gpu_launch_config.h"

#ifdef __HIPCC__
#define BLOCK_SIZE 256
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

宏的名字加一下限定,POOL_BLOCK_SIZE之类的

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据建议修改。


inline DEVICE void ParameterUpdate(int tid, int output_stride) {
input = input_data[tid];
output_data += output_stride;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output_data不是定义成了const吗,还可以+=?以及这种方式,感觉不太安全啊。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+= 在这里的目的是移动 output_data 所代表的指针地址。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉这个类型,是为了避免input_dataoutput_data的访存,强行做了一些封装,本身不具备完备的语义跟可解释性。

};

template <typename T, typename PoolProcess, typename Enable = void>
struct PoolingFunctor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个Functor看起来是用于反向计算,命名不够代表实际含义,以及每个函数是用来干什么的?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据建议修改成 PoolingGradProcess,并对内部成员方法添加注释。

}

inline HOSTDEVICE void operator()(const T* __restrict__ output_grad,
T* __restrict__ gradient, int pool_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gradient是谁的梯度?这个函数不太像常规的operator(),用具体的函数名代替合适些。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gradient会修改成input_grad_dataoperator 会修改成Compute

template <typename T, typename PoolProcess>
struct PoolingFunctor<T, PoolProcess,
typename std::enable_if<std::is_same<
PoolProcess, math::AvgPoolGrad<T>>::value>::type> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kernel中PoolProcess不会再实际用到了吧?这个PoolProcess只是用来区分Avg和Max的Pool定义,没有实际用于计算,感觉没有必要。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯,Kenel内部的 PoolProcess 可以消除掉了。

PoolProcess pool_process, bool exclusive, bool adaptive, T* input_grad,
bool channel_last = false) {
const int nthreads, const T* __restrict__ output_grad,
const int output_height, const int output_width, const int input_width,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参数列表中统一一下height、width的顺序。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据建议修改。

int w_offset, h_offset, c_offset;
int phstart, phend, pwstart, pwend;
int output_stride;

if (!channel_last) { /* NCHW */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NHWC或NCHW这些索引的计算,感觉可能是比较常见的?可以进行封装一下,比如可以定义一个IndexCalculator4d`,并且针对NHWC、NCHW提供一些基础计算函数?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据建议会进行封装化修改。

auto pool_divmod =
FastDivModOfPool(input_channels, input_width, input_height, ksize_width,
ksize_height, stride_width, stride_height);
auto pool_functor = PoolingFunctor<T, PoolProcess>(input_data, output_data);
Copy link
Contributor

@Xreki Xreki Sep 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个Functor的引入,是为了减少IO?感觉可以基于原PoolProcess改造一下。

Copy link
Contributor Author

@JamesLim-sy JamesLim-sy Sep 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

尝试过对于原始PoolProcess的修改,但是实现起来感觉需要在类对象中加入太多成员,就切换成用对CUDA计算进行特化的实现方式了。
原始PoolProcess是链接中的类对象,

template <class T>
class MaxPoolGrad {
public:
DEVICE inline void compute(const T& x, const T& y, const T& dy, T scale,
T* dx) {
*dx += dy * static_cast<T>(x == y);
}
};
template <class T>
class AvgPoolGrad {
public:
DEVICE inline void compute(const T& x, const T& y, const T& dy, T scale,
T* dx) {
*dx += (scale * dy);
}
};

类对象需要同时支持CPU和CUDA计算,其中CPU的计算逻辑和CUDA的计算逻辑差异比较大,CPU的计算逻辑是,input_data 赋值在前,output_data指针地址的偏移滞后,且CPU计算中output_data的指针地址需要伴随循环不断偏移 ,如下:
float scale = 1.0 / pool_size;
for (int d = dstart; d < dend; ++d) {
for (int h = hstart; h < hend; ++h) {
for (int w = wstart; w < wend; ++w) {
int input_idx = (d * input_height + h) * input_width + w;
int output_idx =
(pd * output_height + ph) * output_width + pw;
pool_grad_process.compute(
input_data[input_idx], output_data[output_idx],
output_grad_data[output_idx], static_cast<T>(scale),
input_grad_data + input_idx);
}
}
}
}
}
}
input_data += input_stride;
output_data += output_stride;
input_grad_data += input_stride;
output_grad_data += output_stride;

和CUDA的计算逻辑不太相同,CUDA 的计算逻辑中数据读取和指针偏移都是一次性的,如下:
output_data += output_stride;
output_grad += output_stride;
for (int ph = phstart; ph < phend; ++ph) {
for (int pw = pwstart; pw < pwend; ++pw) {
int pool_size;
if (adaptive) {
pool_size = static_cast<int>(ceil(static_cast<double>(input_height) /
ksize_height)) *
static_cast<int>(
ceil(static_cast<double>(input_width) / ksize_width));
} else {
int hstart = ph * stride_height - padding_height;
int wstart = pw * stride_width - padding_width;
int hend = min(hstart + ksize_height, input_height);
int wend = min(wstart + ksize_width, input_width);
hstart = max(hstart, 0);
wstart = max(wstart, 0);
pool_size = exclusive ? (hend - hstart) * (wend - wstart)
: ksize_height * ksize_width;
}
int output_sub_idx = channel_last
? (ph * output_width + pw) * channels + offsetC
: ph * output_width + pw;
pool_process.compute(input, output_data[output_sub_idx],
output_grad[output_sub_idx],
static_cast<T>(1.0 / pool_size), &gradient);


inline DEVICE void ParameterUpdate(int tid, int output_stride) {
input = input_data[tid];
output_data += output_stride;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉这个类型,是为了避免input_dataoutput_data的访存,强行做了一些封装,本身不具备完备的语义跟可解释性。

batch_idx = index / channels / output_width / output_height;
}
int hstart, hend, wstart, wend;
int pw, ph, c, input_stride;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • input_stride -> input_offset
  • pw、ph是什么缩写?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pw 和 ph 分表代表 w_offet 和 h_offset,下个commit会修改这里的不规范命名

T input_grad_data = static_cast<T>(0);
int phstart, phend, pwstart, pwend;
int w_offset, h_offset, c_offset, output_stride;
ParamPreparationByDatalayout<>(index, channel_last, divmods, padding_width,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

函数命名应该按照功能来,这个函数的功能应该是计算4D坐标?

T* dx) {
*dx += dy * static_cast<T>(x == y);
static constexpr bool use_x = true;
DEVICE inline void compute(const T& x, const T* y, const T* dy, int out_idx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么要把y和dy改成指针类型呢?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 主要是把y改成指针,按照原始的数据类型传入的话,AvgPool 也需要从global_memory中读取output_data[output_index] 数据再传入到compute方法内,AvgPool中并不需要数据output_data[output_index] ,传指针可以避免这里的开销。
  • 考虑到 dx是指针类型,对应的就把dy改成了指针类型。

Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JamesLim-sy JamesLim-sy merged commit 8668519 into PaddlePaddle:develop Sep 19, 2021
AnnaTrainingG pushed a commit to AnnaTrainingG/Paddle that referenced this pull request Sep 29, 2021
* Optimization of pool2d grad, first commit.

* remove useless print codes

* refine codes

* refine codes

* seal more operation into template specialization

* fix template struct error in MaxPool2dGrad.

* Fix header including error

* refine code with comment

* Seal the param-preparation codes into function for common use.

* Seal the param-preparation codes into function for common use.

* Seal the param-preparation into funciton and make it common for other kernels

* polish code and erase useless template speicalization

* Rerun triger

* rerun trigger
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants