Support multi-outputs feature for broadcast ops #38329

JamesLim-sy · 2021-12-21T13:16:08Z

PR types

New features

PR changes

OPs

Describe

功能特性 & 注意事项：

支持Elementwise多输出功能，简化以下计算情形中多次LaunchBroadcastElementwiseCudaKernel 所产生的启动overhead开销
```
out_1  = x + y * z;
out_2 = x / z;
```
但必须注意 out_1 和 out_2 的维度应相同，且2个functor的axis设置须要相同

使用方法：

引入paddle::framework::Array 作为输入的数据类型，并设计如下所示的functor

template <typename InT, typename OutT>
struct Functor {
  HOSTDEVICE paddle::framework::Array<OutT, 2>  operator()(InT x, InT y, InT z) {
    paddle::framework::Array<OutT, 2> outs;
    outs[0] = x + y * z;
    outs[1] = x / z;
    return outs;
  }
};

调用 LaunchBroadcastElementwiseCudaKernel时，模板参数由于<InT, OutT, functor> 转变为<InT, OutT, functor， NumOuts>，其中NumOuts用于表达functor的输出元素数量（本例为2），默认值为1故兼容现有的单个functor计算的写法

性能表现：

支持多输出前后，elementwise计算性能几无影响：
Fig. 1 支持多输出功能前
Fig. 2 支持多输出功能后

遗留项目

目前引入了模板参数NumOuts 用于实现多输出情形，但面对多输出情形Function_traits中的ReturnType 模板参数本身也是paddle::framework::Array<OutT, NumOuts> 的类型，一直想从ReturnType中尝试获取NumOuts，暂时未能有效实现

paddle-bot-old · 2021-12-21T13:16:23Z

✅ This PR's description meets the template requirements!
Please wait for other CI results.

paddle-bot-old · 2021-12-21T13:16:24Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

…I-ROCm

Xreki · 2021-12-23T15:22:11Z

paddle/pten/kernels/hybird/cuda/elementwise/elementwise_broadcast.cu.h

@@ -208,6 +231,7 @@ __device__ void ElementwiseBroadcastKernelImpl(
    Functor func) {
  InT args[Arity][VecSize];
  OutT result[VecSize];
+  ScalarType<OutT> vec_result[NumOuts][VecSize];


这个vec_result可以挪到L199 WriteData函数里面定义

已根据要求修改

xingfeng01 · 2021-12-24T08:45:38Z

paddle/fluid/operators/kernel_primitives/datamover_primitives.h

@@ -441,8 +475,8 @@ __device__ __forceinline__ void WriteData(T* dst, T* __restrict__ src,
    }
  } else {
    // Vector type
-    const int kVectorSize = (NX % 4 == 0) ? 4 : (NX % 2 == 0) ? 2 : 1;
-    const int kVectorsPerThread = NX / kVectorSize;
+    constexpr int kVectorSize = (NX % 4 == 0) ? 4 : (NX % 2 == 0) ? 2 : 1;


constexpr 有性能影响吗 (c++14 支持) ？理论上 const 编译器也会在编译时计算

这个修改是从读代码的角度出发的，切换成constexpr的话，可以显式认定是编译期计算

xingfeng01 · 2021-12-24T08:46:17Z

paddle/pten/kernels/hybird/cuda/elementwise/elementwise_broadcast.cu.h

@@ -170,11 +170,11 @@ struct DimensionsTransform {
  }
 };

-template <typename T, int VecSize, int Rank, bool IsBoundary = false>
+template <typename T, int VecSize, int Rank, bool IsBoundary>


为什么要去掉默认设置呢？

修改的时候看到LoadData只接受了ElementwiseBroadcastKernelImpl 函数的调用，而ElementwiseBroadcastKernelImpl 已经设定了bool IsBoundary = false，所以取消了这里的默认模板参数设置；

修改回原始状态

xingfeng01 · 2021-12-24T08:46:34Z

paddle/pten/kernels/hybird/cuda/elementwise/elementwise_broadcast.cu.h

 __device__ __forceinline__ void LoadData(
    T *dst,
    const T *__restrict__ src,
-    uint32_t block_offset,
+    int block_offset,


为什么要从 uint 改成 int ？

和上一条相同，修改的时候看到LoadData只接受了ElementwiseBroadcastKernelImpl函数的调用，而ElementwiseBroadcastKernelImpl的参数列表中是int block_offset，所以就把这里改成了，不过看到了KP中采用的是uint32_t ，会再提个commit把这里改回来
修改回原始状态uint32_t

xingfeng01 · 2021-12-24T08:47:20Z

paddle/fluid/operators/kernel_primitives/datamover_primitives.h

@@ -428,6 +428,40 @@ __device__ __forceinline__ void ReadDataReduce(
 * src: The register pointer, the size is NX * NY.
 * size: The current block needs to load size elements continuously.
 */
+
+#if defined(__NVCC__)


这两个是新加的吗？之前的功能有不完善吗？

功能上已经完成了，此处修改的目的是想从编译阶段区分IsBoundary的两种情况所走入的计算分支，本质是想实现c++ 17中的逻辑

if constexpr (condition) { }

经过检查，这里采用偏特化的修改导致了性能下降问题，故复原回原始写法

… support_multi-output_for_broadcast

AnnaTrainingG

LGTM

JamesLim-sy changed the title ~~Support Multiple output for elementwise and broadcast ops~~ Support Multiple outputs for elementwise and broadcast ops Dec 21, 2021

JamesLim-sy changed the title ~~Support Multiple outputs for elementwise and broadcast ops~~ Support Multiple outputs feature for elementwise and broadcast ops Dec 21, 2021

JamesLim-sy force-pushed the support_multi-output_for_broadcast branch from 3d77a4c to 474b13c Compare December 22, 2021 13:20

No harm to KP

ffccae7

JamesLim-sy force-pushed the support_multi-output_for_broadcast branch from 474b13c to ffccae7 Compare December 22, 2021 18:44

JamesLim-sy added 4 commits December 22, 2021 19:09

Pass the compile stage

3e29d51

change the WriteData function

153846d

fix template bugs and pass ctest of current elementwise

4ec09b1

for passing partial template specialization of tempalte function in C…

6032c4e

…I-ROCm

Xreki reviewed Dec 23, 2021

View reviewed changes

JamesLim-sy changed the title ~~Support Multiple outputs feature for elementwise and broadcast ops~~ Support multi-outputs feature for broadcast ops Dec 24, 2021

To make 'WriteData' funtion flexible.

3eb6c63

JamesLim-sy requested a review from Xreki December 24, 2021 05:53

xingfeng01 reviewed Dec 24, 2021

View reviewed changes

JamesLim-sy added 5 commits December 24, 2021 16:46

a less harmful way to support multi-output

31b0da0

a less harmful way to support multi-output

ad979a5

a less harmful way for multi=output feature support

81be8ed

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

af1a118

… support_multi-output_for_broadcast

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

e871094

… support_multi-output_for_broadcast

AnnaTrainingG approved these changes Dec 27, 2021

View reviewed changes

JamesLim-sy merged commit 89d38f5 into PaddlePaddle:develop Dec 27, 2021

JamesLim-sy mentioned this pull request Dec 28, 2021

Support multi-output feature for elementwise #38410

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multi-outputs feature for broadcast ops #38329

Support multi-outputs feature for broadcast ops #38329

JamesLim-sy commented Dec 21, 2021 •

edited

Loading

paddle-bot-old bot commented Dec 21, 2021 •

edited

Loading

paddle-bot-old bot commented Dec 21, 2021

Xreki Dec 23, 2021

JamesLim-sy Dec 24, 2021

xingfeng01 Dec 24, 2021

JamesLim-sy Dec 24, 2021 •

edited

Loading

xingfeng01 Dec 24, 2021

JamesLim-sy Dec 24, 2021 •

edited

Loading

xingfeng01 Dec 24, 2021

JamesLim-sy Dec 24, 2021 •

edited

Loading

xingfeng01 Dec 24, 2021

JamesLim-sy Dec 24, 2021

JamesLim-sy Dec 24, 2021

AnnaTrainingG left a comment

Support multi-outputs feature for broadcast ops #38329

Support multi-outputs feature for broadcast ops #38329

Conversation

JamesLim-sy commented Dec 21, 2021 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Dec 21, 2021 • edited Loading

paddle-bot-old bot commented Dec 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JamesLim-sy Dec 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JamesLim-sy Dec 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JamesLim-sy Dec 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AnnaTrainingG left a comment

Choose a reason for hiding this comment

JamesLim-sy commented Dec 21, 2021 •

edited

Loading

paddle-bot-old bot commented Dec 21, 2021 •

edited

Loading

JamesLim-sy Dec 24, 2021 •

edited

Loading

JamesLim-sy Dec 24, 2021 •

edited

Loading

JamesLim-sy Dec 24, 2021 •

edited

Loading