Add Slice Tensor for int64 index in ElemenetwiseKernel #57313

AnnaTrainingG · 2023-09-14T06:33:30Z

PR types

Others

PR changes

Others

Description

Others
Pcard-70459

5120 4 384 384, 1 1 384 384 bfloat16 精度验证
该规模下本PR性能为9.5ms, baseline nsys 统计不出性能数据
加sync 本PR： paddle.Add 加了sync的程序运行时间：0.0124376 秒
加sync baseline: paddle.Add 加了sync的程序运行时间程序运行时间：0.10285941 秒

无sync 本PR：程序运行时间：0.021558000000000003 秒
无sync old: 程序运行时间：0.007517 秒

说明：baseline 统计的耗时是 commit 3474e09 之前的耗时，原因是： 3474e09 commit 实现有误，会导致访存越界

import paddle
import numpy as np
import datetime

tensor_a = paddle.rand(shape=[5120, 4, 384, 384], dtype="float16")
tensor_b = paddle.rand(shape=[5120, 1, 384, 384], dtype="float16")
elapsed_time = datetime.timedelta()
# 记录起始时间
# 计算运行时间
for i in range(10):
    paddle.device.cuda.synchronize()
    start_time = datetime.datetime.now()
    tensor_z= paddle.add(tensor_a,tensor_b)
    paddle.device.cuda.synchronize()
    end_time = datetime.datetime.now()
    elapsed_time += end_time - start_time

a0,a1, = paddle.split(tensor_z, num_or_sections=2, axis=1)
in0,in1= paddle.split(tensor_a, num_or_sections=2, axis=1)

r0 = paddle.add(tensor_b, in0)
r1 = paddle.add(tensor_b, in1)

result1 = paddle.any(paddle.equal(a0,r0),[0,1,2,3])
result2 = paddle.any(paddle.equal(a1,r1),[0,1,2,3]) 
np.testing.assert_equal(result1.numpy(),True)
np.testing.assert_equal(result2.numpy(),True)

# 打印结果
print(f"程序运行时间：{(elapsed_time.total_seconds())/10} 秒")

测试脚本

本PR op_benchmark 中算子case规模较小，机器波动造成

paddle-bot · 2023-09-14T06:33:34Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Xreki · 2023-09-18T08:33:59Z

paddle/phi/kernels/funcs/broadcast_function.h

+#ifndef PADDLE_WITH_XPU_KP
+  constexpr bool kEnabledInt64IndexKernel = (NumOuts == 1 && kArity <= 3);
+  auto loader_classifier =
+      BroadcastTypeClassifier<OutT, Functor, kArity, NumOuts>(ins, outs, axis);


外层构造过一遍BroadcastTypeClassifier，则可以把构造好的loader_classifier作为参数传给BroadcastKernelForDifferentVecSize函数，避免函数里面再构造一次，产生额外的CPU开销。

已经删除两者的tensor 首地址及dim不同，不可直接传递

Xreki · 2023-09-18T08:35:14Z

paddle/phi/kernels/funcs/broadcast_function.h

+  auto compute_size = std::numeric_limits<int32_t>::max();
+  bool use_int64_index_kernel = kEnabledInt64IndexKernel &&
+                                (*outs)[0]->numel() >= compute_size &&
+                                (!loader_classifier.all_elementwise);


all_elementwise即不需要广播分支，也需要支持大Tensor计算

已经修改

Xreki · 2023-09-18T08:39:40Z

paddle/phi/kernels/funcs/broadcast_function.h

+                                (!loader_classifier.all_elementwise);
+
+  if (use_int64_index_kernel) {  // use_int64_index_kernel
+    const auto dims_simplifier =


dims_simplifier在BroadcastTypeClassifier内部也计算过，若外部需要使用，可以改成作为成员保存。

两个dim 不一样不可直接使用

Xreki · 2023-09-18T08:50:41Z

paddle/phi/kernels/funcs/broadcast_function.h

+    int all_rank = dims_simplifier.rank;
+    auto old_in_dims = dims_simplifier.in_dims;
+    auto old_out_dims = dims_simplifier.out_dims;
+    auto old_in_strides = dims_simplifier.in_dims;


old_ -> origin_

Xreki · 2023-09-18T09:01:05Z

paddle/phi/kernels/funcs/broadcast_function.h

+    auto old_out_dims = dims_simplifier.out_dims;
+    auto old_in_strides = dims_simplifier.in_dims;
+
+    old_out_strides.resize(all_rank);


整个代码太长，且有点缺乏条理。建议按照功能封装成几个子函数。另外考虑下这个逻辑复用的可能，比如能否服用到reduce计算中。

已经修改

Xreki

看PR描述，大Tensor情况下的计算时间看起来并不是很长，可以添加一个单测保证计算的正确性
nsys跑不出来，那就跑下端到端的时间对比吧，测试代码贴到PR描述里面，多测试几个配置

可以考虑把整个逻辑封装一下，放在单独的一个头文件中（如tensor_slicer.h）方便复用如：

class TensorSlicer {
 public:
  using ArgumentsTuple = std::tuple<std::vector<const DenseTensor *> ins, std::vector<DenseTensor *> outs>;
  TensorSlicer(const std::vector<const DenseTensor *> &ins, std::vector<DenseTensor *> *outs) { 
    // 初始化
  }

  int size() const { return num_splits_; }
  ArgumentsTuple operator[](int i) {
    ...
  }

 private:
  std::vector<const DenseTensor *>* ins_; // not owned
  std::vector<DenseTensor *>* outs_;      // not owned
  int num_splits_{0};
  std::vector<int64_t> strides_; // 切片相关的信息
};

外层调用逻辑为：

if (numel > compute_size) {
  auto slicer = TensorSlicer(ins, outs);
  for (int i = 0; i < slice.size(); ++i) {
    auto args = slicer[i];
    BroadcastKernelForDifferentVecSize(args.first, args.second, ...);
  }
  return;
}
BroadcastKernelForDifferentVecSize(ins, outs, ...)

Xreki · 2023-09-24T15:03:06Z

paddle/phi/kernels/funcs/broadcast_function.h

@@ -950,6 +897,215 @@ BroadcastKernelForDifferentVecSize(const KPDevice &ctx,
  }
 }

+static void initDims(std::vector<int64_t> *dims, int size, int64_t value) {


std::vector有相应的构造函数，支持直接初始化为N个值为value的vector

Xreki · 2023-09-24T15:06:27Z

paddle/phi/kernels/funcs/broadcast_function.h

+  std::reverse(dims->begin(), dims->end());
+}
+
+static void UpdateTensor(DenseTensor *x,


函数名为SliceTensor更合理

Xreki · 2023-09-24T15:09:04Z

paddle/phi/kernels/funcs/broadcast_function.h

+                          int axis,
+                          Functor func,
+                          const int64_t compute_size) {
+  const auto dims_simplifier =


这里对dims化简其实没有意义，计算Tensor切片的strides复杂度比dims化简要低。

此处不仅仅是简化dims 更是对input dims的扩展，保证出0D之外，输入输出的dims_size是一样的

Xreki · 2023-09-24T15:13:57Z

paddle/phi/kernels/funcs/broadcast_function.h

+    BroadcastKernelForDifferentVecSize<OutT, Functor, kArity, NumOuts>(
+        ctx, new_ins, &new_outs, axis, func);
+  }
+  return;


这行可以删除

已经删除

Xreki · 2023-09-24T15:18:01Z

paddle/phi/kernels/funcs/broadcast_function.h

+  }
+
+  // compute
+  DenseTensor tmp_in[kArity];


这里定义的Tensor看起来并没有被用到，因为L1023又定义了同名Tensor。

重复定义，就已经删除

Xreki · 2023-09-25T01:37:04Z

paddle/phi/kernels/funcs/broadcast_function.h

+  phi::Array<_ptr_ OutT *, NumOuts> outs_data;
+  for (int i = 0; i < NumOuts; ++i) {
+    outs_data[i] = (_ptr_ OutT *)(ctx.Alloc<OutT>((*outs)[i]));
+  }


outs_data在该函数中并没有被用到

此处是为了分配空间，已经修改

上层BroadcastKernel已经分配了空间了

Xreki · 2023-09-26T08:00:44Z

test/legacy_test/test_elementwise_sub_op.py

@@ -88,6 +88,37 @@ def init_dtype(self):
    or not core.is_bfloat16_supported(core.CUDAPlace(0)),


支持bfloat16的CI太少了，单测可以只测float16。

已经修改

Xreki · 2023-09-26T08:02:04Z

test/legacy_test/test_elementwise_sub_op.py

+class TestTensorAddSplit(unittest.TestCase):
+    def _split_compute(self, dtype):
+        paddle.disable_static()
+        tensor_a = paddle.rand(shape=[5120, 4, 384, 384], dtype=dtype)


注意显存使用，这一个tensor就5.6G。感觉新增一个测试文件比较好？和其他单测在一起，显存是否容易崩？

已经新增了测试文件

Xreki · 2023-09-28T02:04:16Z

test/legacy_test/test_elementwise_tensor_split.py

+from paddle.base import core
+
+
+class TestElementwiseOp(OpTest):


这个新增的单测不要继承OpTest，因为不是按照OpTest的方式测试的。按照普通的unittest方式加就行。

已经修改

paddle-ci-bot · 2023-10-06T03:23:26Z

Sorry to inform you that db6165c's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

Xreki · 2023-10-09T03:19:11Z

该规模下本PR性能为9.5ms, baseline nsys 统计不出性能数据

既然develop报错的原因已找到，再统计下性能数据吧。

Xreki · 2023-10-09T03:21:23Z

paddle/phi/kernels/funcs/broadcast_function.h

-  if (use_int64_index_kernel) {
-    switch (vec_size) {
-      case VecSizeL: {
-        LaunchBroadcastKernelWithInt64IndexHelper<OutT,


其他int64实现相关的源码也可以删除

后续单独提交PR 删

Xreki

LGTM and great work~

…57313)

AnnaTrainingG force-pushed the broadcast branch from fda92a7 to 55c3c7c Compare September 14, 2023 09:13

Xreki reviewed Sep 18, 2023

View reviewed changes

AnnaTrainingG requested a review from Xreki September 22, 2023 10:54

Xreki reviewed Sep 24, 2023

View reviewed changes

Xreki reviewed Sep 25, 2023

View reviewed changes

AnnaTrainingG changed the title ~~Broadcast~~ Add Slice Tensor for int64 index in ElemenetwiseKernel Sep 25, 2023

AnnaTrainingG and others added 13 commits September 25, 2023 21:27

update

678aebd

0D is ok

c8ed2e4

update

1e9ab49

update

1e0b71c

update

da6812e

update

bfddda2

origin_

ca875ea

update

ca81c22

update

d20409f

update

90d8efa

Add Ctest

58083d2

Update test_elementwise_sub_op.py

51a8293

update

85e7012

AnnaTrainingG force-pushed the broadcast branch from 601b21c to 85e7012 Compare September 25, 2023 13:33

AnnaTrainingG added 2 commits September 26, 2023 10:05

update alloc

6c2cefb

for ctest

ff793d3

Xreki reviewed Sep 26, 2023

View reviewed changes

AnnaTrainingG and others added 5 commits September 26, 2023 16:12

update

84d9fb6

Update test_elementwise_sub_op.py

d8bf008

revert

dcb66cb

Update test_elementwise_tensor_split.py

bba107c

Update parallel_UT_rule.py

3e6cc05

Xreki reviewed Sep 28, 2023

View reviewed changes

Split

19db945

AnnaTrainingG force-pushed the broadcast branch from db6165c to 19db945 Compare October 7, 2023 02:42

Xreki reviewed Oct 9, 2023

View reviewed changes

update

76b0744

Xreki approved these changes Oct 10, 2023

View reviewed changes

AnnaTrainingG merged commit f147f4b into PaddlePaddle:develop Oct 10, 2023
27 checks passed

Frida-a pushed a commit to Frida-a/Paddle that referenced this pull request Oct 14, 2023

Add Slice Tensor for int64 index in ElemenetwiseKernel (PaddlePaddle#…

6045f52

…57313)

jiahy0825 pushed a commit to jiahy0825/Paddle that referenced this pull request Oct 16, 2023

Add Slice Tensor for int64 index in ElemenetwiseKernel (PaddlePaddle#…

216bc64

…57313)

danleifeng pushed a commit to danleifeng/Paddle that referenced this pull request Nov 14, 2023

Add Slice Tensor for int64 index in ElemenetwiseKernel (PaddlePaddle#…

fce9b80

…57313)

		@@ -88,6 +88,37 @@ def init_dtype(self):
		or not core.is_bfloat16_supported(core.CUDAPlace(0)),

		from paddle.base import core


		class TestElementwiseOp(OpTest):

Add Slice Tensor for int64 index in ElemenetwiseKernel #57313

Add Slice Tensor for int64 index in ElemenetwiseKernel #57313

Conversation

AnnaTrainingG commented Sep 14, 2023 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Sep 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xreki left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paddle-ci-bot bot commented Oct 6, 2023

Xreki commented Oct 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xreki left a comment

Choose a reason for hiding this comment

AnnaTrainingG commented Sep 14, 2023 •

edited

Loading

Xreki left a comment •

edited

Loading