-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Slice Tensor for int64 index in ElemenetwiseKernel #57313
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
fda92a7
to
55c3c7c
Compare
#ifndef PADDLE_WITH_XPU_KP | ||
constexpr bool kEnabledInt64IndexKernel = (NumOuts == 1 && kArity <= 3); | ||
auto loader_classifier = | ||
BroadcastTypeClassifier<OutT, Functor, kArity, NumOuts>(ins, outs, axis); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
外层构造过一遍BroadcastTypeClassifier
,则可以把构造好的loader_classifier
作为参数传给BroadcastKernelForDifferentVecSize
函数,避免函数里面再构造一次,产生额外的CPU开销。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经删除 两者的tensor 首地址及dim不同,不可 直接传递
auto compute_size = std::numeric_limits<int32_t>::max(); | ||
bool use_int64_index_kernel = kEnabledInt64IndexKernel && | ||
(*outs)[0]->numel() >= compute_size && | ||
(!loader_classifier.all_elementwise); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all_elementwise
即不需要广播分支,也需要支持大Tensor计算
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
(!loader_classifier.all_elementwise); | ||
|
||
if (use_int64_index_kernel) { // use_int64_index_kernel | ||
const auto dims_simplifier = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dims_simplifier
在BroadcastTypeClassifier
内部也计算过,若外部需要使用,可以改成作为成员保存。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
两个dim 不一样不可直接使用
int all_rank = dims_simplifier.rank; | ||
auto old_in_dims = dims_simplifier.in_dims; | ||
auto old_out_dims = dims_simplifier.out_dims; | ||
auto old_in_strides = dims_simplifier.in_dims; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
old_
-> origin_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
auto old_out_dims = dims_simplifier.out_dims; | ||
auto old_in_strides = dims_simplifier.in_dims; | ||
|
||
old_out_strides.resize(all_rank); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
整个代码太长,且有点缺乏条理。建议按照功能封装成几个子函数。另外考虑下这个逻辑复用的可能,比如能否服用到reduce
计算中。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 看PR描述,大Tensor情况下的计算时间看起来并不是很长,可以添加一个单测保证计算的正确性
- nsys跑不出来,那就跑下端到端的时间对比吧,测试代码贴到PR描述里面,多测试几个配置
可以考虑把整个逻辑封装一下,放在单独的一个头文件中(如tensor_slicer.h)方便复用如:
class TensorSlicer {
public:
using ArgumentsTuple = std::tuple<std::vector<const DenseTensor *> ins, std::vector<DenseTensor *> outs>;
TensorSlicer(const std::vector<const DenseTensor *> &ins, std::vector<DenseTensor *> *outs) {
// 初始化
}
int size() const { return num_splits_; }
ArgumentsTuple operator[](int i) {
...
}
private:
std::vector<const DenseTensor *>* ins_; // not owned
std::vector<DenseTensor *>* outs_; // not owned
int num_splits_{0};
std::vector<int64_t> strides_; // 切片相关的信息
};
外层调用逻辑为:
if (numel > compute_size) {
auto slicer = TensorSlicer(ins, outs);
for (int i = 0; i < slice.size(); ++i) {
auto args = slicer[i];
BroadcastKernelForDifferentVecSize(args.first, args.second, ...);
}
return;
}
BroadcastKernelForDifferentVecSize(ins, outs, ...)
@@ -950,6 +897,215 @@ BroadcastKernelForDifferentVecSize(const KPDevice &ctx, | |||
} | |||
} | |||
|
|||
static void initDims(std::vector<int64_t> *dims, int size, int64_t value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::vector
有相应的构造函数,支持直接初始化为N
个值为value
的vector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
std::reverse(dims->begin(), dims->end()); | ||
} | ||
|
||
static void UpdateTensor(DenseTensor *x, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
函数名为SliceTensor
更合理
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
int axis, | ||
Functor func, | ||
const int64_t compute_size) { | ||
const auto dims_simplifier = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里对dims
化简其实没有意义,计算Tensor
切片的strides
复杂度比dims
化简要低。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
此处不仅仅是简化dims 更是对input dims的扩展,保证出0D之外,输入输出的dims_size是一样的
BroadcastKernelForDifferentVecSize<OutT, Functor, kArity, NumOuts>( | ||
ctx, new_ins, &new_outs, axis, func); | ||
} | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这行可以删除
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经删除
} | ||
|
||
// compute | ||
DenseTensor tmp_in[kArity]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里定义的Tensor看起来并没有被用到,因为L1023又定义了同名Tensor。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
重复定义,就已经删除
phi::Array<_ptr_ OutT *, NumOuts> outs_data; | ||
for (int i = 0; i < NumOuts; ++i) { | ||
outs_data[i] = (_ptr_ OutT *)(ctx.Alloc<OutT>((*outs)[i])); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outs_data
在该函数中并没有被用到
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
此处是为了分配空间,已经修改
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
上层BroadcastKernel
已经分配了空间了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
601b21c
to
85e7012
Compare
@@ -88,6 +88,37 @@ def init_dtype(self): | |||
or not core.is_bfloat16_supported(core.CUDAPlace(0)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
支持bfloat16的CI太少了,单测可以只测float16。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
class TestTensorAddSplit(unittest.TestCase): | ||
def _split_compute(self, dtype): | ||
paddle.disable_static() | ||
tensor_a = paddle.rand(shape=[5120, 4, 384, 384], dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注意显存使用,这一个tensor就5.6G。感觉新增一个测试文件比较好?和其他单测在一起,显存是否容易崩?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经新增了测试文件
from paddle.base import core | ||
|
||
|
||
class TestElementwiseOp(OpTest): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个新增的单测不要继承OpTest
,因为不是按照OpTest
的方式测试的。按照普通的unittest
方式加就行。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
Sorry to inform you that db6165c's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
db6165c
to
19db945
Compare
既然develop报错的原因已找到,再统计下性能数据吧。 |
if (use_int64_index_kernel) { | ||
switch (vec_size) { | ||
case VecSizeL: { | ||
LaunchBroadcastKernelWithInt64IndexHelper<OutT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
其他int64实现相关的源码也可以删除
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
后续单独提交PR 删
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and great work~
PR types
Others
PR changes
Others
Description
Others
Pcard-70459
5120 4 384 384, 1 1 384 384 bfloat16 精度验证
该规模下本PR性能为9.5ms, baseline nsys 统计不出性能数据
加sync 本PR: paddle.Add 加了sync的 程序运行时间:0.0124376 秒
加sync baseline: paddle.Add 加了sync的程序运行时间 程序运行时间:0.10285941 秒
无sync 本PR:程序运行时间:0.021558000000000003 秒
无sync old: 程序运行时间:0.007517 秒
说明:baseline 统计的耗时是 commit 3474e09 之前的耗时,原因是: 3474e09 commit 实现有误,会导致访存越界
测试脚本
本PR op_benchmark 中算子case规模较小,机器波动造成