Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added oneDNN reduce_op GRAD kernel #32280

Merged
merged 43 commits into from
Apr 21, 2021
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
1ecc4cf
added external reorder to profiler
Dec 2, 2020
d4f9ad4
resolved conflicts
jakpiase Mar 8, 2021
f85e7a3
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
jakpiase Mar 9, 2021
5c02f89
added mkldnn reduce op kernel
jakpiase Mar 22, 2021
7c3b736
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
jakpiase Mar 22, 2021
4147b25
refactored reduce op
jakpiase Mar 23, 2021
726846f
reverted old file
jakpiase Mar 23, 2021
6763404
added clang formatting
jakpiase Mar 23, 2021
f2555e5
removed unnecessary imports and comments
jakpiase Mar 23, 2021
8f80eb5
minor change
jakpiase Mar 23, 2021
539fe3c
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
jakpiase Mar 25, 2021
3dfabd9
merged with develop
jakpiase Mar 25, 2021
895f948
Revert "merged with develop"
jakpiase Mar 25, 2021
cd9d2f3
minor change
jakpiase Mar 25, 2021
87fc5a1
fixed mispelling
jakpiase Mar 25, 2021
a75ee12
Minor refactoring
jakpiase Mar 26, 2021
b442889
minor change
jakpiase Mar 26, 2021
27dec3a
importet necessary modules
jakpiase Mar 26, 2021
71089fe
minor change
jakpiase Mar 26, 2021
29097ce
minor formatting change
jakpiase Mar 26, 2021
164043a
excluded cuda from bf test
jakpiase Mar 29, 2021
be36f94
fixed static mode in test_resnet_v2
jakpiase Mar 29, 2021
424083f
added formatting
jakpiase Mar 29, 2021
87b5b38
added support for edge case
jakpiase Apr 7, 2021
94e4ace
added files for reduce grad
jakpiase Apr 13, 2021
9ae1005
added grad tests for onednn reduce
jakpiase Apr 14, 2021
cfa2519
resolved conflicts
jakpiase Apr 14, 2021
7d3797f
added formatting
jakpiase Apr 14, 2021
782e25c
minor changes
jakpiase Apr 14, 2021
bd69270
minor change
jakpiase Apr 14, 2021
ffe6156
minor formatting change
jakpiase Apr 14, 2021
27f8bb7
minor change
jakpiase Apr 14, 2021
2e4ce07
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
jakpiase Apr 14, 2021
27355d0
changed test
jakpiase Apr 14, 2021
1445bd6
minor changes
jakpiase Apr 14, 2021
aa5dccd
added formatting
jakpiase Apr 14, 2021
9f9eea9
minor change
jakpiase Apr 14, 2021
996b81e
added suggested changes
jakpiase Apr 15, 2021
fce4eb4
added formatting
jakpiase Apr 15, 2021
24af4d3
removed doubled memset
jakpiase Apr 16, 2021
02dc16d
added suggested changes
jakpiase Apr 19, 2021
6464442
reverted one change
jakpiase Apr 19, 2021
7387593
changed formatting
jakpiase Apr 19, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions paddle/fluid/operators/reduce_ops/mkldnn/reduce_mean_mkldnn_op.cc
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,40 @@ class ReduceMeanMKLDNNKernel : public ReduceMKLDNNKernel<T> {
}
};

template <typename T>
class ReduceMeanGradMKLDNNKernel : public ReduceGradMKLDNNKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
const auto* input_x = ctx.Input<Tensor>("X");
auto input_dims = framework::vectorize(input_x->dims());
auto reduce_dims = ctx.Attr<std::vector<int>>("dim");

int number_of_elements = 1;
if (!ctx.Attr<bool>("reduce_all")) {
for (size_t i = 0; i < reduce_dims.size(); ++i) {
reduce_dims[i] = (reduce_dims[i] >= 0)
? reduce_dims[i]
: input_dims.size() + reduce_dims[i];
number_of_elements *= input_dims[reduce_dims[i]];
}
} else {
for (size_t i = 0; i < input_dims.size(); ++i)
number_of_elements *= input_dims[i];
jakpiase marked this conversation as resolved.
Show resolved Hide resolved
}

this->RunKernel(ctx, dnnl::algorithm::binary_add, 0.0f,
1.0L / number_of_elements);
}
};

} // namespace operators
} // namespace paddle

namespace ops = paddle::operators;
REGISTER_OP_KERNEL(reduce_mean, MKLDNN, paddle::platform::CPUPlace,
ops::ReduceMeanMKLDNNKernel<float>,
ops::ReduceMeanMKLDNNKernel<paddle::platform::bfloat16>);

REGISTER_OP_KERNEL(reduce_mean_grad, MKLDNN, paddle::platform::CPUPlace,
ops::ReduceMeanGradMKLDNNKernel<float>,
ops::ReduceMeanGradMKLDNNKernel<paddle::platform::bfloat16>);
60 changes: 60 additions & 0 deletions paddle/fluid/operators/reduce_ops/mkldnn/reduce_mkldnn_op.h
Original file line number Diff line number Diff line change
Expand Up @@ -121,5 +121,65 @@ class ReduceMKLDNNKernel : public framework::OpKernel<T> {
}
};

template <typename T>
class ReduceGradMKLDNNKernel : public framework::OpKernel<T> {
jakpiase marked this conversation as resolved.
Show resolved Hide resolved
public:
void RunKernel(const framework::ExecutionContext& ctx,
dnnl::algorithm binary_type, float scale_x,
float scale_y) const {
auto& dev_ctx =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're not modifying dev_ctx:

Suggested change
auto& dev_ctx =
const auto& dev_ctx =

ctx.template device_context<platform::MKLDNNDeviceContext>();
const auto& onednn_engine = dev_ctx.GetEngine();

auto dims = ctx.Attr<std::vector<int>>("dim");
auto* input_dy = ctx.Input<Tensor>(framework::GradVarName("Out"));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this blank line.

Suggested change

auto* output_dx = ctx.Output<Tensor>(framework::GradVarName("X"));

output_dx->mutable_data<T>(ctx.GetPlace());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

output_dx->set_format(getPlainFormatTag(output_dx));
output_dx->set_layout(input_dy->layout());

platform::BinaryReductionGradMKLDNNHandler<T> handler(
binary_type, dev_ctx, onednn_engine, ctx.GetPlace(), output_dx,
input_dy, scale_x, scale_y,
ctx.InputName(framework::GradVarName("Out")));

auto src_dx_memory = handler.AcquireSrcMemory(output_dx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto src_dx_memory = handler.AcquireSrcMemory(output_dx);
const auto src_dx_memory = handler.AcquireSrcMemory(output_dx);

const auto src_dy_memory = handler.AcquireSecondSrcMemory(input_dy);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

memset(output_dx->data<T>(), 0, src_dx_memory->get_desc().get_size());
jakpiase marked this conversation as resolved.
Show resolved Hide resolved

const auto binary_prim = handler.AcquireForwardPrimitive();

const std::unordered_map<int, dnnl::memory> args = {
{DNNL_ARG_SRC_0, *src_dx_memory},
{DNNL_ARG_SRC_1, *src_dy_memory},
{DNNL_ARG_DST, *src_dx_memory}};

auto& astream = platform::MKLDNNDeviceContext::tls().get_stream();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

binary_prim->execute(astream, args);
astream.wait();
}

protected:
mkldnn::memory::format_tag getPlainFormatTag(const Tensor* tensor) const {
switch (tensor->dims().size()) {
case 1:
return mkldnn::memory::format_tag::a;
case 2:
return mkldnn::memory::format_tag::ab;
case 3:
return mkldnn::memory::format_tag::abc;
case 4:
return mkldnn::memory::format_tag::abcd;
default:
return mkldnn::memory::format_tag::abcde;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 5 dim tensor is a default case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made a restriction in GetExpectedKernelType that dims must be in range <1,5>. I had to ensure the compiler that there always will be a return value from this function. I can delete the default statement and just leave the instruction outside switch block. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add case 5: and in default statement throw an error that invalid argument passed.

}
}
};

} // namespace operators
} // namespace paddle
12 changes: 12 additions & 0 deletions paddle/fluid/operators/reduce_ops/mkldnn/reduce_sum_mkldnn_op.cc
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,22 @@ class ReduceSumMKLDNNKernel : public ReduceMKLDNNKernel<T> {
}
};

template <typename T>
class ReduceSumGradMKLDNNKernel : public ReduceGradMKLDNNKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
this->RunKernel(ctx, dnnl::algorithm::binary_add, 0.0f, 1.0f);
}
};

} // namespace operators
} // namespace paddle

namespace ops = paddle::operators;
REGISTER_OP_KERNEL(reduce_sum, MKLDNN, paddle::platform::CPUPlace,
ops::ReduceSumMKLDNNKernel<float>,
ops::ReduceSumMKLDNNKernel<paddle::platform::bfloat16>);

REGISTER_OP_KERNEL(reduce_sum_grad, MKLDNN, paddle::platform::CPUPlace,
ops::ReduceSumGradMKLDNNKernel<float>,
ops::ReduceSumGradMKLDNNKernel<paddle::platform::bfloat16>);
34 changes: 31 additions & 3 deletions paddle/fluid/operators/reduce_ops/reduce_op.h
Original file line number Diff line number Diff line change
Expand Up @@ -559,15 +559,43 @@ class ReduceGradOp : public framework::OperatorWithKernel {
protected:
framework::OpKernelType GetExpectedKernelType(
const framework::ExecutionContext& ctx) const override {
auto input_data_type = OperatorWithKernel::IndicateVarDataType(
ctx, framework::GradVarName("Out"));

#ifdef PADDLE_WITH_MKLDNN
auto CanMKLDNNReduceGradBeUsed = [&]() {
auto dx_dims = ctx.Input<Tensor>("X")->dims();
if (ctx.Attr<bool>("reduce_all") ||
((int)ctx.Attr<std::vector<int>>("dim").size() == dx_dims.size()))
return true;

if (dx_dims.size() > 5) return false; // max 5D tensor is supported
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this second condition Attr<std::vector<int>>("dim").size() == dx_dims.size() you may pass reducing tensor with rank > 5 if one will pass manually reduction of all dims in "dim" attribute. Please first check for 5dim.


auto dy_dims = ctx.Input<Tensor>(framework::GradVarName("Out"))->dims();

// Subtensor must be on rightmost part of the bigger tensor
for (int i = 0; i < dy_dims.size(); ++i) {
if (dx_dims[dx_dims.size() - dy_dims.size() + i] != dy_dims[i]) {
return false;
}
}
return true;
};
if (this->CanMKLDNNBeUsed(ctx, input_data_type) &&
CanMKLDNNReduceGradBeUsed()) {
return framework::OpKernelType(input_data_type, ctx.GetPlace(),
framework::DataLayout::kMKLDNN,
framework::LibraryType::kMKLDNN);
}
#endif

int in_dtype = ctx.Attr<int>("in_dtype");
if (in_dtype >= 0) {
return framework::OpKernelType(
static_cast<framework::proto::VarType::Type>(in_dtype),
ctx.GetPlace());
}
return framework::OpKernelType(OperatorWithKernel::IndicateVarDataType(
ctx, framework::GradVarName("Out")),
ctx.GetPlace());
return framework::OpKernelType(input_data_type, ctx.GetPlace());
}
};

Expand Down
63 changes: 63 additions & 0 deletions paddle/fluid/platform/mkldnn_reuse.h
Original file line number Diff line number Diff line change
Expand Up @@ -630,6 +630,69 @@ class BinaryMKLDNNHandler : public platform::MKLDNNHandlerT<T, dnnl::binary> {
}
};

template <typename T>
class BinaryReductionGradMKLDNNHandler
: public platform::MKLDNNHandlerT<T, dnnl::binary> {
public:
BinaryReductionGradMKLDNNHandler(const dnnl::algorithm algo,
const MKLDNNDeviceContext& dev_ctx,
const mkldnn::engine engine,
platform::Place cpu_place, const Tensor* x,
const Tensor* y, float scale_x,
float scale_y, const std::string& uniq_name)
: platform::MKLDNNHandlerT<T, dnnl::binary>(
dev_ctx, engine, cpu_place,
platform::CreateKey(dev_ctx, framework::vectorize(x->dims()),
uniq_name)) {
if (!this->isCached()) {
PADDLE_ENFORCE_EQ(
x->layout(), DataLayout::kMKLDNN,
platform::errors::InvalidArgument("Wrong layout set for X tensor."));
PADDLE_ENFORCE_NE(
x->format(), MKLDNNMemoryFormat::undef,
platform::errors::InvalidArgument("Wrong format set for X tensor."));

PADDLE_ENFORCE_EQ(
y->layout(), DataLayout::kMKLDNN,
platform::errors::InvalidArgument("Wrong layout set for Y tensor."));
PADDLE_ENFORCE_NE(
y->format(), MKLDNNMemoryFormat::undef,
platform::errors::InvalidArgument("Wrong format set for Y tensor."));

auto src1_tz = framework::vectorize(y->dims());
const auto src0_tz = framework::vectorize(x->dims());

// GetExpectedKernelType checks if smaller vector is a subvector with all
// the dims in correct order on the rightmost part of the bigger vector,
// f.e. a correct vector for broadcasting:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// f.e. a correct vector for broadcasting:
// i.e. a correct vector for broadcasting:

// x = 5, 7, 3, 2, 4, 8
// y = 4, 8
for (size_t i = src1_tz.size(); i < src0_tz.size(); ++i) {
src1_tz.insert(src1_tz.begin(), 1L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here this won't have much impact, but in general this is not the optimal way of adding elements to a vector at the beginning, since right now you'll be copying all vector data a few times. You should rather first allocate appropriate amount of memory and then fill it.

}

const auto src0_md = dnnl::memory::desc(
src0_tz, platform::MKLDNNGetDataType<T>(), x->format());
const auto src1_md = dnnl::memory::desc(
src1_tz, platform::MKLDNNGetDataType<T>(), x->format());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
src1_tz, platform::MKLDNNGetDataType<T>(), x->format());
src1_tz, platform::MKLDNNGetDataType<T>(), y->format());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only change that I haven't implemented. Y tensor is a reduced one, so it has fewer dimensions than X tensor. PaddlePaddle does not keep dims by default, so I have to use x->format(). This operation is safe, because I am checking if the subtensor is on the rightmost part of bigger tensor.


dnnl::primitive_attr attributes;
attributes.set_scales(DNNL_ARG_SRC_0, 0, {scale_x});
attributes.set_scales(DNNL_ARG_SRC_1, 0, {scale_y});

this->AcquireForwardPrimitiveDescriptor(attributes, algo, src0_md,
src1_md, src0_md);
}
}

std::shared_ptr<mkldnn::memory> AcquireSecondSrcMemory(
const framework::Tensor* input) {
const T* input_data = input->data<T>();
return this->AcquireMemoryFromPrimitive(
this->fwd_pd_->src1_desc(), to_void_cast<T>(input_data), "@src1_mem_p");
}
};

template <typename T>
class ReductionMKLDNNHandler
: public platform::MKLDNNHandlerT<T, dnnl::reduction> {
Expand Down
Loading