Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Op fusion experiment #1817

Merged
merged 12 commits into from
Nov 14, 2022
Merged

Op fusion experiment #1817

merged 12 commits into from
Nov 14, 2022

Conversation

chentong319
Copy link
Collaborator

@chentong319 chentong319 commented Oct 31, 2022

This PR is an experimental PR to check the performance impact of Op fusion.
It has been detected that the following pattern occurs repeatedly in a model:

%1 = "onnx.Concat"(...)
%2 = "onnx.Transpose"(%1)
%3 = "onnx.Shape"(%1)

Since the current MLIR loop fusion cannot fuse the loops generated from these Ops, I am trying to fuse them at Op level just to see the performance impact.
This PR introduced a new ONNX Op CancatShapeTranspose and copied the shape inference and krnl lowering code the three original Ops into krnl lowering for the new Op. I didn't add shape inference just because the code is just for experiment and the op fusion should occur after shape inference.
The baseline model is

func.func @main_graph(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>) -> (tensor<2xi64>, tensor<?x?xf32>)
{
    %1 = "onnx.Concat"(%arg0, %arg1) {axis = 1 : si64} : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32>
    %2 = "onnx.Transpose"(%1) {perm = [1, 0]} : (tensor<?x?xf32>) -> tensor<?x?xf32>
    %3 = "onnx.Shape"(%1) : (tensor<?x?xf32>) -> tensor<2xi64>
    return %3, %2 : tensor<2xi64>, tensor<?x?xf32>
}

The fused model is

func.func @main_graph(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>) -> (tensor<2xi64>, tensor<?x?xf32>)
{
    %1:2 = "onnx.ConcatShapeTranspose"(%arg0, %arg1) {axis = 1 : si64, perm = [1, 0]} : (tensor<?x?xf32>, tensor<?x?xf32>) -> (tensor<2xi64>, tensor<?x?xf32>)
    return %1#0, %1#1 : tensor<2xi64>, tensor<?x?xf32>
}

I tried experiment on my MacBook for two data size (100x200, 100x300) and (1000x2000, 1000, 3000) and measured the time with two methods:1) time in python driver. 2) time from onnx instrumentation
The time measured in python driver are similar for both versions. The fused version has significant shorter execution time measured with instrumentation. Need further investigation.
Here are the details

  • (100x200, 100x300)

    1. baseline python: 2.77e-4 instrumentation: 1.99e-4
    2. fused python: 2.72e-4 instrumentation: 1.21e-4
  • (1000x2000, 1000, 3000)

    1. baseline python: 4.5e-2 instrumentation: 4.3e-2
    2. fused python: 4.2e-2 instrumentation: 3.1e-2

Signed-off-by: chentong319 <chentong@us.ibm.com>
Signed-off-by: chentong319 <chentong@us.ibm.com>
Signed-off-by: chentong319 <chentong@us.ibm.com>
Signed-off-by: chentong319 <chentong@us.ibm.com>
Signed-off-by: chentong319 <chentong@us.ibm.com>
@AlexandreEichenberger
Copy link
Collaborator

@chentong319 So it seems to be about 30% faster. What were the original sizes in the benchmark it came from, just to get a feel?

@AlexandreEichenberger
Copy link
Collaborator

Also, if it is a test that you don't expect to merge in, I would convert it to draft. Tx

@chentong319
Copy link
Collaborator Author

@jenkins-droid test it please.

Signed-off-by: chentong319 <chentong@us.ibm.com>
@imaihal
Copy link
Collaborator

imaihal commented Nov 4, 2022

@chentong319 Thanks for creating this PR.

I tested lowering following mlir with -O3 --EmitONNXIR, the onnx.Shape disappeared as follows. I got the same behavior also in our practical model. Is it not enough to fusing only Concat and Transpose (as ConcatTranspose)? Fusing three ops is better?

func.func @main_graph(%arg0: tensor<10x10xf32>, %arg1: tensor<10x10xf32>) -> (tensor<2xi64>, tensor<?x?xf32>)
{
    %1 = "onnx.Concat"(%arg0, %arg1) {axis = 1 : si64} : (tensor<10x10xf32>, tensor<10x10xf32>) -> tensor<?x?xf32>
    %2 = "onnx.Transpose"(%1) {perm = [1, 0]} : (tensor<?x?xf32>) -> tensor<?x?xf32>
    %3 = "onnx.Shape"(%1) : (tensor<?x?xf32>) -> tensor<2xi64>
    return %3, %2 : tensor<2xi64>, tensor<?x?xf32>
}
  func.func @main_graph(%arg0: tensor<10x10xf32>, %arg1: tensor<10x10xf32>) -> (tensor<2xi64>, tensor<20x10xf32>) {
    %0 = "onnx.Concat"(%arg0, %arg1) {axis = 1 : si64} : (tensor<10x10xf32>, tensor<10x10xf32>) -> tensor<10x20xf32>
    %1 = "onnx.Transpose"(%0) {perm = [1, 0]} : (tensor<10x20xf32>) -> tensor<20x10xf32>
    %2 = "onnx.Constant"() {value = dense<[10, 20]> : tensor<2xi64>} : () -> tensor<2xi64>
    return %2, %1 : tensor<2xi64>, tensor<20x10xf32>
  }

`

@imaihal
Copy link
Collaborator

imaihal commented Nov 4, 2022

@chentong319 Sorry.. please ignore previous comment.

Above static dim case is ok by fusing two ops, but maybe fusing three ops are needed to support dynamic dim case?

func.func @main_graph(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>) -> (tensor<2xi64>, tensor<?x?xf32>)
{
    %1 = "onnx.Concat"(%arg0, %arg1) {axis = 1 : si64} : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32>
    %2 = "onnx.Transpose"(%1) {perm = [1, 0]} : (tensor<?x?xf32>) -> tensor<?x?xf32>
    %3 = "onnx.Shape"(%1) : (tensor<?x?xf32>) -> tensor<2xi64>
    return %3, %2 : tensor<2xi64>, tensor<?x?xf32>
}
  func.func @main_graph(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>) -> (tensor<2xi64>, tensor<?x?xf32>) {
    %0 = "onnx.Concat"(%arg0, %arg1) {axis = 1 : si64} : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32>
    %1 = "onnx.Transpose"(%0) {perm = [1, 0]} : (tensor<?x?xf32>) -> tensor<?x?xf32>
    %2 = "onnx.Dim"(%0) {axis = 0 : si64} : (tensor<?x?xf32>) -> tensor<1xi64>
    %3 = "onnx.Dim"(%0) {axis = 1 : si64} : (tensor<?x?xf32>) -> tensor<1xi64>
    %4 = "onnx.Concat"(%2, %3) {axis = 0 : si64} : (tensor<1xi64>, tensor<1xi64>) -> tensor<2xi64>
    return %4, %1 : tensor<2xi64>, tensor<?x?xf32>
  }

@tungld
Copy link
Collaborator

tungld commented Nov 4, 2022

@imaihal did you see any improvement with your practical model?

@imaihal
Copy link
Collaborator

imaihal commented Nov 4, 2022

@imaihal did you see any improvement with your practical model?

@tungld
Yes. I tested our practical model by manually fusing Concat, Transpose, and Shape with ConcatShapeTranspose op. (This PR introduces ConcatShapeTranspose op, but not rewrite automatically, right?)

It seems elapsed time in Concat-Transpose-Shape part becomes faster about 60%. I will check again and will show the result offline.

Signed-off-by: chentong319 <chentong@us.ibm.com>
@chentong319
Copy link
Collaborator Author

chentong319 commented Nov 10, 2022

Thanks @imaihal to verify the performance impact. I updated the PR. Could you and @tungld review it?
The automatic transformation part will be added in another PR.

Copy link
Collaborator

@tungld tungld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chentong319 thanks for proposing a fused op!

My three main comments are:

  • Creating a IndexExpr-based ShapeHelper so that our DimAnalysis can utilize it.
  • Adding a lit test for the lowering of ConcatShapeTranspose.
  • ShapeOp may be decomposed into DimOp and Constant. Please make sure we have fuse Concat, Shape and Transpose before Shape is vanished. This perhaps is a topic for the next PR.

src/Conversion/ONNXToKrnl/Tensor/ConcatShapeTranspose.cpp Outdated Show resolved Hide resolved
src/Conversion/ONNXToKrnl/Tensor/ConcatShapeTranspose.cpp Outdated Show resolved Hide resolved
src/Conversion/ONNXToKrnl/Tensor/ConcatShapeTranspose.cpp Outdated Show resolved Hide resolved
src/Conversion/ONNXToKrnl/Tensor/ConcatShapeTranspose.cpp Outdated Show resolved Hide resolved
DimsExpr outputTransposeDims(commonRank);
auto permAttr = operandAdaptor.perm();
for (uint64_t i = 0; i < commonRank; i++) {
auto current = outputConcatDims[ArrayAttrIntVal(permAttr, i)];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a concrete type instead of auto here.

src/Conversion/ONNXToKrnl/Tensor/ConcatShapeTranspose.cpp Outdated Show resolved Hide resolved
src/Conversion/ONNXToKrnl/Tensor/ConcatShapeTranspose.cpp Outdated Show resolved Hide resolved
Signed-off-by: chentong319 <chentong@us.ibm.com>
Signed-off-by: chentong319 <chentong@us.ibm.com>
Signed-off-by: chentong319 <chentong@us.ibm.com>
Copy link
Collaborator

@tungld tungld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the changes!

@chentong319 chentong319 merged commit ea4c87c into onnx:main Nov 14, 2022
@chentong319 chentong319 deleted the merge-ops branch November 14, 2022 14:53
@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #8573 [push] Op fusion experiment (#1... started at 08:53

@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #7640 [push] Op fusion experiment (#1... started at 09:55

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #8589 [push] Op fusion experiment (#1... started at 09:53

@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #8573 [push] Op fusion experiment (#1... passed after 1 hr 17 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #8589 [push] Op fusion experiment (#1... passed after 1 hr 25 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #7640 [push] Op fusion experiment (#1... passed after 1 hr 41 min

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants