Op fusion experiment #1817

chentong319 · 2022-10-31T15:38:07Z

This PR is an experimental PR to check the performance impact of Op fusion.
It has been detected that the following pattern occurs repeatedly in a model:

%1 = "onnx.Concat"(...)
%2 = "onnx.Transpose"(%1)
%3 = "onnx.Shape"(%1)

Since the current MLIR loop fusion cannot fuse the loops generated from these Ops, I am trying to fuse them at Op level just to see the performance impact.
This PR introduced a new ONNX Op CancatShapeTranspose and copied the shape inference and krnl lowering code the three original Ops into krnl lowering for the new Op. I didn't add shape inference just because the code is just for experiment and the op fusion should occur after shape inference.
The baseline model is

func.func @main_graph(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>) -> (tensor<2xi64>, tensor<?x?xf32>)
{
    %1 = "onnx.Concat"(%arg0, %arg1) {axis = 1 : si64} : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32>
    %2 = "onnx.Transpose"(%1) {perm = [1, 0]} : (tensor<?x?xf32>) -> tensor<?x?xf32>
    %3 = "onnx.Shape"(%1) : (tensor<?x?xf32>) -> tensor<2xi64>
    return %3, %2 : tensor<2xi64>, tensor<?x?xf32>
}

The fused model is

func.func @main_graph(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>) -> (tensor<2xi64>, tensor<?x?xf32>)
{
    %1:2 = "onnx.ConcatShapeTranspose"(%arg0, %arg1) {axis = 1 : si64, perm = [1, 0]} : (tensor<?x?xf32>, tensor<?x?xf32>) -> (tensor<2xi64>, tensor<?x?xf32>)
    return %1#0, %1#1 : tensor<2xi64>, tensor<?x?xf32>
}

I tried experiment on my MacBook for two data size (100x200, 100x300) and (1000x2000, 1000, 3000) and measured the time with two methods:1) time in python driver. 2) time from onnx instrumentation
The time measured in python driver are similar for both versions. The fused version has significant shorter execution time measured with instrumentation. Need further investigation.
Here are the details

(100x200, 100x300)
1. baseline python: 2.77e-4 instrumentation: 1.99e-4
2. fused python: 2.72e-4 instrumentation: 1.21e-4
(1000x2000, 1000, 3000)
1. baseline python: 4.5e-2 instrumentation: 4.3e-2
2. fused python: 4.2e-2 instrumentation: 3.1e-2

Signed-off-by: chentong319 <chentong@us.ibm.com>

AlexandreEichenberger · 2022-11-01T23:14:13Z

@chentong319 So it seems to be about 30% faster. What were the original sizes in the benchmark it came from, just to get a feel?

AlexandreEichenberger · 2022-11-01T23:15:01Z

Also, if it is a test that you don't expect to merge in, I would convert it to draft. Tx

chentong319 · 2022-11-04T00:19:01Z

@jenkins-droid test it please.

Signed-off-by: chentong319 <chentong@us.ibm.com>

imaihal · 2022-11-04T06:43:06Z

@chentong319 Thanks for creating this PR.

I tested lowering following mlir with -O3 --EmitONNXIR, the onnx.Shape disappeared as follows. I got the same behavior also in our practical model. Is it not enough to fusing only Concat and Transpose (as ConcatTranspose)? Fusing three ops is better?

func.func @main_graph(%arg0: tensor<10x10xf32>, %arg1: tensor<10x10xf32>) -> (tensor<2xi64>, tensor<?x?xf32>)
{
    %1 = "onnx.Concat"(%arg0, %arg1) {axis = 1 : si64} : (tensor<10x10xf32>, tensor<10x10xf32>) -> tensor<?x?xf32>
    %2 = "onnx.Transpose"(%1) {perm = [1, 0]} : (tensor<?x?xf32>) -> tensor<?x?xf32>
    %3 = "onnx.Shape"(%1) : (tensor<?x?xf32>) -> tensor<2xi64>
    return %3, %2 : tensor<2xi64>, tensor<?x?xf32>
}

  func.func @main_graph(%arg0: tensor<10x10xf32>, %arg1: tensor<10x10xf32>) -> (tensor<2xi64>, tensor<20x10xf32>) {
    %0 = "onnx.Concat"(%arg0, %arg1) {axis = 1 : si64} : (tensor<10x10xf32>, tensor<10x10xf32>) -> tensor<10x20xf32>
    %1 = "onnx.Transpose"(%0) {perm = [1, 0]} : (tensor<10x20xf32>) -> tensor<20x10xf32>
    %2 = "onnx.Constant"() {value = dense<[10, 20]> : tensor<2xi64>} : () -> tensor<2xi64>
    return %2, %1 : tensor<2xi64>, tensor<20x10xf32>
  }

`

imaihal · 2022-11-04T06:59:04Z

@chentong319 Sorry.. please ignore previous comment.

Above static dim case is ok by fusing two ops, but maybe fusing three ops are needed to support dynamic dim case?

func.func @main_graph(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>) -> (tensor<2xi64>, tensor<?x?xf32>)
{
    %1 = "onnx.Concat"(%arg0, %arg1) {axis = 1 : si64} : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32>
    %2 = "onnx.Transpose"(%1) {perm = [1, 0]} : (tensor<?x?xf32>) -> tensor<?x?xf32>
    %3 = "onnx.Shape"(%1) : (tensor<?x?xf32>) -> tensor<2xi64>
    return %3, %2 : tensor<2xi64>, tensor<?x?xf32>
}

  func.func @main_graph(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>) -> (tensor<2xi64>, tensor<?x?xf32>) {
    %0 = "onnx.Concat"(%arg0, %arg1) {axis = 1 : si64} : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32>
    %1 = "onnx.Transpose"(%0) {perm = [1, 0]} : (tensor<?x?xf32>) -> tensor<?x?xf32>
    %2 = "onnx.Dim"(%0) {axis = 0 : si64} : (tensor<?x?xf32>) -> tensor<1xi64>
    %3 = "onnx.Dim"(%0) {axis = 1 : si64} : (tensor<?x?xf32>) -> tensor<1xi64>
    %4 = "onnx.Concat"(%2, %3) {axis = 0 : si64} : (tensor<1xi64>, tensor<1xi64>) -> tensor<2xi64>
    return %4, %1 : tensor<2xi64>, tensor<?x?xf32>
  }

tungld · 2022-11-04T07:42:31Z

@imaihal did you see any improvement with your practical model?

imaihal · 2022-11-04T08:04:06Z

@imaihal did you see any improvement with your practical model?

@tungld
Yes. I tested our practical model by manually fusing Concat, Transpose, and Shape with ConcatShapeTranspose op. (This PR introduces ConcatShapeTranspose op, but not rewrite automatically, right?)

It seems elapsed time in Concat-Transpose-Shape part becomes faster about 60%. I will check again and will show the result offline.

Signed-off-by: chentong319 <chentong@us.ibm.com>

chentong319 · 2022-11-10T19:46:09Z

Thanks @imaihal to verify the performance impact. I updated the PR. Could you and @tungld review it?
The automatic transformation part will be added in another PR.

tungld

@chentong319 thanks for proposing a fused op!

My three main comments are:

Creating a IndexExpr-based ShapeHelper so that our DimAnalysis can utilize it.
Adding a lit test for the lowering of ConcatShapeTranspose.
ShapeOp may be decomposed into DimOp and Constant. Please make sure we have fuse Concat, Shape and Transpose before Shape is vanished. This perhaps is a topic for the next PR.

src/Conversion/ONNXToKrnl/Tensor/ConcatShapeTranspose.cpp

tungld · 2022-11-11T02:30:29Z

src/Conversion/ONNXToKrnl/Tensor/ConcatShapeTranspose.cpp

+    DimsExpr outputTransposeDims(commonRank);
+    auto permAttr = operandAdaptor.perm();
+    for (uint64_t i = 0; i < commonRank; i++) {
+      auto current = outputConcatDims[ArrayAttrIntVal(permAttr, i)];


Use a concrete type instead of auto here.

src/Conversion/ONNXToKrnl/Tensor/ConcatShapeTranspose.cpp

Signed-off-by: chentong319 <chentong@us.ibm.com>

tungld

LGTM! Thanks for the changes!

jenkins-droid · 2022-11-14T14:53:31Z

Jenkins Linux amd64 Build #8573 [push] Op fusion experiment (#1... started at 08:53

jenkins-droid · 2022-11-14T14:53:32Z

Jenkins Linux ppc64le Build #7640 [push] Op fusion experiment (#1... started at 09:55

jenkins-droid · 2022-11-14T14:53:35Z

Jenkins Linux s390x Build #8589 [push] Op fusion experiment (#1... started at 09:53

jenkins-droid · 2022-11-14T16:11:24Z

Jenkins Linux amd64 Build #8573 [push] Op fusion experiment (#1... passed after 1 hr 17 min

jenkins-droid · 2022-11-14T16:18:54Z

Jenkins Linux s390x Build #8589 [push] Op fusion experiment (#1... passed after 1 hr 25 min

jenkins-droid · 2022-11-14T16:34:46Z

Jenkins Linux ppc64le Build #7640 [push] Op fusion experiment (#1... passed after 1 hr 41 min

chentong319 added 5 commits October 28, 2022 15:54

compilation on MAC

fdf17eb

Signed-off-by: chentong319 <chentong@us.ibm.com>

dialect

672e222

Signed-off-by: chentong319 <chentong@us.ibm.com>

compiled

889a4a7

Signed-off-by: chentong319 <chentong@us.ibm.com>

shape inference

d4c173a

Signed-off-by: chentong319 <chentong@us.ibm.com>

format

da20b76

Signed-off-by: chentong319 <chentong@us.ibm.com>

chentong319 requested review from imaihal and tungld October 31, 2022 15:38

test

eec703c

Signed-off-by: chentong319 <chentong@us.ibm.com>

Merge remote-tracking branch 'upstream/main' into merge-ops

6e9dcad

Signed-off-by: chentong319 <chentong@us.ibm.com>

chentong319 added the Ready for Review label Nov 10, 2022

tungld reviewed Nov 11, 2022

View reviewed changes

chentong319 added 4 commits November 11, 2022 17:43

inference

59613c7

Signed-off-by: chentong319 <chentong@us.ibm.com>

lowering

6dced6c

Signed-off-by: chentong319 <chentong@us.ibm.com>

Merge remote-tracking branch 'upstream/main' into merge-ops

48f4cda

Pure

80ad697

Signed-off-by: chentong319 <chentong@us.ibm.com>

tungld approved these changes Nov 14, 2022

View reviewed changes

Merge branch 'main' into merge-ops

277f28c

chentong319 merged commit ea4c87c into onnx:main Nov 14, 2022

chentong319 deleted the merge-ops branch November 14, 2022 14:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Op fusion experiment #1817

Op fusion experiment #1817

chentong319 commented Oct 31, 2022 •

edited

Loading

AlexandreEichenberger commented Nov 1, 2022

AlexandreEichenberger commented Nov 1, 2022

chentong319 commented Nov 4, 2022

imaihal commented Nov 4, 2022 •

edited

Loading

imaihal commented Nov 4, 2022 •

edited

Loading

tungld commented Nov 4, 2022

imaihal commented Nov 4, 2022

chentong319 commented Nov 10, 2022 •

edited

Loading

tungld left a comment

tungld Nov 11, 2022

tungld left a comment

jenkins-droid commented Nov 14, 2022

jenkins-droid commented Nov 14, 2022

jenkins-droid commented Nov 14, 2022

jenkins-droid commented Nov 14, 2022

jenkins-droid commented Nov 14, 2022

jenkins-droid commented Nov 14, 2022

Op fusion experiment #1817

Op fusion experiment #1817

Conversation

chentong319 commented Oct 31, 2022 • edited Loading

AlexandreEichenberger commented Nov 1, 2022

AlexandreEichenberger commented Nov 1, 2022

chentong319 commented Nov 4, 2022

imaihal commented Nov 4, 2022 • edited Loading

imaihal commented Nov 4, 2022 • edited Loading

tungld commented Nov 4, 2022

imaihal commented Nov 4, 2022

chentong319 commented Nov 10, 2022 • edited Loading

tungld left a comment

Choose a reason for hiding this comment

tungld Nov 11, 2022

Choose a reason for hiding this comment

tungld left a comment

Choose a reason for hiding this comment

jenkins-droid commented Nov 14, 2022

jenkins-droid commented Nov 14, 2022

jenkins-droid commented Nov 14, 2022

jenkins-droid commented Nov 14, 2022

jenkins-droid commented Nov 14, 2022

jenkins-droid commented Nov 14, 2022

chentong319 commented Oct 31, 2022 •

edited

Loading

imaihal commented Nov 4, 2022 •

edited

Loading

imaihal commented Nov 4, 2022 •

edited

Loading

chentong319 commented Nov 10, 2022 •

edited

Loading