Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QuantizeLinear acceleration #2967

Merged
merged 42 commits into from
Oct 11, 2024
Merged

Conversation

AlexandreEichenberger
Copy link
Collaborator

@AlexandreEichenberger AlexandreEichenberger commented Oct 4, 2024

Simple change that compute the reciprocal needed for the scale factor outside of the inner loop. Roughly cut down the time of quantize linear by a factor of 2.

Default is off, and this can be enabled with -O3 -enable-fast-math option, which at this time is only about the reciprocal for quantize linear and dynamic quantize linear.

Added a lit test with this option on.

At some time, we may want to turn this one on by default, but it breaks 2 backend tests (because the values are just at the border between 2 quantized values). I opened a PR in ONNX to explore if we can fix it at the source. [ https://github.com/onnx/onnx/issues/6433 ]

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
@tungld
Copy link
Collaborator

tungld commented Oct 7, 2024

It seems there was precision loss in the backend test for dynamicquatizedlinear. I see this in Jenkins s390x:

ref_outputs = [array([153, 255,   0,  26, 221, 179], dtype=uint8), array(0.01960784, dtype=float32), array(153, dtype=uint8)]
outputs = [array([153, 255,   0,  25, 221, 179], dtype=uint8), array(0.01960784, dtype=float32), array(153, dtype=uint8)]

@AlexandreEichenberger
Copy link
Collaborator Author

Yes, and I wonder if we should attempt to change the test in ONNX to enable this optimization.

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
!DISABLE_FAST_MATH_FOR_QL && isa<FloatType>(inputElementType);
if (useOneOverScale) {
Value one = create.math.constant(inputElementType, 1.0);
oneOverScale = create.math.div(one, scale);
Copy link
Collaborator

@tungld tungld Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we can have a little better performance if we prepare a vector here instead of a scalar. I guess in the loop create.math.mul(x, oneOverScale); will splat oneOverScale whose type is of scalar.

Copy link
Collaborator Author

@AlexandreEichenberger AlexandreEichenberger Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I can check if the splat is migrated outside of the loop.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No Splat in the innermost loop, they are migrated out.

    scf.for %arg3 = %c0_6 to %c65536_7 step %c16 {
      %8 = vector.load %reshape[%arg3] : memref<65536xf32>, vector<16xf32>
      %9 = arith.mulf %8, %6 : vector<16xf32>
      %10 = vector.shape_cast %9 : vector<16xf32> to vector<4x4xf32>
      %11 = vector.extract %10[0] : vector<4xf32> from vector<4x4xf32>
      %12 = "krnl.round_even"(%11) : (vector<4xf32>) -> vector<4xf32>
      %13 = vector.insert %12, %10 [0] : vector<4xf32> into vector<4x4xf32>
      %14 = vector.extract %10[1] : vector<4xf32> from vector<4x4xf32>
      %15 = "krnl.round_even"(%14) : (vector<4xf32>) -> vector<4xf32>
      %16 = vector.insert %15, %13 [1] : vector<4xf32> into vector<4x4xf32>
      %17 = vector.extract %10[2] : vector<4xf32> from vector<4x4xf32>
      %18 = "krnl.round_even"(%17) : (vector<4xf32>) -> vector<4xf32>
      %19 = vector.insert %18, %16 [2] : vector<4xf32> into vector<4x4xf32>
      %20 = vector.extract %10[3] : vector<4xf32> from vector<4x4xf32>
      %21 = "krnl.round_even"(%20) : (vector<4xf32>) -> vector<4xf32>
      %22 = vector.insert %21, %19 [3] : vector<4xf32> into vector<4x4xf32>
      %23 = vector.shape_cast %22 : vector<4x4xf32> to vector<16xf32>
      %24 = arith.addf %23, %7 : vector<16xf32>
      %25 = arith.maxnumf %24, %cst_0 : vector<16xf32>
      %26 = arith.minnumf %25, %cst : vector<16xf32>
      %27 = arith.fptoui %26 : vector<16xf32> to vector<16xi32>
      %28 = arith.trunci %27 : vector<16xi32> to vector<16xi8>
      %29 = builtin.unrealized_conversion_cast %28 : vector<16xi8> to vector<16xui8>
      vector.store %29, %reshape_5[%arg3] : memref<65536xui8>, vector<16xui8>
    }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks!

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
@AlexandreEichenberger
Copy link
Collaborator Author

FYI, on z16 went from 95us to 14us when using the -enable-fast-math in combination with using the HW instruction for round. Without the HW instruction, enable-fast-math results in 37us.

@AlexandreEichenberger
Copy link
Collaborator Author

@tungld its ready for another review, made the flag optional for the moment.

Copy link
Collaborator

@tungld tungld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tungld tungld merged commit c3dbcf8 into onnx:main Oct 11, 2024
7 checks passed
@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #15832 [push] QuantizeLinear accelerat... started at 04:33

@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #14862 [push] QuantizeLinear accelerat... started at 05:45

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #15835 [push] QuantizeLinear accelerat... started at 05:33

@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #15832 [push] QuantizeLinear accelerat... passed after 1 hr 13 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #15835 [push] QuantizeLinear accelerat... passed after 1 hr 36 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #14862 [push] QuantizeLinear accelerat... passed after 2 hr 23 min

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants