-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QuantizeLinear acceleration #2967
Conversation
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
It seems there was precision loss in the backend test for dynamicquatizedlinear. I see this in Jenkins s390x:
|
Yes, and I wonder if we should attempt to change the test in ONNX to enable this optimization. |
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
!DISABLE_FAST_MATH_FOR_QL && isa<FloatType>(inputElementType); | ||
if (useOneOverScale) { | ||
Value one = create.math.constant(inputElementType, 1.0); | ||
oneOverScale = create.math.div(one, scale); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether we can have a little better performance if we prepare a vector here instead of a scalar. I guess in the loop create.math.mul(x, oneOverScale);
will splat oneOverScale
whose type is of scalar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I can check if the splat is migrated outside of the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No Splat in the innermost loop, they are migrated out.
scf.for %arg3 = %c0_6 to %c65536_7 step %c16 {
%8 = vector.load %reshape[%arg3] : memref<65536xf32>, vector<16xf32>
%9 = arith.mulf %8, %6 : vector<16xf32>
%10 = vector.shape_cast %9 : vector<16xf32> to vector<4x4xf32>
%11 = vector.extract %10[0] : vector<4xf32> from vector<4x4xf32>
%12 = "krnl.round_even"(%11) : (vector<4xf32>) -> vector<4xf32>
%13 = vector.insert %12, %10 [0] : vector<4xf32> into vector<4x4xf32>
%14 = vector.extract %10[1] : vector<4xf32> from vector<4x4xf32>
%15 = "krnl.round_even"(%14) : (vector<4xf32>) -> vector<4xf32>
%16 = vector.insert %15, %13 [1] : vector<4xf32> into vector<4x4xf32>
%17 = vector.extract %10[2] : vector<4xf32> from vector<4x4xf32>
%18 = "krnl.round_even"(%17) : (vector<4xf32>) -> vector<4xf32>
%19 = vector.insert %18, %16 [2] : vector<4xf32> into vector<4x4xf32>
%20 = vector.extract %10[3] : vector<4xf32> from vector<4x4xf32>
%21 = "krnl.round_even"(%20) : (vector<4xf32>) -> vector<4xf32>
%22 = vector.insert %21, %19 [3] : vector<4xf32> into vector<4x4xf32>
%23 = vector.shape_cast %22 : vector<4x4xf32> to vector<16xf32>
%24 = arith.addf %23, %7 : vector<16xf32>
%25 = arith.maxnumf %24, %cst_0 : vector<16xf32>
%26 = arith.minnumf %25, %cst : vector<16xf32>
%27 = arith.fptoui %26 : vector<16xf32> to vector<16xi32>
%28 = arith.trunci %27 : vector<16xi32> to vector<16xi8>
%29 = builtin.unrealized_conversion_cast %28 : vector<16xi8> to vector<16xui8>
vector.store %29, %reshape_5[%arg3] : memref<65536xui8>, vector<16xui8>
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Thanks!
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
FYI, on z16 went from 95us to 14us when using the |
@tungld its ready for another review, made the flag optional for the moment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Jenkins Linux amd64 Build #15832 [push] QuantizeLinear accelerat... started at 04:33 |
Jenkins Linux ppc64le Build #14862 [push] QuantizeLinear accelerat... started at 05:45 |
Jenkins Linux s390x Build #15835 [push] QuantizeLinear accelerat... started at 05:33 |
Jenkins Linux amd64 Build #15832 [push] QuantizeLinear accelerat... passed after 1 hr 13 min |
Jenkins Linux s390x Build #15835 [push] QuantizeLinear accelerat... passed after 1 hr 36 min |
Jenkins Linux ppc64le Build #14862 [push] QuantizeLinear accelerat... passed after 2 hr 23 min |
Simple change that compute the reciprocal needed for the scale factor outside of the inner loop. Roughly cut down the time of quantize linear by a factor of 2.
Default is off, and this can be enabled with
-O3 -enable-fast-math
option, which at this time is only about the reciprocal for quantize linear and dynamic quantize linear.Added a lit test with this option on.
At some time, we may want to turn this one on by default, but it breaks 2 backend tests (because the values are just at the border between 2 quantized values). I opened a PR in ONNX to explore if we can fix it at the source. [ https://github.com/onnx/onnx/issues/6433 ]