sgemm accuracy drops with AVX2 #1486

penpornk · 2022-11-02T13:35:50Z

Summary

For some matrix sizes, sgemm accuracy drops with AVX2. The simplest case being when all elements in the result matrix should be the same. This can affect the accuracy of downstream applications, e.g., constant folding in TF.

Version

oneDNN v2.7.1 (commit a96c94f)

Environment

oneDNN includes hardware-specific optimizations and may behave
differently on depending on the compiler and build environment. Include
the following information to help reproduce the issue:

Threading: OpenMP
Max CPU ISA: AVX2
CPU make and model (try lscpu; if your lscpu does not list CPU flags,
try running cat /proc/cpuinfo | grep flags | sort -u)

  Model name:            Intel(R) Xeon(R) CPU @ 2.20GHz
    CPU family:          6
    Model:               79
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           1
    Stepping:            0
    BogoMIPS:            4399.99
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology n
                         onstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_
                         single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities

OS version (uname -a): Debian 5.18
Compiler version (gcc --version): gcc 12.2.0
CMake version (cmake --version): 3.24.2
CMake output log

-- DNNL_TARGET_ARCH: X64
-- DNNL_LIBRARY_NAME: dnnl
-- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE) 
-- Could NOT find Doxyrest (missing: DOXYREST_EXECUTABLE) 
-- Could NOT find PythonInterp (missing: PYTHON_EXECUTABLE) (Required is at least version "2.7")
-- Could NOT find Sphinx (missing: SPHINX_EXECUTABLE) 
-- Enabled workload: TRAINING
-- Enabled primitives: ALL
-- Enabled primitive CPU ISA: ALL
-- Enabled primitive GPU ISA: ALL
-- Primitive cache is enabled
-- Configuring done
-- Generating done
-- Build files have been written to: /usr/local/google/home/penporn/software/oneDNN/build

git hash (git log -1 --format=%H): a96c94f

Steps to reproduce

Call wrong_results() in the following code:

#include <cmath>
#include <cstdio>
#include "oneapi/dnnl/dnnl.h"

void print_row(float* mat, int m, int n, int i) {
  int count = 1;
  float* row_i = mat + (i * n);
  printf("Row %3d: ", i);
  for (int j = 1; j < n; ++j) {
    if (fabs(row_i[j] - row_i[j - 1]) > 3e-7) {
      printf("%f (%d times), ", row_i[j - 1], count);
      count = 1;
    } else {
      ++count;
    }
  }
  printf("%f (%d times)\n", row_i[n - 1], count);
}

void wrong_result() {
  const int m = 20;
  const int n = 18;
  const int k = 512;

  // Initialize matrices.
  float* A = new float[m * k];
  float* B = new float[k * n];
  float* C = new float[m * n];
  for (int i = 0; i < m * k; ++i) A[i] = 1.1f;
  for (int i = 0; i < k * n; ++i) B[i] = 0.9f;
  for (int i = 0; i < m * n; ++i) C[i] = 0.0f;

  // Set up matrix multiplication parameters.
  const float alpha = 1.0f;
  const float beta = 0.0f;
  const char transa = 'N';
  const char transb = 'N';
  const int lda = k;  // Row major.
  const int ldb = n;  // Row major.
  const int ldc = n;  // Row major.

  // Call oneDNN sgemm
  dnnl_sgemm(transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);

  // Print results
  for (int i = 0; i < m; ++i) print_row(C, m, n, i);

  float diff = fabs(C[0] - C[m * n - 1]);
  printf("Absolute diff: %g\n", diff);
  printf("Relative diff: %g\n", diff / C[0]);

  // Free matrices
  delete[] A;
  delete[] B;
  delete[] C;
}

Observed behavior

The resulting C matrix doesn't have all same value. A different value appears in fringe rows and columns. (Fringe rows change with m (total number of rows of C) and the number of OpenMP threads.)

Example output

m = 20, #threads = 1:

onednn_verbose,info,oneDNN v2.7.1 (commit a96c94f39773df35760d037d1ebe580dcf79c137)
onednn_verbose,info,cpu,runtime:OpenMP,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX2
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,exec,cpu,gemm_api,,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 dst_f32::blocked:ab:f0,,20x512:512x18,21.814
Row   0: 506.881195 (16 times), 506.879639 (2 times)
Row   1: 506.881195 (16 times), 506.879639 (2 times)
Row   2: 506.881195 (16 times), 506.879639 (2 times)
Row   3: 506.881195 (16 times), 506.879639 (2 times)
Row   4: 506.881195 (16 times), 506.879639 (2 times)
Row   5: 506.881195 (16 times), 506.879639 (2 times)
Row   6: 506.881195 (16 times), 506.879639 (2 times)
Row   7: 506.881195 (16 times), 506.879639 (2 times)
Row   8: 506.881195 (16 times), 506.879639 (2 times)
Row   9: 506.881195 (16 times), 506.879639 (2 times)
Row  10: 506.881195 (16 times), 506.879639 (2 times)
Row  11: 506.881195 (16 times), 506.879639 (2 times)
Row  12: 506.881195 (16 times), 506.879639 (2 times)
Row  13: 506.881195 (16 times), 506.879639 (2 times)
Row  14: 506.881195 (16 times), 506.879639 (2 times)
Row  15: 506.881195 (16 times), 506.879639 (2 times)
Row  16: 506.881195 (16 times), 506.879639 (2 times)
Row  17: 506.881195 (16 times), 506.879639 (2 times)
Row  18: 506.879639 (18 times)
Row  19: 506.879639 (18 times)
Absolute diff: 0.0015564
Relative diff: 3.07054e-06

m=100, #threads = 16:

onednn_verbose,info,oneDNN v2.7.1 (commit a96c94f39773df35760d037d1ebe580dcf79c137)
onednn_verbose,info,cpu,runtime:OpenMP,nthr:16
onednn_verbose,info,cpu,isa:Intel AVX2
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,exec,cpu,gemm_api,,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 dst_f32::blocked:ab:f0,,100x512:512x18,35.9971
Row   0: 506.881195 (16 times), 506.879639 (2 times)
Row   1: 506.881195 (16 times), 506.879639 (2 times)
Row   2: 506.881195 (16 times), 506.879639 (2 times)
Row   3: 506.881195 (16 times), 506.879639 (2 times)
Row   4: 506.881195 (16 times), 506.879639 (2 times)
Row   5: 506.881195 (16 times), 506.879639 (2 times)
Row   6: 506.879639 (18 times)
Row   7: 506.881195 (16 times), 506.879639 (2 times)
Row   8: 506.881195 (16 times), 506.879639 (2 times)
Row   9: 506.881195 (16 times), 506.879639 (2 times)
Row  10: 506.881195 (16 times), 506.879639 (2 times)
Row  11: 506.881195 (16 times), 506.879639 (2 times)
Row  12: 506.881195 (16 times), 506.879639 (2 times)
Row  13: 506.879639 (18 times)
...
Row  91: 506.881195 (16 times), 506.879639 (2 times)
Row  92: 506.881195 (16 times), 506.879639 (2 times)
Row  93: 506.881195 (16 times), 506.879639 (2 times)
Row  94: 506.881195 (16 times), 506.879639 (2 times)
Row  95: 506.881195 (16 times), 506.879639 (2 times)
Row  96: 506.881195 (16 times), 506.879639 (2 times)
Row  97: 506.879639 (18 times)
Row  98: 506.879639 (18 times)
Row  99: 506.879639 (18 times)
Absolute diff: 0.0015564
Relative diff: 3.07054e-06

Expected behavior

All elements in the matrix should have the same value.

The text was updated successfully, but these errors were encountered:

vpirogov · 2022-11-03T21:45:03Z

Thank you for the report and the reproducer, @penpornk.

+@akharito, @aaraujom

aaraujom · 2022-11-03T23:32:58Z

@penpornk - Thanks for the reproducer we can reproduce same output on our side. We are taking a look.

Tagging @dzarukin to keep him in the loop.

mgouicem · 2022-11-07T12:31:58Z

Hi @penpornk and thanks for the report.
In oneDNN, we typically have no requirement on the order on which values are accumulated. In GEMM/Matmul/Convolution it can manifest itself by:

having different results on different platforms (e.g. avx2 vs avx512),
having different results depending on shapes (e.g. with fixed K and different M/N, we typically parallelize differently along K)
having different results depending on the position of the output elements (this is the case you are hitting, where tails have a different order of accumulation than full blocks).

Adding requirements on the order of accumulation could and typically has an impact on oneDNN performance. Could you clarify how this particular behavior wrt the order of accumulation affects your application? Can you confirm that the accuracy drop in the end-user application is indeed due to the different order of accumulation between tails and blocks? Or is it a more general issue wrt order of accumulation?

Edit: in general, if the end user application accuracy is impacted by the order of accumulation, I would be tempted to say that this application is not very numerically stable. Even if we fix the order of accumulation in oneDNN, we still might pick the "wrong" order for a given application to reach maximum accuracy.

rahulpalamuttam · 2022-11-07T16:43:20Z

Thanks @mgouicem.

This error prevents a constant compression optimization of a constant folded matrix-multiply whose LHS is a matrix with the rows broadcasted (so the rows are just repeats).

If the LHS of a matrix-multiply has rows that are all the same, than the output will also have rows that are all the same.
From there, we can compress that constant in a further optimization, which due to this discrepancy we cannot since the last row of the constant-folded matrix-multiply is not bit-exact with the other rows.

This is problematic specifically because constant folding in grappler relies on the TF MatMul kernel for CPUs which invokes GEMM.

Outside of compiler-optimizations around constants which depend on bit-exactness, which could warrant not using gemm for such purposes, I suspect there are researchers relying on broadcasting their arrays before feeding to a matrix-multiply.

Even if we fix the order of accumulation in oneDNN, we still might pick the "wrong" order for a given application to reach maximum accuracy.

Sure but so is non-determinism in the output of a matrix-multiply where the LHS has all rows repeated but the output does not.

I'm happy with a mode where we can guarantee determinism, but I'm also happy to hear more from folks working on Grappler, MLIR's TF dialect, other compilers leveraging gemm for constant folding, or researchers relying on the accuracy of operations.

mgouicem · 2022-11-07T17:36:25Z

This error prevents a constant compression optimization of a constant folded matrix-multiply whose LHS is a matrix with the rows broadcasted (so the rows are just repeats).

If the LHS of a matrix-multiply has rows that are all the same, than the output will also have rows that are all the same.
From there, we can compress that constant in a further optimization, which due to this discrepancy we cannot since the last row of the constant-folded matrix-multiply is not bit-exact with the other rows.

For this particular case, it would seem better to swap the matrix-multiplication and broadcast operations. This would both save some flops and guarantee that all rows are identical. Is it something doable on your side?

This is problematic specifically because constant folding in grappler relies on the TF MatMul kernel for CPUs which invokes GEMM.

Sure but so is non-determinism in the output of a matrix-multiply where the LHS has all rows repeated but the output does not.

I'm happy with a mode where we can guarantee determinism

If swapping broadcast and Matmul is not an option on your side, we can investigate if preserving accumulation order in tails does not hurt performance too much. If it does, we would likely have to expose a mode as you mentioned, which would likely dispatch an unoptimized implementation. Is that a viable option for TF?

@aaraujom @akharito on their perspective wrt to this suggestion.

aaraujom · 2022-11-07T18:09:54Z

From gemm side it would be non-trivial to guarantee same order of computations and it would have performance implications. Not only tail handling would have to be changed, but we would have to be careful on how we perform parallel reductions as well.

On small sizes and with multithreading we might end up with kernel executing one main block plus a tail. Low performance on tail side would reduce overall performance I think.

rahulpalamuttam · 2022-11-07T18:39:59Z

For this particular case, it would seem better to swap the matrix-multiplication and broadcast operations.

It's tricky because constant folding operates under the assumption that all inputs to an operation is a constant so it has to follow breadth-first-order. We likely can't re-order this for all the ways a compiler, during the compilation process, can generate this IR fragment (for all such IR formats/dialects). It might be possible by adding a canonicalization rewrite pattern, but its still cumbersome since any ML compiler leveraging gemm for constant folding would have to buy into this at all stages of IR transforms that deal with explicit matmul ops.

I also want to be mindful of the user experience of researchers since they need to be aware of this if they want accurate lhs broadcasted matmuls in practice.

we can investigate if preserving accumulation order in tails does not hurt performance too much. If it does, we would likely have to expose a mode as you mentioned, which would likely dispatch an unoptimized implementation.

Thanks, I completely appreciate the effort! I think an unoptimized implementation of the kernel can be used for constant folding, but I think someone on the TF team, or working on grappler, or the TF MLIR dialect might be a better authority.

trivial to guarantee same order of computations and it would have performance implications. Not only tail handling would have to be changed, but we would have to be careful on how we perform parallel reductions as well.

Right, I figured the order of accumulate is the tricky bit here.

On small sizes and with multithreading we might end up with kernel executing one main block plus a tail. Low performance on tail side would reduce overall performance I think.

Makes sense, that ordering affects the distribution of work and we get throttled on the tail.

mgouicem · 2022-11-07T22:01:13Z

Just to make double-sure we are trying to solve the right problem (@penpornk @rahulpalamuttam please confirm):
Is the issue with sgemm(broadcast_rows(A), B) != broadcast_rows(sgemm(A, B)), in other words, if A rows are all equal, a user would expect all rows of the output to be equal too? Same with sgemm(A, broadcast_cols(B)) != broadcast_cols(sgemm(A, B)), where a user would expect that if B columns are all equal, all columns of the output should be equal too.

If this is correct, why is this property necessary ? (other than it makes sense :-) ). In particular, why is it required for constant folding, but not for non constant computation?
Just trying to understand when this property is required, to understand what possible solution we can offer, and what are the performance tradeoffs involved.

Makes sense, that ordering affects the distribution of work and we get throttled on the tail.

From oneDNN perspective, the issue is not with parallelization over K: sgemm implementations always add partial sums in order so that results are run-to-run reproducible (running the same program twice on the same system/environment with the same inputs return the same result).
The issue here is mostly that for M/N tails, we increase the number of accumulators in register by splitting the reduction over K in two. By having two sets of accumulator registers (accumulation over even K indices, and accumulators over odd K indices), it allows the sgemm tail kernel to better interleave instructions and better use the CPU execution ports.

rahulpalamuttam · 2022-11-07T22:17:29Z

Is the issue with sgemm(broadcast_rows(A), B) != broadcast_rows(sgemm(A, B)), in other words, if A rows are all equal, a user would expect all rows of the output to be equal too?

Yes

Same with sgemm(A, broadcast_cols(B)) != broadcast_cols(sgemm(A, B))

I haven't explicitly tested it with cols, but I imagine it happens as well.

In particular, why is it required for constant folding, but not for non constant computation?

It's required for both. Which is why for the latter I want input from ML practitioners and not just compiler-folk. Having been one at some time, tiny deltas like those can balloon into bigger non-determinisms in model training which is not so fun to deal with. Ofc, outside of the world of ML and compilers, sgemm is important in a variety of other fields :)

Just trying to understand when this property is required, to understand what possible solution we can offer, and what are the performance tradeoffs involved.

At risk of over-claiming, I think this property is always required/preferred, but I agree there are also instances where performance is better than absolute, deterministic accuracy.

We absolutely want deterministic accuracy for constant folding and so I think a flag or different implementation to turn-on for TF Kernels used during constant folding should be ok in the short-term.

Fwiw, the original bug was in a ML research use-case over a year ago. I just re-discovered it again for constant folding.
This is why I want to say its always required/preferred, but folks with more authority on both areas can weigh in.

rahulpalamuttam · 2022-11-21T23:14:12Z

ping @mgouicem

we can investigate if preserving accumulation order in tails does not hurt performance too much

I'm curious if there's some initial investigation results.

mgouicem · 2022-11-22T12:38:33Z

Hi @rahulpalamuttam sorry no results yet. We will update this thread as we get some data.

rahulpalamuttam · 2023-01-09T22:22:11Z

Hi @mgouicem,

Happy new year!

I was wondering if you were able to gather any data for this issue.

Cheers,

Rahul P

aaraujom · 2023-01-09T22:27:16Z

Hi @rahulpalamuttam - I didn't look into this issue in regards performance. However, we have internal plans to move to move matmul and inner product implementations to brgemm kernels that don't present the issue you mentioned about it.

rahulpalamuttam · 2023-01-10T18:22:09Z

However, we have internal plans to move to move matmul and inner product implementations to brgemm kernels that don't present the issue you mentioned about it.

Do you mind sharing any rough outline of timelines here?

I'ld like a resolution to this bug, and if it means changing the matmul kernels used it would be nice to know when TF should switch over?

@penpornk

aaraujom · 2023-01-11T22:27:18Z

Do you mind sharing any rough outline of timelines here?

From oneDNN side we have plans to have avx2 using brgemm kernels by the end of Q1 if everything goes well.

I'ld like a resolution to this bug

Although the results are not bitwise identical for the scenario in the reproducer, the results computed are actually more accurate for the tail parts.

The exact entries for result matrix C is $506.88 = \frac{11}{10} * \frac{9}{10} * 512$. The error at the main parts is $506.881195 - 506.88 = 0.0011949999999956$ and the error at the tail parts is $506.879639 - 506.88 = 0.00036099999999806$. It is more accurate for the tails because we can accumulate in 2 sets of registers and then add them together instead of accumulation one by one.

Changing to a kernel that always accumulate on same order (along k-dimension) wouldn't necessarily improve accuracy, but would make elements of the matrix the same.

mgouicem · 2023-01-12T11:51:06Z

Happy new year all! :-)

@aaraujom , I don't think this is an accuracy issue, but mostly a semantic issue.
@rahulpalamuttam wants to compute a broadcast operation followed by a matmul operation (see this comment). Hence, deterministic dot product within an sgemm computation is expected here. The accumulation order / accuracy does not matter here, as long as we use the same order across rows/columns.

rahulpalamuttam · 2023-01-12T19:04:23Z

@rahulpalamuttam wants to compute a broadcast operation followed by a matmul operation (see #1486 (comment)). Hence, deterministic dot product within an sgemm computation is expected here. The accumulation order / accuracy does not matter here, as long as we use the same order across rows/columns.

+100

WilliamTambellini · 2023-06-14T16:54:01Z

FYI: Facing same issue with onednn 3.0.0.
Question: does it matter in M, K or N is multiple of 8 ?

aaraujom · 2023-06-15T19:01:35Z

@WilliamTambellini - The sizes of M and N are relevant. If N % 16 > 8 and M % 6 > 3 that would avoid the issue as far as I understand.

The problem comes from accumulating along K dimension in different orders in kernel_16x[123] and kernel_8x* from jit_avx_gemm_f32.cpp. Accumulation order difference comes from here if I remember correctly. (Note that kernels are written in column major ordering and dnnl_sgemm expects row major ordering, so swap M and N if checking the kernels.)

I'm not sure I missed another part of the kernels that changes accumulation order, but if the above doesn't work, avoiding tail cases completely (N % 16 == 0 and M % 6 == 0) should avoid any special tail handling that changes accumulation order in K dimension.

Anyways, that's not a fix. We are adding avx2 BRGEMM support for matmul, but it is not quite ready yet.

Shreyas-fuj · 2023-10-25T10:32:49Z

Hi @aaraujom ,
Is the fix ready for avx2 brgemm yet? Can you please also point out the area in brgemm where this accumulation issue can be handled?

aaraujom · 2023-11-21T17:49:52Z

Is the fix ready for avx2 brgemm yet?

Unfortunately not yet for matmul. We add brgemm avx2 kernels, but still need some work for enabling it for matmul.

Can you please also point out the area in brgemm where this accumulation issue can be handled?

brgemm kernels are different than gemm kernels. brgemm kernels keep the same order of accumulation along reduce dimension (k-dim), which avoid the issue reported above.

penpornk added the sighting Suspicious library behavior. Should be promoted to a bug when confirmed label Nov 2, 2022

vpirogov assigned aaraujom Nov 4, 2022

vpirogov added the platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 label Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sgemm accuracy drops with AVX2 #1486

sgemm accuracy drops with AVX2 #1486

penpornk commented Nov 2, 2022

vpirogov commented Nov 3, 2022

aaraujom commented Nov 3, 2022

mgouicem commented Nov 7, 2022 •

edited

Loading

rahulpalamuttam commented Nov 7, 2022

mgouicem commented Nov 7, 2022

aaraujom commented Nov 7, 2022

rahulpalamuttam commented Nov 7, 2022

mgouicem commented Nov 7, 2022

rahulpalamuttam commented Nov 7, 2022 •

edited

Loading

rahulpalamuttam commented Nov 21, 2022

mgouicem commented Nov 22, 2022

rahulpalamuttam commented Jan 9, 2023

aaraujom commented Jan 9, 2023

rahulpalamuttam commented Jan 10, 2023

aaraujom commented Jan 11, 2023 •

edited

Loading

mgouicem commented Jan 12, 2023

rahulpalamuttam commented Jan 12, 2023 •

edited

Loading

WilliamTambellini commented Jun 14, 2023

aaraujom commented Jun 15, 2023

Shreyas-fuj commented Oct 25, 2023

aaraujom commented Nov 21, 2023

sgemm accuracy drops with AVX2 #1486

sgemm accuracy drops with AVX2 #1486

Comments

penpornk commented Nov 2, 2022

Summary

Version

Environment

Steps to reproduce

Observed behavior

Example output

Expected behavior

vpirogov commented Nov 3, 2022

aaraujom commented Nov 3, 2022

mgouicem commented Nov 7, 2022 • edited Loading

rahulpalamuttam commented Nov 7, 2022

mgouicem commented Nov 7, 2022

aaraujom commented Nov 7, 2022

rahulpalamuttam commented Nov 7, 2022

mgouicem commented Nov 7, 2022

rahulpalamuttam commented Nov 7, 2022 • edited Loading

rahulpalamuttam commented Nov 21, 2022

mgouicem commented Nov 22, 2022

rahulpalamuttam commented Jan 9, 2023

aaraujom commented Jan 9, 2023

rahulpalamuttam commented Jan 10, 2023

aaraujom commented Jan 11, 2023 • edited Loading

mgouicem commented Jan 12, 2023

rahulpalamuttam commented Jan 12, 2023 • edited Loading

WilliamTambellini commented Jun 14, 2023

aaraujom commented Jun 15, 2023

Shreyas-fuj commented Oct 25, 2023

aaraujom commented Nov 21, 2023

mgouicem commented Nov 7, 2022 •

edited

Loading

rahulpalamuttam commented Nov 7, 2022 •

edited

Loading

aaraujom commented Jan 11, 2023 •

edited

Loading

rahulpalamuttam commented Jan 12, 2023 •

edited

Loading