-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sgemm accuracy drops with AVX2 #1486
Comments
Hi @penpornk and thanks for the report.
Adding requirements on the order of accumulation could and typically has an impact on oneDNN performance. Could you clarify how this particular behavior wrt the order of accumulation affects your application? Can you confirm that the accuracy drop in the end-user application is indeed due to the different order of accumulation between tails and blocks? Or is it a more general issue wrt order of accumulation? Edit: in general, if the end user application accuracy is impacted by the order of accumulation, I would be tempted to say that this application is not very numerically stable. Even if we fix the order of accumulation in oneDNN, we still might pick the "wrong" order for a given application to reach maximum accuracy. |
Thanks @mgouicem. This error prevents a constant compression optimization of a constant folded matrix-multiply whose LHS is a matrix with the rows broadcasted (so the rows are just repeats). If the LHS of a matrix-multiply has rows that are all the same, than the output will also have rows that are all the same. This is problematic specifically because constant folding in grappler relies on the TF MatMul kernel for CPUs which invokes GEMM. Outside of compiler-optimizations around constants which depend on bit-exactness, which could warrant not using gemm for such purposes, I suspect there are researchers relying on broadcasting their arrays before feeding to a matrix-multiply.
Sure but so is non-determinism in the output of a matrix-multiply where the LHS has all rows repeated but the output does not. I'm happy with a mode where we can guarantee determinism, but I'm also happy to hear more from folks working on Grappler, MLIR's TF dialect, other compilers leveraging gemm for constant folding, or researchers relying on the accuracy of operations. |
For this particular case, it would seem better to swap the matrix-multiplication and broadcast operations. This would both save some flops and guarantee that all rows are identical. Is it something doable on your side?
If swapping broadcast and Matmul is not an option on your side, we can investigate if preserving accumulation order in tails does not hurt performance too much. If it does, we would likely have to expose a mode as you mentioned, which would likely dispatch an unoptimized implementation. Is that a viable option for TF? @aaraujom @akharito on their perspective wrt to this suggestion. |
From gemm side it would be non-trivial to guarantee same order of computations and it would have performance implications. Not only tail handling would have to be changed, but we would have to be careful on how we perform parallel reductions as well. On small sizes and with multithreading we might end up with kernel executing one main block plus a tail. Low performance on tail side would reduce overall performance I think. |
It's tricky because constant folding operates under the assumption that all inputs to an operation is a constant so it has to follow breadth-first-order. We likely can't re-order this for all the ways a compiler, during the compilation process, can generate this IR fragment (for all such IR formats/dialects). It might be possible by adding a canonicalization rewrite pattern, but its still cumbersome since any ML compiler leveraging gemm for constant folding would have to buy into this at all stages of IR transforms that deal with explicit matmul ops. I also want to be mindful of the user experience of researchers since they need to be aware of this if they want accurate lhs broadcasted matmuls in practice.
Thanks, I completely appreciate the effort! I think an unoptimized implementation of the kernel can be used for constant folding, but I think someone on the TF team, or working on grappler, or the TF MLIR dialect might be a better authority.
Right, I figured the order of accumulate is the tricky bit here.
Makes sense, that ordering affects the distribution of work and we get throttled on the tail. |
Just to make double-sure we are trying to solve the right problem (@penpornk @rahulpalamuttam please confirm): If this is correct, why is this property necessary ? (other than it makes sense :-) ). In particular, why is it required for constant folding, but not for non constant computation?
From oneDNN perspective, the issue is not with parallelization over K: sgemm implementations always add partial sums in order so that results are run-to-run reproducible (running the same program twice on the same system/environment with the same inputs return the same result). |
Yes
I haven't explicitly tested it with cols, but I imagine it happens as well.
It's required for both. Which is why for the latter I want input from ML practitioners and not just compiler-folk. Having been one at some time, tiny deltas like those can balloon into bigger non-determinisms in model training which is not so fun to deal with. Ofc, outside of the world of ML and compilers, sgemm is important in a variety of other fields :)
At risk of over-claiming, I think this property is always required/preferred, but I agree there are also instances where performance is better than absolute, deterministic accuracy. We absolutely want deterministic accuracy for constant folding and so I think a flag or different implementation to turn-on for TF Kernels used during constant folding should be ok in the short-term. Fwiw, the original bug was in a ML research use-case over a year ago. I just re-discovered it again for constant folding. |
ping @mgouicem
I'm curious if there's some initial investigation results. |
Hi @rahulpalamuttam sorry no results yet. We will update this thread as we get some data. |
Hi @mgouicem, Happy new year! I was wondering if you were able to gather any data for this issue. Cheers, Rahul P |
Hi @rahulpalamuttam - I didn't look into this issue in regards performance. However, we have internal plans to move to move matmul and inner product implementations to brgemm kernels that don't present the issue you mentioned about it. |
Do you mind sharing any rough outline of timelines here? I'ld like a resolution to this bug, and if it means changing the matmul kernels used it would be nice to know when TF should switch over? |
From oneDNN side we have plans to have avx2 using brgemm kernels by the end of Q1 if everything goes well.
Although the results are not bitwise identical for the scenario in the reproducer, the results computed are actually more accurate for the tail parts. The exact entries for result matrix C is Changing to a kernel that always accumulate on same order (along k-dimension) wouldn't necessarily improve accuracy, but would make elements of the matrix the same. |
Happy new year all! :-) @aaraujom , I don't think this is an accuracy issue, but mostly a semantic issue. |
+100 |
FYI: Facing same issue with onednn 3.0.0. |
@WilliamTambellini - The sizes of The problem comes from accumulating along I'm not sure I missed another part of the kernels that changes accumulation order, but if the above doesn't work, avoiding tail cases completely ( Anyways, that's not a fix. We are adding avx2 BRGEMM support for matmul, but it is not quite ready yet. |
Hi @aaraujom , |
Unfortunately not yet for matmul. We add brgemm avx2 kernels, but still need some work for enabling it for matmul.
brgemm kernels are different than gemm kernels. brgemm kernels keep the same order of accumulation along reduce dimension (k-dim), which avoid the issue reported above. |
Summary
For some matrix sizes, sgemm accuracy drops with AVX2. The simplest case being when all elements in the result matrix should be the same. This can affect the accuracy of downstream applications, e.g., constant folding in TF.
Version
oneDNN v2.7.1 (commit a96c94f)
Environment
oneDNN includes hardware-specific optimizations and may behave
differently on depending on the compiler and build environment. Include
the following information to help reproduce the issue:
lscpu
; if yourlscpu
does not list CPU flags,try running
cat /proc/cpuinfo | grep flags | sort -u
)uname -a
): Debian 5.18gcc --version
): gcc 12.2.0cmake --version
): 3.24.2git log -1 --format=%H
): a96c94fSteps to reproduce
Call
wrong_results()
in the following code:Observed behavior
The resulting C matrix doesn't have all same value. A different value appears in fringe rows and columns. (Fringe rows change with
m
(total number of rows of C) and the number of OpenMP threads.)Example output
m = 20
, #threads = 1:m=100
, #threads = 16:Expected behavior
All elements in the matrix should have the same value.
The text was updated successfully, but these errors were encountered: