-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure that TryGetContainableHWIntrinsicOp
is non-mutating
#79363
Ensure that TryGetContainableHWIntrinsicOp
is non-mutating
#79363
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsThis removes a possible mutating side-effect from This helps ensure accidental side effects don't happen and improves codegen for cases like the following: Unsafe.SkipInit(out this);
Vector128<float> x = Vector128.Create(m11, m12, m13, m14);
x.StoreUnsafe(ref M11);
Vector128<float> y = Vector128.Create(m21, m22, m23, m24);
y.StoreUnsafe(ref M21);
Vector128<float> z = Vector128.Create(m31, m32, m33, m34);
z.StoreUnsafe(ref M31);
Vector128<float> w = Vector128.Create(m41, m42, m43, m44);
w.StoreUnsafe(ref M41); Before we generated: 3809 cmp byte ptr [rcx], cl
C4E37121C210 vinsertps xmm0, xmm1, xmm2, 16
C4E37921C320 vinsertps xmm0, xmm0, xmm3, 32
C5FA104C2428 vmovss xmm1, dword ptr [rsp+28H]
C4E37921C130 vinsertps xmm0, xmm0, xmm1, 48
C5F81101 vmovups xmmword ptr [rcx], xmm0
C5FA10442430 vmovss xmm0, dword ptr [rsp+30H]
C5FA104C2438 vmovss xmm1, dword ptr [rsp+38H]
C4E37921C110 vinsertps xmm0, xmm0, xmm1, 16
C5FA104C2440 vmovss xmm1, dword ptr [rsp+40H]
C4E37921C120 vinsertps xmm0, xmm0, xmm1, 32
C5FA104C2448 vmovss xmm1, dword ptr [rsp+48H]
C4E37921C130 vinsertps xmm0, xmm0, xmm1, 48
C5F8114110 vmovups xmmword ptr [rcx+10H], xmm0
C5FA10442450 vmovss xmm0, dword ptr [rsp+50H]
C5FA104C2458 vmovss xmm1, dword ptr [rsp+58H]
C4E37921C110 vinsertps xmm0, xmm0, xmm1, 16
C5FA104C2460 vmovss xmm1, dword ptr [rsp+60H]
C4E37921C120 vinsertps xmm0, xmm0, xmm1, 32
C5FA104C2468 vmovss xmm1, dword ptr [rsp+68H]
C4E37921C130 vinsertps xmm0, xmm0, xmm1, 48
C5F8114120 vmovups xmmword ptr [rcx+20H], xmm0
C5FA10442470 vmovss xmm0, dword ptr [rsp+70H]
C5FA104C2478 vmovss xmm1, dword ptr [rsp+78H]
C4E37921C110 vinsertps xmm0, xmm0, xmm1, 16
C5FA108C2480000000 vmovss xmm1, dword ptr [rsp+80H]
C4E37921C120 vinsertps xmm0, xmm0, xmm1, 32
C5FA108C2488000000 vmovss xmm1, dword ptr [rsp+88H]
C4E37921C130 vinsertps xmm0, xmm0, xmm1, 48
C5F8114130 vmovups xmmword ptr [rcx+30H], xmm0 Now we generate 3809 cmp byte ptr [rcx], cl
C4E37121C210 vinsertps xmm0, xmm1, xmm2, 16
C4E37921C320 vinsertps xmm0, xmm0, xmm3, 32
C4E3792144242830 vinsertps xmm0, xmm0, dword ptr [rsp+28H], 48
C5F81101 vmovups xmmword ptr [rcx], xmm0
C5FA10442430 vmovss xmm0, dword ptr [rsp+30H]
C4E3792144243810 vinsertps xmm0, xmm0, dword ptr [rsp+38H], 16
C4E3792144244020 vinsertps xmm0, xmm0, dword ptr [rsp+40H], 32
C4E3792144244830 vinsertps xmm0, xmm0, dword ptr [rsp+48H], 48
C5F8114110 vmovups xmmword ptr [rcx+10H], xmm0
C5FA10442450 vmovss xmm0, dword ptr [rsp+50H]
C4E3792144245810 vinsertps xmm0, xmm0, dword ptr [rsp+58H], 16
C4E3792144246020 vinsertps xmm0, xmm0, dword ptr [rsp+60H], 32
C4E3792144246830 vinsertps xmm0, xmm0, dword ptr [rsp+68H], 48
C5F8114120 vmovups xmmword ptr [rcx+20H], xmm0
C5FA10442470 vmovss xmm0, dword ptr [rsp+70H]
C4E3792144247810 vinsertps xmm0, xmm0, dword ptr [rsp+78H], 16
C4E3792184248000000020 vinsertps xmm0, xmm0, dword ptr [rsp+80H], 32
C4E3792184248800000030 vinsertps xmm0, xmm0, dword ptr [rsp+88H], 48
C5F8114130 vmovups xmmword ptr [rcx+30H], xmm0
|
Ideally we'd be able to improve this even more and recognize However, I've been told this is complicated since we apparently do not surface whether or not a given |
CC. @dotnet/jit-contrib |
Why are there some regressions? |
@BruceForstall, looks like possibly a bug in diff reporting? If I had to guess, this is an issue with a peephole optimization not correctly impacting the computed instruction size. If we look at 173746.dasm from
We'll see - ;; size=91 bbWeight=1 PerfScore 18.08
+ ;; size=97 bbWeight=1 PerfScore 18.08 For Latter instruction groups largely just change the IG block number they jump to and occasionally insert new alignment bytes, but even when its a diff of 1 alignment byte, they often have significantly larger size inscreases measured. |
@BruceForstall, @kunalspathak actually after looking at the raw disassembly (which includes the codegen bytes), this is the alignment code pessimizing SIMD codegen. Basically, if we decide to emit alignment blocks at all, then SIMD codegen gets forced to always emit the 3-byte encoding rather than the more optimal 2-byte encoding: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/emitxarch.cpp#L1560-L1569 This can mean otherwise profitable changes can appear to be no longer profitable in any method that is SIMD or floating-point heavy. |
That is unfortunately true. We could potentially have a heuristic to not |
I'm going to see how easy it is to "fix" the size estimation to not statically return 3. The biggest "missing" piece should really be the checks on which registers are used. |
I don't think its worth blocking this PR on that change, however. This change is a net positive to the number of instructions emitted with the "regressions" being an external factor based on loop alignment. |
No changes, just rerunning CI and getting updated diffs now that #79478 is merged. I expect the SPMI diffs to look even better. Plan on merging after those get back, assuming nothing unexpected crops up. |
Indeed better diffs But shows a couple small regressions on |
… removing NEG nodes
@BruceForstall, resolved all the regressions with the last two commits. Since they were a little more complex, it might be worth a second review/sign-off: The first handles the FMA case by ensuring we recheck containment after removing the The second was a little more complex and involves |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This removes a possible mutating side-effect from
TryGetContainableHWIntrinsicOp
by allowingCreateScalarUnsafe
to be directly contained.This helps ensure accidental side effects don't happen and improves codegen for cases like the following:
Before we generated:
Now we generate