Ensure that `TryGetContainableHWIntrinsicOp` is non-mutating #79363

tannergooding · 2022-12-07T22:23:46Z

This removes a possible mutating side-effect from TryGetContainableHWIntrinsicOp by allowing CreateScalarUnsafe to be directly contained.

This helps ensure accidental side effects don't happen and improves codegen for cases like the following:

Unsafe.SkipInit(out this);

Vector128<float> x = Vector128.Create(m11, m12, m13, m14);
x.StoreUnsafe(ref M11);

Vector128<float> y = Vector128.Create(m21, m22, m23, m24);
y.StoreUnsafe(ref M21);

Vector128<float> z = Vector128.Create(m31, m32, m33, m34);
z.StoreUnsafe(ref M31);

Vector128<float> w = Vector128.Create(m41, m42, m43, m44);
w.StoreUnsafe(ref M41);

Before we generated:

3809                 cmp      byte  ptr [rcx], cl
C4E37121C210         vinsertps xmm0, xmm1, xmm2, 16
C4E37921C320         vinsertps xmm0, xmm0, xmm3, 32
C5FA104C2428         vmovss   xmm1, dword ptr [rsp+28H]
C4E37921C130         vinsertps xmm0, xmm0, xmm1, 48
C5F81101             vmovups  xmmword ptr [rcx], xmm0
C5FA10442430         vmovss   xmm0, dword ptr [rsp+30H]
C5FA104C2438         vmovss   xmm1, dword ptr [rsp+38H]
C4E37921C110         vinsertps xmm0, xmm0, xmm1, 16
C5FA104C2440         vmovss   xmm1, dword ptr [rsp+40H]
C4E37921C120         vinsertps xmm0, xmm0, xmm1, 32
C5FA104C2448         vmovss   xmm1, dword ptr [rsp+48H]
C4E37921C130         vinsertps xmm0, xmm0, xmm1, 48
C5F8114110           vmovups  xmmword ptr [rcx+10H], xmm0
C5FA10442450         vmovss   xmm0, dword ptr [rsp+50H]
C5FA104C2458         vmovss   xmm1, dword ptr [rsp+58H]
C4E37921C110         vinsertps xmm0, xmm0, xmm1, 16
C5FA104C2460         vmovss   xmm1, dword ptr [rsp+60H]
C4E37921C120         vinsertps xmm0, xmm0, xmm1, 32
C5FA104C2468         vmovss   xmm1, dword ptr [rsp+68H]
C4E37921C130         vinsertps xmm0, xmm0, xmm1, 48
C5F8114120           vmovups  xmmword ptr [rcx+20H], xmm0
C5FA10442470         vmovss   xmm0, dword ptr [rsp+70H]
C5FA104C2478         vmovss   xmm1, dword ptr [rsp+78H]
C4E37921C110         vinsertps xmm0, xmm0, xmm1, 16
C5FA108C2480000000   vmovss   xmm1, dword ptr [rsp+80H]
C4E37921C120         vinsertps xmm0, xmm0, xmm1, 32
C5FA108C2488000000   vmovss   xmm1, dword ptr [rsp+88H]
C4E37921C130         vinsertps xmm0, xmm0, xmm1, 48
C5F8114130           vmovups  xmmword ptr [rcx+30H], xmm0

Now we generate

3809                 cmp      byte  ptr [rcx], cl
C4E37121C210         vinsertps xmm0, xmm1, xmm2, 16
C4E37921C320         vinsertps xmm0, xmm0, xmm3, 32
C4E3792144242830     vinsertps xmm0, xmm0, dword ptr [rsp+28H], 48
C5F81101             vmovups  xmmword ptr [rcx], xmm0
C5FA10442430         vmovss   xmm0, dword ptr [rsp+30H]
C4E3792144243810     vinsertps xmm0, xmm0, dword ptr [rsp+38H], 16
C4E3792144244020     vinsertps xmm0, xmm0, dword ptr [rsp+40H], 32
C4E3792144244830     vinsertps xmm0, xmm0, dword ptr [rsp+48H], 48
C5F8114110           vmovups  xmmword ptr [rcx+10H], xmm0
C5FA10442450         vmovss   xmm0, dword ptr [rsp+50H]
C4E3792144245810     vinsertps xmm0, xmm0, dword ptr [rsp+58H], 16
C4E3792144246020     vinsertps xmm0, xmm0, dword ptr [rsp+60H], 32
C4E3792144246830     vinsertps xmm0, xmm0, dword ptr [rsp+68H], 48
C5F8114120           vmovups  xmmword ptr [rcx+20H], xmm0
C5FA10442470         vmovss   xmm0, dword ptr [rsp+70H]
C4E3792144247810     vinsertps xmm0, xmm0, dword ptr [rsp+78H], 16
C4E3792184248000000020 vinsertps xmm0, xmm0, dword ptr [rsp+80H], 32
C4E3792184248800000030 vinsertps xmm0, xmm0, dword ptr [rsp+88H], 48
C5F8114130           vmovups  xmmword ptr [rcx+30H], xmm0

ghost · 2022-12-07T22:23:55Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This removes a possible mutating side-effect from TryGetContainableHWIntrinsicOp by allowing CreateScalarUnsafe to be directly contained.

This helps ensure accidental side effects don't happen and improves codegen for cases like the following:

Unsafe.SkipInit(out this);

Vector128<float> x = Vector128.Create(m11, m12, m13, m14);
x.StoreUnsafe(ref M11);

Vector128<float> y = Vector128.Create(m21, m22, m23, m24);
y.StoreUnsafe(ref M21);

Vector128<float> z = Vector128.Create(m31, m32, m33, m34);
z.StoreUnsafe(ref M31);

Vector128<float> w = Vector128.Create(m41, m42, m43, m44);
w.StoreUnsafe(ref M41);

Before we generated:

3809                 cmp      byte  ptr [rcx], cl
C4E37121C210         vinsertps xmm0, xmm1, xmm2, 16
C4E37921C320         vinsertps xmm0, xmm0, xmm3, 32
C5FA104C2428         vmovss   xmm1, dword ptr [rsp+28H]
C4E37921C130         vinsertps xmm0, xmm0, xmm1, 48
C5F81101             vmovups  xmmword ptr [rcx], xmm0
C5FA10442430         vmovss   xmm0, dword ptr [rsp+30H]
C5FA104C2438         vmovss   xmm1, dword ptr [rsp+38H]
C4E37921C110         vinsertps xmm0, xmm0, xmm1, 16
C5FA104C2440         vmovss   xmm1, dword ptr [rsp+40H]
C4E37921C120         vinsertps xmm0, xmm0, xmm1, 32
C5FA104C2448         vmovss   xmm1, dword ptr [rsp+48H]
C4E37921C130         vinsertps xmm0, xmm0, xmm1, 48
C5F8114110           vmovups  xmmword ptr [rcx+10H], xmm0
C5FA10442450         vmovss   xmm0, dword ptr [rsp+50H]
C5FA104C2458         vmovss   xmm1, dword ptr [rsp+58H]
C4E37921C110         vinsertps xmm0, xmm0, xmm1, 16
C5FA104C2460         vmovss   xmm1, dword ptr [rsp+60H]
C4E37921C120         vinsertps xmm0, xmm0, xmm1, 32
C5FA104C2468         vmovss   xmm1, dword ptr [rsp+68H]
C4E37921C130         vinsertps xmm0, xmm0, xmm1, 48
C5F8114120           vmovups  xmmword ptr [rcx+20H], xmm0
C5FA10442470         vmovss   xmm0, dword ptr [rsp+70H]
C5FA104C2478         vmovss   xmm1, dword ptr [rsp+78H]
C4E37921C110         vinsertps xmm0, xmm0, xmm1, 16
C5FA108C2480000000   vmovss   xmm1, dword ptr [rsp+80H]
C4E37921C120         vinsertps xmm0, xmm0, xmm1, 32
C5FA108C2488000000   vmovss   xmm1, dword ptr [rsp+88H]
C4E37921C130         vinsertps xmm0, xmm0, xmm1, 48
C5F8114130           vmovups  xmmword ptr [rcx+30H], xmm0

Now we generate

3809                 cmp      byte  ptr [rcx], cl
C4E37121C210         vinsertps xmm0, xmm1, xmm2, 16
C4E37921C320         vinsertps xmm0, xmm0, xmm3, 32
C4E3792144242830     vinsertps xmm0, xmm0, dword ptr [rsp+28H], 48
C5F81101             vmovups  xmmword ptr [rcx], xmm0
C5FA10442430         vmovss   xmm0, dword ptr [rsp+30H]
C4E3792144243810     vinsertps xmm0, xmm0, dword ptr [rsp+38H], 16
C4E3792144244020     vinsertps xmm0, xmm0, dword ptr [rsp+40H], 32
C4E3792144244830     vinsertps xmm0, xmm0, dword ptr [rsp+48H], 48
C5F8114110           vmovups  xmmword ptr [rcx+10H], xmm0
C5FA10442450         vmovss   xmm0, dword ptr [rsp+50H]
C4E3792144245810     vinsertps xmm0, xmm0, dword ptr [rsp+58H], 16
C4E3792144246020     vinsertps xmm0, xmm0, dword ptr [rsp+60H], 32
C4E3792144246830     vinsertps xmm0, xmm0, dword ptr [rsp+68H], 48
C5F8114120           vmovups  xmmword ptr [rcx+20H], xmm0
C5FA10442470         vmovss   xmm0, dword ptr [rsp+70H]
C4E3792144247810     vinsertps xmm0, xmm0, dword ptr [rsp+78H], 16
C4E3792184248000000020 vinsertps xmm0, xmm0, dword ptr [rsp+80H], 32
C4E3792184248800000030 vinsertps xmm0, xmm0, dword ptr [rsp+88H], 48
C5F8114130           vmovups  xmmword ptr [rcx+30H], xmm0

Author:	tannergooding
Assignees:	tannergooding
Labels:	`area-CodeGen-coreclr`
Milestone:	-

tannergooding · 2022-12-07T22:29:01Z

Ideally we'd be able to improve this even more and recognize n sequential inserts all coming from sequential addresses such that we could replace it with a single LoadVector128.

However, I've been told this is complicated since we apparently do not surface whether or not a given parameter is pass by register or pass via shadow copy and so we cannot trivially see that these are coming from sequential addresses.

…op1 type is a floating-point

tannergooding · 2022-12-08T19:16:05Z

CC. @dotnet/jit-contrib

BruceForstall · 2022-12-08T22:25:04Z

Diffs

Why are there some regressions?

tannergooding · 2022-12-09T15:27:36Z

@BruceForstall, looks like possibly a bug in diff reporting? If I had to guess, this is an issue with a peephole optimization not correctly impacting the computed instruction size.

If we look at 173746.dasm from asm.coreclr_tests.run.Linux.x64.checked:

Base: 173746.dasm.txt
Diff: 173746.dasm.txt

We'll see G_M25977_IG01 has the following diff even though there were zero code changes.

-						;; size=91 bbWeight=1    PerfScore 18.08
+						;; size=97 bbWeight=1    PerfScore 18.08

For IG02, it drops 5 movaps instructions (-4 bytes each, so -20 bytes total) and gains a jg instruction (+6 bytes) since all instructions from the BB fit into one IG now. However, the diff still reports an +18 byte change in size, even though it should be -14.

Latter instruction groups largely just change the IG block number they jump to and occasionally insert new alignment bytes, but even when its a diff of 1 alignment byte, they often have significantly larger size inscreases measured.

tannergooding · 2022-12-09T15:46:28Z

@BruceForstall, @kunalspathak actually after looking at the raw disassembly (which includes the codegen bytes), this is the alignment code pessimizing SIMD codegen.

Basically, if we decide to emit alignment blocks at all, then SIMD codegen gets forced to always emit the 3-byte encoding rather than the more optimal 2-byte encoding: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/emitxarch.cpp#L1560-L1569

This can mean otherwise profitable changes can appear to be no longer profitable in any method that is SIMD or floating-point heavy.

kunalspathak · 2022-12-09T17:53:35Z

This can mean otherwise profitable changes can appear to be no longer profitable in any method that is SIMD or floating-point heavy.

That is unfortunately true. We could potentially have a heuristic to not align blocks beyond a point where we have seen XXX SIMD instructions?

tannergooding · 2022-12-09T18:20:46Z

That is unfortunately true. We could potentially have a heuristic to not align blocks beyond a point where we have seen XXX SIMD instructions?

I'm going to see how easy it is to "fix" the size estimation to not statically return 3.

The biggest "missing" piece should really be the checks on which registers are used.

tannergooding · 2022-12-09T19:15:27Z

I don't think its worth blocking this PR on that change, however.

This change is a net positive to the number of instructions emitted with the "regressions" being an external factor based on loop alignment.

…trinsicop

tannergooding · 2022-12-14T19:25:20Z

No changes, just rerunning CI and getting updated diffs now that #79478 is merged. I expect the SPMI diffs to look even better. Plan on merging after those get back, assuming nothing unexpected crops up.

tannergooding · 2022-12-14T21:43:45Z

Indeed better diffs

But shows a couple small regressions on vfma***ps that need to be looked into and resolved. These look to have existed in the original as well but were harder to spot due to all the alignment diffs

… removing NEG nodes

…trinsicop

tannergooding · 2022-12-15T14:42:23Z

@BruceForstall, resolved all the regressions with the last two commits. Since they were a little more complex, it might be worth a second review/sign-off:

The first handles the FMA case by ensuring we recheck containment after removing the NEG node. Basically we had CreateScalarUnsafe(-value) and that blocked containment. When we have such a case, we decide to change the instruction being emitted to something like FusedMultiplySubtract and remove the NEG. We then just need to recheck containment on CreateScalarUnsafe(value).

The second was a little more complex and involves vpbroadcast* being a special instruction where the scalar integer type can come as a from memory or from a SIMD register (where-as typically its a general purpose register). This means for byte we have effectively vpbroadcastb xmm, xmm/m8. To resolve this second one I opted to always remove CreateScalarUnsafe and then specially handle the case where its regOptional but not spilled in codegen.

BruceForstall

LGTM

Ensure that TryGetContainableHWIntrinsicOp is non-mutating

09a3cc6

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 7, 2022

ghost assigned tannergooding Dec 7, 2022

tannergooding added 4 commits December 7, 2022 15:34

Applying formatting patch

8d9deab

Ensure BuildOperandUses handles CreateScalarUnsafe being contained

10d1860

Ensure CreateScalarUnsafe is only contained for regOptional when the …

055ff69

…op1 type is a floating-point

Directly check that op1 is contained/regOptional

cf5609f

build-analysis bot mentioned this pull request Dec 8, 2022

Tracking issue for CI build timeouts #76454

Closed

tannergooding marked this pull request as ready for review December 8, 2022 19:15

BruceForstall approved these changes Dec 8, 2022

View reviewed changes

tannergooding mentioned this pull request Dec 9, 2022

This fixes emitGetVexPrefixSize to support detecting 2-byte prefix support #79478

Merged

Merge remote-tracking branch 'dotnet/main' into trygetcontainablehwin…

4dd4579

…trinsicop

This was referenced Dec 15, 2022

Precondition failure: File has not had execution verified #79439

Closed

Build fails with "eng/common/tools.sh: line 474: 537 Segmentation fault" #76759

Closed

tannergooding added 4 commits December 14, 2022 17:34

Ensure FusedMultiplyAdd rechecks CreateScalarUnsafe containment after…

12f77cb

… removing NEG nodes

Ensure BroadcastScalarToVector doesn't regress containable scenarios

a6a9c1a

Merge remote-tracking branch 'dotnet/main' into trygetcontainablehwin…

6f5c6e4

…trinsicop

Applying formatting patch

3f048df

BruceForstall approved these changes Dec 16, 2022

View reviewed changes

tannergooding merged commit c6045ad into dotnet:main Dec 16, 2022

tannergooding mentioned this pull request Jan 12, 2023

[Perf] Windows/x64: Regressions in System.MathBenchmarks.Double #80359

Closed

ghost locked as resolved and limited conversation to collaborators Jan 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure that `TryGetContainableHWIntrinsicOp` is non-mutating #79363

Ensure that `TryGetContainableHWIntrinsicOp` is non-mutating #79363

tannergooding commented Dec 7, 2022

ghost commented Dec 7, 2022

tannergooding commented Dec 7, 2022

tannergooding commented Dec 8, 2022

BruceForstall commented Dec 8, 2022

tannergooding commented Dec 9, 2022

tannergooding commented Dec 9, 2022

kunalspathak commented Dec 9, 2022

tannergooding commented Dec 9, 2022

tannergooding commented Dec 9, 2022

tannergooding commented Dec 14, 2022

tannergooding commented Dec 14, 2022

tannergooding commented Dec 15, 2022

BruceForstall left a comment

Ensure that TryGetContainableHWIntrinsicOp is non-mutating #79363

Ensure that TryGetContainableHWIntrinsicOp is non-mutating #79363

Conversation

tannergooding commented Dec 7, 2022

ghost commented Dec 7, 2022

tannergooding commented Dec 7, 2022

tannergooding commented Dec 8, 2022

BruceForstall commented Dec 8, 2022

tannergooding commented Dec 9, 2022

tannergooding commented Dec 9, 2022

kunalspathak commented Dec 9, 2022

tannergooding commented Dec 9, 2022

tannergooding commented Dec 9, 2022

tannergooding commented Dec 14, 2022

tannergooding commented Dec 14, 2022

tannergooding commented Dec 15, 2022

BruceForstall left a comment

Choose a reason for hiding this comment

Ensure that `TryGetContainableHWIntrinsicOp` is non-mutating #79363

Ensure that `TryGetContainableHWIntrinsicOp` is non-mutating #79363