Add basic support for folding SIMD intrinsics #81547

tannergooding · 2023-02-02T17:19:49Z

This is a minimal proof of concept that adds SIMD folding support for:

Negate
Add
Subtract
GetElement

The first three are done as a general proof of concept for simd = op(simd) and simd = op(simd, simd). They attempt to use templating to reduce code duplication and otherwise make things simple to add/test.

The latter is an example of a case that can't really use templating due to it being scalar = op(simd, int). However, it is one that has a decent amount of actual light up for code that is using scalar fallbacks.

The intent is not that we ever add SIMD folding for "everything". My goal would be to add SIMD folding for the xplat API surface, that is what Vector64/128/256/512<T> expose and provide software fallbacks for.

This helps keep SIMD constant folding support generally "scoped" and to things which are known to be generally supported/commonplace.

For cases like Add/Subtract, we should ideally also add the simple cases around x + 0 == x and x - 0 == x. The same would be true for x * 1, x / 1, and other similar cases when such SIMD folding is added. That is, basically covering the same scenarios that the scalar binary ops cover (e.g. GT_ADD, GT_SUB, etc).

ghost · 2023-02-02T17:20:01Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

This is a minimal proof of concept that adds SIMD folding support for:

Negate
Add
Subtract
GetElement

The first three are done as a general proof of concept for simd = op(simd) and simd = op(simd, simd). They attempt to use templating to reduce code duplication and otherwise make things simple to add/test.

The latter is an example of a case that can't really use templating due to it being scalar = op(simd, int). However, it is one that has a decent amount of actual light up for code that is using scalar fallbacks.

The intent is not that we ever add SIMD folding for "everything". My goal would be to add SIMD folding for the xplat API surface, that is what Vector64/128/256/512<T> expose and provide software fallbacks for.

This helps keep SIMD constant folding support generally "scoped" and to things which are known to be generally supported/commonplace.

For cases like Add/Subtract, we should ideally also add the simple cases around x + 0 == x and x - 0 == x. The same would be true for x * 1, x / 1, and other similar cases when such SIMD folding is added. That is, basically covering the same scenarios that the scalar binary ops cover (e.g. GT_ADD, GT_SUB, etc).

Author:	tannergooding
Assignees:	tannergooding
Labels:	`area-CodeGen-coreclr`
Milestone:	-

tannergooding · 2023-02-02T17:22:38Z

Additional tests still need to be added. We already have a few due to the HWIntrinsic tests being templated, but more scenarios should be added.

tannergooding · 2023-02-02T21:12:13Z

Initial diff is great.

No measurable TP change and positive diffs for tests/benchmarks.

Linux Arm64

Collection	Base size (bytes)	Diff size (bytes)
benchmarks.run.linux.arm64.checked.mch	16,511,180	-1,148
coreclr_tests.run.linux.arm64.checked.mch	168,695,028	-3,284
libraries.crossgen2.linux.arm64.checked.mch	46,483,764	+0
libraries.pmi.linux.arm64.checked.mch	63,552,024	+0
libraries_tests.pmi.linux.arm64.checked.mch	144,368,452	-10,160

Linux x64

Collection	Base size (bytes)	Diff size (bytes)
benchmarks.run.linux.x64.checked.mch	14,564,149	-964
coreclr_tests.run.linux.x64.checked.mch	106,635,565	-3,323
libraries.crossgen2.linux.x64.checked.mch	16,389,941	+0
libraries.pmi.linux.x64.checked.mch	48,774,711	-52
libraries_tests.pmi.linux.x64.checked.mch	116,999,297	-9,484

Windows Arm64/x64 are about half this (largely due to ABI differences from what I can tell), but still all positive.

tannergooding · 2023-02-08T23:46:53Z

So we start off with this tree

***** BB01, STMT00000(before)
N006 ( 15, 11) [000005] -A-XG---R--                         *  ASG       simd16 (copy)
N005 (  7,  5) [000004] D--XG--N---                         +--*  OBJ       simd16<System.Numerics.Quaternion>
N004 (  1,  1) [000003] -----------                         |  \--*  LCL_VAR   byref  V01 RetBuf       u:1
N003 (  7,  5) [000002] -----------                         \--*  HWINTRINSIC simd16 float Subtract
N001 (  3,  2) [000000] -----------                            +--*  CNS_VEC   simd16<0x00000000, 0x00000000, 0x00000000, 0x3f800000>
N002 (  3,  2) [000001] -----------                            \--*  CNS_VEC   simd16<0x00000000, 0x00000000, 0x00000000, 0x3f800000>

That gets recognized and constant folded

N001 [000000]   CNS_VEC  <0x00000000, 0x00000000, 0x00000000, 0x3f800000> => $100 {Simd16Cns[0x00000000, 0x00000000, 0x00000000, 0x3f800000]}
N002 [000001]   CNS_VEC  <0x00000000, 0x00000000, 0x00000000, 0x3f800000> => $100 {Simd16Cns[0x00000000, 0x00000000, 0x00000000, 0x3f800000]}
N003 [000002]   HWINTRINSIC => $101 {Simd16Cns[0x00000000, 0x00000000, 0x00000000, 0x00000000]}

We don't see any transforms to the tree here (just tracking Subtract as $101 and the constants as $100), even though we've just determined this can be a constant evaluation

we then do some work in PHASE Optimize Valnum CSEs to CSE the constants doing two morphs, ending up with

N010 ( 14, 12)              [000005] -A-XG---R--                         *  ASG       simd16 (copy) $181
N009 (  7,  5)              [000004] D--XG--N---                         +--*  OBJ       simd16<System.Numerics.Quaternion> $181
N008 (  1,  1)              [000003] -----------                         |  \--*  LCL_VAR   byref  V01 RetBuf       u:1 $80
N007 (  6,  6)              [000002] -A---------                         \--*  HWINTRINSIC simd16 float Subtract $101
N005 (  4,  4)              [000011] -A---------                            +--*  COMMA     simd16 $100
N003 (  3,  3) CSE #01 (def)[000009] -A------R--                            |  +--*  ASG       simd16 (copy) $VN.Void
N002 (  1,  1)              [000008] D------N---                            |  |  +--*  LCL_VAR   simd16<System.Numerics.Quaternion> V03 cse0         d:1 $VN.Void
N001 (  3,  2)              [000000] -------N---                            |  |  \--*  CNS_VEC   simd16<0x00000000, 0x00000000, 0x00000000, 0x3f800000> $100
N004 (  1,  1)              [000010] -----------                            |  \--*  LCL_VAR   simd16<System.Numerics.Quaternion> V03 cse0         u:1 $100
N006 (  1,  1)              [000012] -----------                            \--*  LCL_VAR   simd16<System.Numerics.Quaternion> V03 cse0         u:1 $100

Then in assertion prop we finally do the replacement, but notably don't yet get rid of the now unused CSE

N010 ( 14, 12) [000005] -A-XG---R--                         *  ASG       simd16 (copy) $181
N009 (  7,  5) [000004] D--XG--N---                         +--*  OBJ       simd16<System.Numerics.Quaternion> $181
N008 (  1,  1) [000003] -----------                         |  \--*  LCL_VAR   byref  V01 RetBuf       u:1 $80
               [000014] -A---------                         \--*  COMMA     simd16
N003 (  3,  3) [000009] -A------R--                            +--*  ASG       simd16 (copy) $VN.Void
N002 (  1,  1) [000008] D------N---                            |  +--*  LCL_VAR   simd16<System.Numerics.Quaternion> V03 cse0         d:1 $VN.Void
N001 (  3,  2) [000000] -------N---                            |  \--*  CNS_VEC   simd16<0x00000000, 0x00000000, 0x00000000, 0x3f800000> $100
               [000013] -----------                            \--*  CNS_VEC   simd16<0x00000000, 0x00000000, 0x00000000, 0x00000000> $101

When processing the comma in MorphBlock we then hit an assert in MorphCommaBlock and there is a note there

// TODO-Cleanup: this block is not needed for not struct nodes, but
// TryPrimitiveCopy works wrong without this transformation.

tannergooding · 2023-02-08T23:53:44Z

It seems like there are a few problems here:

We're doing potentially unnecessary work in ValueNum for inputs to an expression which itself will later replaced with a constant. This means we don't for example account for $100 having 2 less usages. I think we would still track $101 the same as a direct Zero constant, however, so that might still get tracked correctly...
MorphCommaBlock isn't correctly handling primitive types because TryPrimitiveCopy isn't doing the right thing
MorphCommaBlock isn't correctly handling SIMD constants and potentially is pessimizing TYP_SIMD structs in general

If someone from @dotnet/jit-contrib has some time I'd like to better understand the bits here and what we can do to resolve them. I expect 1/2 are external to this PR, but 3 is something that needs resolving for this to go in.

Is the "simple fix" just having MorphCommaBlock skip the handling for effectiveVal that aren't one of LCL_VAR, LCL_FLD, BLK, OBJ, IND, or FIELD?

tannergooding · 2023-02-09T13:39:19Z

@EgorBo hit this as well, in a different way, and is handling it in #81857

tannergooding · 2023-02-11T00:44:50Z

CC. @dotnet/jit-contrib

EgorBo · 2023-02-11T16:08:49Z

src/coreclr/jit/gentree.h

+
+#if defined(TARGET_XARCH)
+        // scalar operations on xarch copy the upper bits from arg0
+        *result = arg0;


can you explain me please this part on an example? (the difference between xarch and arm)

xarch has the behavior where scalar operations "copy the upper bits", that is x + y is equivalent to:

Vector128<T> result = x; return result.WithElement(0, x.GetElement(0) + y.GetElement(0));

arm on the other hand zeros the upper bits, that is x + y is equivalent to:

Vector128<T> result = Vector128<T>.Zero; return result.WithElement(0, x.GetElement(0) + y.GetElement(0));

Added a path that explicitly zeros for Arm64 to help clarify the logic

EgorBo

LGTM. Just out of curiosity - will e.g. adding folding for vector comparison be only a matter of additing a case to EvaluateBinaryScalar and that's it?

EgorBo · 2023-02-11T21:32:56Z

src/coreclr/jit/simd.h

+{
+    switch (baseType)
+    {
+        case TYP_FLOAT:


Presumably these could be hidden under a macro like EVAL_UNARY_SIMD(TYP_FLOAT, float) but some people hate macros so it's fine as is.

tannergooding · 2023-02-11T23:30:02Z

will e.g. adding folding for vector comparison be only a matter of additing a case to EvaluateBinaryScalar and that's it?

For the most part yes. There are a few floating-point edge cases that will end up needing specialization so they handle NaN correctly.

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 2, 2023

ghost assigned tannergooding Feb 2, 2023

tannergooding force-pushed the cns_vec-fold branch 3 times, most recently from d02c412 to 530de00 Compare February 3, 2023 21:52

build-analysis bot mentioned this pull request Feb 4, 2023

Test failure Loader\\classloader\\DictionaryExpansion\\DictionaryExpansion\\DictionaryExpansion.cmd #75244

Closed

tannergooding force-pushed the cns_vec-fold branch 2 times, most recently from 84d0cda to 368384b Compare February 8, 2023 20:59

runfoapp bot mentioned this pull request Feb 8, 2023

Long Running Test: Interop/MonoAPI/MonoMono/PInvokeDetach/PInvokeDetach.sh #73040

Closed

tannergooding added 5 commits February 9, 2023 10:36

Create a helper ValueNumStore::EvalHWIntrinsicFunBinary

4285c7e

Adding some basic support for folding SIMD unary and binary operations

804c6b8

Refactor SIMD constant folding logic to not depend on C++ 14

c4d9b1a

Apply formatting patch

cc7ae6e

Handle identity folding for simd add/sub

155c259

tannergooding force-pushed the cns_vec-fold branch 5 times, most recently from 78fae54 to 44d7f81 Compare February 10, 2023 23:04

tannergooding marked this pull request as ready for review February 11, 2023 00:44

BruceForstall requested a review from jakobbotsch February 11, 2023 01:10

Add some basic tests covering SIMD constant folding

0da233a

tannergooding force-pushed the cns_vec-fold branch from 44d7f81 to 0da233a Compare February 11, 2023 14:01

EgorBo reviewed Feb 11, 2023

View reviewed changes

Move genTreeOps to its own header so simd.h can use it in Evaluate*Simd

fddea65

tannergooding force-pushed the cns_vec-fold branch from feb4403 to fddea65 Compare February 11, 2023 16:54

Applying formatting patch

16beeda

EgorBo approved these changes Feb 11, 2023

View reviewed changes

EgorBo reviewed Feb 11, 2023

View reviewed changes

tannergooding merged commit 8458201 into dotnet:main Feb 11, 2023

tannergooding deleted the cns_vec-fold branch February 11, 2023 23:30

tannergooding mentioned this pull request Feb 14, 2023

[Perf] Linux/x64: 7 Improvements on 2/12/2023 12:05:22 AM dotnet/perf-autofiling-issues#12933

Closed

cincuranet mentioned this pull request Feb 14, 2023

Regressions in System.Numerics.Tests.Perf_Vector3 #82102

Closed

ghost locked as resolved and limited conversation to collaborators Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add basic support for folding SIMD intrinsics #81547

Add basic support for folding SIMD intrinsics #81547

tannergooding commented Feb 2, 2023

ghost commented Feb 2, 2023

tannergooding commented Feb 2, 2023

tannergooding commented Feb 2, 2023

tannergooding commented Feb 8, 2023

tannergooding commented Feb 8, 2023

tannergooding commented Feb 9, 2023 •

edited

Loading

tannergooding commented Feb 11, 2023

EgorBo Feb 11, 2023

tannergooding Feb 11, 2023

tannergooding Feb 11, 2023

EgorBo left a comment

EgorBo Feb 11, 2023

tannergooding commented Feb 11, 2023

Add basic support for folding SIMD intrinsics #81547

Add basic support for folding SIMD intrinsics #81547

Conversation

tannergooding commented Feb 2, 2023

ghost commented Feb 2, 2023

tannergooding commented Feb 2, 2023

tannergooding commented Feb 2, 2023

Linux Arm64

Linux x64

tannergooding commented Feb 8, 2023

tannergooding commented Feb 8, 2023

tannergooding commented Feb 9, 2023 • edited Loading

tannergooding commented Feb 11, 2023

EgorBo Feb 11, 2023

Choose a reason for hiding this comment

tannergooding Feb 11, 2023

Choose a reason for hiding this comment

tannergooding Feb 11, 2023

Choose a reason for hiding this comment

EgorBo left a comment

Choose a reason for hiding this comment

EgorBo Feb 11, 2023

Choose a reason for hiding this comment

tannergooding commented Feb 11, 2023

tannergooding commented Feb 9, 2023 •

edited

Loading