Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic support for folding SIMD intrinsics #81547

Merged
merged 8 commits into from
Feb 11, 2023

Conversation

tannergooding
Copy link
Member

This is a minimal proof of concept that adds SIMD folding support for:

  • Negate
  • Add
  • Subtract
  • GetElement

The first three are done as a general proof of concept for simd = op(simd) and simd = op(simd, simd). They attempt to use templating to reduce code duplication and otherwise make things simple to add/test.

The latter is an example of a case that can't really use templating due to it being scalar = op(simd, int). However, it is one that has a decent amount of actual light up for code that is using scalar fallbacks.

The intent is not that we ever add SIMD folding for "everything". My goal would be to add SIMD folding for the xplat API surface, that is what Vector64/128/256/512<T> expose and provide software fallbacks for.

This helps keep SIMD constant folding support generally "scoped" and to things which are known to be generally supported/commonplace.

For cases like Add/Subtract, we should ideally also add the simple cases around x + 0 == x and x - 0 == x. The same would be true for x * 1, x / 1, and other similar cases when such SIMD folding is added. That is, basically covering the same scenarios that the scalar binary ops cover (e.g. GT_ADD, GT_SUB, etc).

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 2, 2023
@ghost ghost assigned tannergooding Feb 2, 2023
@ghost
Copy link

ghost commented Feb 2, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

This is a minimal proof of concept that adds SIMD folding support for:

  • Negate
  • Add
  • Subtract
  • GetElement

The first three are done as a general proof of concept for simd = op(simd) and simd = op(simd, simd). They attempt to use templating to reduce code duplication and otherwise make things simple to add/test.

The latter is an example of a case that can't really use templating due to it being scalar = op(simd, int). However, it is one that has a decent amount of actual light up for code that is using scalar fallbacks.

The intent is not that we ever add SIMD folding for "everything". My goal would be to add SIMD folding for the xplat API surface, that is what Vector64/128/256/512<T> expose and provide software fallbacks for.

This helps keep SIMD constant folding support generally "scoped" and to things which are known to be generally supported/commonplace.

For cases like Add/Subtract, we should ideally also add the simple cases around x + 0 == x and x - 0 == x. The same would be true for x * 1, x / 1, and other similar cases when such SIMD folding is added. That is, basically covering the same scenarios that the scalar binary ops cover (e.g. GT_ADD, GT_SUB, etc).

Author: tannergooding
Assignees: tannergooding
Labels:

area-CodeGen-coreclr

Milestone: -

@tannergooding
Copy link
Member Author

Additional tests still need to be added. We already have a few due to the HWIntrinsic tests being templated, but more scenarios should be added.

@tannergooding
Copy link
Member Author

Initial diff is great.

No measurable TP change and positive diffs for tests/benchmarks.

Linux Arm64

Collection Base size (bytes) Diff size (bytes)
benchmarks.run.linux.arm64.checked.mch 16,511,180 -1,148
coreclr_tests.run.linux.arm64.checked.mch 168,695,028 -3,284
libraries.crossgen2.linux.arm64.checked.mch 46,483,764 +0
libraries.pmi.linux.arm64.checked.mch 63,552,024 +0
libraries_tests.pmi.linux.arm64.checked.mch 144,368,452 -10,160

Linux x64

Collection Base size (bytes) Diff size (bytes)
benchmarks.run.linux.x64.checked.mch 14,564,149 -964
coreclr_tests.run.linux.x64.checked.mch 106,635,565 -3,323
libraries.crossgen2.linux.x64.checked.mch 16,389,941 +0
libraries.pmi.linux.x64.checked.mch 48,774,711 -52
libraries_tests.pmi.linux.x64.checked.mch 116,999,297 -9,484

Windows Arm64/x64 are about half this (largely due to ABI differences from what I can tell), but still all positive.

@tannergooding
Copy link
Member Author

So we start off with this tree

***** BB01, STMT00000(before)
N006 ( 15, 11) [000005] -A-XG---R--                         *  ASG       simd16 (copy)
N005 (  7,  5) [000004] D--XG--N---                         +--*  OBJ       simd16<System.Numerics.Quaternion>
N004 (  1,  1) [000003] -----------                         |  \--*  LCL_VAR   byref  V01 RetBuf       u:1
N003 (  7,  5) [000002] -----------                         \--*  HWINTRINSIC simd16 float Subtract
N001 (  3,  2) [000000] -----------                            +--*  CNS_VEC   simd16<0x00000000, 0x00000000, 0x00000000, 0x3f800000>
N002 (  3,  2) [000001] -----------                            \--*  CNS_VEC   simd16<0x00000000, 0x00000000, 0x00000000, 0x3f800000>

That gets recognized and constant folded

N001 [000000]   CNS_VEC  <0x00000000, 0x00000000, 0x00000000, 0x3f800000> => $100 {Simd16Cns[0x00000000, 0x00000000, 0x00000000, 0x3f800000]}
N002 [000001]   CNS_VEC  <0x00000000, 0x00000000, 0x00000000, 0x3f800000> => $100 {Simd16Cns[0x00000000, 0x00000000, 0x00000000, 0x3f800000]}
N003 [000002]   HWINTRINSIC => $101 {Simd16Cns[0x00000000, 0x00000000, 0x00000000, 0x00000000]}

We don't see any transforms to the tree here (just tracking Subtract as $101 and the constants as $100), even though we've just determined this can be a constant evaluation

we then do some work in PHASE Optimize Valnum CSEs to CSE the constants doing two morphs, ending up with

N010 ( 14, 12)              [000005] -A-XG---R--                         *  ASG       simd16 (copy) $181
N009 (  7,  5)              [000004] D--XG--N---                         +--*  OBJ       simd16<System.Numerics.Quaternion> $181
N008 (  1,  1)              [000003] -----------                         |  \--*  LCL_VAR   byref  V01 RetBuf       u:1 $80
N007 (  6,  6)              [000002] -A---------                         \--*  HWINTRINSIC simd16 float Subtract $101
N005 (  4,  4)              [000011] -A---------                            +--*  COMMA     simd16 $100
N003 (  3,  3) CSE #01 (def)[000009] -A------R--                            |  +--*  ASG       simd16 (copy) $VN.Void
N002 (  1,  1)              [000008] D------N---                            |  |  +--*  LCL_VAR   simd16<System.Numerics.Quaternion> V03 cse0         d:1 $VN.Void
N001 (  3,  2)              [000000] -------N---                            |  |  \--*  CNS_VEC   simd16<0x00000000, 0x00000000, 0x00000000, 0x3f800000> $100
N004 (  1,  1)              [000010] -----------                            |  \--*  LCL_VAR   simd16<System.Numerics.Quaternion> V03 cse0         u:1 $100
N006 (  1,  1)              [000012] -----------                            \--*  LCL_VAR   simd16<System.Numerics.Quaternion> V03 cse0         u:1 $100

Then in assertion prop we finally do the replacement, but notably don't yet get rid of the now unused CSE

N010 ( 14, 12) [000005] -A-XG---R--                         *  ASG       simd16 (copy) $181
N009 (  7,  5) [000004] D--XG--N---                         +--*  OBJ       simd16<System.Numerics.Quaternion> $181
N008 (  1,  1) [000003] -----------                         |  \--*  LCL_VAR   byref  V01 RetBuf       u:1 $80
               [000014] -A---------                         \--*  COMMA     simd16
N003 (  3,  3) [000009] -A------R--                            +--*  ASG       simd16 (copy) $VN.Void
N002 (  1,  1) [000008] D------N---                            |  +--*  LCL_VAR   simd16<System.Numerics.Quaternion> V03 cse0         d:1 $VN.Void
N001 (  3,  2) [000000] -------N---                            |  \--*  CNS_VEC   simd16<0x00000000, 0x00000000, 0x00000000, 0x3f800000> $100
               [000013] -----------                            \--*  CNS_VEC   simd16<0x00000000, 0x00000000, 0x00000000, 0x00000000> $101

When processing the comma in MorphBlock we then hit an assert in MorphCommaBlock and there is a note there

// TODO-Cleanup: this block is not needed for not struct nodes, but
// TryPrimitiveCopy works wrong without this transformation.

@tannergooding
Copy link
Member Author

It seems like there are a few problems here:

  1. We're doing potentially unnecessary work in ValueNum for inputs to an expression which itself will later replaced with a constant. This means we don't for example account for $100 having 2 less usages. I think we would still track $101 the same as a direct Zero constant, however, so that might still get tracked correctly...

  2. MorphCommaBlock isn't correctly handling primitive types because TryPrimitiveCopy isn't doing the right thing

  3. MorphCommaBlock isn't correctly handling SIMD constants and potentially is pessimizing TYP_SIMD structs in general

If someone from @dotnet/jit-contrib has some time I'd like to better understand the bits here and what we can do to resolve them. I expect 1/2 are external to this PR, but 3 is something that needs resolving for this to go in.

Is the "simple fix" just having MorphCommaBlock skip the handling for effectiveVal that aren't one of LCL_VAR, LCL_FLD, BLK, OBJ, IND, or FIELD?

@tannergooding
Copy link
Member Author

tannergooding commented Feb 9, 2023

@EgorBo hit this as well, in a different way, and is handling it in #81857

@tannergooding tannergooding force-pushed the cns_vec-fold branch 5 times, most recently from 78fae54 to 44d7f81 Compare February 10, 2023 23:04
@tannergooding tannergooding marked this pull request as ready for review February 11, 2023 00:44
@tannergooding
Copy link
Member Author

CC. @dotnet/jit-contrib


#if defined(TARGET_XARCH)
// scalar operations on xarch copy the upper bits from arg0
*result = arg0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain me please this part on an example? (the difference between xarch and arm)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xarch has the behavior where scalar operations "copy the upper bits", that is x + y is equivalent to:

Vector128<T> result = x;
return result.WithElement(0, x.GetElement(0) + y.GetElement(0));

arm on the other hand zeros the upper bits, that is x + y is equivalent to:

Vector128<T> result = Vector128<T>.Zero;
return result.WithElement(0, x.GetElement(0) + y.GetElement(0));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a path that explicitly zeros for Arm64 to help clarify the logic

Copy link
Member

@EgorBo EgorBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just out of curiosity - will e.g. adding folding for vector comparison be only a matter of additing a case to EvaluateBinaryScalar and that's it?

{
switch (baseType)
{
case TYP_FLOAT:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably these could be hidden under a macro like EVAL_UNARY_SIMD(TYP_FLOAT, float) but some people hate macros so it's fine as is.

@tannergooding
Copy link
Member Author

will e.g. adding folding for vector comparison be only a matter of additing a case to EvaluateBinaryScalar and that's it?

For the most part yes. There are a few floating-point edge cases that will end up needing specialization so they handle NaN correctly.

@tannergooding tannergooding merged commit 8458201 into dotnet:main Feb 11, 2023
@tannergooding tannergooding deleted the cns_vec-fold branch February 11, 2023 23:30
@ghost ghost locked as resolved and limited conversation to collaborators Mar 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants