Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ValueNumbering support for GT_SIMD and GT_HWINTRINSIC tree nodes #31834

Merged
merged 34 commits into from
Feb 26, 2020

Conversation

briansull
Copy link
Contributor

No description provided.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 6, 2020
@tannergooding
Copy link
Member

CC. @CarolEidt, @echesakovMSFT


#ifdef FEATURE_HW_INTRINSICS
case GT_HWINTRINSIC:
return true; // allow Hardware Intrinsics to be CSE-ed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any HWIntrinsics we don't want CSE'd or any that need special handling (such as the non-temporal load/stores)?

Copy link
Contributor

@CarolEidt CarolEidt Feb 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent thought - I would say this should include:

  • any of the "special" loads (LoadFence, any NonTemporal
  • any store (i.e. HW_Category_MemoryStore and StoreFence)
  • any prefetch
  • MemoryFence

Perhaps we should consider adding a flag for this, but a reasonable proxy in the meantime would be to exclude any HW_Category_MemoryStore, HW_Category_MemoryLoad and HW_Category_Special:

  • There would be few, if any, loads that the JIT could identify as CSE's (local stack loads would generally not show up as actual load intrinsics).
  • The only HW_Category_Special that I found that this would unnecessarily exclude is CompareLessThan

@briansull
Copy link
Contributor Author

One issue that I ran into while working on this feature was that the numArgs for the SIMD and HW_INTRINSICS is not always set to a useful value. Value numbering needs to know the exact number of arguments to expect and some instructions have a "-1" for numArgs or specify 2 when then can accept either 1 or 2.

For the purposes of Value Numbering it would be better to have multiple entries in the table when the intrinsic can take different numbers of arguments.
Currently I just bail out when the table specifies a numArgs of -1.

@briansull
Copy link
Contributor Author

briansull commented Feb 7, 2020

For true binary operations, it would be useful to have a "commutative" column that indicates that the operands can be safely swapped.

@tannergooding
Copy link
Member

tannergooding commented Feb 7, 2020

For true binary operations, it would be useful to have a "communitive" column that indicates that the operands can be safely swapped.

You can use static bool IsCommutative(NamedIntrinsic id): https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsic.h#L202. You would get the id via AsHWIntrinsic()->gtHWIntrinsicId

This is likewise hooked up through the standard OperIsCommutative API.

Its also worth noting that there are some intrinsics, like FusedMultiplyAdd, which are commutative but have 3 operands and require special handling to account for this (the operation is a fused a * b + c and so the a and b operands are commutative).

For the purposes of Value Numbering it would be better to have multiple entries in the table when the intrinsic can take different numbers of arguments.

This would make lookup during imporation quite a bit more expensive. We already have to check the names, I don't think we want to also have to start checking other parameters.

There are existing helper methods such as static int lookupNumArgs(const GenTreeHWIntrinsic* node): https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsic.h#L135 which will return the correct number of arguments for a given node and will attempt to do it as efficiently as possible.

src/coreclr/src/jit/gentree.cpp Show resolved Hide resolved
src/coreclr/src/jit/gentree.cpp Outdated Show resolved Hide resolved
@briansull
Copy link
Contributor Author

There are existing helper methods such as static int lookupNumArgs(const GenTreeHWIntrinsic* node): https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsic.h#L135 which will return the correct number of arguments for a given node and will attempt to do it as efficiently as possible.

The Value number operation defined by this enum (below), requires that each enum value have a fixed value for arity and commutativity, as we just pass around the value number and then expect to be able to unpack it based upon the VNFunc.

enum VNFunc
{
    // Implicitly, elements of genTreeOps here.
    VNF_Boundary = GT_COUNT,
#define ValueNumFuncDef(nm, arity, commute, knownNonNull, sharedStatic) VNF_##nm,
#include "valuenumfuncs.h"
    VNF_COUNT
};

@tannergooding
Copy link
Member

tannergooding commented Feb 8, 2020

The Value number operation defined by this enum (below), requires that each enum value have a fixed value for arity and commutativity, as we just pass around the value number and then expect to be able to unpack it based upon the VNFunc

@CarolEidt, do you have any concerns about updating the hwintrinsic tables to only contain constant data (so entries that currently have a variable number of args or supporting multiple SIMD sizes would be split into multiple entries, for example)?

It would make the table larger but I think we could probably cut some of the current lookup cost and still make this manageable (like having a table per column as we've done elsewhere; or tracking the first/last entry for a given ISA).

@CarolEidt
Copy link
Contributor

do you have any concerns about updating the hwintrinsic tables to only contain constant data (so entries that currently have a variable number of args or supporting multiple SIMD sizes would be split into multiple entries

It would be unfortunate, but I don't think it would be too prohibitive to split it out by number of args, but I don't think it's necessary to split it out by different SIMD sizes, as nodes of different sizes should never have the same value type - isn't that right @briansull?

@briansull
Copy link
Contributor Author

I don't think it's necessary to split it out by different SIMD sizes, as nodes of different sizes should never have the same value type - isn't that right @briansull?

Actually the type size isn't part of the the value number, so there are a couple of nodes where Value Numbering adds an additional arg that holds a constant represent the size of the operation. We do that for GT_CAST and I also had to add this for the SIMD Init operation., As one test was CSE-ing two different sized Vector Inits: 3-way and 4-way. So it awkward but it is something that we can already handle.

            case SIMDIntrinsicInit:
            {
                // Also encode the resulting type as op2vnp
                ValueNumPair op2vnp;
                op2vnp.SetBoth(vnStore->VNForIntCon(INT32(tree->TypeGet())));

                excSetPair     = op1Xvnp;
                normalPair     = vnStore->VNPairForFunc(tree->TypeGet(), GetVNFuncForNode(tree), op1vnp, op2vnp);
                tree->gtVNPair = vnStore->VNPWithExc(normalPair, excSetPair);
                return;
            }

case SIMDIntrinsicWidenLo:
{
// Also encodes the resulting type
encodeResultType = true;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also encode the result type for all of the above SIMD operations.

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks!

@AndyAyersMS
Copy link
Member

I would like to see new tests added here, especially given the rather complex IR surface.

Can you also show the impact on FX codegen and/or tests? I am curious how much coverage we're likely to get organically.

How robust will this be as people add new intrinsics?

@briansull
Copy link
Contributor Author

briansull commented Feb 22, 2020

Beginning PMI PerfScore Diffs for benchstones and benchmarks game
D:\jit-diffs\runtime\x64\dasmset_98
Total PerfScoreUnits of diff: -3149.45 (-0.46% of base)
    diff is an improvement.
Top file regressions (PerfScoreUnits):
       43.10 : Span\SpanBench\SpanBench.dasm (0.22% of base)
Top file improvements (PerfScoreUnits):
    -2805.60 : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm (-3.38% of base)
     -289.60 : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm (-6.44% of base)
      -97.35 : SIMD\RayTracer\RayTracer\RayTracer.dasm (-0.78% of base)
4 total files with Perf Score differences (3 improved, 1 regressed), 78 unchanged.
Top method regressions (percentages):
       43.10 (14.43% of base) : Span\SpanBench\SpanBench.dasm - SpanBench:TestSpanFill(ref,int) (7 methods)
        5.00 ( 3.73% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Sphere:Intersect(Ray):ISect:this
Top method improvements (percentages):
      -98.60 (-9.31% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetNaturalColor(SceneObject,Vector,Vector,Vector,Scene):Color:this
     -371.33 (-8.90% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
     -361.70 (-8.57% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
       -3.50 (-8.44% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - ComplexVecFloat:square():ComplexVecFloat:this
       -3.50 (-8.44% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - ComplexVecDouble:square():ComplexVecDouble:this
      -80.63 (-7.49% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedWithADT>b__1(int):this
      -96.50 (-7.41% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - MandelBrot_7:GetByte(long,double,int,int):ubyte
     -192.50 (-7.19% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - <>c__DisplayClass4_0:<DoBench>b__0(int):this
      -78.50 (-7.14% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass6_0:<RenderMultiThreadedWithADT>b__1(int):this
     -320.74 (-7.12% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleStrictRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
     -318.24 (-6.98% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatStrictRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
     -264.93 (-6.38% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
     -139.06 (-6.21% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass4_0:<RenderMultiThreadedWithADT>b__1(int):this (2 methods)
     -254.90 (-5.75% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
     -168.96 (-5.16% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedNoADT>b__1(int):this (3 methods)
     -211.84 (-4.67% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatStrictRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
     -211.84 (-4.39% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleStrictRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
      -15.93 (-1.58% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass4_0:<RenderMultiThreadedNoADT>b__1(int):this
       -0.75 (-0.96% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Sphere:Normal(Vector):Vector:this
       -2.25 (-0.49% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Camera:Create(Vector,Vector):Camera
       -0.75 (-0.48% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetPoint(double,double,Camera):Vector:this
       -0.60 (-0.18% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - MandelBrot_7:DoBench(int,int):ref
24 total methods with Perf Score differences (22 improved, 2 regressed), 1869 unchanged.

@briansull
Copy link
Contributor Author

briansull commented Feb 22, 2020

Beginning PMI PerfScore Diffs for System.Private.CoreLib.dll
Completed PMI PerfScore Diffs for System.Private.CoreLib.dll in 148.60s
in D:\jit-diffs\runtime\x64\dasmset_96

Top method regressions (percentages):
       11.35 ( 1.83% of base) : System.Private.CoreLib.dasm - SpanHelpers:IndexOfAny(byref,ushort,ushort,ushort,int):int
       11.95 ( 1.66% of base) : System.Private.CoreLib.dasm - SpanHelpers:IndexOfAny(byref,ushort,ushort,ushort,ushort,int):int
        0.20 ( 0.14% of base) : System.Private.CoreLib.dasm - Vector:LessThanAny(Vector`1,Vector`1):bool (6 methods)
        0.20 ( 0.14% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanAny(Vector`1,Vector`1):bool (6 methods)
        1.00 ( 0.12% of base) : System.Private.CoreLib.dasm - SpanHelpers:Contains(byref,ushort,int):bool (2 methods)
Top method improvements (percentages):
       -6.70 (-22.98% of base) : System.Private.CoreLib.dasm - Vector3:LengthSquared():float:this
      -12.65 (-19.69% of base) : System.Private.CoreLib.dasm - Vector3:Reflect(Vector3,Vector3):Vector3
      -12.65 (-18.67% of base) : System.Private.CoreLib.dasm - Vector3:Normalize(Vector3):Vector3
      -42.00 (-18.30% of base) : System.Private.CoreLib.dasm - Vector`1:GreaterThanOrEqual(Vector`1,Vector`1):Vector`1 (6 methods)
      -42.00 (-18.30% of base) : System.Private.CoreLib.dasm - Vector`1:LessThanOrEqual(Vector`1,Vector`1):Vector`1 (6 methods)
       -6.70 (-16.48% of base) : System.Private.CoreLib.dasm - Vector3:Length():float:this
       -8.25 (-14.44% of base) : System.Private.CoreLib.dasm - Vector4:Normalize(Vector4):Vector4
      -45.83 (-14.27% of base) : System.Private.CoreLib.dasm - Vector:LessThanOrEqual(Vector`1,Vector`1):Vector`1 (10 methods)
      -45.83 (-14.27% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanOrEqual(Vector`1,Vector`1):Vector`1 (10 methods)
       -3.00 (-12.50% of base) : System.Private.CoreLib.dasm - Vector4:LengthSquared():float:this
      -29.13 (-11.50% of base) : System.Private.CoreLib.dasm - Vector:LessThanOrEqualAny(Vector`1,Vector`1):bool (6 methods)
      -29.13 (-11.50% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanOrEqualAny(Vector`1,Vector`1):bool (6 methods)
      -29.33 (-10.23% of base) : System.Private.CoreLib.dasm - Vector:LessThanOrEqualAll(Vector`1,Vector`1):bool (6 methods)
      -29.33 (-10.23% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanOrEqualAll(Vector`1,Vector`1):bool (6 methods)
      -31.50 (-9.40% of base) : System.Private.CoreLib.dasm - Vector:ConditionalSelect(Vector`1,Vector`1,Vector`1):Vector`1 (8 methods)
      -22.50 (-8.74% of base) : System.Private.CoreLib.dasm - Vector`1:ConditionalSelect(Vector`1,Vector`1,Vector`1):Vector`1 (6 methods)
       -3.00 (-8.45% of base) : System.Private.CoreLib.dasm - Vector4:Length():float:this
       -8.25 (-7.39% of base) : System.Private.CoreLib.dasm - Vector:Min(Vector`1,Vector`1):Vector`1 (6 methods)
       -8.25 (-7.39% of base) : System.Private.CoreLib.dasm - Vector:Max(Vector`1,Vector`1):Vector`1 (6 methods)
      -14.15 (-7.30% of base) : System.Private.CoreLib.dasm - Plane:CreateFromVertices(Vector3,Vector3,Vector3):Plane
      -29.90 (-6.41% of base) : System.Private.CoreLib.dasm - Vector`1:Min(Vector`1,Vector`1):Vector`1 (6 methods)
      -29.90 (-6.41% of base) : System.Private.CoreLib.dasm - Vector`1:Max(Vector`1,Vector`1):Vector`1 (6 methods)
      -21.60 (-5.67% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateLookAt(Vector3,Vector3,Vector3):Matrix4x4
       -1.00 (-5.00% of base) : System.Private.CoreLib.dasm - Vector2:LengthSquared():float:this
       -1.50 (-4.93% of base) : System.Private.CoreLib.dasm - VectorMath:Lerp(Vector128`1,Vector128`1,Vector128`1):Vector128`1
       -9.90 (-3.32% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateWorld(Vector3,Vector3,Vector3):Matrix4x4
       -1.00 (-3.17% of base) : System.Private.CoreLib.dasm - Vector2:Length():float:this
      -17.85 (-2.64% of base) : System.Private.CoreLib.dasm - Matrix4x4:Decompose(Matrix4x4,byref,byref,byref):bool
      -10.02 (-2.07% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateConstrainedBillboard(Vector3,Vector3,Vector3,Vector3,Vector3):Matrix4x4
       -2.17 (-1.56% of base) : System.Private.CoreLib.dasm - ASCIIUtility:GetIndexOfFirstNonAsciiChar_Default(long,long):long
       -1.64 (-1.42% of base) : System.Private.CoreLib.dasm - ASCIIUtility:GetIndexOfFirstNonAsciiByte_Default(long,long):long
       -1.50 (-0.87% of base) : System.Private.CoreLib.dasm - ASCIIUtility:NarrowUtf16ToAscii_Sse2(long,long,long):long (2 methods)
       -1.26 (-0.51% of base) : System.Private.CoreLib.dasm - ASCIIUtility:GetIndexOfFirstNonAsciiChar_Sse2(long,long):long (2 methods)
       -0.75 (-0.23% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateBillboard(Vector3,Vector3,Vector3,Vector3):Matrix4x4
       -0.85 (-0.19% of base) : System.Private.CoreLib.dasm - SpanHelpers:LastIndexOf(byref,ushort,int):int
       -0.45 (-0.08% of base) : System.Private.CoreLib.dasm - SpanHelpers:IndexOfAny(byref,ushort,ushort,int):int
42 total methods with Perf Score differences (36 improved, 6 regressed), 22012 unchanged.

@briansull
Copy link
Contributor Author

briansull commented Feb 22, 2020

The regression in Span\SpanBench\SpanBench.dasm - SpanBench:TestSpanFill(ref,int) (7 methods)
We make a simd32 into a CSE, which requires an additional two caller saved ymm registers to be spilled in the prolog. And this ymm register is live across a call, which since only half of the register is preserved we have to save and restore half of it at a callsite:

prolog

+      vmovaps  qword ptr [rsp+70H], xmm6
+      vmovaps  qword ptr [rsp+60H], xmm7

Loop:

+     vextractf128 ymm7, ymm6, 1
       call     Span`1:Fill(Vector`1):this
       inc      edi
       cmp      edi, esi
+     vinsertf128 ymm6, ymm6, ymm7, 1

@@ -2461,6 +2461,28 @@ class CSE_Heuristic
extra_yes_cost *= 2; // full cost if we are being Conservative
}
}

#ifdef FEATURE_SIMD
// SIMD types will cause a SIMD register to be spilled/restored
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "may cause".
I think you should have a single comment that talks about the heuristics for estimating register save/restore/spill costs. First, it is conservative to assume that each SIMD CSE will require an additional register to be saved/restored in prolog/epilog, as there may be disjoint CSE live ranges that can utilize the same register.
Second, this code presumes that the CSEs will be live arcoss a call, which may be true more often than not, but it also assumes that each use will require a restore, though it may be that multiple uses will not span a call.
While it may be fine to be conservative in these ways, I think we should accurately describe those conservative assumptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

{
spillSimdRegInProlog++; // we likely need to spill an extra register to save the upper half

// Increase the cse_use_cost since at use sites we have to generate code to spill/restore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, "we have to" should be "we may have to"

@briansull
Copy link
Contributor Author

briansull commented Feb 25, 2020

Here is an example where value number of SIMD types produced a very nice win:

runtime\src\coreclr\tests\src\JIT\Regression\JitBlue\GitHub_11816\GitHub_11816.cs:

    static void DoSomeWorkWithAStruct(ref Vector<float> source, out Vector<float> result)
    {
        StructType u;
        u.A = new Vector<float>(2) * source;
        u.B = new Vector<float>(3) * source;
        u.C = new Vector<float>(4) * source;
        u.D = new Vector<float>(5) * source;
        u.E = new Vector<float>(6) * source;
        u.F = new Vector<float>(7) * source;
        result = u.A + u.B + u.C + u.D + u.E + u.F;
    }

We create the following CSE's to hold source, u.A, u.B, u.C, u.D, U.E and u.F

;  V10 cse0         [V10,T10] (  2,  2   )  simd32  ->  mm0         "CSE - aggressive"
;  V11 cse1         [V11,T11] (  2,  2   )  simd32  ->  mm2         "CSE - aggressive"
;  V12 cse2         [V12,T12] (  2,  2   )  simd32  ->  mm3         "CSE - aggressive"
;  V13 cse3         [V13,T13] (  2,  2   )  simd32  ->  mm4         "CSE - aggressive"
;  V14 cse4         [V14,T14] (  2,  2   )  simd32  ->  mm5         "CSE - aggressive"
;  V15 cse5         [V15,T15] (  2,  2   )  simd32  ->  mm1         "CSE - aggressive"
;  V16 cse6         [V16,T03] (  7,  7   )  simd32  ->  mm1         "CSE - aggressive"

Reducing the code from:

       vmovupd  ymmword ptr[rsp+08H], ymm0
       vbroadcastss ymm0, ymmword ptr[reloc @RWD04]
       vmovupd  ymm1, ymmword ptr[rcx]
       vmulps   ymm0, ymm1
       vmovupd  ymmword ptr[rsp+28H], ymm0
       vbroadcastss ymm0, ymmword ptr[reloc @RWD08]
       vmovupd  ymm1, ymmword ptr[rcx]
       vmulps   ymm0, ymm1
       vmovupd  ymmword ptr[rsp+48H], ymm0
       vbroadcastss ymm0, ymmword ptr[reloc @RWD12]
       vmovupd  ymm1, ymmword ptr[rcx]
       vmulps   ymm0, ymm1
       vmovupd  ymmword ptr[rsp+68H], ymm0
       vbroadcastss ymm0, ymmword ptr[reloc @RWD16]
       vmovupd  ymm1, ymmword ptr[rcx]
       vmulps   ymm0, ymm1
       vmovupd  ymmword ptr[rsp+88H], ymm0
       vbroadcastss ymm0, ymmword ptr[reloc @RWD20]
       vmovupd  ymm1, ymmword ptr[rcx]
       vmulps   ymm0, ymm1
       vmovupd  ymmword ptr[rsp+A8H], ymm0
       vmovupd  ymm0, ymmword ptr[rsp+08H]
       vmovupd  ymm1, ymmword ptr[rsp+28H]
       vaddps   ymm0, ymm1
       vmovupd  ymm1, ymmword ptr[rsp+48H]
       vaddps   ymm0, ymm1
       vmovupd  ymm1, ymmword ptr[rsp+68H]
       vaddps   ymm0, ymm1
       vmovupd  ymm1, ymmword ptr[rsp+88H]
       vaddps   ymm0, ymm1
       vmovupd  ymm1, ymmword ptr[rsp+A8H]
       vaddps   ymm0, ymm1
       vmovupd  ymmword ptr[rdx], ymm0

To:

       vbroadcastss ymm2, ymmword ptr[reloc @RWD04]
       vmulps   ymm2, ymm1
       vbroadcastss ymm3, ymmword ptr[reloc @RWD08]
       vmulps   ymm3, ymm1
       vbroadcastss ymm4, ymmword ptr[reloc @RWD12]
       vmulps   ymm4, ymm1
       vbroadcastss ymm5, ymmword ptr[reloc @RWD16]
       vmulps   ymm5, ymm1
       vbroadcastss ymm6, ymmword ptr[reloc @RWD20]
       vmulps   ymm1, ymm6, ymm1
       vaddps   ymm0, ymm2
       vaddps   ymm0, ymm3
       vaddps   ymm0, ymm4
       vaddps   ymm0, ymm5
       vaddps   ymm0, ymm1
       vmovupd  ymmword ptr[rdx], ymm0

PerfScore:

OLD> ; Total bytes of code 255, prolog size 32, PerfScore 175.50, (MethodHash=2eeeff7d) for method GitHub_11816:DoSomeWorkWithAStruct(byref,byref)
NEW> ; Total bytes of code 131, prolog size 12, PerfScore 89.10, (MethodHash=2eeeff7d) for method GitHub_11816:DoSomeWorkWithAStruct(byref,byref)

@briansull
Copy link
Contributor Author

We perform a nice CSE for the a[i] and b[i] operations in this test case:

src\coreclr\tests\src\JIT\SIMD>emacs BitwiseOperations.cs

        static int TestBool()
        {
            Random random = new Random(13);
            byte[] arr1 = GenerateByteArray(64, random);
            byte[] arr2 = GenerateByteArray(64, random);
            var a = new System.Numerics.Vector<byte>(arr1);
            var b = new System.Numerics.Vector<byte>(arr2);

            var xorR = a ^ b;
            var andR = a & b;
            var orR = a | b;
            int Count = System.Numerics.Vector<byte>.Count;
            for (int i = 0; i < Count; ++i)
            {
                int d = a[i] ^ b[i];
                if (xorR[i] != d)
                {
                    return 0;
                }
                d = a[i] & b[i];
                if (andR[i] != d)
                {
                    return 0;
                }
                d = a[i] | b[i];
                if (orR[i] != d)
                {
                    return 0;
                }
            }
            return 100;
        }

@briansull
Copy link
Contributor Author

@AndyAyersMS I believe that we have a reasonable amount of test coverage here.

I will add a Config variable to allow us to disable this new functionality and revert to the old behavior.

@briansull
Copy link
Contributor Author

briansull commented Feb 26, 2020

We also get this CSE now in:
coreclr\tests\src\JIT\Performance\CodeQuality\SIMD\ConsoleMandel\Abstractions.cs

Real * Imaginary

        [MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
        public ComplexVecFloat square()
        {
            return new ComplexVecFloat(Real * Real - Imaginary * Imaginary, Real * Imaginary + Real * Imaginary);
        }
CSE candidate #03, vn=$189 in BB01, [cost= 8, size= 7]: 
N014 (  8,  7) CSE #03 (use)[000020] ---XG-------              *  SIMD      simd32 float * <l:$24f, c:$252>
N009 (  3,  2) CSE #01 (use)[000017] *--XG-------              +--*  IND       simd32 <l:$241, c:$250>
N008 (  1,  1)              [000016] ------------              |  \--*  LCL_VAR   byref  V00 this         u:1 Zero Fseq[Real] $80
N013 (  4,  4) CSE #02 (use)[000019] *--XG-------              \--*  IND       simd32 <l:$246, c:$251>
N012 (  2,  2)              [000069] -------N----                 \--*  ADD       byref  $280
N010 (  1,  1)              [000018] ------------                    +--*  LCL_VAR   byref  V00 this         u:1 (last use) $80
N011 (  1,  1)              [000068] ------------                    \--*  CNS_INT   long   32 field offset Fseq[Imaginary] $c1
G_M53767_IG02:
       vmovupd  ymm0, ymmword ptr[rcx]
       vmulps   ymm1, ymm0, ymm0
       vmovupd  ymm2, ymmword ptr[rcx+32]
       vmulps   ymm3, ymm2, ymm2
       vsubps   ymm1, ymm3
 <     vmulps   ymm3, ymm0, ymm2
       vmulps   ymm0, ymm2
 <    vaddps   ymm0, ymm3, ymm0
 >    vaddps   ymm0, ymm0
       vmovupd  ymmword ptr[rdx], ymm1
       vmovupd  ymmword ptr[rdx+32], ymm0
       mov      rax, rdx
<					;; bbWeight=1    PerfScore 32.25
>					;; bbWeight=1    PerfScore 29.25
G_M53767_IG03:
       vzeroupper 
       ret      
						;; bbWeight=1    PerfScore 2.00

<; Total bytes of code 52, prolog size 3, PerfScore 41.45, (MethodHash=c9862df9) for method ComplexVecFloat:square():ComplexVecFloat:this
>; Total bytes of code 48, prolog size 3, PerfScore 37.95, (MethodHash=c9862df9) for method ComplexVecFloat:square():ComplexVecFloat:this


   Default 0, ValueNumbering of SIMD nodes and HW Intrinsic nodes enabled
   If 1, then disable ValueNumbering of SIMD nodes
   If 2, then disable ValueNumbering of HW Intrinsic nodes
   If 3, disable both SIMD and HW Intrinsic nodes
@briansull
Copy link
Contributor Author

Beginning PMI PerfScore Diffs for System.Private.CoreLib.dll, framework assemblies
Completed PMI PerfScore Diffs for System.Private.CoreLib.dll, framework assemblies in 685.82s
Diffs (if any) can be viewed by comparing: d:\jit-diffs\runtime\x64\dasmset_108\base d:\jit-diffs\runtime\x64\dasmset_108\diff
Analyzing PerfScore diffs...
PMI PerfScore Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary of Perf Score diffs:
(Lower is better)
Total PerfScoreUnits of diff: -1179.71 (-0.00% of base)
    diff is an improvement.
Top file improvements (PerfScoreUnits):
     -536.77 : System.Private.CoreLib.dasm (-0.02% of base)
     -196.30 : Microsoft.Diagnostics.FastSerialization.dasm (-0.07% of base)
     -113.13 : System.Data.Common.dasm (-0.00% of base)
      -75.97 : System.Linq.Parallel.dasm (-0.01% of base)
      -47.00 : System.Text.Encodings.Web.dasm (-0.34% of base)
      -44.70 : Newtonsoft.Json.dasm (-0.00% of base)
      -34.90 : System.Collections.dasm (-0.02% of base)
      -31.99 : System.Collections.Immutable.dasm (-0.01% of base)
      -20.18 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (-0.00% of base)
      -18.77 : System.Threading.Channels.dasm (-0.03% of base)
      -13.15 : System.Private.Xml.Linq.dasm (-0.02% of base)
Top method regressions (percentages):
       11.35 ( 1.83% of base) : System.Private.CoreLib.dasm - SpanHelpers:IndexOfAny(byref,ushort,ushort,ushort,int):int
       11.95 ( 1.66% of base) : System.Private.CoreLib.dasm - SpanHelpers:IndexOfAny(byref,ushort,ushort,ushort,ushort,int):int
        1.50 ( 0.79% of base) : System.Collections.Immutable.dasm - Node:Add(Vector`1):Node:this
        1.05 ( 0.48% of base) : System.Linq.dasm - Lookup`2:GetGrouping(Vector`1,bool):Grouping`2:this
        2.03 ( 0.44% of base) : System.Collections.Immutable.dasm - Builder:set_Count(int):this (7 methods)
        0.20 ( 0.14% of base) : System.Private.CoreLib.dasm - Vector:LessThanAny(Vector`1,Vector`1):bool (6 methods)
        0.20 ( 0.14% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanAny(Vector`1,Vector`1):bool (6 methods)
        1.00 ( 0.12% of base) : System.Private.CoreLib.dasm - SpanHelpers:Contains(byref,ushort,int):bool (2 methods)
        0.34 ( 0.08% of base) : Microsoft.Diagnostics.FastSerialization.dasm - GrowableArray`1:set_Count(int):this (7 methods)
        1.30 ( 0.07% of base) : System.Collections.Immutable.dasm - ImmutableSortedSet`1:LeafToRootRefill(IEnumerable`1):ImmutableSortedSet`1:this (7 methods)
Top method improvements (percentages):
       -6.70 (-22.98% of base) : System.Private.CoreLib.dasm - Vector3:LengthSquared():float:this
      -12.65 (-19.69% of base) : System.Private.CoreLib.dasm - Vector3:Reflect(Vector3,Vector3):Vector3
      -12.65 (-18.67% of base) : System.Private.CoreLib.dasm - Vector3:Normalize(Vector3):Vector3
      -42.00 (-18.30% of base) : System.Private.CoreLib.dasm - Vector`1:GreaterThanOrEqual(Vector`1,Vector`1):Vector`1 (6 methods)
      -42.00 (-18.30% of base) : System.Private.CoreLib.dasm - Vector`1:LessThanOrEqual(Vector`1,Vector`1):Vector`1 (6 methods)
      -23.00 (-17.73% of base) : System.Text.Encodings.Web.dasm - Sse2Helper:CreateEscapingMask_UnsafeRelaxedJavaScriptEncoder(Vector128`1):Vector128`1 (2 methods)
       -6.70 (-16.48% of base) : System.Private.CoreLib.dasm - Vector3:Length():float:this
      -20.10 (-14.71% of base) : System.Text.Encodings.Web.dasm - Sse2Helper:CreateEscapingMask_DefaultJavaScriptEncoderBasicLatin(Vector128`1):Vector128`1
       -8.25 (-14.44% of base) : System.Private.CoreLib.dasm - Vector4:Normalize(Vector4):Vector4
      -45.83 (-14.27% of base) : System.Private.CoreLib.dasm - Vector:LessThanOrEqual(Vector`1,Vector`1):Vector`1 (10 methods)
      -45.83 (-14.27% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanOrEqual(Vector`1,Vector`1):Vector`1 (10 methods)
       -3.00 (-12.50% of base) : System.Private.CoreLib.dasm - Vector4:LengthSquared():float:this
      -29.13 (-11.50% of base) : System.Private.CoreLib.dasm - Vector:LessThanOrEqualAny(Vector`1,Vector`1):bool (6 methods)
      -29.13 (-11.50% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanOrEqualAny(Vector`1,Vector`1):bool (6 methods)
      -35.20 (-10.49% of base) : System.Collections.Immutable.dasm - ImmutableInterlocked:GetOrAdd(byref,Vector`1,Func`2):long
      -29.33 (-10.23% of base) : System.Private.CoreLib.dasm - Vector:LessThanOrEqualAll(Vector`1,Vector`1):bool (6 methods)
      -29.33 (-10.23% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanOrEqualAll(Vector`1,Vector`1):bool (6 methods)
       -7.50 (-9.77% of base) : System.Collections.dasm - HashSet`1:AddValue(int,int,Vector`1):this
       -5.90 (-9.56% of base) : System.DirectoryServices.AccountManagement.dasm - PrincipalValueCollection`1:Insert(int,Vector`1):this
      -31.50 (-9.40% of base) : System.Private.CoreLib.dasm - Vector:ConditionalSelect(Vector`1,Vector`1,Vector`1):Vector`1 (8 methods)
      -22.50 (-8.74% of base) : System.Private.CoreLib.dasm - Vector`1:ConditionalSelect(Vector`1,Vector`1,Vector`1):Vector`1 (6 methods)
       -3.00 (-8.45% of base) : System.Private.CoreLib.dasm - Vector4:Length():float:this
       -8.25 (-7.39% of base) : System.Private.CoreLib.dasm - Vector:Min(Vector`1,Vector`1):Vector`1 (6 methods)
       -8.25 (-7.39% of base) : System.Private.CoreLib.dasm - Vector:Max(Vector`1,Vector`1):Vector`1 (6 methods)
      -14.15 (-7.30% of base) : System.Private.CoreLib.dasm - Plane:CreateFromVertices(Vector3,Vector3,Vector3):Plane
       -3.40 (-7.11% of base) : System.Threading.Channels.dasm - Deque`1:DequeueHead():Vector`1:this
      -29.90 (-6.41% of base) : System.Private.CoreLib.dasm - Vector`1:Min(Vector`1,Vector`1):Vector`1 (6 methods)
      -29.90 (-6.41% of base) : System.Private.CoreLib.dasm - Vector`1:Max(Vector`1,Vector`1):Vector`1 (6 methods)
       -3.50 (-6.31% of base) : System.Threading.Channels.dasm - Deque`1:EnqueueTail(Vector`1):this
       -2.50 (-6.27% of base) : System.Text.Encodings.Web.dasm - Sse2Helper:CreateAsciiMask(Vector128`1):Vector128`1
       -3.10 (-5.96% of base) : System.Collections.dasm - Stack`1:PushWithResize(Vector`1):this
      -21.60 (-5.67% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateLookAt(Vector3,Vector3,Vector3):Matrix4x4
       -3.10 (-5.50% of base) : Microsoft.Diagnostics.FastSerialization.dasm - SegmentedList`1:Add(Vector`1):this

@briansull
Copy link
Contributor Author

Beginning PMI PerfScore Diffs for benchstones and benchmarks game in D:\fxkit\runtime\artifacts\tests\coreclr\Windows_NT.x64.Release
Completed PMI PerfScore Diffs for benchstones and benchmarks game in D:\fxkit\runtime\artifacts\tests\coreclr\Windows_NT.x64.Release in 113.61s
Diffs (if any) can be viewed by comparing: d:\jit-diffs\runtime\x64\dasmset_109\base d:\jit-diffs\runtime\x64\dasmset_109\diff
Total PerfScoreUnits of diff: -3192.55 (-0.47% of base)
    diff is an improvement.
Top file improvements (PerfScoreUnits):
    -2805.60 : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm (-3.38% of base)
     -289.60 : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm (-6.44% of base)
      -97.35 : SIMD\RayTracer\RayTracer\RayTracer.dasm (-0.78% of base)
3 total files with Perf Score differences (3 improved, 0 regressed), 79 unchanged.
Top method regressions (percentages):
        5.00 ( 3.73% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Sphere:Intersect(Ray):ISect:this
Top method improvements (percentages):
      -98.60 (-9.31% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetNaturalColor(SceneObject,Vector,Vector,Vector,Scene):Color:this
     -371.33 (-8.90% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
     -361.70 (-8.57% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
       -3.50 (-8.44% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - ComplexVecFloat:square():ComplexVecFloat:this
       -3.50 (-8.44% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - ComplexVecDouble:square():ComplexVecDouble:this
      -80.63 (-7.49% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedWithADT>b__1(int):this
      -96.50 (-7.41% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - MandelBrot_7:GetByte(long,double,int,int):ubyte
     -192.50 (-7.19% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - <>c__DisplayClass4_0:<DoBench>b__0(int):this
      -78.50 (-7.14% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass6_0:<RenderMultiThreadedWithADT>b__1(int):this
     -320.74 (-7.12% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleStrictRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
     -318.24 (-6.98% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatStrictRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
     -264.93 (-6.38% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
     -139.06 (-6.21% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass4_0:<RenderMultiThreadedWithADT>b__1(int):this (2 methods)
     -254.90 (-5.75% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
     -168.96 (-5.16% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedNoADT>b__1(int):this (3 methods)
     -211.84 (-4.67% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatStrictRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
     -211.84 (-4.39% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleStrictRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this

@briansull
Copy link
Contributor Author

Beginning PMI PerfScore Diffs for D:\fxkit\runtime\artifacts\tests\coreclr\Windows_NT.x64.Release\JIT\SIMD
Completed PMI PerfScore Diffs for D:\fxkit\runtime\artifacts\tests\coreclr\Windows_NT.x64.Release\JIT\SIMD in 141.91s
Diffs (if any) can be viewed by comparing: d:\jit-diffs\runtime\x64\dasmset_110\base d:\jit-diffs\runtime\x64\dasmset_110\diff
Analyzing PerfScore diffs...
Total PerfScoreUnits of diff: -1789.36 (-0.16% of base)
    diff is an improvement.
Top file regressions (PerfScoreUnits):
       79.83 : VectorRelOp_ro\VectorRelOp_ro.dasm (0.68% of base)
        9.90 : Matrix4x4_ro\Matrix4x4_ro.dasm (3.23% of base)
        0.10 : VectorIntEquals_ro\VectorIntEquals_ro.dasm (0.00% of base)
Top file improvements (PerfScoreUnits):
     -770.70 : VectorMatrix_ro\VectorMatrix_ro.dasm (-0.26% of base)
     -747.30 : VectorConvert_ro\VectorConvert_ro.dasm (-4.99% of base)
     -110.30 : Vector3Interop_ro\Vector3Interop_ro.dasm (-4.96% of base)
      -67.88 : VectorGet_ro\VectorGet_ro.dasm (-1.17% of base)
      -66.40 : VectorExp_ro\VectorExp_ro.dasm (-1.42% of base)
      -65.60 : BitwiseOperations_ro\BitwiseOperations_ro.dasm (-8.52% of base)
      -24.00 : Haar-likeFeaturesGeneric_ro\Haar-likeFeaturesGeneric_ro.dasm (-1.66% of base)
       -9.30 : VectorMul_ro\VectorMul_ro.dasm (-0.17% of base)
       -6.10 : LdfldGeneric_ro\LdfldGeneric_ro.dasm (-4.35% of base)
       -5.18 : VectorInitN_ro\VectorInitN_ro.dasm (-0.21% of base)
       -3.08 : VectorSqrt_ro\VectorSqrt_ro.dasm (-0.11% of base)
       -2.60 : StoreElement_ro\StoreElement_ro.dasm (-3.76% of base)
       -0.75 : Plane_ro\Plane_ro.dasm (-0.38% of base)
Top method regressions (percentages):
        9.90 ( 3.28% of base) : Matrix4x4_ro\Matrix4x4_ro.dasm - Matrix4x4Test:Matrix4x4CreateScaleCenterTest3():int
       31.91 ( 1.95% of base) : VectorRelOp_ro\VectorRelOp_ro.dasm - VectorRelopTest`1:VectorRelOp(int,int):int
       28.11 ( 1.68% of base) : VectorRelOp_ro\VectorRelOp_ro.dasm - VectorRelopTest`1:VectorRelOp(short,short):int
       25.11 ( 1.51% of base) : VectorRelOp_ro\VectorRelOp_ro.dasm - VectorRelopTest`1:VectorRelOp(long,long):int
        0.10 ( 0.15% of base) : VectorIntEquals_ro\VectorIntEquals_ro.dasm - VectorTest:VectorIntEquals():int
Top method improvements (percentages):
      -65.60 (-23.85% of base) : BitwiseOperations_ro\BitwiseOperations_ro.dasm - Program:TestBool():int
      -65.65 (-18.22% of base) : VectorExp_ro\VectorExp_ro.dasm - VectorExpTest`1:VectorExp(Vector`1,double,double,double):int
      -42.18 (-17.87% of base) : VectorGet_ro\VectorGet_ro.dasm - VectorGetTest`1:VectorGet(long,int):int
      -20.45 (-13.18% of base) : Vector3Interop_ro\Vector3Interop_ro.dasm - RPInvokeTest:callBack_RPInvoke_Vector3Array(ref,int)
      -20.70 (-12.09% of base) : Vector3Interop_ro\Vector3Interop_ro.dasm - RPInvokeTest:callBack_RPInvoke_Vector3Arg(int,Vector3,String,Vector3)
     -116.80 (-11.82% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertDoubleSingle(Vector`1,Vector`1):int
      -45.65 (-11.26% of base) : Vector3Interop_ro\Vector3Interop_ro.dasm - RPInvokeTest:callBack_RPInvoke_Vector3Arg_Unix2(Vector3,float,float,float,float,float,float,float,Vector3,float,float,Vector3,float)
      -25.70 (-11.21% of base) : VectorGet_ro\VectorGet_ro.dasm - VectorGetTest`1:VectorGet(double,int):int
      -92.40 (-11.13% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertUInt16And8(Vector`1,Vector`1):int
      -90.80 (-10.97% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertUInt32And16(Vector`1,Vector`1):int
      -89.40 (-10.34% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertUInt64And32(Vector`1,Vector`1):int
      -93.20 (-10.30% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertInt16And8(Vector`1,Vector`1):int
      -91.00 (-10.09% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertInt32And16(Vector`1,Vector`1):int
      -89.40 (-9.53% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertInt64And32(Vector`1,Vector`1):int
      -42.00 (-8.39% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertSingleInt(Vector`1):int
      -23.50 (-8.37% of base) : Vector3Interop_ro\Vector3Interop_ro.dasm - RPInvokeTest:callBack_RPInvoke_Vector3Arg_Unix(Vector3,float,float,float,float,float,float,float,Vector3,float,float)
      -42.30 (-7.26% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertDoubleInt64(Vector`1):int
       -6.10 (-4.84% of base) : LdfldGeneric_ro\LdfldGeneric_ro.dasm - Program:Main(ref):int
      -12.00 (-4.25% of base) : Haar-likeFeaturesGeneric_ro\Haar-likeFeaturesGeneric_ro.dasm - Program:VectorAndFilter(ref,ref):ref
      -12.00 (-4.09% of base) : Haar-likeFeaturesGeneric_ro\Haar-likeFeaturesGeneric_ro.dasm - Program:VectorFilter(ref,ref):ref

@briansull
Copy link
Contributor Author

Beginning PMI PerfScore Diffs for D:\fxkit\runtime\artifacts\tests\coreclr\Windows_NT.x64.Release\JIT\Performance\CodeQuality
Completed PMI PerfScore Diffs for D:\fxkit\runtime\artifacts\tests\coreclr\Windows_NT.x64.Release\JIT\Performance\CodeQuality in 112.49s
Diffs (if any) can be viewed by comparing: d:\jit-diffs\runtime\x64\dasmset_111\base d:\jit-diffs\runtime\x64\dasmset_111\diff
Analyzing PerfScore diffs...
Total PerfScoreUnits of diff: -3537.58 (-0.50% of base)
    diff is an improvement.
Top file improvements (PerfScoreUnits):
    -2805.60 : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm (-3.38% of base)
     -345.03 : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm (-1.54% of base)
     -289.60 : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm (-6.44% of base)
      -97.35 : SIMD\RayTracer\RayTracer\RayTracer.dasm (-0.78% of base)
4 total files with Perf Score differences (4 improved, 0 regressed), 80 unchanged.
Top method regressions (percentages):
        5.00 ( 3.73% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Sphere:Intersect(Ray):ISect:this
        4.70 ( 1.55% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - VectorMath:Log(Vector256`1):Vector256`1
        2.50 ( 0.46% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - Camera:Create(VectorPacket256,VectorPacket256):Camera
Top method improvements (percentages):
      -10.25 (-32.59% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - <>c:<.cctor>b__5_2(VectorPacket256):VectorPacket256:this
      -10.25 (-32.59% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - <>c:<.cctor>b__5_4(VectorPacket256):VectorPacket256:this
       -9.00 (-18.69% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - VectorPacket256:op_Multiply(Vector256`1,VectorPacket256):VectorPacket256
      -98.60 (-9.31% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetNaturalColor(SceneObject,Vector,Vector,Vector,Scene):Color:this
     -371.33 (-8.90% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
     -361.70 (-8.57% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
       -3.50 (-8.44% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - ComplexVecFloat:square():ComplexVecFloat:this
       -3.50 (-8.44% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - ComplexVecDouble:square():ComplexVecDouble:this
      -80.63 (-7.49% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedWithADT>b__1(int):this
      -96.50 (-7.41% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - MandelBrot_7:GetByte(long,double,int,int):ubyte
     -192.50 (-7.19% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - <>c__DisplayClass4_0:<DoBench>b__0(int):this
      -78.50 (-7.14% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass6_0:<RenderMultiThreadedWithADT>b__1(int):this
     -320.74 (-7.12% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleStrictRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
     -318.24 (-6.98% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatStrictRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
      -35.00 (-6.45% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - Packet256Tracer:CreateDefaultScene():Scene
     -264.93 (-6.38% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
     -139.06 (-6.21% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass4_0:<RenderMultiThreadedWithADT>b__1(int):this (2 methods)
     -254.90 (-5.75% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
     -168.96 (-5.16% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedNoADT>b__1(int):this (3 methods)
     -211.84 (-4.67% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatStrictRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
     -211.84 (-4.39% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleStrictRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
      -33.93 (-4.32% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - Packet256Tracer:Shade(Intersections,RayPacket256,Scene,int):VectorPacket256:this
     -239.80 (-3.82% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - Packet256Tracer:GetNaturalColor(Vector256`1,VectorPacket256,VectorPacket256,VectorPacket256,Scene):VectorPacket256:this

@briansull
Copy link
Contributor Author

We detect and make a new CSE in:

coreclr\tests\src\JIT\Performance\CodeQuality\BenchmarksGame\mandelbrot\mandelbrot-7.cs

The expression:
Zi = Zr * Zi + Zr * Zi + vCiby;
has the sub expression (Zr *Zi) occurring twice, so we make it into a CSE:
Saving about 10% of the loop cost.

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        private static unsafe byte GetByte(double* pCrb, double Ciby, int x, int y)
        {
            // This currently does not do anything special for 'Count > 2'

            var res = 0;

            for (var i = 0; i < 8; i += 2)
            {
                var Crbx = Unsafe.Read<Vector<double>>(pCrb + x + i);
                var Zr = Crbx;
                var vCiby = new Vector<double>(Ciby);
                var Zi = vCiby;

                var b = 0;
                var j = 49;

                do
                {
                    var nZr = Zr * Zr - Zi * Zi + Crbx;
                    Zi = Zr * Zi + Zr * Zi + vCiby;
                    Zr = nZr;

We also make the CSE when we inline the GetByte method into the lamba function :

                Parallel.For(0, size, y => {
                    var offset = y * lineLength;

                    for (var x = 0; x < lineLength; x++)
                    {
                        data[offset + x] = GetByte(_Crb, Cib[y], x * 8, y);
                    }
                });

@briansull
Copy link
Contributor Author

I verified that setting
set COMPLUS_JitDisableSimdVN=3
will disable my changes.

@briansull briansull merged commit 4ef3cc6 into dotnet:master Feb 26, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants