Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Vector64 and Vector128.Create methods #36267

Merged
merged 10 commits into from
May 13, 2020

Conversation

tannergooding
Copy link
Member

This is basically the same as #35857 but for ARM64 and updates the Vector64/128.Create methods to be intrinsic.

They are handled entirely in lowering where they will be replaced with the corresponding constant (as was done for GT_SIMD nodes) or where they are lowered to the correct sequence of HWIntrinsics (which allows containment and other checks to "just work"). This should make it rather trivial to support "partial constants" as well (that is, a vector where say 50% of the inputs are constant and the other half are not).

It might be beneficial to eventually create a proper GenTreeVecCns node and to also try and handle this earlier (which would allow constants to be deduplicated and other features), but that is a more involved change.

I'm working on getting a jit-diff and will post when it is available.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 12, 2020
@tannergooding
Copy link
Member Author

CC. @kunalspathak, @CarolEidt, @echesakovMSFT

@tannergooding
Copy link
Member Author

Also CC. @TamarChristinaArm

@@ -197,6 +197,7 @@ HARDWARE_INTRINSIC(AdvSimd_Arm64, CompareTest, 1
HARDWARE_INTRINSIC(AdvSimd_Arm64, CompareTestScalar, 8, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_cmtst, INS_cmtst, INS_invalid, INS_cmtst}, HW_Category_SIMDScalar, HW_Flag_Commutative)
HARDWARE_INTRINSIC(AdvSimd_Arm64, Divide, -1, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_fdiv, INS_fdiv}, HW_Category_SimpleSIMD, HW_Flag_NoFlag)
HARDWARE_INTRINSIC(AdvSimd_Arm64, DuplicateSelectedScalarToVector128, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_dup, INS_dup, INS_invalid, INS_dup}, HW_Category_IMM, HW_Flag_SupportsContainment|HW_Flag_BaseTypeFromFirstArg|HW_Flag_SpecialCodeGen)
HARDWARE_INTRINSIC(AdvSimd_Arm64, DuplicateToVector64, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_mov, INS_mov, INS_invalid, INS_fmov}, HW_Category_SimpleSIMD, HW_Flag_SupportsContainment|HW_Flag_SpecialCodeGen)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARM actually exposes native intrinsics for these under the same name as the other primitive types (vdup_n_s64, etc). I'm going to include this in the misc proposal for APIs that may be missing.

// Arguments:
// node - The hardware intrinsic node.
//
void Lowering::LowerHWIntrinsicCreate(GenTreeHWIntrinsic* node)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ended up being quite a bit simpler than x86 as ARM defines insert intrinsics as part of the baseline.


if ((simdSize == 8) && (simdType == TYP_DOUBLE))
{
simdType = TYP_SIMD8;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because impFixupStructReturnType changes the node from TYP_SIMD8 to TYP_DOUBLE. I'd imagine it would be desirable to preserve it as TYP_SIMD8 and adjust the various code paths to recognize TYP_SIMD8.

Is that covered by an existing issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to the work that @sandreenko is doing to avoid retyping return types.

idx = comp->gtNewIconNode(N, TYP_INT);
BlockRange().InsertBefore(opN, idx);

tmp1 = comp->gtNewSimdHWIntrinsicNode(simdType, tmp1, idx, opN, NI_AdvSimd_Insert, baseType, simdSize);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are currently exposing AdvSimd.Insert for long, ulong, and double.; However, I don't see how these can be supported on ARM32...

Could someone please indicate what instruction would be used, because it seems like something we need to go fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should become VMOV. I have a bigger explanation in #35037

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are currently exposing AdvSimd.Insert for long, ulong, and double.; However, I don't see how these can be supported on ARM32...

I put implementation details in the <summary></summary> comments for those

/// <summary>
/// float64x2_t vsetq_lane_f64 (float64_t a, float64x2_t v, const int lane)
/// A32: VMOV.F64 Dd, Dm
/// A64: INS Vd.D[lane], Vn.D[0]
/// </summary>
public static Vector128<double> Insert(Vector128<double> vector, byte index, double data) => Insert(vector, index, data);

/// <summary>
/// int64x2_t vsetq_lane_s64 (int64_t a, int64x2_t v, const int lane)
/// A32: VMOV.64 Dd, Rt, Rt2
/// A64: INS Vd.D[lane], Xn
/// </summary>
public static Vector128<long> Insert(Vector128<long> vector, byte index, long data) => Insert(vector, index, data);

/// <summary>
/// uint64x2_t vsetq_lane_u64 (uint64_t a, uint64x2_t v, const int lane)
/// A32: VMOV.64 Dd, Rt, Rt2
/// A64: INS Vd.D[lane], Xn
/// </summary>
public static Vector128<ulong> Insert(Vector128<ulong> vector, byte index, ulong data) => Insert(vector, index, data);

@@ -1100,141 +1100,8 @@ void Lowering::LowerHWIntrinsic(GenTreeHWIntrinsic* node)
ContainCheckHWIntrinsic(node);
}

union VectorConstant {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -317,6 +317,139 @@ class Lowering final : public Phase
void LowerHWIntrinsicCC(GenTreeHWIntrinsic* node, NamedIntrinsic newIntrinsicId, GenCondition condition);
void LowerHWIntrinsicCreate(GenTreeHWIntrinsic* node);
void LowerFusedMultiplyAdd(GenTreeHWIntrinsic* node);

union VectorConstant {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -4272,7 +4272,8 @@ GenTree* Compiler::impIntrinsic(GenTree* newobjThis,

if (mustExpand && (retNode == nullptr))
{
NO_WAY("JIT must expand the intrinsic!");
assert(!"Unhandled must expand intrinsic, throwing PlatformNotSupportedException");
return impUnsupportedHWIntrinsic(CORINFO_HELP_THROW_PLATFORM_NOT_SUPPORTED, method, sig, mustExpand);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the JIT would fail fast with The JIT compiler encountered invalid IL code or an internal limitation.

Now, it will assert in debug mode but will throw PlatformNotSupportedException at runtime.

Copy link
Member Author

@tannergooding tannergooding May 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was changed to not use impUnsupportedHWIntrinsic and instead use gtNewMustThrowException directly.

impUnsupportedHWIntrinsic was renamed to impUnsupportedNamedIntrinsic and moved out of FEATURE_HW_INTRINSIC

}
else

if (result == NI_Illegal)
Copy link
Member Author

@tannergooding tannergooding May 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path accounts for mustExpand for unrecognized HWIntrinsics (and thus will never hit the above assert) in particular as we know that throw PlatformNotSupportedException is the desired behavior for anything not recognized and which is recursive under namespace System.Runtime.Intrinsics.

All of the code paths are either directly recursive (void MyMethod() => MyMethod()) and are under one of the S.R.I.X86 or S.R.I.Arm or are indirectly recursive and are directly under S.R.I. An indirectly recursive intrinsic is one like Vector64.Create which is:

if (AdvSimd.IsSupported)
{
    return Vector64.Create(...);
}

return SoftwareFallback(...);

So the throw PNSE will be generated under the dead AdvSimd.IsSupported code path under minOpts and will never actually be hit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about this, as I thought that we always expanded intrinsics, even under minopts.
That said, I think that this code section could use a comment describing the "indirectly recursive" case. In fact, for developers who are not intimately familiar with this code, a few words in general about how/why we reach this case would be useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessarily about minopts vs optimized, it's about platforms like ARM where HWIntrinsics aren't supported at all or cases like Vector64 on x86.

In both of those scenarios, it is recognized as intrinsic (due to the [Intrinsic]) and as mustExpand (due to the recursion), but it isn't possible to actually expand the call since it isn't recognized and so it was hitting https://github.com/dotnet/runtime/pull/36267/files#diff-b8d003a58643e5595d2920ca5993b952L4275

This was really only a problem for classes like Vector64/Vector128/Vector256 as they are shared between all architectures (where classes like x86.Sse or Arm.AdvSimd have two copies: one which is recursive and one which throws). It looks to have only been showing for minOpts because we drop the dead code entirely otherwise and so we never try to expand the call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding
Copy link
Member Author

Hmmm... Getting a Assertion failed 'NYI: Unimplemented node type CLS_VAR_ADDR'. Is there something else used for method constants on ARM64?

}
assert((argCnt == 1) || (argCnt == (simdSize / genTypeSize(baseType))));

if (argCnt == cnsArgCnt)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If argCnt == cnsArgCnt == 1 and the constant is small enough for mov or fmov (immediate), I presume that is better and we should prefer it here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding Indeed. Though when testing the values you have to keep all alias of mov in mind.

But also If the alternative is a literal load I think you should allow some mov/movk as well as I believe currently the address calculation for the literal pool itself takes a couple of instructions, so you might as well use those for making the constant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

But also If the alternative is a literal load I think you should allow some mov/movk as well as I believe currently the address calculation for the literal pool itself takes a couple of instructions, so you might as well use those for making the constant.

I will log an issue for this in particular. It seems like something we should more generally support and which we likely don't support today; but I don't think it's worth blocking this PR on getting completed.

@tannergooding
Copy link
Member Author

Hmmm... Getting a Assertion failed 'NYI: Unimplemented node type CLS_VAR_ADDR'. Is there something else used for method constants on ARM64?

I added basic support for GT_CLS_VAR_ADDR to the ARM64 JIT to handle this.

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, just a couple of suggestions

impPopStack();
}

return gtNewMustThrowException(CORINFO_HELP_THROW_PLATFORM_NOT_SUPPORTED, JITtype2varType(sig->retType),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent - this is a much better way to handle this. Thanks!

}
else

if (result == NI_Illegal)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about this, as I thought that we always expanded intrinsics, even under minopts.
That said, I think that this code section could use a comment describing the "indirectly recursive" case. In fact, for developers who are not intimately familiar with this code, a few words in general about how/why we reach this case would be useful.

argList = argList->Rest();
}

assert(N == (argCnt - 1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it would be more straightforward to include this last case in the loop above - unless I'm missing something it's identical to all the other cases, except in the case where there are 2 args, and checking for that inside the loop seems pretty low cost and quite a bit cleaner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its different because it modifies the original node, rather than introducing a new node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, makes sense.

@tannergooding
Copy link
Member Author

The diff is:

Found 303 files with textual diffs.

Summary of Code Size diffs:
(Lower is better)

Total bytes of diff: 292 (0.001% of base)
    diff is a regression.

Total byte diff includes 1256 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had   17 unique methods,     1256 unique bytes

Top file regressions (bytes):
         528 : System.Private.CoreLib.dasm (0.010% of base)

Top file improvements (bytes):
        -180 : System.Text.Encodings.Web.dasm (-0.471% of base)
         -48 : System.Memory.dasm (-0.016% of base)
          -8 : System.Collections.dasm (-0.001% of base)

4 total files with Code Size differences (3 improved, 1 regressed), 262 unchanged.

Top method regressions (bytes):
         100 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|12_0(ubyte):Vector64`1 (0 base, 1 diff methods)
         100 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|17_0(byte):Vector64`1 (0 base, 1 diff methods)
         100 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|22_0(ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte):Vector64`1 (0 base, 1 diff methods)
         100 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|25_0(byte,byte,byte,byte,byte,byte,byte,byte):Vector64`1 (0 base, 1 diff methods)
          84 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|14_0(short):Vector64`1 (0 base, 1 diff methods)
          84 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|19_0(ushort):Vector64`1 (0 base, 1 diff methods)
          84 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|23_0(short,short,short,short):Vector64`1 (0 base, 1 diff methods)
          84 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|27_0(ushort,ushort,ushort,ushort):Vector64`1 (0 base, 1 diff methods)
          76 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|15_0(int):Vector64`1 (0 base, 1 diff methods)
          76 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|18_0(float):Vector64`1 (0 base, 1 diff methods)
          76 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|20_0(int):Vector64`1 (0 base, 1 diff methods)
          76 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|24_0(int,int):Vector64`1 (0 base, 1 diff methods)
          76 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|26_0(float,float):Vector64`1 (0 base, 1 diff methods)
          76 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|28_0(int,int):Vector64`1 (0 base, 1 diff methods)
          36 (25.000% of base) : System.Private.CoreLib.dasm - Vector128:Create(ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte):Vector128`1
          36 (75.000% of base) : System.Private.CoreLib.dasm - Vector128:Create(short,short,short,short,short,short,short,short):Vector128`1
          36 (25.000% of base) : System.Private.CoreLib.dasm - Vector128:Create(byte,byte,byte,byte,byte,byte,byte,byte,byte,byte,byte,byte,byte,byte,byte,byte):Vector128`1
          36 (75.000% of base) : System.Private.CoreLib.dasm - Vector128:Create(ushort,ushort,ushort,ushort,ushort,ushort,ushort,ushort):Vector128`1
          24 (75.000% of base) : System.Private.CoreLib.dasm - Vector128:Create(long,long):Vector128`1 (2 methods)
          24 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|16_0(long):Vector64`1 (0 base, 1 diff methods)

Top method improvements (bytes):
        -180 (-61.644% of base) : System.Text.Encodings.Web.dasm - Ssse3Helper:.cctor()
        -132 (-62.264% of base) : System.Private.CoreLib.dasm - Vector64:Create(ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte):Vector64`1
        -132 (-62.264% of base) : System.Private.CoreLib.dasm - Vector64:Create(byte,byte,byte,byte,byte,byte,byte,byte):Vector64`1
        -112 (-73.684% of base) : System.Private.CoreLib.dasm - Vector64:Create(int):Vector64`1 (2 methods)
        -104 (-59.091% of base) : System.Private.CoreLib.dasm - Vector128:Create(int,int,int,int):Vector128`1 (2 methods)
         -76 (-76.000% of base) : System.Private.CoreLib.dasm - Vector64:Create(ubyte):Vector64`1
         -76 (-76.000% of base) : System.Private.CoreLib.dasm - Vector64:Create(byte):Vector64`1
         -60 (-71.429% of base) : System.Private.CoreLib.dasm - Vector64:Create(short):Vector64`1
         -60 (-71.429% of base) : System.Private.CoreLib.dasm - Vector64:Create(ushort):Vector64`1
         -56 (-73.684% of base) : System.Private.CoreLib.dasm - Vector64:Create(float):Vector64`1
         -52 (-52.000% of base) : System.Private.CoreLib.dasm - Vector64:Create(short,short,short,short):Vector64`1
         -52 (-52.000% of base) : System.Private.CoreLib.dasm - Vector64:Create(ushort,ushort,ushort,ushort):Vector64`1
         -32 (-6.667% of base) : System.Memory.dasm - Base64:Ssse3Encode(byref,byref,long,int,int,long,long)
         -16 (-3.053% of base) : System.Memory.dasm - Base64:Ssse3Decode(byref,byref,long,int,int,long,long)
          -8 (-4.255% of base) : System.Collections.dasm - BitArray:.cctor()
          -8 (-16.667% of base) : System.Private.CoreLib.dasm - Vector64:Create(long):Vector64`1 (2 methods)
          -4 (-0.571% of base) : System.Private.CoreLib.dasm - ASCIIUtility:GetIndexOfFirstNonAsciiChar_Sse2(long,long):long
          -4 (-0.725% of base) : System.Private.CoreLib.dasm - ASCIIUtility:NarrowUtf16ToAscii_Sse2(long,long,long):long
Top method regressions (percentages):
         100 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|12_0(ubyte):Vector64`1 (0 base, 1 diff methods)
          16 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|13_0(double):Vector64`1 (0 base, 1 diff methods)
          84 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|14_0(short):Vector64`1 (0 base, 1 diff methods)
          76 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|15_0(int):Vector64`1 (0 base, 1 diff methods)
          24 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|16_0(long):Vector64`1 (0 base, 1 diff methods)
         100 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|17_0(byte):Vector64`1 (0 base, 1 diff methods)
          76 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|18_0(float):Vector64`1 (0 base, 1 diff methods)
          84 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|19_0(ushort):Vector64`1 (0 base, 1 diff methods)
          76 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|20_0(int):Vector64`1 (0 base, 1 diff methods)
          24 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|21_0(long):Vector64`1 (0 base, 1 diff methods)
         100 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|22_0(ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte):Vector64`1 (0 base, 1 diff methods)
          84 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|23_0(short,short,short,short):Vector64`1 (0 base, 1 diff methods)
          76 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|24_0(int,int):Vector64`1 (0 base, 1 diff methods)
         100 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|25_0(byte,byte,byte,byte,byte,byte,byte,byte):Vector64`1 (0 base, 1 diff methods)
          76 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|26_0(float,float):Vector64`1 (0 base, 1 diff methods)
          84 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|27_0(ushort,ushort,ushort,ushort):Vector64`1 (0 base, 1 diff methods)
          76 (     ∞ of base) : System.Private.CoreLib.dasm - Vector64:<Create>g__SoftwareFallback|28_0(int,int):Vector64`1 (0 base, 1 diff methods)
          20 (125.000% of base) : System.Private.CoreLib.dasm - Vector128:Create(float,float,float,float):Vector128`1
          36 (75.000% of base) : System.Private.CoreLib.dasm - Vector128:Create(short,short,short,short,short,short,short,short):Vector128`1
          36 (75.000% of base) : System.Private.CoreLib.dasm - Vector128:Create(ushort,ushort,ushort,ushort,ushort,ushort,ushort,ushort):Vector128`1

Top method improvements (percentages):
         -76 (-76.000% of base) : System.Private.CoreLib.dasm - Vector64:Create(ubyte):Vector64`1
         -76 (-76.000% of base) : System.Private.CoreLib.dasm - Vector64:Create(byte):Vector64`1
         -56 (-73.684% of base) : System.Private.CoreLib.dasm - Vector64:Create(float):Vector64`1
        -112 (-73.684% of base) : System.Private.CoreLib.dasm - Vector64:Create(int):Vector64`1 (2 methods)
         -60 (-71.429% of base) : System.Private.CoreLib.dasm - Vector64:Create(short):Vector64`1
         -60 (-71.429% of base) : System.Private.CoreLib.dasm - Vector64:Create(ushort):Vector64`1
        -132 (-62.264% of base) : System.Private.CoreLib.dasm - Vector64:Create(ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte):Vector64`1
        -132 (-62.264% of base) : System.Private.CoreLib.dasm - Vector64:Create(byte,byte,byte,byte,byte,byte,byte,byte):Vector64`1
        -180 (-61.644% of base) : System.Text.Encodings.Web.dasm - Ssse3Helper:.cctor()
        -104 (-59.091% of base) : System.Private.CoreLib.dasm - Vector128:Create(int,int,int,int):Vector128`1 (2 methods)
         -52 (-52.000% of base) : System.Private.CoreLib.dasm - Vector64:Create(short,short,short,short):Vector64`1
         -52 (-52.000% of base) : System.Private.CoreLib.dasm - Vector64:Create(ushort,ushort,ushort,ushort):Vector64`1
          -8 (-16.667% of base) : System.Private.CoreLib.dasm - Vector64:Create(long):Vector64`1 (2 methods)
         -32 (-6.667% of base) : System.Memory.dasm - Base64:Ssse3Encode(byref,byref,long,int,int,long,long)
          -8 (-4.255% of base) : System.Collections.dasm - BitArray:.cctor()
         -16 (-3.053% of base) : System.Memory.dasm - Base64:Ssse3Decode(byref,byref,long,int,int,long,long)
          -4 (-0.725% of base) : System.Private.CoreLib.dasm - ASCIIUtility:NarrowUtf16ToAscii_Sse2(long,long,long):long          -4 (-0.571% of base) : System.Private.CoreLib.dasm - ASCIIUtility:GetIndexOfFirstNonAsciiChar_Sse2(long,long):long

42 total methods with Code Size differences (18 improved, 24 regressed), 244865 unchanged.

33 files had text diffs but no metric diffs.
System.Private.Xml.dasm had 138 diffs
System.ComponentModel.TypeConverter.dasm had 66 diffs
System.Reflection.MetadataLoadContext.dasm had 30 diffs
Microsoft.CodeAnalysis.VisualBasic.dasm had 28 diffs
Microsoft.Extensions.DependencyInjection.dasm had 22 diffs
System.ComponentModel.Composition.dasm had 22 diffs
System.Composition.Convention.dasm had 20 diffs
System.Linq.Expressions.dasm had 14 diffs
Newtonsoft.Json.dasm had 10 diffs
System.Data.Odbc.dasm had 8 diffs
System.Linq.Parallel.dasm had 8 diffs
System.Text.Json.dasm had 8 diffs
System.Diagnostics.StackTrace.dasm had 6 diffs
System.DirectoryServices.dasm had 6 diffs
Microsoft.Extensions.Logging.Abstractions.dasm had 4 diffs
System.DirectoryServices.AccountManagement.dasm had 4 diffs
System.Drawing.Common.dasm had 4 diffs
System.Net.Requests.dasm had 4 diffs
System.Security.Cryptography.Pkcs.dasm had 4 diffs
System.Text.RegularExpressions.dasm had 4

It is slightly "smaller" due to the additional methods that were introduced in S.P.Corelib to separate out the Software Fallback (same reason why Vector128.Create shows as a "regression").
-- We are (and were prior to this PR) inserting unnecessary uxtb/sxtb (and related for ushort/short) before Insert in several cases however. Ideally we would optimize that away.

Diffs are largely cases like:

- mov     w0, #0xd1ffab1e
- dup     v8.8h, w0
+ ldr     q8, [@RWD00]
  ...
+ RWD00  db	080h, 07Fh, 080h, 07Fh, 080h, 07Fh, 080h, 07Fh, 080h, 07Fh, 080h, 07Fh, 080h, 07Fh, 080h, 07Fh

We should also see a couple other wins once #36267 (comment) is resolved.

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks!

@tannergooding tannergooding merged commit 9924705 into dotnet:master May 13, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants