-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Vector64 and Vector128.Create methods #36267
Conversation
CC. @kunalspathak, @CarolEidt, @echesakovMSFT |
Also CC. @TamarChristinaArm |
@@ -197,6 +197,7 @@ HARDWARE_INTRINSIC(AdvSimd_Arm64, CompareTest, 1 | |||
HARDWARE_INTRINSIC(AdvSimd_Arm64, CompareTestScalar, 8, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_cmtst, INS_cmtst, INS_invalid, INS_cmtst}, HW_Category_SIMDScalar, HW_Flag_Commutative) | |||
HARDWARE_INTRINSIC(AdvSimd_Arm64, Divide, -1, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_fdiv, INS_fdiv}, HW_Category_SimpleSIMD, HW_Flag_NoFlag) | |||
HARDWARE_INTRINSIC(AdvSimd_Arm64, DuplicateSelectedScalarToVector128, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_dup, INS_dup, INS_invalid, INS_dup}, HW_Category_IMM, HW_Flag_SupportsContainment|HW_Flag_BaseTypeFromFirstArg|HW_Flag_SpecialCodeGen) | |||
HARDWARE_INTRINSIC(AdvSimd_Arm64, DuplicateToVector64, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_mov, INS_mov, INS_invalid, INS_fmov}, HW_Category_SimpleSIMD, HW_Flag_SupportsContainment|HW_Flag_SpecialCodeGen) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARM actually exposes native intrinsics for these under the same name as the other primitive types (vdup_n_s64
, etc). I'm going to include this in the misc proposal for APIs that may be missing.
// Arguments: | ||
// node - The hardware intrinsic node. | ||
// | ||
void Lowering::LowerHWIntrinsicCreate(GenTreeHWIntrinsic* node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ended up being quite a bit simpler than x86 as ARM defines insert
intrinsics as part of the baseline.
|
||
if ((simdSize == 8) && (simdType == TYP_DOUBLE)) | ||
{ | ||
simdType = TYP_SIMD8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because impFixupStructReturnType
changes the node from TYP_SIMD8
to TYP_DOUBLE
. I'd imagine it would be desirable to preserve it as TYP_SIMD8
and adjust the various code paths to recognize TYP_SIMD8
.
Is that covered by an existing issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to the work that @sandreenko is doing to avoid retyping return types.
idx = comp->gtNewIconNode(N, TYP_INT); | ||
BlockRange().InsertBefore(opN, idx); | ||
|
||
tmp1 = comp->gtNewSimdHWIntrinsicNode(simdType, tmp1, idx, opN, NI_AdvSimd_Insert, baseType, simdSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are currently exposing AdvSimd.Insert
for long
, ulong
, and double
.; However, I don't see how these can be supported on ARM32...
Could someone please indicate what instruction would be used, because it seems like something we need to go fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should become VMOV
. I have a bigger explanation in #35037
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are currently exposing AdvSimd.Insert for long, ulong, and double.; However, I don't see how these can be supported on ARM32...
I put implementation details in the <summary></summary> comments for those
runtime/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Arm/AdvSimd.cs
Lines 5256 to 5261 in 583fa67
/// <summary> | |
/// float64x2_t vsetq_lane_f64 (float64_t a, float64x2_t v, const int lane) | |
/// A32: VMOV.F64 Dd, Dm | |
/// A64: INS Vd.D[lane], Vn.D[0] | |
/// </summary> | |
public static Vector128<double> Insert(Vector128<double> vector, byte index, double data) => Insert(vector, index, data); |
runtime/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Arm/AdvSimd.cs
Lines 5277 to 5282 in 583fa67
/// <summary> | |
/// int64x2_t vsetq_lane_s64 (int64_t a, int64x2_t v, const int lane) | |
/// A32: VMOV.64 Dd, Rt, Rt2 | |
/// A64: INS Vd.D[lane], Xn | |
/// </summary> | |
public static Vector128<long> Insert(Vector128<long> vector, byte index, long data) => Insert(vector, index, data); |
runtime/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Arm/AdvSimd.cs
Lines 5312 to 5317 in 583fa67
/// <summary> | |
/// uint64x2_t vsetq_lane_u64 (uint64_t a, uint64x2_t v, const int lane) | |
/// A32: VMOV.64 Dd, Rt, Rt2 | |
/// A64: INS Vd.D[lane], Xn | |
/// </summary> | |
public static Vector128<ulong> Insert(Vector128<ulong> vector, byte index, ulong data) => Insert(vector, index, data); |
@@ -1100,141 +1100,8 @@ void Lowering::LowerHWIntrinsic(GenTreeHWIntrinsic* node) | |||
ContainCheckHWIntrinsic(node); | |||
} | |||
|
|||
union VectorConstant { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -317,6 +317,139 @@ class Lowering final : public Phase | |||
void LowerHWIntrinsicCC(GenTreeHWIntrinsic* node, NamedIntrinsic newIntrinsicId, GenCondition condition); | |||
void LowerHWIntrinsicCreate(GenTreeHWIntrinsic* node); | |||
void LowerFusedMultiplyAdd(GenTreeHWIntrinsic* node); | |||
|
|||
union VectorConstant { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was just moved from https://github.com/dotnet/runtime/pull/36267/files#diff-55de8576070c2b77ed274e8e03c6676fL1103
…orted mustExpand intrinsics
src/coreclr/src/jit/importer.cpp
Outdated
@@ -4272,7 +4272,8 @@ GenTree* Compiler::impIntrinsic(GenTree* newobjThis, | |||
|
|||
if (mustExpand && (retNode == nullptr)) | |||
{ | |||
NO_WAY("JIT must expand the intrinsic!"); | |||
assert(!"Unhandled must expand intrinsic, throwing PlatformNotSupportedException"); | |||
return impUnsupportedHWIntrinsic(CORINFO_HELP_THROW_PLATFORM_NOT_SUPPORTED, method, sig, mustExpand); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously the JIT would fail fast with The JIT compiler encountered invalid IL code or an internal limitation.
Now, it will assert in debug mode but will throw PlatformNotSupportedException
at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was changed to not use impUnsupportedHWIntrinsic
and instead use gtNewMustThrowException
directly.
impUnsupportedHWIntrinsic
was renamed to impUnsupportedNamedIntrinsic
and moved out of FEATURE_HW_INTRINSIC
} | ||
else | ||
|
||
if (result == NI_Illegal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This path accounts for mustExpand
for unrecognized HWIntrinsics (and thus will never hit the above assert
) in particular as we know that throw PlatformNotSupportedException
is the desired behavior for anything not recognized and which is recursive under namespace System.Runtime.Intrinsics
.
All of the code paths are either directly recursive (void MyMethod() => MyMethod()
) and are under one of the S.R.I.X86
or S.R.I.Arm
or are indirectly recursive and are directly under S.R.I
. An indirectly recursive intrinsic is one like Vector64.Create
which is:
if (AdvSimd.IsSupported)
{
return Vector64.Create(...);
}
return SoftwareFallback(...);
So the throw PNSE
will be generated under the dead AdvSimd.IsSupported
code path under minOpts and will never actually be hit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about this, as I thought that we always expanded intrinsics, even under minopts.
That said, I think that this code section could use a comment describing the "indirectly recursive" case. In fact, for developers who are not intimately familiar with this code, a few words in general about how/why we reach this case would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't necessarily about minopts
vs optimized
, it's about platforms like ARM where HWIntrinsics aren't supported at all or cases like Vector64
on x86
.
In both of those scenarios, it is recognized as intrinsic (due to the [Intrinsic]
) and as mustExpand
(due to the recursion), but it isn't possible to actually expand the call since it isn't recognized and so it was hitting https://github.com/dotnet/runtime/pull/36267/files#diff-b8d003a58643e5595d2920ca5993b952L4275
This was really only a problem for classes like Vector64
/Vector128
/Vector256
as they are shared between all architectures (where classes like x86.Sse
or Arm.AdvSimd
have two copies: one which is recursive and one which throws). It looks to have only been showing for minOpts
because we drop the dead code entirely otherwise and so we never try to expand the call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment elaborating as to why here: https://github.com/dotnet/runtime/pull/36267/files#diff-b8d003a58643e5595d2920ca5993b952R4501-R4517
Hmmm... Getting a |
} | ||
assert((argCnt == 1) || (argCnt == (simdSize / genTypeSize(baseType)))); | ||
|
||
if (argCnt == cnsArgCnt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If argCnt == cnsArgCnt == 1
and the constant is small enough for mov
or fmov
(immediate), I presume that is better and we should prefer it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding Indeed. Though when testing the values you have to keep all alias of mov
in mind.
But also If the alternative is a literal load I think you should allow some mov/movk
as well as I believe currently the address calculation for the literal pool itself takes a couple of instructions, so you might as well use those for making the constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
But also If the alternative is a literal load I think you should allow some mov/movk as well as I believe currently the address calculation for the literal pool itself takes a couple of instructions, so you might as well use those for making the constant.
I will log an issue for this in particular. It seems like something we should more generally support and which we likely don't support today; but I don't think it's worth blocking this PR on getting completed.
I added basic support for |
…nsupported platforms
112d9cb
to
b2401b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, just a couple of suggestions
src/coreclr/src/jit/importer.cpp
Outdated
impPopStack(); | ||
} | ||
|
||
return gtNewMustThrowException(CORINFO_HELP_THROW_PLATFORM_NOT_SUPPORTED, JITtype2varType(sig->retType), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent - this is a much better way to handle this. Thanks!
} | ||
else | ||
|
||
if (result == NI_Illegal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about this, as I thought that we always expanded intrinsics, even under minopts.
That said, I think that this code section could use a comment describing the "indirectly recursive" case. In fact, for developers who are not intimately familiar with this code, a few words in general about how/why we reach this case would be useful.
argList = argList->Rest(); | ||
} | ||
|
||
assert(N == (argCnt - 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it would be more straightforward to include this last case in the loop above - unless I'm missing something it's identical to all the other cases, except in the case where there are 2 args, and checking for that inside the loop seems pretty low cost and quite a bit cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its different because it modifies the original node, rather than introducing a new node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, makes sense.
The diff is:
It is slightly "smaller" due to the additional methods that were introduced in S.P.Corelib to separate out the Software Fallback (same reason why Diffs are largely cases like: - mov w0, #0xd1ffab1e
- dup v8.8h, w0
+ ldr q8, [@RWD00]
...
+ RWD00 db 080h, 07Fh, 080h, 07Fh, 080h, 07Fh, 080h, 07Fh, 080h, 07Fh, 080h, 07Fh, 080h, 07Fh, 080h, 07Fh We should also see a couple other wins once #36267 (comment) is resolved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks!
This is basically the same as #35857 but for ARM64 and updates the
Vector64/128.Create
methods to be intrinsic.They are handled entirely in lowering where they will be replaced with the corresponding constant (as was done for GT_SIMD nodes) or where they are lowered to the correct sequence of HWIntrinsics (which allows containment and other checks to "just work"). This should make it rather trivial to support "partial constants" as well (that is, a vector where say 50% of the inputs are constant and the other half are not).
It might be beneficial to eventually create a proper GenTreeVecCns node and to also try and handle this earlier (which would allow constants to be deduplicated and other features), but that is a more involved change.
I'm working on getting a jit-diff and will post when it is available.