-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT ARM64-SVE: Add TrueMask and LoadVector #98218
Conversation
Change-Id: I285f8aba668409ca94e11be2489a6d9b50a4ec6e
Change-Id: I3ad4fd9a8d823cb43a9546ba6356006a0907ac57
Note regarding the This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change. |
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsWIP patch to add TrueMask and LoadVector support
|
Wanted to show where I am with getting some API code working. Intention here is to use Test in Sve_mine.cs is not intended for merging and is only until I get template testing working. [MethodImpl(MethodImplOptions.NoInlining)]
public unsafe static Vector<byte> LoadVector_ImplicitMask(byte* address)
{
Vector<byte> mask = Sve.TrueMask(SveMaskPattern.All);
return Sve.LoadVector(mask, address);
} Generates to:
When run the test sometimes passes with
And sometimes fails with:
I assume there it's either due to register allocation code not yet done or something missing in GC or related. |
Diff results for #98218Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64MinOpts (-0.01% to -0.00%)
Throughput diffs for osx/arm64 ran on windows/x64MinOpts (-0.01% to -0.00%)
Throughput diffs for windows/arm64 ran on windows/x64MinOpts (-0.01% to -0.00%)
Details here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good overall
// this output can be used as a per-element mask | ||
HW_Flag_ReturnsPerElementMask = 0x10000, | ||
|
||
// The intrinsic uses a mask in arg1 to select elements present in the result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arg1
: Is it always be the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not just check for TYP_MASK
to determine this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arg1: Is it always be the case?
Yes, that's the sve convention. Result, then mask, then inputs.
Can we not just check for TYP_MASK to determine this?
Ok, that sounds better. I can look and see how this would be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not just check for TYP_MASK to determine this?
@tannergooding - Looking closer at this, I'm not quite sure what this would entail.
In hwintrinsiclistxarch.h
the only reference to mask is use of HW_Flag_ReturnsPerElementMask
.
I can't see any obvious way for the jit to understand know that the first arg of the method is expected to be a predicate mask, other than to use the enum or hardcode it with case statements somewhere.
The jit can check the type of the actual arg1 child node, but that only tells us what the type actually is, and not what the expected type is. I imagine I'll have to write code that says if the actual type and expected type don't match, then somehow convert arg1 to the expected type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine I'll have to write code that says if the actual type and expected type don't match, then somehow convert arg1 to the expected type.
Yes, basically.
Most intrinsics support masking optionally and so you'll have something similar to this https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp#L19988-L20008. That is, you'll have some bool GenTree::isSveEmbeddedMaskingCompatibleHWIntrinsic()
which likely looks up a flag in the hwintrinsiclistarm64.h
table to see if that particular intrinsic supports embedded masking/predication.
There are then a handful of intrinsics which require masking. For example, SVE comparison intrinsics may always return a TYP_MASK, in which case you could either add a new entry to the table such as HW_Flag_ReturnsSveMask
or explicitly handle it like xarch does here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsicxarch.cpp#L3985-L3999
There are then a handful of intrinsics which require mask inputs and which aren't recognized via pattern matching. You would likewise add a flag or manually handle the few of them like this: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsicxarch.cpp#L3970-L3983
The insertion of the ConvertVectorToMask
and ConvertMaskToVector
intrinsics is important since the user may have passedin something that was of the incorrect type. For example, it might've been a mask of bytes, where we needed a mask of ints; or might've been an actual vector where we needed a mask and vice-versa. Likewise it ensures we don't need to check the type on every other intrinsic that does properly take a vector.
We then make this efficient in morph (see https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/morph.cpp#L10775-L10827) where we ensure that we aren't unnecessarily converting from mask to vector and back to mask, or vice versa. This allows things that take a mask to consume a produced mask directly and gives the optimal codegen expected in most scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the comment around
We are notably missing and need to add a bit which handles the case where we have LCL_VAR TYP_SIMD = TYP_MASK because that can currently block the ability to consume a mask directly if it's multi-use. We ideally would have it stored as LCL_VAR TYP_MASK instead (even if the use manually hoisted as a Vector in C#/IL) and then have the things consume it as ConvertMaskToVector(LCL_VAR) if they actually needed a vector.
This shouldn't be overly complex to add, however, it's just not been done as of yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. That feels like it might touch quite a few files. Given the size of this PR, do you think it's worth keeping this PR as is, and then putting the LCL_VAR TYP_MASK
in a follow on, along with the lowering code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then putting the LCL_VAR TYP_MASK in a follow on
Yes, I think this would even be the preferred route given its not required and is its own isolated change really.
along with the lowering code?
Which lowering code is this?
In general I think its fine for this PR to be the basic plumbing of TYP_MASK support into the Arm64 side of the JIT. As long as TrueMask
and LoadVector
are minimally working as expected, I think we're golden and we can extend that to other operations and enable optimizations separately. That is exactly what we did for xarch to help with review and scoping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which lowering code is this?
I added some code do the remove the mask->vector->mask and vector->mask->vector conversions. But, nothing in this PR uses it because of the lcl var, so I decided not to push it.
Will mark this as ready now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will mark this as ready now.
... but not quite yet, as I need #99049 to merge so I can remove it from this PR.
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Arm/Enums.cs
Outdated
Show resolved
Hide resolved
src/coreclr/jit/emitarm64.cpp
Outdated
case INS_sve_ld1h: | ||
case INS_sve_ld1w: | ||
case INS_sve_ld1d: | ||
return emitIns_R_R_R_I(ins, size, reg1, reg2, reg3, 0, opt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any reason why we can't call emitIns_R_R_R_I()
directly from the caller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's two ways to do this. Without any special coding in the table, it'll just automatically use the R_R_R()
version because that's how many args there are in the intrinsic.
I see that elsewhere in NEON etc, this is how it's already done.
Alternatively, we could add an extra flag something like HW_Flag_extra_zero_arg
or just hardcode via the HW_Flag_specialcodegen
route. That feels a lot of extra code.
New version pushed:
|
Diff results for #98218Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (-0.01% to -0.00%)
MinOpts (-0.01% to -0.00%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (-0.01% to -0.00%)
MinOpts (-0.01% to +0.00%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (-0.01% to -0.00%)
MinOpts (-0.01% to +0.00%)
Details here |
case NI_Sve_CreateTrueMaskUInt16: | ||
case NI_Sve_CreateTrueMaskUInt32: | ||
case NI_Sve_CreateTrueMaskUInt64: | ||
needBranchTargetReg = !intrin.op1->isContainedIntOrIImmed(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this creates internal register for "def". Make sure that we create an internal register for "use" as well. I forgot to do that in one place and fixed it in #98814.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this creates internal register for "def". Make sure that we create an internal register for "use" as well. I forgot to do that in one place and fixed it in #98814.
I think this is ok as is. The code will use all the generic functionality and work down to the buildInternalRegisterUses()
call at the end of the function.
src/coreclr/jit/lsraarm64.cpp
Outdated
switch (intrin.id) | ||
{ | ||
case NI_Sve_LoadVector: | ||
srcCandidates = RBM_LOWMASK; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is RBM_LOWMASK
true for all variants of ld1*
or are there some which could operate on higher mask register? I am wondering how we can make this easy for development of other APIs where developer don't have to think about which candidates to set for given intrinsic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is
RBM_LOWMASK
true for all variants ofld1*
or are there some which could operate on higher mask register? I am wondering how we can make this easy for development of other APIs where developer don't have to think about which candidates to set for given intrinsic?
Yes, all ld1*
should be the same. We should be able to pull this information automatically.
I've pushed an update which adds HW_Flag_LowMaskedOperation
instead of the switch. I'm fairly keen on pushing as much as we can into the table as it reduces number of files touched each time an api is added. But, I'm a little concerned we'll run out of space for flags - at that point, we can either get creative with flag reuse or turn some flags back into a switch.
Change-Id: I9b3f7d73699e73ada7aa4b74a6cdadaa4574e996
Updated version now produces the move to/from predicates. Test case dump with 4 annotations:
Next I need to add the lowering code which spots masks moves and removes them |
...braries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Arm/Sve.PlatformNotSupported.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Arm/Sve.cs
Outdated
Show resolved
Hide resolved
This reverts commit 6beb760.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
WIP patch to add TrueMask and LoadVector support