-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avx512 extract most significant bits #82731
Avx512 extract most significant bits #82731
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue DetailsPR addresses #80820. This adds a new type, A future PR will implement mask registers in full, which depends upon As an example, the following code... ulong x = v1.ExtractMostSignificantBits(); // v1 is `Vector512<long>` roughly lowers to vmovups zmm0, zmmword ptr[rcx]
vpmovq2m k1, zmm0
kmovb rax, k1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the 1st pass. I have reviewed just the portion that is not part of #80960
@dotnet/avx512-contrib can we get a rerun? Looks like some stuff was timed out? |
Looks like everything is passing, I have made the requested changes if you would like to give it another pass. |
Changes LGTM. Once #80960 is merged, let us do a final superpmi measurement before merging. |
@anthonycanino Please resolve the conflicts. |
6cf682d
to
40a46eb
Compare
I just pulled the changes onto a fresh branch based on main after its dependent PR was merged. |
Seems like a TP hit, possibly because of added checks for K-instruction in emitter? Can you double check? Also, I am surprised to see diffs in arm64. There are few things that should be under |
src/coreclr/jit/typelist.h
Outdated
@@ -64,6 +64,7 @@ DEF_TP(SIMD16 ,"simd16" , TYP_SIMD16, TI_STRUCT,16,16, 16, 4,16, VTF_S|VTF_ | |||
DEF_TP(SIMD32 ,"simd32" , TYP_SIMD32, TI_STRUCT,32,32, 32, 8,16, VTF_S|VTF_VEC) | |||
DEF_TP(SIMD64 ,"simd64" , TYP_SIMD64, TI_STRUCT,64,64, 64, 16,16, VTF_S|VTF_VEC) | |||
#endif // TARGET_XARCH | |||
DEF_TP(MASK ,"mask" , TYP_MASK, TI_STRUCT,8, 8, 8, 2,8, VTF_S) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the type be VTF_S|VTF_VEC
? That would let varTypeIsSIMD()
be true; do we want that? E.g., it would mean varTypeUsesFloatReg
returns true
, which we probably don't want since TYP_MASK uses a new register type. @tannergooding
Also, I wonder if it should be VTF_S
. It's not really a struct, is it, like the other SIMD types?
Maybe it should be VTF_ANY
so it doesn't have any default handling. E.g., we probably don't want varTypeIsStruct()
to return true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a struct and will eventually map to the VectorMask<T>
type(s).
Under the hood we have just k0-k7
which are 64-bits and while the actual hardware manuals don't differentiate further than k#
, the .NET and C++ intrinsics do. -- .NET has VectorMask#<T>
while C++ has __mmask8/16/32/64
. So VectorMask256<byte>
is equivalent to __mmask16
since it has 16 entries.
Given that, is just MASK
sufficient or do we need MASK8/16/32/64
? If we have more than one, presumably we'd have this be VTF_S|VTF_MSK
. -- I could see us going either way with there being different tradeoffs for each approach.
@kunalspathak Looks better. @BruceForstall I mostly addressed your comments, with one caveat --- right now, |
@anthonycanino Where is the TP regression coming from? |
Investigating with |
Throughput impact on Windows x64The following shows the impact on throughput in terms of number of instructions executed inside the JIT. Negative percentages/lower numbers are better. linux arm64Overall (-0.02% to -0.01%)
MinOpts (-0.08% to -0.01%)
FullOpts (-0.02% to -0.00%)
DetailsAll contexts:
MinOpts contexts:
FullOpts contexts:
osx arm64Overall (-0.02% to -0.01%)
MinOpts (-0.08% to -0.01%)
FullOpts (-0.02% to -0.00%)
DetailsAll contexts:
MinOpts contexts:
FullOpts contexts:
linux x64Overall (+0.13% to +0.26%)
MinOpts (+0.42% to +0.67%)
FullOpts (+0.12% to +0.16%)
DetailsAll contexts:
MinOpts contexts:
FullOpts contexts:
windows arm64Overall (-0.02% to -0.01%)
MinOpts (-0.08% to -0.01%)
FullOpts (-0.02% to -0.00%)
DetailsAll contexts:
MinOpts contexts:
FullOpts contexts:
windows x64Overall (+0.11% to +0.27%)
MinOpts (+0.43% to +0.66%)
FullOpts (+0.11% to +0.16%)
DetailsAll contexts:
MinOpts contexts:
FullOpts contexts:
Throughput impact on Windows x86The following shows the impact on throughput in terms of number of instructions executed inside the JIT. Negative percentages/lower numbers are better. linux armOverall (-0.01% to -0.00%)
MinOpts (-0.03% to -0.00%)
FullOpts (-0.01% to -0.00%)
DetailsAll contexts:
MinOpts contexts:
FullOpts contexts:
windows x86Overall (+0.06% to +0.10%)
MinOpts (+0.15% to +0.28%)
FullOpts (+0.05% to +0.08%)
DetailsAll contexts:
MinOpts contexts:
FullOpts contexts:
|
else if (thisType == TYP_MASK) | ||
{ | ||
availableRegs[i] = &availableMaskRegs; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a general case like the other where integers are the most common, but they're at the bottom of the checked cases for the loop.
Presumably changing this to either a switch
or to prioritize setting availableIntRegs
as the first path would be beneficial so we don't have "one extra check" before hitting the path.
Actually, looking at this, this entire thing is just very unnecessarily expensive. We could track the classification in the typelist.h
so we could do:
#define DEF_TP(tn, nm, jitType, verType, sz, sze, asze, st, al, tf, ar) availableRegs[static_cast<int>(TYP_##tn)] = &ar;
#include "typelist.h"
#undef DEF_TP
TYP_COUNT
This would involve no loops, no complex checks as to type kind/etc.
@kunalspathak, thoughts? Probably not that impactful overall, since its just part of the LSRA constructor; but still seems "meaningful" given we have around 25 TYP_*
defines and this is doing 4-5 checks to initialize 17 of them and 3-4 checks to initialize 6 others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- Not saying this needs to be handled in this PR, its something we can do as a follow up if its something we want to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated #83109 to take care of it.
PhasedVar<regMaskTP> availableFloatRegs; | ||
PhasedVar<regMaskTP> availableDoubleRegs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kunalspathak, it's not clear why we have availableFloatRegs
and availableDoubleRegs
. Could you elaborate a bit?
Given that availableDoubleRegs
is used to account for both double
and TYP_SIMD
and that float
/double
/simd
are all the "same registers", it's not immediately clear why we have the split.
Even on Arm32 where you have 1x double register taking up 2x float registers, it seems like it'd be simpler/easier to track this as a single "register file" and for double to simply take up 2 bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- Not something to handle in this PR, more making observations and asking questions as this code is being touched in general and looking for potential opportunities to reduce the impact of changes like this, particularly given how impactful LSRA changes are to overall TP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I don't know historical reason behind having 2 separate variables for floats and double although we have #define RBM_ALLDOUBLE RBM_ALLFLOAT
in all targets except ARM. I will have to spend some time in understanding the implication if we unify them.
Does this look good now? The throughput numbers seem pretty good. |
Diffs -- looks good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Would like threadsuspend.cpp
removed from the change.
Would like @tannergooding to also review.
I think I've addressed all issues? |
Will get this reviewed a bit later tonight and should be able to get it merged at that time. |
// Technically, K instructions would be considered under the VEX encoding umbrella, but due to | ||
// the instruction table encoding had to be pulled out with the rest of the `INST5` definitions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment needs to be updated slightly since KInstructions are part of the INST3 group now.
@@ -46,7 +53,7 @@ bool emitter::IsSSEOrAVXInstruction(instruction ins) | |||
bool emitter::IsAvx512OrPriorInstruction(instruction ins) | |||
{ | |||
// TODO-XArch-AVX512: Fix check once AVX512 instructions are added. | |||
return (ins >= INS_FIRST_SSE_INSTRUCTION) && (ins <= INS_LAST_AVX512_INSTRUCTION); | |||
return ((ins >= INS_FIRST_SSE_INSTRUCTION) && (ins <= INS_LAST_AVX512_INSTRUCTION)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
return ((ins >= INS_FIRST_SSE_INSTRUCTION) && (ins <= INS_LAST_AVX512_INSTRUCTION)); | |
return (ins >= INS_FIRST_SSE_INSTRUCTION) && (ins <= INS_LAST_AVX512_INSTRUCTION); |
PR addresses #80820.
This adds a new type,
TYP_MASK
which represents Intel opmaskk
registers on xarch platform. Enough infrastructure was added so thatVector512.ExtractMostSignificantBits
could be hardware accelerated onAVX512
capable platforms, which requires the use of the k-mask registers.A future PR will implement mask registers in full, which depends upon
VectorMask
API proposal. At the moment, only a single mask register will ever be used, and is used as an internal register for theExtractMostSignificantBits
method. Follow up PRs will extend this to theVector512
comparison operators, which require a similar approach.As an example, the following code...
roughly lowers to