-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable AVX512 embedded masking for most other intrinsics #101886
Conversation
…tible, or Commutative
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
bool canUseEmbeddedBroadcast() const | ||
{ | ||
return JitConfig.EnableEmbeddedBroadcast(); | ||
} | ||
|
||
bool canUseEmbeddedMasking() const | ||
{ | ||
return JitConfig.EnableEmbeddedMasking(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The embedded broadcast/masking support for both AVX512 and SVE is pretty complex in parts, as such having a knob to allow disabling it can be beneficial to help validate perf/size wins for the feature and to allow users to workaround any issues if they happen to be found.
65da3bf
to
40ffaa3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I skimmed through the changes in hwintrinsiclistxarch.h
and instrsxarch.h
and they looked OK to me. If there are any specific changes, other than adding HW_Flag_InvalidNodeId
or INS_Flags_EmbeddedBroadcastSupported
, let me know.
Overall the changes looks good. It seems at multiple places having insOpts
having default value would save us from passing INS_OPTS_NONE
around.
Waiting for superpmi-diff
results.
src/tests/JIT/HardwareIntrinsics/X86/Shared/_BinaryOpTestTemplate.template
Show resolved
Hide resolved
I opted to have it default for the |
…g converted to the AVX512 form
I didn't quite understand the changes made in f83162d...can you elaborate? other than that, things look good. |
For SSE-SSE41 there isn't actually an instruction to do floating-point With AVX512 and the ability to do embedded masking, we want to emit So the change just ensured that we stopped lying about the operation being done when the operands were swapped. Thus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
linux x64Diffs are based on 2,304,731 contexts (997,292 MinOpts, 1,307,439 FullOpts). MISSED contexts: 7 (0.00%) Overall (-21,300 bytes)
MinOpts (-8,946 bytes)
FullOpts (-12,354 bytes)
windows x64Diffs are based on 2,615,190 contexts (1,040,939 MinOpts, 1,574,251 FullOpts). MISSED contexts: 4 (0.00%) Overall (-23,037 bytes)
MinOpts (-9,043 bytes)
FullOpts (-13,994 bytes)
|
Diffs generally look similar to the following:
|
In the optimal case, like seen in some of the tests, we can convert something like: Vector512.ConditionalSelect(mask, x + Vector512.Create(cns), Vector512<T>.Zero) into vaddps zmm0 {k1}{z}, zmm0, dword ptr [rax] {1to16} The few regressions that do show up tend to be from using the EVEX encoding, but that's to be expected as we're using larger instructions that have lower cost. There's some longer term improvements that could still be done around containment and commutativity (such as inversing the mask or specially handling some types of blending more), but those are longer term goals to handle. |
Some other prominent diffs look like - vcmppd xmm4, xmm0, xmm0, 0
- vcmppd xmm5, xmm3, xmm3, 0
- vcmppd xmm6, xmm1, xmm2, 0
- vxorps xmm7, xmm7, xmm7
- vpcmpgtq xmm7, xmm7, xmm0
- vpternlogq xmm7, xmm0, xmm3, -54
- vcmppd xmm1, xmm2, xmm1, 1
- vpternlogq xmm1, xmm0, xmm3, -54
- vpternlogq xmm6, xmm7, xmm1, -54
- vpternlogq xmm5, xmm6, xmm3, -54
- vpternlogq xmm4, xmm5, xmm0, -54
- vmovups xmmword ptr [rcx], xmm4
+ vcmppd k1, xmm0, xmm0, 0
+ vcmppd k2, xmm3, xmm3, 0
+ vcmppd k3, xmm1, xmm2, 0
+ vxorps xmm4, xmm4, xmm4
+ vpcmpgtq k4, xmm4, xmm0
+ vblendmpd xmm4 {k4}, xmm3, xmm0
+ vcmppd k4, xmm2, xmm1, 1
+ vblendmpd xmm1 {k4}, xmm3, xmm0
+ vblendmpd xmm1 {k3}, xmm1, xmm4
+ vblendmpd xmm1 {k2}, xmm3, xmm1
+ vblendmpd xmm0 {k1}, xmm0, xmm1
+ vmovups xmmword ptr [rcx], xmm0 and - vpcmpgtd ymm1, ymm1, ymm0
- vxorps ymm2, ymm2, ymm2
- vpsubd ymm2, ymm2, ymm0
- vpternlogd ymm1, ymm2, ymm0, -54
- vxorps ymm2, ymm2, ymm2
- vpcmpgtd ymm1, ymm2, ymm1
+ vpcmpgtd k1, ymm1, ymm0
+ vmovaps ymm2, ymm0
+ vpsubd ymm2 {k1}, ymm1, ymm0
+ vpcmpgtd ymm1, ymm1, ymm2 The TP regression peaks at around |
CC. @fanyang-mono, seems there's a Mono LLVMAOT failure in the form of:
I'm guessing this has something to do with the Mono V128 acceleration for x64, but it's not clear what in the tests would be causing it. It's only failing for |
I've logged #102037 to track the general issue |
* Remove HW_Flag_MultiIns in favor of using HW_Flag_SpecialCodeGen * Add a new flag HW_Flag_InvalidNodeId * Change HW_Flag_EmbMaskingIncompatible to be HW_Flag_EmbMaskingCompatible * Mark various compare intrinsics with HW_Flag_NoEvexSemantics * Marking various intrinsics as EmbBroadcastCompatible, EmbMaskingCompatible, or Commutative * Applying formatting patch * Ensure WithLower/WithUpper are not marked as InvalidNodeId * Ensure that instOptions are being passed down all relevant hwintrinsic code paths * Ensure the insOpts are plumbed through for EVEX instructions * Ensure EVEX instructions are properly annotated with EmbeddedBroadcastSupported * Ensure that embedded broadcast/masking is displayed in the disassembly * Applying formatting patch * Updating the hwintrinsic tests to cover embedded broadcast/masking * Fix some handling in the JIT related to embedded broadcast/masking * Fixup some tests where validating embedded masking is non-trivial * Cleanup some cases found by SPMI * Ensure that CompareLessThan has its operands swapped back if its being converted to the AVX512 form * Don't regress a scenario around op_Equality and TYP_MASK * Adjusting hardware intrinsic tests to test non-zero masks * Avoid some messiness around operand swapping * Ensure embedded masks mark TYP_SIMD16 and TYP_SIMD32 instructions as needing EVEX * Mark Sse2_r/Sse2_ro as AotIncompatible due to runtime/102037
This is a continuation of #97675 and almost finishes out #87097
In particular, it enables the embedded masking support for all intrinsics except for the various load, store, move, and broadcast intrinsics that explicitly deal with memory operations.
As part of this, the PR explicitly marks intrinsics which should never appear as the
intrinsicId
of a node to help ensure the relevant intrinsics are being properly handled.