-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Vector128 and Vector256.Create methods #35857
Conversation
Diffs are similar to: - mov ecx, 128
- vmovd xmm0, ecx
- vpbroadcastw xmm0, xmm0
+ vmovupd xmm0, xmmword ptr [reloc @RWD00]
...
+ RWD00 db 080h, 000h, 080h, 000h, 080h, 000h, 080h, 000h, 080h, 000h, 080h, 000h, 080h, 000h, 080h, 000h |
CC. @CarolEidt, @echesakovMSFT |
As a separate issue, I noticed for the various intrinsics that take a
In the case where the instruction can take a |
I think the cast should be handled as contained along with its operand. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, but it's a lot of code and could use some more textual comments describing what's being done, and it would be helpful to give meaningful name to the temps (the t*s) in the pseudo-IR to make it easier to follow.
// TODO-XARCH-CQ: It may be beneficial to emit the movq | ||
// instruction, which takes a 64-bit memory address and | ||
// works on 32-bit x86 systems. | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you chose not to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not something we are handling today anywhere else and we have an identical TODO
in the other places.
I believe this would just require us creating a GT_IND
node since x86 requires it be a long*
or ulong*
, but I'm not positive if that is the ideal/correct way to handle it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting, this is also not something we are handling in the managed implementation today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I gathered that, but was just curious, given the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super familiar with the differences between the different addressing forms in the JIT and when each is appropriate to use.
Do we have a helper function that will create a T*
or ref T
from an arbitrary T
(In this case a ulong*
from a ulong
) when the given operand on the stack could be a constant, local, field, byref, or other indirection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC you really want a GT_ADDR
, to get the address of an operand. I the JIT addresses are not strongly typed, they are always TYP_BYREF
, so it would just be something like:
GenTree* addr = gtNewOperNode(GT_ADDR, TYP_BYREF, op);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I'd note that I'm not really pressing for you to do this now - I was just curious why you added the comment rather than doing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll attempt this in a follow up PR, hopefully it is that straightforward 😄
src/coreclr/src/jit/lowerxarch.cpp
Outdated
// /--* t? simd32 | ||
// +--* t? simd16 | ||
// +--* t? int | ||
// t0 = * HWINTRINSIC simd32 T InsertVector128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment doesn't match what's generated below (e.g. you're creating a constant of 0x01). Also, it would be easier to follow if you referred to 'op1', 'op2' and 'idx' here (and maybe add 'v' for the target of NI_Vector128_Create
and result
for the final value, instead of using 't2' and 't?'.
I would also do something similar for the comments below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I forgot to fill in the generated LIR for this one. I'm updating it and the other comments to be closer to the following:
// We will be constructing the following parts:
// /--* op1 T
// tmp = * HWINTRINSIC simd16 T Create
// /--* tmp simd16
// * STORE_LCL_VAR simd16
// tmp = LCL_VAR simd16
// dup = LCL_VAR simd16
// /--* dup simd16
// dup = * HWINTRINSIC simd16 T ToVector256Unsafe
// idx = CNS_INT int 0
// /--* dup simd32
// +--* tmp simd16
// +--* idx int
// node = * HWINTRINSIC simd32 T InsertVector128
// This is roughly the following managed code:
// var tmp = Vector128.Create(op1);
// var dup = tmp.ToVector256Unsafe(tmp);
// return Avx.InsertVector128(dup, tmp, 1);
src/coreclr/src/jit/lowerxarch.cpp
Outdated
} | ||
else | ||
{ | ||
tmp = op2->AsArgList()->Current(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite confused by this, and I suspect I'm missing something. You've handle the 4-argument case above, and this handles 2. Where do you handle 8 arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one actually handles everything that is > 4
and I can add an assert + comment to clarify.
For Vector256
(which this code path is restricted to) we will have 4, 8, 16, or 32 operands, so in all cases we have a GT_LIST
.
Both paths construct the upper and lower Vector128
portions and then combine them. The for loop here ensures that op1 points to the first operand and op2 points to the 3rd, 5th, 9th, or 17th operand.
The first path needs to exist since it will be 2 operands per half, and se we need to track them in gtOp1
/gtOp2
rather than in a GT_LIST
While the second path can just use the original list, split into two halves with the 2nd, 4th, 8th, or 16th operand no longer have a gtOp2
(which normally points to the next operand in the list).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I see now that op1
and op2
are still lists, and the two NI_Vector128_Create
intrinsics are then recursively lowered. I might ask that you create a new GenTree*
variable for those rather than reusing op1
. I know that the JIT reuses variables all over the place, but I don't think it's a great practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment explaining the split and what values are expected where, using new variables named lo
and hi
to track the lower and upper halves of the Vector256 being created.
I believe I addressed all the feedback given and everything passes for the various ISAs locally so I've kicked off the |
Resolved an issue that showed up with ARM where the |
ARM64 failure is because Edit: To clarify, we don't take recursion into account when some overload of the method is intrinsic. So if you have |
I've updated the calls to not be recursive on ARM64 for right now. I'm working on getting the same support done for ARM64, but it will be a separate PR to help keep the size of this one smaller. Edit: We also need to expose a couple of additional intrinsics before I can implement the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks for the additional comments; they may seem verbose, but they really help with understanding what's going on!
Looks like there is some
|
Seems superpmi ignores all @CarolEidt, I've updated |
e2dbef8
to
7b3b272
Compare
7b3b272
to
c056b88
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks
This resolves #11965 and #10033 by updating the
Vector128/256.Create
methods to be intrinsic on x86.They are handled entirely in lowering where they will be replaced with the corresponding constant (as was done for
GT_SIMD
nodes) or where they are lowered to the correct sequence of HWIntrinsics (which allows containment and other checks to "just work"). This should make it rather trivial to support "partial constants" as well (that is, a vector where say 50% of the inputs are constant and the other half are not).It might be beneficial to eventually create a proper
GenTreeVecCns
node and to also try and handle this earlier (which would allow constants to be deduplicated and other features), but that is a more involved change.jit-analyze reports (AVX2):
jit-analyze reports (SSE2):
The regression for SSE2 is because we are now inlining a
Vector128.Create(long)
call where we were not previously