-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mono][interp] Add instrinsics for common Vector128 operations #81782
Conversation
Tagging subscribers to this area: @BrzVlad, @kotlarmilos Issue Detailsnull
|
f105d35
to
519b31b
Compare
349d045
to
96cad8e
Compare
96cad8e
to
3a44372
Compare
/azp run runtime-extra-platforms |
Azure Pipelines successfully started running 1 pipeline(s). |
Don't have a full grip on this PR yet to the point that I could checkmark it, but it seems promising so far. I have a basic jiterpreter implementation for it such that there are no trace aborts due to SIMD opcodes. |
This is a lot of boilerplate that is very similar to what we do for the JIT/AOT. Is there some way we can share more between the execution engines? There's a ton of these SIMD opcodes it would be better if we could generate the code for each execution engine from a single abstract template. Maybe a python script or something... /cc @jandupej |
RyuJIT handles this via a single header file per architecture: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsiclistxarch.h And a general header that defines the enums + basic structs + basic helper APIs: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsic.h We have an entry per intrinsic that tracks most of the table drivable information and a few flags to help cover anything that needs special handling or lookups (which happens through well-defined helper methods/etc). I imagine a very similar setup would work for Mono being C only somewhat restricts the ability for some of this to be shared between RyuJIT and Mono, but that might be fine overall. |
I agree that would be beneficial. It will take some tuning to get this tabular approach right and applicable to mini, LLVM and interpreter. So my plan is to start small and do it with a single category of operations first (e.g. comparisons). When the final form is ironed out, we will convert existing code to the new form. |
Shelving the jiterpreter bits for now, but they're at kg@a38fa17. |
There are no issues that I'm aware of with this PR. It needs a final approval so we can get it in and get more perf numbers. |
We add intrinsics for Vector128 intrinsics that are actively used within our bcl. We declare a set of simd method names, the same way we do it with jit, in `simd-methods.def`. In `transform-simd.c` we lookup method names in the list of supported intrinsics for `Vector128` and `Vector128<T>`. Once we find a supported instrinsic, we generate code for it, typically a `MINT_SIMD_INTRINS_*` opcode. In order to avoid adding too many new opcodes to the interpreter, simd intrinsics are grouped by signature. So all simd intrinsics that receive a single argument and return a value, will be called through `MINT_SIMD_INTRINS_P_P`. This instruction will receive an index to get the intrinsic implementation and calls it indirectly. Some of the intrinsics are implemented using the standard vector intrinsics, supported by gcc and clang. These do not fully expose the SIMD capabilities, so some intrinsics are implemented naively. This should still be faster than using nonvectorized approach from managed code. In the future we can add better implmentation, on platforms where we have low level support. This would both be faster and reduce code size.
These intrinsics are not yet implemented on jiterpreter, making it slighty slower instead.
v128_create receives as an argument every single element of the vector. This method is typically used with constants. For a Vector128<short> this means that creating a constant vector required 8 ldc.i4 and a v128_create. We can instead use a single instruction and embed the vector value in the code stream directly.
It is actually not used in bcl, it is not really vectorized on any platforms and the codegen for the interp implementation is massive and inefficient.
3a44372
to
719523d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think we should disable WASI
- Why can't we use
__builtin_shuffle(vec1, mask)
@@ -101,6 +102,10 @@ typedef enum { | |||
|
|||
#define PROFILE_INTERP 0 | |||
|
|||
#if !HOST_BROWSER && __GNUC__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#if !HOST_BROWSER && __GNUC__ | |
#if !HOST_BROWSER && !HOST_WASI && __GNUC__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The real reason for disabling on browser is not because this PR has some problems with wasm, but because this PR leads to jiterpreter trace aborts since support is not yet added. On WASI there is no jiterpreter support so this PR should still have perf gains.
|
||
// Shuffle | ||
|
||
#define V128_SHUFFLE(eltype, itype) do { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why you have to do this by hand and not with __builtin_shuffle
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simply didn't want to make use of any additional builtins for now, to keep it as simple as possible and reduce potential compatibility issues. Will start adding various builtins in future PR's, probably even arch specific ones. I think __builtin_shufflevector
has better support and is very powerful, so might be a better candidate.
We add intrinsics for Vector128 intrinsics that are actively used within our bcl.
We declare a set of simd method names, the same way we do it with jit, in
simd-methods.def
. Intransform-simd.c
we lookup method names in the list of supported intrinsics forVector128
andVector128<T>
. Once we find a supported instrinsic, we generate code for it, typically aMINT_SIMD_INTRINS_*
opcode. In order to avoid adding too many new opcodes to the interpreter, simd intrinsics are grouped by signature. So all simd intrinsics that receive a single argument and return a value, will be called throughMINT_SIMD_INTRINS_P_P
. This instruction will receive an index to get the intrinsic implementation and calls it indirectly.Some of the intrinsics are implemented using the standard vector intrinsics, supported by gcc and clang. These do not fully expose the SIMD capabilities, so some intrinsics are implemented naively. This should still be faster than using nonvectorized approach from managed code. In the future we can add better implementation, on platforms where we have low level support. This would both be faster and reduce code size.
Vectorized algorithms that I've tested on interpreter tend to be about 1.5x - 2x faster than their non-vectorized version. On wasm, the jiterpreter handles pretty well non-vectorized algorithms (since they are typically a set of simple instructions that all get compiled to a single trace) and this PR doesn't produce a clear perf improvement. Wasm should get an additional speedup once SIMD support is added into the jiterpreter.