Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wasm] Implement I2 and I4 shuffles in the jiterpreter #86469

Merged
merged 2 commits into from
May 19, 2023

Conversation

kg
Copy link
Member

@kg kg commented May 18, 2023

Enabling v128 and packedsimd support causes code that relies on i2/i4 shuffle to run. The scalar fallback implementation of these is extremely slow, so that makes algorithms dependent on those shuffles much slower at least according to browser-bench.

This PR adds implementations for those shuffles by taking the short/int sized shuffle vectors, narrowing them to bytes, then replicating the narrowed vector across all the lanes and adding bits in order to produce an equivalent byte shuffle vector. Then it uses the wasm swizzle bytes intrinsic to emulate the desired shuffle operation.

In my testing this speeds up 'Span, Reverse chars' from browser-bench considerably (~0.12ms -> 0.04ms) but does not seem to speed up IndexOf or SequenceEqual for chars. I'm not sure why, but one guess is that the cost of converting the shuffle vectors every operation is too significant. A future improvement would be to detect constant shuffle vectors and perform the full expansion at JIT time, which might close the gap. You can see an example of how a constant shuffle vector still generates the full emulation logic here:
image
Since we know the vector at offset 160 is constant we can safely remove all of that work at some point. I'm not sure whether it makes sense to try and do this in the interp at transform time, it's probably better to do it in jiterp.

@kg kg added arch-wasm WebAssembly architecture area-Codegen-Jiterpreter-mono labels May 18, 2023
@kg kg requested review from vargaz, radekdoulik and kotlarmilos May 18, 2023 22:32
@kg kg requested review from lewing and pavelsavara as code owners May 18, 2023 22:32
@ghost ghost assigned kg May 18, 2023
@ghost
Copy link

ghost commented May 18, 2023

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

Issue Details

Enabling v128 and packedsimd support causes code that relies on i2/i4 shuffle to run. The scalar fallback implementation of these is extremely slow, so that makes algorithms dependent on those shuffles much slower at least according to browser-bench.

This PR adds implementations for those shuffles by taking the short/int sized shuffle vectors, narrowing them to bytes, then replicating the narrowed vector across all the lanes and adding bits in order to produce an equivalent byte shuffle vector. Then it uses the wasm swizzle bytes intrinsic to emulate the desired shuffle operation.

In my testing this speeds up 'Span, Reverse chars' from browser-bench considerably (~0.12ms -> 0.04ms) but does not seem to speed up IndexOf or SequenceEqual for chars. I'm not sure why, but one guess is that the cost of converting the shuffle vectors every operation is too significant. A future improvement would be to detect constant shuffle vectors and perform the full expansion at JIT time, which might close the gap. You can see an example of how a constant shuffle vector still generates the full emulation logic here:
image
Since we know the vector at offset 160 is constant we can safely remove all of that work at some point. I'm not sure whether it makes sense to try and do this in the interp at transform time, it's probably better to do it in jiterp.

Author: kg
Assignees: -
Labels:

arch-wasm, area-Codegen-Jiterpreter-mono

Milestone: -

@kg
Copy link
Member Author

kg commented May 18, 2023

Incidentally, the code AOT generates for this (and the code clang generates by default using its most similar three-operand shuffle intrinsic) does a bunch of extract and replace lane operations instead. I don't really understand why that's the chosen approach to emulate shuffles, it seems like it would be much more expensive and the generated code is enormous. Do you have any idea why, @radekdoulik ?

@kg kg merged commit 40ba49a into dotnet:main May 19, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jun 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-wasm WebAssembly architecture
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants