[wasm] Implement I2 and I4 shuffles in the jiterpreter #86469

kg · 2023-05-18T22:32:12Z

Enabling v128 and packedsimd support causes code that relies on i2/i4 shuffle to run. The scalar fallback implementation of these is extremely slow, so that makes algorithms dependent on those shuffles much slower at least according to browser-bench.

This PR adds implementations for those shuffles by taking the short/int sized shuffle vectors, narrowing them to bytes, then replicating the narrowed vector across all the lanes and adding bits in order to produce an equivalent byte shuffle vector. Then it uses the wasm swizzle bytes intrinsic to emulate the desired shuffle operation.

In my testing this speeds up 'Span, Reverse chars' from browser-bench considerably (~0.12ms -> 0.04ms) but does not seem to speed up IndexOf or SequenceEqual for chars. I'm not sure why, but one guess is that the cost of converting the shuffle vectors every operation is too significant. A future improvement would be to detect constant shuffle vectors and perform the full expansion at JIT time, which might close the gap. You can see an example of how a constant shuffle vector still generates the full emulation logic here:

Since we know the vector at offset 160 is constant we can safely remove all of that work at some point. I'm not sure whether it makes sense to try and do this in the interp at transform time, it's probably better to do it in jiterp.

…CL char operations won't be terribly slow

ghost · 2023-05-18T22:32:22Z

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

Issue Details

Enabling v128 and packedsimd support causes code that relies on i2/i4 shuffle to run. The scalar fallback implementation of these is extremely slow, so that makes algorithms dependent on those shuffles much slower at least according to browser-bench.

This PR adds implementations for those shuffles by taking the short/int sized shuffle vectors, narrowing them to bytes, then replicating the narrowed vector across all the lanes and adding bits in order to produce an equivalent byte shuffle vector. Then it uses the wasm swizzle bytes intrinsic to emulate the desired shuffle operation.

In my testing this speeds up 'Span, Reverse chars' from browser-bench considerably (~0.12ms -> 0.04ms) but does not seem to speed up IndexOf or SequenceEqual for chars. I'm not sure why, but one guess is that the cost of converting the shuffle vectors every operation is too significant. A future improvement would be to detect constant shuffle vectors and perform the full expansion at JIT time, which might close the gap. You can see an example of how a constant shuffle vector still generates the full emulation logic here:

Since we know the vector at offset 160 is constant we can safely remove all of that work at some point. I'm not sure whether it makes sense to try and do this in the interp at transform time, it's probably better to do it in jiterp.

Author:	kg
Assignees:	-
Labels:	`arch-wasm`, `area-Codegen-Jiterpreter-mono`
Milestone:	-

kg · 2023-05-18T22:34:13Z

Incidentally, the code AOT generates for this (and the code clang generates by default using its most similar three-operand shuffle intrinsic) does a bunch of extract and replace lane operations instead. I don't really understand why that's the chosen approach to emulate shuffles, it seems like it would be much more expensive and the generated code is enormous. Do you have any idea why, @radekdoulik ?

kg added 2 commits May 18, 2023 15:28

Implement a wasm lowering for V128_I2_SHUFFLE in the jiterpreter so B…

bc8ca49

…CL char operations won't be terribly slow

Generalize shuffles and add I4 support

f2f24e5

kg added arch-wasm WebAssembly architecture area-Codegen-Jiterpreter-mono labels May 18, 2023

kg requested review from vargaz, radekdoulik and kotlarmilos May 18, 2023 22:32

kg requested review from lewing and pavelsavara as code owners May 18, 2023 22:32

ghost assigned kg May 18, 2023

vargaz approved these changes May 18, 2023

View reviewed changes

kg mentioned this pull request May 18, 2023

[wasm] Optimize constant i2/i4 shuffles in jiterpreter #86470

Merged

kg merged commit 40ba49a into dotnet:main May 19, 2023

ghost locked as resolved and limited conversation to collaborators Jun 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wasm] Implement I2 and I4 shuffles in the jiterpreter #86469

[wasm] Implement I2 and I4 shuffles in the jiterpreter #86469

kg commented May 18, 2023

ghost commented May 18, 2023

kg commented May 18, 2023

[wasm] Implement I2 and I4 shuffles in the jiterpreter #86469

[wasm] Implement I2 and I4 shuffles in the jiterpreter #86469

Conversation

kg commented May 18, 2023

ghost commented May 18, 2023

kg commented May 18, 2023