-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8bit*8bit 4-D dot-product accumulating to 32bit, similar to ARM SDOT and x86 VNNI #9
Comments
I'm after 8-bit signed GEMM too for my project @browsermt, ultimately for quantized neural networks, which appears to be @bjacob's use case as well. As you mentioned in WebAssembly/simd#224, GEMM routines want to use all the registers with a larger tile to minimize memory bandwidth requirements. Which implies not only the ability to query for register count but some knowledge of how many extra registers the implementation requires. For example, a pre-VNNI x86 machine typically does The instruction sets are slightly different. ARM has signed * signed in SDOT with more support. x86 has unsigned * signed only which is annoying and requires arithmetic hacks to make one of the arguments signed (by adding 128 and subtracting out in a bias term). We could target USDOT/SUDOT on ARM but then you get the unnecessary extra arithmetic and require a more recent processor just to be compatible with Intel. Another option is to emulate signed * signed on x86 via sign bit manipulation. In practice a GEMM will want to use the longest register length it can get away with. So while WebAssembly should have an 8-bit dot product instruction, I wonder if browsers should just support more general matrix multiplication. They already have limited support in DOMMatrix and WebGL. Paging @mlopatka @XapaJIaMnu |
Perhaps @lars-t-hansen can provide some thoughts on whether supporting such operations is interesting/feasible from Mozilla's perspective given our current implementation of WASM and the roadmap for this year. |
The WebNN people are proposing to add GEMM to the browser, including 8-bit: https://webmachinelearning.github.io/webnn/#api-neuralnetworkcontext-gemm . |
We had taken a look at WebAsm SIMD for NN inference here. The relevance to the present issue is that as there are multiple other issues preventing the WebAsm SIMD proposal from approaching native performance levels, it is difficult to advocate a big investment in the present feature, which is about recent ISA extensions, until those other issues are resolved. The deepest is Issue WebAssembly/simd#225, which is a general issue about the entire 'intrinsics' programming model, for which native code has a toolbox of work-arounds and mitigations that are not practical in WebAsm SIMD. As resolving Issue WebAssembly/simd#225 seems to be out of immediate scope for WebAsm SIMD, I would like to see alternatives to WebAsm SIMD emerge where tackling this hard issue would be part of the initial design. There, adding new instructions to follow recent ISA extensions, like discussed in the present issue, would be far easier to justify. Adding @Maratyszcza . |
Matching AVX-VNNI most likely would not be feasible for this proposal, unless it can be efficiently emulated in SSE - there is no intended AVX support here at all. There is flexible vectors proposal which is planned to support AVX. |
The operation can be emulated with SSSE3 and even SSE2 if necessary (but I think WebAssembly already assumes SSSE3). Usually, 8-bit GEMM is implemented on pre-VNNI Intel as
because Since we're one bit short, another strategy is sign bit manipulation before calling It's also possible to emulate in SSE2 by widening to 16-bit then doing |
We should probably look into expected instruction sequences within ISA limits that the proposal has, though I can't promise it is going to make into MVP. |
@kpu, citation for this? (SpiderMonkey currently will disable Wasm SIMD for < SSE4.1 and I think V8 does the same, but the spec has so far allowed scalarization of every operation and I'm not aware of any assumption of available technology short of IEEE math.) |
@lars-t-hansen Sorry what I meant is WebAssembly already has instructions that map to SSSE3 and later on Intel, which is the highest version required by In particular, I am stressing to @penzn that IBM z/Architecture has signed * signed and unsigned * unsigned 8-bit instructions with addition into a 16-bit accumulator. They don't do horizontal add, so there are separate instructions to get results for odd-index and even-indexed positions. So |
At today's SIMD sync meeting, we discussed considering this for relaxed-simd, and agreed that we can carry on further discussions, hence transferring the issue there. |
I've filed WebAssembly/flexible-vectors#15 for this a while back, though given what is flexible vectors about it would need to be consistent across platforms to be part of it. |
To add some specific numbers: https://bugzilla.mozilla.org/show_bug.cgi?id=1746631 A machine translation application compiled to WebAssembly https://github.com/browsermt/bergamot-translator . Speed measured in words translated per second (wps). Heavy user of 8-bit integer matrix multiplication. Pure WASM SIMD: 95 wps. (The rest of the app is compiled to WebAssembly). |
@kpu Thanks for sharing! What is Pure WAsm? Is it WebAssembly MVP or WebAssembly SIMD? |
@Maratyszcza WebAssembly SIMD 128-bit. I've updated my comment. |
This issue is a placeholder for future discussion about supporting 4-dimensional-reducing dot-product instructions taking 8bit inputs and accumulating into 32bit, i.e.
This would be similar to recent instructions: ARM SDOT/UDOT (and even more recent USDOT and SUDOT supporting mixed signednesses) and x86 AVX-VNNI instructions.
The motivation for filing this issue now is that I had created some confusion by commenting on that topic on PR WebAssembly/simd#127 which is actually about something different.
@kpu let's take discussion here.
The text was updated successfully, but these errors were encountered: