8bit*8bit 4-D dot-product accumulating to 32bit, similar to ARM SDOT and x86 VNNI #9

bjacob · 2020-08-26T14:45:52Z

This issue is a placeholder for future discussion about supporting 4-dimensional-reducing dot-product instructions taking 8bit inputs and accumulating into 32bit, i.e.

int32_accumulator += int8_lhs_0 * int8_rhs_0 + ... + int8_lhs_3 * int8_rhs_3

This would be similar to recent instructions: ARM SDOT/UDOT (and even more recent USDOT and SUDOT supporting mixed signednesses) and x86 AVX-VNNI instructions.

The motivation for filing this issue now is that I had created some confusion by commenting on that topic on PR WebAssembly/simd#127 which is actually about something different.

@kpu let's take discussion here.

The text was updated successfully, but these errors were encountered:

kpu · 2020-08-26T15:20:34Z

I'm after 8-bit signed GEMM too for my project @browsermt, ultimately for quantized neural networks, which appears to be @bjacob's use case as well.

As you mentioned in WebAssembly/simd#224, GEMM routines want to use all the registers with a larger tile to minimize memory bandwidth requirements. Which implies not only the ability to query for register count but some knowledge of how many extra registers the implementation requires. For example, a pre-VNNI x86 machine typically does vpmaddubsw to get a 16-bit result followed by vpmaddwd with 1s to upcast and horizontally add to 32-bit. And those 1s take up a register. Users could address this problem in their code via shipping various tile sizes and autotuning.

The instruction sets are slightly different. ARM has signed * signed in SDOT with more support. x86 has unsigned * signed only which is annoying and requires arithmetic hacks to make one of the arguments signed (by adding 128 and subtracting out in a bias term). We could target USDOT/SUDOT on ARM but then you get the unnecessary extra arithmetic and require a more recent processor just to be compatible with Intel. Another option is to emulate signed * signed on x86 via sign bit manipulation.

In practice a GEMM will want to use the longest register length it can get away with.

So while WebAssembly should have an 8-bit dot product instruction, I wonder if browsers should just support more general matrix multiplication. They already have limited support in DOMMatrix and WebGL.

Paging @mlopatka @XapaJIaMnu

mlopatka · 2020-08-28T16:29:05Z

Perhaps @lars-t-hansen can provide some thoughts on whether supporting such operations is interesting/feasible from Mozilla's perspective given our current implementation of WASM and the roadmap for this year.

kpu · 2020-08-28T16:37:30Z

The WebNN people are proposing to add GEMM to the browser, including 8-bit: https://webmachinelearning.github.io/webnn/#api-neuralnetworkcontext-gemm .

bjacob · 2020-08-28T19:01:57Z

We had taken a look at WebAsm SIMD for NN inference here. The relevance to the present issue is that as there are multiple other issues preventing the WebAsm SIMD proposal from approaching native performance levels, it is difficult to advocate a big investment in the present feature, which is about recent ISA extensions, until those other issues are resolved. The deepest is Issue WebAssembly/simd#225, which is a general issue about the entire 'intrinsics' programming model, for which native code has a toolbox of work-arounds and mitigations that are not practical in WebAsm SIMD. As resolving Issue WebAssembly/simd#225 seems to be out of immediate scope for WebAsm SIMD, I would like to see alternatives to WebAsm SIMD emerge where tackling this hard issue would be part of the initial design. There, adding new instructions to follow recent ISA extensions, like discussed in the present issue, would be far easier to justify.

Adding @Maratyszcza .

penzn · 2020-08-28T19:17:47Z

Matching AVX-VNNI most likely would not be feasible for this proposal, unless it can be efficiently emulated in SSE - there is no intended AVX support here at all. There is flexible vectors proposal which is planned to support AVX.

kpu · 2020-08-28T19:28:43Z

The operation can be emulated with SSSE3 and even SSE2 if necessary (but I think WebAssembly already assumes SSSE3).

Usually, 8-bit GEMM is implemented on pre-VNNI Intel as vpmaddubsw, vpmaddwd, and vpaddd. That's how efficient 8-bit GEMMs do it. But pedantically it violates this statement from WebAssembly/simd#225 (comment)

No, exposing underlying architectural details would introduce platform-specific behavior and violate WebAssembly's determinism.

because vpmaddubsw saturates to signed 16-bit int. For example -128 * 255 - 128 * 255 = -65280 but it saturates to -32768 to fit in 16-bit. Which would be platform-specific saturation.

Since we're one bit short, another strategy is sign bit manipulation before calling vpmaddubsw. But -128 is weird. And -128*-128 + -128 * -128 = 32768 would still saturate to 32767. In a neural network, I don't care about this amount of saturation.

It's also possible to emulate in SSE2 by widening to 16-bit then doing vpmaddwd for 16-bit signed multiply twice and vpaddd twice. This would be deterministic and match VNNI, albeit slowly.

penzn · 2020-08-29T00:14:25Z

We should probably look into expected instruction sequences within ISA limits that the proposal has, though I can't promise it is going to make into MVP.

lars-t-hansen · 2020-08-31T05:36:00Z

but I think WebAssembly already assumes SSSE3

@kpu, citation for this?

(SpiderMonkey currently will disable Wasm SIMD for < SSE4.1 and I think V8 does the same, but the spec has so far allowed scalarization of every operation and I'm not aware of any assumption of available technology short of IEEE math.)

kpu · 2020-08-31T18:57:32Z

@lars-t-hansen Sorry what I meant is WebAssembly already has instructions that map to SSSE3 and later on Intel, which is the highest version required by vpmaddubsw, vpmaddwd, and vpaddd (perhaps without the v). Of course the proposed instruction can be emulated, serially or in 16-bit on older than SSSE3 were it to be supported.

In particular, I am stressing to @penzn that _mm_dpbusds_epi32 is a 128-bit version for new processors and there are ways to semi-efficiently implement it on older x86 all the way back to SSSE3. There is no requirement for wider SIMD to get this instruction (though that would be super-useful for GEMM).

IBM z/Architecture has signed * signed and unsigned * unsigned 8-bit instructions with addition into a 16-bit accumulator. They don't do horizontal add, so there are separate instructions to get results for odd-index and even-indexed positions. So VMAE and VMAO (signed * signed) or VMALE and VMALO (unsigned * unsigned). Then widen and sum operations to accomplish a dot product into a 32-bit result.

ngzhian · 2021-03-19T17:17:06Z

At today's SIMD sync meeting, we discussed considering this for relaxed-simd, and agreed that we can carry on further discussions, hence transferring the issue there.

penzn · 2021-04-16T18:08:50Z

I've filed WebAssembly/flexible-vectors#15 for this a while back, though given what is flexible vectors about it would need to be consistent across platforms to be part of it.

kpu · 2022-01-13T18:02:15Z

To add some specific numbers: https://bugzilla.mozilla.org/show_bug.cgi?id=1746631

A machine translation application compiled to WebAssembly https://github.com/browsermt/bergamot-translator . Speed measured in words translated per second (wps). Heavy user of 8-bit integer matrix multiplication.

Pure WASM SIMD: 95 wps.
Add pmaddubs to WASM: 390 wps (+310% to Pure WASM SIMD)
Add a native 8-bit matrix multiply on SSSE3 as intrinsic : 490 wps (+25% to pmaddubs, +415% to Pure WASM SIMD)
Add a native 8-bit matrix multiply on AVX2 as intrinsic : 560 wps (+43% to pmaddubs, +489% to Pure WASM SIMD)

(The rest of the app is compiled to WebAssembly).

Maratyszcza · 2022-01-13T18:59:49Z

@kpu Thanks for sharing! What is Pure WAsm? Is it WebAssembly MVP or WebAssembly SIMD?

kpu · 2022-01-13T20:15:45Z

@Maratyszcza WebAssembly SIMD 128-bit. I've updated my comment.

ngzhian transferred this issue from WebAssembly/simd Mar 19, 2021

ngzhian added the instruction-proposal label Mar 19, 2021

penzn mentioned this issue Apr 16, 2021

Consider integer dot products WebAssembly/flexible-vectors#15

Open

benvanik mentioned this issue Jul 16, 2021

Investigate WASM as a HAL executable format iree-org/iree#2863

Open

kpu mentioned this issue Jul 19, 2021

Translations produced by test_page are (very, hilariously) wrong browsermt/bergamot-translator#206

Closed

guillaumekln mentioned this issue Jul 29, 2021

Is possible port it to WASM? OpenNMT/CTranslate2#528

Open

Maratyszcza mentioned this issue Feb 18, 2022

Relaxed Integer Dot Product instructions #52

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8bit*8bit 4-D dot-product accumulating to 32bit, similar to ARM SDOT and x86 VNNI #9

8bit*8bit 4-D dot-product accumulating to 32bit, similar to ARM SDOT and x86 VNNI #9

bjacob commented Aug 26, 2020

kpu commented Aug 26, 2020 •

edited

Loading

mlopatka commented Aug 28, 2020

kpu commented Aug 28, 2020

bjacob commented Aug 28, 2020

penzn commented Aug 28, 2020

kpu commented Aug 28, 2020 •

edited

Loading

penzn commented Aug 29, 2020

lars-t-hansen commented Aug 31, 2020

kpu commented Aug 31, 2020 •

edited

Loading

ngzhian commented Mar 19, 2021

penzn commented Apr 16, 2021

kpu commented Jan 13, 2022 •

edited

Loading

Maratyszcza commented Jan 13, 2022

kpu commented Jan 13, 2022

8bit*8bit 4-D dot-product accumulating to 32bit, similar to ARM SDOT and x86 VNNI #9

8bit*8bit 4-D dot-product accumulating to 32bit, similar to ARM SDOT and x86 VNNI #9

Comments

bjacob commented Aug 26, 2020

kpu commented Aug 26, 2020 • edited Loading

mlopatka commented Aug 28, 2020

kpu commented Aug 28, 2020

bjacob commented Aug 28, 2020

penzn commented Aug 28, 2020

kpu commented Aug 28, 2020 • edited Loading

penzn commented Aug 29, 2020

lars-t-hansen commented Aug 31, 2020

kpu commented Aug 31, 2020 • edited Loading

ngzhian commented Mar 19, 2021

penzn commented Apr 16, 2021

kpu commented Jan 13, 2022 • edited Loading

Maratyszcza commented Jan 13, 2022

kpu commented Jan 13, 2022

kpu commented Aug 26, 2020 •

edited

Loading

kpu commented Aug 28, 2020 •

edited

Loading

kpu commented Aug 31, 2020 •

edited

Loading

kpu commented Jan 13, 2022 •

edited

Loading