-
Notifications
You must be signed in to change notification settings - Fork 43
Conversation
Thanks for the detailed write-up, what do you think about paring this down to just including the dot2_s operations and not the dot2add_s operations. The dot2_s operations as detailed here are useful to have, but the dot2add_s operations to me seem out of scope for MVP of the SIMD proposal. The i16x8.dot2add_s operation maps directly to an instruction only on x86-64 with XOP enabled, on most other architectures, the codegen would be equivalent to generating an add operation after the dot2_s operation. |
We have had two expansions of instruction set that specifically address intermediate overflow, load with extend (#98) and widening operations (part of #89). Load and extend particularly absorbs the cost of extending into the load operation. Is there specific code that regresses without this extension? |
For the instruction naming, the type prefix should be |
@dtig I updated the PR description with the list of codes using
|
@penzn I added a list of applications to the PR description. Compared to using load-with-extend and 32x32->32 multiplications, these "dot product" operations have several performance advantages:
|
@tlively Good point about
|
If we may need to differentiate between different input types, how about |
@tlively Sounds reasonable, there are already instructions with similar names. Updated the commit & PR description. |
Summary: This instruction is not merged to the spec proposal, but we need it to be implemented in the toolchain to experiment with it. It is available only on an opt-in basis through a clang builtin. Defined in WebAssembly/simd#127. Depends on D69696. Reviewers: aheejin Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits, llvm-commits Tags: #clang, #llvm Differential Revision: https://reviews.llvm.org/D69697
@Maratyszcza Could you add pseudocode for the semantics of these operations as well? I want to make sure I implement the interpreter correctly in Binaryen. |
@tlively I don't quite understand the pseudo-code specification in WAsm SIMD, especially given that these "dot product" instructions are the few to have some "horizontal" component in them. You may refer to PMADDWD instruction in Intel architecture manual which is the analog of |
This experimental instruction is specified in WebAssembly/simd#127 and is being implemented to enable further investigation of its performance impact.
This experimental instruction is specified in WebAssembly/simd#127 and is being implemented to enable further investigation of its performance impact.
This experimental instruction is specified in WebAssembly/simd#127 and is being implemented to enable further investigation of its performance impact.
Summary: This instruction is not merged to the spec proposal, but we need it to be implemented in the toolchain to experiment with it. It is available only on an opt-in basis through a clang builtin. Defined in WebAssembly/simd#127. Depends on D69696. Reviewers: aheejin Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits, llvm-commits Tags: #clang, #llvm Differential Revision: https://reviews.llvm.org/D69697
The opcodes in this PR collide with the opcodes used for the {i8x16,i16x8}.avgr_u instructions. Since the averaging instructions have been merged, I will reassign the opcodes for the dot product instructions to 0xdb and 0xdc in the LLVM and Binaryen implementations. Edit: I forgot there was only one dot product instruction implemented. |
@tlively Renumbered opcodes in the PR similarly to LLVM |
0e011c5
to
bb31a5a
Compare
I would like to +1 this request, especially the 8bit*8bit accumulating into 32bit flavor as in VNNI. It would be a necessary prerequisite to even consider targeting WebAsm for integer-quantized neural network inference code. Without such an instruction, 8bit quantization of neural networks simply won't provide a meaningful computational advantage over float; people would merely 8bit-quantize to shrink the download size but then dequantize to float for the client side computation. On the ARM side, the VNNI-equivalent instruction is SDOT/UDOT, available in current commercially available ARM CPUs / Android devices such as the Pixel4 and available in lower-end CPUs as well now (Cortex-A55) so it's a present issue, not future. It is a 4x speed difference today. (For both points made above, see this data). Example production code (used by TensorFlow Lite using these instructions) (the machine-encodings are here to support older assemblers). |
…ccepted status. r=jseward Background: WebAssembly/simd#127 For the widening dot product instruction: - remove the internal 'Experimental' opcode suffix in the C++ code - remove the guard on the instruction in all the C++ decoders - move the test cases from simd/experimental.js to simd/ad-hack.js I have checked that current V8 and wasm-tools use the same opcode mapping. V8 in turn guarantees the correct mapping for LLVM and binaryen. Differential Revision: https://phabricator.services.mozilla.com/D92929
…ccepted status. r=jseward Background: WebAssembly/simd#127 For the widening dot product instruction: - remove the internal 'Experimental' opcode suffix in the C++ code - remove the guard on the instruction in all the C++ decoders - move the test cases from simd/experimental.js to simd/ad-hack.js I have checked that current V8 and wasm-tools use the same opcode mapping. V8 in turn guarantees the correct mapping for LLVM and binaryen. Differential Revision: https://phabricator.services.mozilla.com/D92929
There is no pseducode for this op, but I think the text description is straightforward enough. Merging. |
@Maratyszcza are there any |
This patch implements, for aarch64, the following wasm SIMD extensions i32x4.dot_i16x8_s instruction WebAssembly/simd#127 It also updates dependencies as follows, in order that the new instruction can be parsed, decoded, etc: wat to 1.0.27 wast to 26.0.1 wasmparser to 0.65.0 wasmprinter to 0.2.12 The changes are straightforward: * new CLIF instruction `widening_pairwise_dot_product_s` * translation from wasm into `widening_pairwise_dot_product_s` * new AArch64 instructions `smull`, `smull2` (part of the `VecRRR` group) * translation from `widening_pairwise_dot_product_s` to `smull ; smull2 ; addv` There is no testcase in this commit, because that is a separate repo. The implementation has been tested, nevertheless.
Not yet, this is a pretty new instruction, so it's not implemented in the interpreter yet. Ofc, contributions welcome, let me know if you're interested (either to contribute implementation or tests, or both). |
This patch implements, for aarch64, the following wasm SIMD extensions i32x4.dot_i16x8_s instruction WebAssembly/simd#127 It also updates dependencies as follows, in order that the new instruction can be parsed, decoded, etc: wat to 1.0.27 wast to 26.0.1 wasmparser to 0.65.0 wasmprinter to 0.2.12 The changes are straightforward: * new CLIF instruction `widening_pairwise_dot_product_s` * translation from wasm into `widening_pairwise_dot_product_s` * new AArch64 instructions `smull`, `smull2` (part of the `VecRRR` group) * translation from `widening_pairwise_dot_product_s` to `smull ; smull2 ; addv` There is no testcase in this commit, because that is a separate repo. The implementation has been tested, nevertheless.
This patch implements, for aarch64, the following wasm SIMD extensions i32x4.dot_i16x8_s instruction WebAssembly/simd#127 It also updates dependencies as follows, in order that the new instruction can be parsed, decoded, etc: wat to 1.0.27 wast to 26.0.1 wasmparser to 0.65.0 wasmprinter to 0.2.12 The changes are straightforward: * new CLIF instruction `widening_pairwise_dot_product_s` * translation from wasm into `widening_pairwise_dot_product_s` * new AArch64 instructions `smull`, `smull2` (part of the `VecRRR` group) * translation from `widening_pairwise_dot_product_s` to `smull ; smull2 ; addv` There is no testcase in this commit, because that is a separate repo. The implementation has been tested, nevertheless.
This patch implements, for aarch64, the following wasm SIMD extensions i32x4.dot_i16x8_s instruction WebAssembly/simd#127 It also updates dependencies as follows, in order that the new instruction can be parsed, decoded, etc: wat to 1.0.27 wast to 26.0.1 wasmparser to 0.65.0 wasmprinter to 0.2.12 The changes are straightforward: * new CLIF instruction `widening_pairwise_dot_product_s` * translation from wasm into `widening_pairwise_dot_product_s` * new AArch64 instructions `smull`, `smull2` (part of the `VecRRR` group) * translation from `widening_pairwise_dot_product_s` to `smull ; smull2 ; addv` There is no testcase in this commit, because that is a separate repo. The implementation has been tested, nevertheless.
This patch implements, for aarch64, the following wasm SIMD extensions i32x4.dot_i16x8_s instruction WebAssembly/simd#127 It also updates dependencies as follows, in order that the new instruction can be parsed, decoded, etc: wat to 1.0.27 wast to 26.0.1 wasmparser to 0.65.0 wasmprinter to 0.2.12 The changes are straightforward: * new CLIF instruction `widening_pairwise_dot_product_s` * translation from wasm into `widening_pairwise_dot_product_s` * new AArch64 instructions `smull`, `smull2` (part of the `VecRRR` group) * translation from `widening_pairwise_dot_product_s` to `smull ; smull2 ; addv` There is no testcase in this commit, because that is a separate repo. The implementation has been tested, nevertheless.
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982
It multiplies respective lanes from the 2 input operands, then adds adjacent lanes. This was merged into the proposal in WebAssembly#127.
It multiplies respective lanes from the 2 input operands, then adds adjacent lanes. This was merged into the proposal in #127.
This patch implements, for aarch64, the following wasm SIMD extensions i32x4.dot_i16x8_s instruction WebAssembly/simd#127 It also updates dependencies as follows, in order that the new instruction can be parsed, decoded, etc: wat to 1.0.27 wast to 26.0.1 wasmparser to 0.65.0 wasmprinter to 0.2.12 The changes are straightforward: * new CLIF instruction `widening_pairwise_dot_product_s` * translation from wasm into `widening_pairwise_dot_product_s` * new AArch64 instructions `smull`, `smull2` (part of the `VecRRR` group) * translation from `widening_pairwise_dot_product_s` to `smull ; smull2 ; addv` There is no testcase in this commit, because that is a separate repo. The implementation has been tested, nevertheless.
This instruction was added in WebAssembly#127.
This instruction was added in #127. Co-authored-by: Andreas Rossberg <rossberg@mpi-sws.org>
Introduction
Integer arithmetic instructions in WebAssembly SIMD produce results of the same element type as their inputs. To avoid overflow, developers have to pre-extend inputs to wider element types and do (twice more) arithmetic operations on twice wider elements. The need to use this work-around is particularly concerning for multiplications:
PMULLW
+PMULHW
instructions (which together compute 32-bit products of 8 16-bit inputs) and a combination of twoPMULLD
instructions (which together compute 32-bit products of 8 32-bit inputs) on various x86 microarchitectures:PMULLD xmm, xmm
PMULLW xmm, xmm
+PMULHW xmm, xmm
Operations that produce twice wider results would need to return two SIMD vectors, and thus depend on the future multi-value proposal. To stay within baseline WebAssembly features, we have to aggregate the two wide results into one, so the instruction can produce a single output vector. Luckily, there is an aggregating operation directly supported in x86, MIPS, and POWER instruction sets, that can also be efficiently lowered on ARM and ARM64: addition of adjacent multiplication results. The resulting combination of a full multiplication and addition of adjacent products can be interpreted as a dot product of 2-wide subvectors within a SIMD vector, producing twice wider results than input vectors (albeit twice fewer result elements than input elements).
This PR introduce the 2-element dot product instructions with signed 16-bit integer input elements and signed 32-bit integer output elements. We don't consider other data types, because they can't be efficiently expressed on x86 (e.g. the only multiplication on byte inputs on x86 multiplies signed bytes by unsigned bytes with signed saturation - too exotic to build a portable instruction on top of it). The new
i32x4.dot_i16x8_s
instruction returns the dot product right away, andi32x4.dot_i16x8_add_s
additionally accumulates it with a third input vector of 32-bit elements. This second instruction was added because accumulation of dot product results is common, and many instruction sets provide a specialized instruction for this case.[October 31 update] Applications
Below are examples of optimized libraries using close equivalents of the proposed
i32x4.dot_i16x8_s
andi32x4.dot_i16x8_add_s
instructions:Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX512VNNI and AVX512VL instruction sets
c = i32x4.dot_i16x8_add_s(a, b, c)
is lowered toVPDPWSSD xmm_c, xmm_a, xmm_b
y = i32x4.dot_i16x8_add_s(a, b, c)
is lowered toVMOVDQA xmm_y, xmm_c + VPDPWSSD xmm_c, xmm_a, xmm_b
x86/x86-64 processors with XOP instruction set
y = i32x4.dot_i16x8_add_s(a, b, c)
is lowered toVPMADCSWD xmm_y, xmm_a, xmm_b, xmm_c
x86/x86-64 processors with AVX instruction set
y = i32x4.dot_i16x8_s(a, b)
is lowered toVPMADDWD xmm_y, xmm_a, xmm_b
y = i32x4.dot_i16x8_add_s(a, b, c)
is lowered toVPMADDWD xmm_tmp, xmm_a, xmm_b + VPADDD xmm_y, xmm_tmp, xmm_c
x86/x86-64 processors with SSE2 instruction set
a = i32x4.dot_i16x8_s(a, b)
is lowered toPMADDWD xmm_a, xmm_b
y = i32x4.dot_i16x8_s(a, b)
is lowered toMOVDQA xmm_y, xmm_a + PMADDWD xmm_y, xmm_b
c = i32x4.dot_i16x8_add_s(a, b, c)
is lowered toMOVDQA xmm_tmp, xmm_a + PMADDWD xmm_tmp, xmm_b + PADDD xmm_c, xmm_tmp
y = i32x4.dot_i16x8_add_s(a, b, c)
is lowered toMOVDQA xmm_y, xmm_a + PMADDWD xmm_y, xmm_b + PADDD xmm_y, xmm_c
ARM64 processors
y = i32x4.dot_i16x8_s(a, b)
is lowered to:SMULL Vtmp.4S, Va.4H, Vb.4H
SMULL2 Vtmp2.4S, Va.8H, Vb.8H
ADDP Vy.4S, Vtmp.4S, Vtmp2.4S
y = i32x4.dot_i16x8_add_s(a, b)
is lowered to:SMULL Vtmp.4S, Va.4H, Vb.4H
SMULL2 Vtmp2.4S, Va.8H, Vb.8H
ADDP Vtmp.4S, Vtmp.4S, Vtmp2.4S
ADD Vy.4S, Vy.4S, Vtmp.4S
ARMv7 processors with NEON instruction set
y = i32x4.dot_i16x8_s(a, b)
is lowered to:VMULL.S16 Qtmp, Da_lo, Vb_lo
VMULL.S16 Qtmp2, Da_hi, Db_hi
VPADD.I32 Dy_lo, Dtmp_lo, Dtmp_hi
VPADD.I32 Dy_hi, Dtmp2_lo, Dtmp2_hi
y = i32x4.dot_i16x8_add_s(a, b)
is lowered to:VMULL.S16 Qtmp, Da_lo, Vb_lo
VMULL.S16 Qtmp2, Da_hi, Db_hi
VPADD.I32 Dtmp_lo, Dtmp_lo, Dtmp_hi
VPADD.I32 Dtmp_hi, Dtmp2_lo, Dtmp2_hi
VADD.I32 Qy, Qy, Qtmp
POWER processors with VMX (Altivec) instruction set
y = i32x4.dot_i16x8_s(a, b)
is lowered toVXOR VRy, VRy, VRy + VMSUMSHM VRy, VRa, VRb, VRy
y = i32x4.dot_i16x8_add_s(a, b, c)
is lowered toVMSUMSHM VRy, VRa, VRb, VRc
MIPS processors with MSA instruction set
y = i32x4.dot_i16x8_s(a, b)
is lowered toDOTP_S.W Wy, Wa, Wb
c = i32x4.dot_i16x8_add_s(a, b, c)
is lowered toDPADD_S.W Wc, Wa, Wb
y = i32x4.dot_i16x8_add_s(a, b, c)
is lowered toMOVE.V Wy, Wc + DPADD_S.W Wy, Wa, Wb
References
[1] Fog, A. "Instruction tables (2019)." URL: www.agner.org/optimize/instruction_tables.pdf.