Extended multiplication instructions #376

Maratyszcza · 2020-10-07T05:02:55Z

Introduction

The result of integer multiplication is generally twice wider than its inputs, and the lane-wise multiplication instructions currently in WebAssembly SIMD specification would commonly overflow and wrap. DSP algorithms often need the full multiplication result, and there have been requests to provide such functionality (e.g. #175 and #226) in WebAssembly SIMD. However, the current WebAssembly SIMD specification lacks such instructions, and the extended multiplication must be simulated via a combination of widen instructions and mul instruction of the twice wider result type. This PR adds extended multiplication instructions, which compute the full result of multiplication, and enable more efficient lowering to native instruction sets than the emulation sequence with widen and mul instructions:

i16x8.mul(i16x8.widen_low_i8x16_s(a), i16x8.widen_low_i8x16_s(b)) -> i16x8.extmul_low_i8x16_s(a, b).
i16x8.mul(i16x8.widen_high_i8x16_s(a), i16x8.widen_high_i8x16_s(b)) -> i16x8.extmul_high_i8x16_s(a, b).
i16x8.mul(i16x8.widen_low_i8x16_u(a), i16x8.widen_low_i8x16_u(b)) -> i16x8.extmul_low_i8x16_u(a, b).
i16x8.mul(i16x8.widen_high_i8x16_u(a), i16x8.widen_high_i8x16_u(b)) -> i16x8.extmul_high_i8x16_u(a, b).
i32x4.mul(i32x4.widen_low_i16x8_s(a), i32x4.widen_low_i16x8_s(b)) -> i32x4.extmul_low_i16x8_s(a, b).
i32x4.mul(i32x4.widen_high_i16x8_s(a), i32x4.widen_high_i16x8_s(b)) -> i32x4.extmul_high_i16x8_s(a, b).
i32x4.mul(i32x4.widen_low_i16x8_u(a), i32x4.widen_low_i16x8_u(b)) -> i32x4.extmul_low_i16x8_u(a, b).
i32x4.mul(i32x4.widen_high_i16x8_u(a), i32x4.widen_high_i16x8_u(b)) -> i32x4.extmul_high_i16x8_u(a, b).
i64x2.mul(i64x2.widen_low_i32x4_s(a), i64x2.widen_low_i32x4_s(b)) -> i64x2.extmul_low_i32x4_s(a, b).
i64x2.mul(i64x2.widen_high_i32x4_s(a), i64x2.widen_high_i32x4_s(b)) -> i64x2.extmul_high_i32x4_s(a, b).
i64x2.mul(i64x2.widen_low_i32x4_u(a), i64x2.widen_low_i32x4_u(b)) -> i64x2.extmul_low_i32x4_u(a, b).
i64x2.mul(i64x2.widen_high_i32x4_u(a), i64x2.widen_high_i32x4_u(b)) -> i64x2.extmul_high_i32x4_u(a, b).

Native instruction sets typically include means to compute the full result of a multiplication, although exact details vary by architecture and data type. ARM NEON provides instructions to compute extended multiplication on the low or high halves of the input SIMD vectors and producing a full SIMD vector of the results (they map 1:1 to the proposed WebAssembly SIMD instructions). x86 provides different instructions depending on data type: 32x32->64 multiplication instruction consume two even-numbered lanes as input and produce a single 128-bit vector with two full 64-bit results. 16x16->32 multiplication is provided via separate instructions to compute low and high 16-bit parts of the 32-bit result, which can be interleaved to get vectors of full 32-bit results.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

i32x4.extmul_low_i16x8_s
- y = i32x4.extmul_low_i16x8_s(a, b) is lowered to VPMULLW xmm_tmp, xmm_a, xmm_b + VPMULHW xmm_y, xmm_a, xmm_b + VPUNPCKLWD xmm_y, xmm_tmp, xmm_y
i32x4.extmul_high_i16x8_s
- y = i32x4.extmul_high_i16x8_s(a, b) is lowered to VPMULLW xmm_tmp, xmm_a, xmm_b + VPMULHW xmm_y, xmm_a, xmm_b + VPUNPCKHWD xmm_y, xmm_tmp, xmm_y
i32x4.extmul_low_i16x8_u
- y = i32x4.extmul_low_i16x8_u(a, b) is lowered to VPMULLW xmm_tmp, xmm_a, xmm_b + VPMULHUW xmm_y, xmm_a, xmm_b + VPUNPCKLWD xmm_y, xmm_tmp, xmm_y
i32x4.extmul_high_i16x8_u
- y = i32x4.extmul_high_i16x8_u(a, b) is lowered to VPMULLW xmm_tmp, xmm_a, xmm_b + VPMULHUW xmm_y, xmm_a, xmm_b + VPUNPCKHWD xmm_y, xmm_tmp, xmm_y
i64x2.extmul_low_i32x4_s(a, b)
- y = i64x2.extmul_low_i32x4_s(a, b) is lowered to VPUNPCKLDQ xmm_tmp, xmm_a, xmm_a + VPUNPCKLDQ xmm_y, xmm_b, xmm_b + VPMULDQ xmm_y, xmm_tmp, xmm_y
i64x2.extmul_high_i32x4_s(a, b)
- y = i64x2.extmul_high_i32x4_s(a, b) is lowered to VPUNPCKHDQ xmm_tmp, xmm_a, xmm_a + VPUNPCKHDQ xmm_y, xmm_b, xmm_b + VPMULDQ xmm_y, xmm_tmp, xmm_y
i64x2.extmul_low_i32x4_u(a, b)
- y = i64x2.extmul_low_i32x4_u(a, b) is lowered to VPUNPCKLDQ xmm_tmp, xmm_a, xmm_a + VPUNPCKLDQ xmm_y, xmm_b, xmm_b + VPMULUDQ xmm_y, xmm_tmp, xmm_y
i64x2.extmul_high_i32x4_u(a, b)
- y = i64x2.extmul_high_i32x4_u(a, b) is lowered to VPUNPCKHDQ xmm_tmp, xmm_a, xmm_a + VPUNPCKHDQ xmm_y, xmm_b, xmm_b + VPMULUDQ xmm_y, xmm_tmp, xmm_y

x86/x86-64 processors with SSE4.1 instruction set

i64x2.extmul_low_i32x4_s(a, b)
- y = i64x2.extmul_low_i32x4_s(a, b) is lowered to PSHUFD xmm_tmp, xmm_a, 0x50 + PSHUFD xmm_y, xmm_b, 0x50 + PMULDQ xmm_y, xmm_tmp
i64x2.extmul_high_i32x4_s(a, b)
- y = i64x2.extmul_high_i32x4_s(a, b) is lowered to PSHUFD xmm_tmp, xmm_a, 0xFA + PSHUFD xmm_y, xmm_b, 0xFA + PMULDQ xmm_y, xmm_tmp

x86/x86-64 processors with SSE2 instruction set

i32x4.extmul_low_i16x8_s
- y = i32x4.extmul_low_i16x8_s(a, b) (y is NOT b) is lowered to MOVDQA xmm_y, xmm_a + MOVDQA xmm_tmp, xmm_a + PMULLW xmm_y, xmm_b + PMULHW xmm_tmp, xmm_b + PUNPCKLWD xmm_y, xmm_tmp
i32x4.extmul_high_i16x8_s
- y = i32x4.extmul_high_i16x8_s(a, b) (y is NOT b) is lowered to MOVDQA xmm_y, xmm_a + MOVDQA xmm_tmp, xmm_a + PMULLW xmm_y, xmm_b + PMULHW xmm_tmp, xmm_b + PUNPCKHWD xmm_y, xmm_tmp
i32x4.extmul_low_i16x8_u
- y = i32x4.extmul_low_i16x8_u(a, b) (y is NOT b) is lowered to MOVDQA xmm_y, xmm_a + MOVDQA xmm_tmp, xmm_a + PMULLW xmm_y, xmm_b + PMULHUW xmm_tmp, xmm_b + PUNPCKLWD xmm_y, xmm_tmp
i32x4.extmul_high_i16x8_u
- y = i32x4.extmul_high_i16x8_u(a, b) (y is NOT b) is lowered to MOVDQA xmm_y, xmm_a + MOVDQA xmm_tmp, xmm_a + PMULLW xmm_y, xmm_b + PMULHUW xmm_tmp, xmm_b + PUNPCKHWD xmm_y, xmm_tmp
i64x2.extmul_low_i32x4_u(a, b)
- y = i64x2.extmul_low_i32x4_u(a, b) is lowered to PSHUFD xmm_tmp, xmm_a, 0x50 + PSHUFD xmm_y, xmm_b, 0x50 + PMULUDQ xmm_y, xmm_tmp
i64x2.extmul_high_i32x4_u(a, b)
- y = i64x2.extmul_high_i32x4_u(a, b) is lowered to PSHUFD xmm_tmp, xmm_a, 0xFA + PSHUFD xmm_y, xmm_b, 0xFA + PMULUDQ xmm_y, xmm_tmp

ARM64 processors

i16x8.extmul_low_i8x16_s
- y = i16x8.extmul_low_i8x16_s(a, b) is lowered to SMULL Vy.8H, Va.8B, Vb.8B
i16x8.extmul_high_i8x16_s
- y = i16x8.extmul_high_i8x16_s(a, b) is lowered to SMULL2 Vy.8H, Va.16B, Vb.16B
i16x8.extmul_low_i8x16_u
- y = i16x8.extmul_low_i8x16_u(a, b) is lowered to UMULL Vy.8H, Va.8B, Vb.8B
i16x8.extmul_high_i8x16_u
- y = i16x8.extmul_high_i8x16_u(a, b) is lowered to UMULL2 Vy.8H, Va.16B, Vb.16B
i32x4.extmul_low_i8x16_s
- y = i32x4.extmul_low_i16x8_s(a, b) is lowered to SMULL Vy.4S, Va.4H, Vb.4H
i32x4.extmul_high_i8x16_s
- y = i32x4.extmul_high_i16x8_s(a, b) is lowered to SMULL2 Vy.4S, Va.8H, Vb.8H
i32x4.extmul_low_i8x16_u
- y = i32x4.extmul_low_i16x8_u(a, b) is lowered to UMULL Vy.4S, Va.4H, Vb.4H
i32x4.extmul_high_i8x16_u
- y = i32x4.extmul_high_i16x8_u(a, b) is lowered to UMULL2 Vy.4S, Va.8H, Vb.8H
i64x2.extmul_low_i32x4_s
- y = i64x2.extmul_low_i32x4_s(a, b) is lowered to SMULL Vy.2D, Va.2S, Vb.2S
i64x2.extmul_high_i32x4_s
- y = i64x2.extmul_high_i32x4_s(a, b) is lowered to SMULL2 Vy.2D, Va.4S, Vb.4S
i64x2.extmul_low_i32x4_u
- y = i64x2.extmul_low_i32x4_u(a, b) is lowered to UMULL Vy.2D, Va.2S, Vb.2S
i64x2.extmul_high_i32x4_u
- y = i64x2.extmul_high_i32x4_u(a, b) is lowered to UMULL2 Vy.2D, Va.4S, Vb.4S

ARMv7 processors with NEON instruction set

i16x8.extmul_low_i8x16_s
- y = i16x8.extmul_low_i8x16_s(a, b) is lowered to VMULL.S8 Qy, Da_lo, Db_lo
i16x8.extmul_high_i8x16_s
- y = i16x8.extmul_high_i8x16_s(a, b) is lowered to VMULL.S8 Qy, Da_hi, Db_hi
i16x8.extmul_low_i8x16_u
- y = i16x8.extmul_low_i8x16_u(a, b) is lowered to VMULL.U8 Qy, Da_lo, Db_lo
i16x8.extmul_high_i8x16_u
- y = i16x8.extmul_high_i8x16_u(a, b) is lowered to VMULL.U8 Qy, Da_hi, Db_hi
i32x4.extmul_low_i8x16_s
- y = i32x4.extmul_low_i16x8_s(a, b) is lowered to VMULL.S16 Qy, Da_lo, Db_lo
i32x4.extmul_high_i8x16_s
- y = i32x4.extmul_high_i16x8_s(a, b) is lowered to VMULL.S16 Qy, Da_hi, Db_hi
i32x4.extmul_low_i8x16_u
- y = i32x4.extmul_low_i16x8_u(a, b) is lowered to VMULL.U16 Qy, Da_lo, Db_lo
i32x4.extmul_high_i8x16_u
- y = i32x4.extmul_high_i16x8_u(a, b) is lowered to VMULL.U16 Qy, Da_hi, Db_hi
i64x2.extmul_low_i32x4_s
- y = i64x2.extmul_low_i32x4_s(a, b) is lowered to VMULL.S32 Qy, Da_lo, Db_lo
i64x2.extmul_high_i32x4_s
- y = i64x2.extmul_high_i32x4_s(a, b) is lowered to VMULL.S32 Qy, Da_hi, Db_hi
i64x2.extmul_low_i32x4_u
- y = i64x2.extmul_low_i32x4_u(a, b) is lowered to VMULL.U32 Qy, Da_lo, Db_lo
i64x2.extmul_high_i32x4_u
- y = i64x2.extmul_high_i32x4_u(a, b) is lowered to VMULL.U32 Qy, Da_hi, Db_hi

omnisip · 2020-10-08T15:02:01Z

@Maratyszcza

I really like this proposal because it's a real world common situation. Is there any way semi- consistent across both architectures to end up with two i32x4s? The reason I ask is that the Intel ops produce the full multiplication for 8 and the only difference is the unpack which means if we want the second vector we have to repeat the multiplication.

With respect to the ARM instructions, it looks like it's a single call on each which makes you wonder if it's doing the same style of calculation under the hood. If ARM provided a method to do this similar to Intel, would it make more sense to implement the proposal in that way? This would get us the benefit of the 8 wide multiplication happening only once.

Maratyszcza · 2020-10-08T16:40:07Z

@omnisip We could try to add an instruction that produce two output SIMD vectors - @tlively mentioned in the last CG meeting that this is now possible. However, 16x16->32 multiplication on x86 is the only case that would benefit here, so I decided to leave these two-output instructions for later.

omnisip · 2020-10-08T17:49:59Z

@omnisip We could try to add an instruction that produce two output SIMD vectors - @tlively mentioned in the last CG meeting that this is now possible. However, 16x16->32 multiplication on x86 is the only case that would benefit here, so I decided to leave these two-output instructions for later.

Well then these next comments will go together:

To add the signed variant for SSE2

sign1 = sar(number1, 31); // this will produce a full 32bit mask (if it's negative)
sign2 = sar(number2, 31);
for each vector:
number = number xor sign; number = number - sign; // this produces the absolute value in 32bits in each slot.
// example: ((-700 ^ (-700 >> 31)) - (-700 >> 31)) === 700
muludq xmm, xmm// absolute value multiplication
// now the fun part.
properSign = sign1 ^ sign2; // proper sign in 32 bits for the output of multiplication
pshufd properSign // expand it to 64 bits.
// Finally...
signed64 = signed64 xor properSign; signed64 = signed64 - properSign;
...

omnisip · 2020-10-08T18:25:02Z

Side note:

8 to 16-bit multiplication can be implemented entirely within pmullw or pmuludq too if you intend on adding support for it.

omnisip · 2020-10-09T02:10:24Z

Full assembly code and running examples to show the signed arithmetic works for sse2.

https://godbolt.org/z/fxqE7r

ngzhian · 2020-10-19T17:04:57Z

Prototyped on arm64 in https://crrev.com/c/2469156

Including saturating, rounding Q15 multiplication as proposed in WebAssembly/simd#365 and extending multiplications as proposed in WebAssembly/simd#376. Since these are just prototypes, skips adding them to the C or JS APIs and the fuzzer, as well as implementing them in the interpreter.

As proposed in WebAssembly/simd#376. This commit implements new builtin functions and intrinsics for these instructions, but does not yet add them to wasm_simd128.h because they have not yet been merged to the proposal. These are the first instructions with opcodes greater than 0xff, so this commit updates the MC layer and disassembler to handle that correctly. Differential Revision: https://reviews.llvm.org/D90253

tlively · 2020-10-28T16:43:27Z

These have now landed in LLVM and Binaryen and should be ready to use in tip-of-tree Emscripten in a few hours. The builtin functions to use are __builtin_wasm_extmul_{low,high}_{arg interpretation}_{s,u}_{result interpretation}.

ngzhian · 2020-12-01T02:09:45Z

@Maratyszcza any suggested lowering for i16x8.extmul_{high,low}i8x16{s,u} for x86 and x64? ?

Maratyszcza · 2020-12-04T05:56:07Z

@ngzhian There isn't anything more efficient than naive i16x8.mul(i16x8.widen_low_i8x16_s(a), i16x8.widen_low_i8x16_s(b))

Maratyszcza · 2020-12-04T20:59:14Z

I evaluated performance impact of these instructions by leveraging them in requantization parts of fixed-point neural network inference primitives in XNNPACK library. Fixed-point neural network operators typically accumulate intermediate result in high precision (32-bit) and in the end need to convert the intermediate result into a low-precision representation (typically 8-bit), the transformation called requantization. Performance impact on the requantization primitive is summarized in the table below:

Processor (Device)	Performance with WAsm SIMD + Extended Multiplication	Performance with WAsm SIMD (baseline)	Speedup
Snapdragon 855 (LG G8 ThinQ)	3.50 GB/s	2.37 GB/s	48%
Snapdragon 670 (Pixel 3a)	2.21 GB/s	1.36 GB/s	63%
Exynos 8895 (Galaxy S8)	2.05 GB/s	1.34 GB/s	53%

Requantization is only one of several components of fixed-point inference that could benefit from the extended multiplication instructions, but even it alone has noticeable end-to-end impact, as demonstrated below for the MobileNet v2 model:

Processor (Device)	Latency with WAsm SIMD + Extended Multiplication	Latency with WAsm SIMD (baseline)	Speedup
Snapdragon 855 (LG G8 ThinQ)	73 ms	78 ms	7%
Snapdragon 670 (Pixel 3a)	137 ms	146 ms	7%
Exynos 8895 (Galaxy S8)	156 ms	165 ms	6%

The code modifications can be seen in google/XNNPACK#1202.

omnisip · 2020-12-04T21:26:07Z

@Maratyszcza Is this because 3 instructions are replaced by 1?

Maratyszcza · 2020-12-04T22:17:22Z

@omnisip Not quite. On one side, the baseline doesn't use 32->64-bit extension instructions as these are still experimental and have to be emulated via 2 WAsm SIMD instructions. On the other side, the baseline version pre-multiplies multiplier by 2 as an optimization, but the version with extended multiplication instructions instead explicitly doubles the result because if we pre-multiply multiplier by 2 it no longer fits into 32 bits.

omnisip · 2020-12-05T01:30:30Z

That makes sense and reviewing the code really shows how much of a performance bump there should be.

Is this a practical use case for SQDMLAL on a native implementation?

jlb6740 · 2020-12-07T18:05:35Z

@Maratyszcza The desire for the full multiplication result is there and the speed up seen on ARM looks very good. On x86/x64 it seems as is there is less flexibility for a more efficient lowering. Any idea what that comparable XNNPACK speed-up is for x86/x64?

omnisip · 2020-12-07T18:52:29Z

@jlb6740 The bulk of the speedup for x64 is the fact that x64 has no native 64x64 multiplication instruction -- so in a roundabout way, one would have to convert to 64 bit then behind the scenes, then perform the same underlying operation as if they were 32bit integers -- Yielding 8 instructions for each 64 x 64 multiply plus 2 instructions for each conversion to 64, yielding 20 to do the full set. Compare that with the 6 instructions it can get the job done with here. Similarly, the performance increase can be made even further if this does multiple return values. PMULDQ and PMULUDQ use even-numbered lanes for multiplication, but otherwise do integer expansion natively. If these return the full set from 32->64, the speed up on x86/x64 should come from not just fewer instructions, but also at least 1 cycle less of latency.

e.g.

;; i64x2.extmul_low_i32x4_s(a, b)
VPSHUFD xmm_tmp, xmm_a, 0x50 ; 1tp, 1 lat; finishes in cycle 1
VPSHUFD xmm_y, xmm_b, 0x50 ; 1tp, 1 lat; finishes in cycle 2;
VPMULDQ xmm_y, xmm_tmp ; 0.5tp, 5 lat; finishes in cycle 7
;; i64x2.extmul_high_i32x4_s(a, b)
VPSHUFD xmm_tmp, xmm_a, 0xFA ; 1tp, 1 lat; finishes in cycle 3;
VPSHUFD xmm_y, xmm_b, 0xFA ; 1tp, 1 lat; finishes in cycle 4;
VPMULDQ xmm_y, xmm_tmp ; 0.5tp, 5 lat; finishes in cycle 9

becomes

VPMULDQ xmm_ab_even, xmm_a, xmm_b ; 0.5tp, lat 5; finishes in cycle 5
VPSRLQ xmm_a_odd, xmm_a, 32; 0.5tp, lat 1; finishes in cycle 1
VPSRLQ xmm_b_odd, xmm_b, 32 0.5tp,  lat 1;  can be swapped for a shuffle to get all three instructions done in the first cycle; but may be inconsequential since VPMULDQ is also 0.5tp
VPMULDQ xmm_ab_odd, xmm_a_odd, xmm_b_odd; 0.5tp lat 5; finishes in cycle 7 (non-shuffle), cycle 6 (shuffle)
VPUNPCKLQDQ xmm_y_low, xmm_ab_even, xmm_ab_odd; 1tp, lat 1; finishes in cycle 7 or 8
VPUNPCKHQDQ xmm_y_high, xmm_ab_even, xmm_ab_odd; 1tp, lat 1 finishes in cycle 8 or 9

penzn · 2020-12-07T19:10:41Z

@omnisip, I don't think x86/64 results have been posted, the table above (at least currently) shows only Arm platforms. I think that what @jlb6740 was wondering about.

Maratyszcza · 2020-12-07T19:26:46Z

@jlb6740 V8 doesn't implement these instructions for x86 yet. Once these instructions appear in x86-64 V8 I will benchmark.

omnisip · 2020-12-07T19:33:14Z

@penzn You're right. The comment was in general with the relative utility of this instruction versus this specific application. If I had Marat's test bench to run on x64, I would expect a performance improvement on x64 that dwarfs the one on ARM perhaps by a long shot. If you look at the implementations (old and new) -- google/XNNPACK@f63a54a#diff-e1f02777660513b5d83f8648e2a4cdcd55e843e50e5625cbc163224a50446d43 -- you can see how many shuffle operations it takes to get from 32 to 64. Fundamentally that's the same on x64 with or without specialized instructions. Then it calls i64x2.mul -- which in a roundabout way undoes the exact same shuffle and shift operation that preceded it. I'm going to verify the V8 implementation, but this is the most efficient way to do signed 64bit multiplication on x64.

omnisip · 2020-12-07T19:54:21Z

Here is a general analysis using LLVM-MCA comparing i64 multiplication (pre-converted, meaning no shuffle or movsx* steps) vs. i32 to i64 (including the conversions inlined). The top right shows the former at 683 cycles for 100 iterations, and the bottom right shows the latter at 415 cycles for 100 iterations. That suggests a 40% improvement by itself, not including any shifts or shuffles that would otherwise be needed for the conversion in the former.

jlb6740 · 2020-12-09T21:47:38Z

@omnisip @Maratyszcza .. Thanks guys. Yes as @penzn comment I was just wondering if there was a similar table for x64 and I guess the answer is no (not yet anyways) but it's coming. Anticipating performance benefits for x64 are dwarfed as expected or even non existent in this particular use case what would that mean for this proposal? This proposal seems to take good advantage of Arm semantics but unfortunately there aren't equivalent semantics for x64 so does this proposal really only target one platform? Could similar boost be achieved on ARM without these new instructions or is not trivial recognize this widening pattern when lowering ARM? I ask because if I understand correctly there are many other combinations to target such as "i16x8.mul = i16x8 x i8x16" that also don't have Wasm instructions and then suppose there is addition as well right?

omnisip · 2020-12-09T22:45:15Z

@jlb6740 By dwarfed, I meant, if you thought the ARM benefits were good, the x64 benefits are likely to be AMAZING. pmuldq appears to be one of the most efficient instructions in the x64 instruction set (SSE4.1), and in a roundabout way, I'm surprised this support wasn't implemented sooner because it's so efficient. Last night I put together a simulation that covered a lot of what @Maratyszcza does in his XNNPACK in different forms to model it. You can see them here.

The modeling was designed to match @Maratyszcza's nested loop with a loaded multiplier and rounding element from memory as part of a greater loop followed by the multiply, add, and doubling operations before persisting to memory. To add to it, I made sure to check what cost analysis it would be on 'core2' (x86-64 appears to be a synonym for this cost model in LLVM) and made every load and store unaligned to match V8.

After that, I split it up into four result panes which you'll see on the right-hand side. The first two (top left and bottom left) present alternative algorithms for SSE2 to do the calculations with pmuludq 1 time versus 3 times. On the top right, you'll see the optimal solution using pmuldq which is proposed by this feature. The fourth pane (bottom right) compares the performance against a purely scalarized 64 bit implementation which unpacks the vector into registers for multiplication before reloading.

The difference in performance is HUMONGOUS. 23-24K instructions become 9600 and cycles go from 7208 to 2668. That's roughly a 2.7x speedup.

(updated to fix typo: pmulld -> pmuldq)

omnisip · 2020-12-09T23:12:33Z

I ask because if I understand correctly there are many other combinations to target such as "i16x8.mul = i16x8 x i8x16" that also don't have Wasm instructions and then suppose there is addition as well right?

Addressing this separately, mul(i8x16,i8x16) -> i16x8 is pretty efficient by itself on x64 (even with the naive conversions in front) -- each multiplication can be done in 3 instructions in it's the best case (converting each operand argument that's coming from memory to 2 movsxbw/movzxbw followed by pmullw). For the mul(i16x8,i16x8) -> i32x4 case, there's actually room for better implementation on x64 than ARM. This occurs if we add an implementation that does multi-return. Since pmullw and pmulhw are required to perform a single 16->32bit multiplication on x64, we end up discarding half of the results by only leveraging one shuffle (to combine the low 16 and high 16 bits). By adding a second shuffle, you can have two vectors returned yielding optimal performance.

penzn · 2020-12-10T00:30:53Z

I think we need performance data to confirm this. I am somewhat skeptical about a lowering using two pshufd instructions, but will be happy to be wrong.

I evaluated performance impact of these instructions by leveraging them in requantization parts of fixed-point neural network inference primitives in XNNPACK library.

@Maratyszcza, is it a benchmark in XNNPACK or you used a framework accelerated by the library?

Maratyszcza · 2020-12-10T00:58:18Z

@penzn I used built-in end_to_end_bench in XNNPACK.

ngzhian · 2020-12-10T03:30:06Z

x64 is in: https://crrev.com/baf7e9029eae8bfa44b70989e49f81990a3d2be0

Maratyszcza · 2020-12-10T23:22:54Z

Here are the results on x86-64 systems. Impact on requantization only:

Processor	Performance with WAsm SIMD + Extended Multiplication	Performance with WAsm SIMD (baseline)	Speedup
Intel Xeon W-2135	4.35 GB/s	3.59 GB/s	21%
Intel Celeron N3060	594 MB/s	493 MB/s	20%
AMD PRO A10-8700B	3.04 GB/s	2.44 GB/s	25%

Impact on end-to-end MobileNet v2 latency:

Processor	Latency with WAsm SIMD + Extended Multiplication	Latency with WAsm SIMD (baseline)	Speedup
Intel Xeon W-2135	42 ms	43 ms	2%
Intel Celeron N3060	260 ms	272 ms	5%
AMD PRO A10-8700B	64 ms	67 ms	5%

omnisip · 2020-12-11T20:24:16Z

Hey @Maratyszcza I'm going to post my questions here on build setup just in case someone else wants to see how to do this too:

Here's what I've got so far:

Latest v8 trunk built for ARM64 and x64.
Bazel is installed
Latest Emscripten SDK (activate/install latest) [Do I need to do something from git on this one to get the latest instructions?]
XNNPack is fresh from Git (trunk/master and will switch to your tag/branch for extended multiplication tests).

Now I just would love to know how to build and run this. Do you know what the commands are for bazel? Then do I run something special to get it to go through d8?

Thanks so much!
Dan

(Update/Side note: I definitely know how to wire emscripten to a git versions of llvm if I need to as well).

Maratyszcza · 2020-12-11T20:55:38Z

AFAIK Bazel doesn't support building for Emscripten out of the box, you'd need a custom toolchain for that. You can get one from TensorFlow.js as well as copy the .blazerc file. Then it should be as simple as bazel build -c opt --config wasm --copt=-msimd128 --linkopt=-msimd128 //:end2end_bench.

tlively · 2020-12-11T22:39:24Z

No need for separate Emscripten or LLVM branches. Everything is checked in for both.

omnisip · 2020-12-11T23:23:18Z

@tlively @abrown Once I get xnnpack to build, I should have all of the changes done to v8. Here's the first rendition of what I want to try: https://gist.github.com/omnisip/67850c665ac33ced75272b3780f2a937

Basically, it eliminates half of the shuffles by shuffling the inputs together, then shifting the result of that to get the second multiplier/multiplicand. There should be move elimination on the SSE4.1 set for the two extra instructions I added but will test that too. On the vex enabled versions, it's still the same number of instructions.

tlively · 2020-12-14T19:28:03Z

We achieved consensus to merge this instruction to the proposal at the most recent sync meeting (#400).

ngzhian · 2020-12-17T06:20:21Z

proposals/simd/BinarySIMD.md

+| `i16x8.extmul_low_i8x16_s`  |   `0x110`| -                  |
+| `i16x8.extmul_high_i8x16_s` |   `0x111`| -                  |
+| `i16x8.extmul_low_i8x16_u`  |   `0x112`| -                  |
+| `i16x8.extmul_high_i8x16_u` |   `0x113`| -                  |
+| `i32x4.extmul_low_i16x8_s`  |   `0x114`| -                  |
+| `i32x4.extmul_high_i16x8_s` |   `0x115`| -                  |
+| `i32x4.extmul_low_i16x8_u`  |   `0x116`| -                  |
+| `i32x4.extmul_high_i16x8_u` |   `0x117`| -                  |
+| `i64x2.extmul_low_i32x4_s`  |   `0x118`| -                  |
+| `i64x2.extmul_high_i32x4_s` |   `0x119`| -                  |
+| `i64x2.extmul_low_i32x4_u`  |   `0x11a`| -                  |
+| `i64x2.extmul_high_i32x4_u` |   `0x11b`| -                  |


These opcodes don't match our current implementation. @tlively I believe LLVM and v8 are in sync right?

It worked for me when I pulled latest emscripten and v8

Yea emscripten and v8 is in sync, this document isn't.

These were accepted into the proposal in WebAssembly#376. There are 12 instructions in total: - i16x8.extmul_{low,high}_i8x16_{s,u} - i32x4.extmul_{low,high}_i16x8_{s,u} - i64x2.extmul_{low,high}_i32x4_{s,u} The implementation is straightforward, widen (using existing operations), then a multiply with the wider shape. Added a test generation script that reuses some logic in the generator for arithmetic instructions. Since these instructions have different src and dst shapes, I tweaked the base class to allow for having different shapes.

These were accepted into the proposal in WebAssembly#376. There are 12 instructions in total: - i16x8.extmul_{low,high}_i8x16_{s,u} - i32x4.extmul_{low,high}_i16x8_{s,u} - i64x2.extmul_{low,high}_i32x4_{s,u} The implementation is straightforward, widen (using existing operations), then a multiply with the wider shape. The binary opcodes are not decided yet, they currently follow the ones used in V8, when those are finalized, we can change it to match. Added a test generation script that reuses some logic in the generator for arithmetic instructions. Since these instructions have different src and dst shapes, I tweaked the base class to allow for having different shapes.

These were accepted into the proposal in #376. There are 12 instructions in total: - i16x8.extmul_{low,high}_i8x16_{s,u} - i32x4.extmul_{low,high}_i16x8_{s,u} - i64x2.extmul_{low,high}_i32x4_{s,u} The implementation is straightforward, widen (using existing operations), then a multiply with the wider shape. The binary opcodes are not decided yet, they currently follow the ones used in V8, when those are finalized, we can change it to match. Added a test generation script that reuses some logic in the generator for arithmetic instructions. Since these instructions have different src and dst shapes, I tweaked the base class to allow for having different shapes.

As proposed in WebAssembly/simd#376. This commit implements new builtin functions and intrinsics for these instructions, but does not yet add them to wasm_simd128.h because they have not yet been merged to the proposal. These are the first instructions with opcodes greater than 0xff, so this commit updates the MC layer and disassembler to handle that correctly. Differential Revision: https://reviews.llvm.org/D90253

tlively mentioned this pull request Oct 27, 2020

Prototype new SIMD multiplications WebAssembly/binaryen#3291

Merged

Maratyszcza force-pushed the extmul branch from 5426a25 to 0430ada Compare December 4, 2020 07:47

Extended multiplication instructions

d7ff1f6

Maratyszcza force-pushed the extmul branch from 0430ada to d7ff1f6 Compare December 4, 2020 07:47

omnisip mentioned this pull request Dec 6, 2020

Agenda for sync meeting 12/11/20 #400

Closed

tlively merged commit 20e914b into WebAssembly:master Dec 14, 2020

omnisip mentioned this pull request Dec 14, 2020

Common Subexpression Elimination #403

Open

lars-t-hansen mentioned this pull request Dec 15, 2020

Add extending multiply simd instructions bytecodealliance/wasm-tools#174

Merged

ngzhian reviewed Dec 17, 2020

View reviewed changes

Maratyszcza mentioned this pull request Dec 17, 2020

i64x2.widen_(low/high)_i32x4_(s/u) instructions #290

Merged

tlively mentioned this pull request Jan 8, 2021

Tracking instructions with unassigned opcodes #421

Closed

This was linked to issues Jan 8, 2021

Proposal to add mul 32x32=64 #175

Closed

Consistently support widening/long variants of integer instructions #226

Open

Maratyszcza mentioned this pull request Jan 14, 2021

Proposal to add mul 32x32=64 #175

Closed

ngzhian mentioned this pull request Feb 2, 2021

[interpreter] Implement SIMD extended multiply instructions #438

Merged

ngzhian mentioned this pull request Feb 8, 2021

[spectext] Add ext mul instructions #453

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extended multiplication instructions #376

Extended multiplication instructions #376

Maratyszcza commented Oct 7, 2020 •

edited

Loading

omnisip commented Oct 8, 2020

Maratyszcza commented Oct 8, 2020

omnisip commented Oct 8, 2020 •

edited

Loading

omnisip commented Oct 8, 2020

omnisip commented Oct 9, 2020

ngzhian commented Oct 19, 2020

tlively commented Oct 28, 2020

ngzhian commented Dec 1, 2020

Maratyszcza commented Dec 4, 2020

Maratyszcza commented Dec 4, 2020

omnisip commented Dec 4, 2020

Maratyszcza commented Dec 4, 2020

omnisip commented Dec 5, 2020 •

edited

Loading

jlb6740 commented Dec 7, 2020

omnisip commented Dec 7, 2020

penzn commented Dec 7, 2020

Maratyszcza commented Dec 7, 2020

omnisip commented Dec 7, 2020 •

edited

Loading

omnisip commented Dec 7, 2020 •

edited

Loading

jlb6740 commented Dec 9, 2020 •

edited

Loading

omnisip commented Dec 9, 2020 •

edited

Loading

omnisip commented Dec 9, 2020

penzn commented Dec 10, 2020

Maratyszcza commented Dec 10, 2020

ngzhian commented Dec 10, 2020 •

edited

Loading

Maratyszcza commented Dec 10, 2020 •

edited

Loading

omnisip commented Dec 11, 2020 •

edited

Loading

Maratyszcza commented Dec 11, 2020 •

edited

Loading

tlively commented Dec 11, 2020

omnisip commented Dec 11, 2020

tlively commented Dec 14, 2020

ngzhian Dec 17, 2020

omnisip Dec 17, 2020

ngzhian Dec 21, 2020

Extended multiplication instructions #376

Extended multiplication instructions #376

Conversation

Maratyszcza commented Oct 7, 2020 • edited Loading

Introduction

Applications

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE4.1 instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON instruction set

omnisip commented Oct 8, 2020

Maratyszcza commented Oct 8, 2020

omnisip commented Oct 8, 2020 • edited Loading

omnisip commented Oct 8, 2020

omnisip commented Oct 9, 2020

ngzhian commented Oct 19, 2020

tlively commented Oct 28, 2020

ngzhian commented Dec 1, 2020

Maratyszcza commented Dec 4, 2020

Maratyszcza commented Dec 4, 2020

omnisip commented Dec 4, 2020

Maratyszcza commented Dec 4, 2020

omnisip commented Dec 5, 2020 • edited Loading

jlb6740 commented Dec 7, 2020

omnisip commented Dec 7, 2020

penzn commented Dec 7, 2020

Maratyszcza commented Dec 7, 2020

omnisip commented Dec 7, 2020 • edited Loading

omnisip commented Dec 7, 2020 • edited Loading

jlb6740 commented Dec 9, 2020 • edited Loading

omnisip commented Dec 9, 2020 • edited Loading

omnisip commented Dec 9, 2020

penzn commented Dec 10, 2020

Maratyszcza commented Dec 10, 2020

ngzhian commented Dec 10, 2020 • edited Loading

Maratyszcza commented Dec 10, 2020 • edited Loading

omnisip commented Dec 11, 2020 • edited Loading

Maratyszcza commented Dec 11, 2020 • edited Loading

tlively commented Dec 11, 2020

omnisip commented Dec 11, 2020

tlively commented Dec 14, 2020

ngzhian Dec 17, 2020

Choose a reason for hiding this comment

omnisip Dec 17, 2020

Choose a reason for hiding this comment

ngzhian Dec 21, 2020

Choose a reason for hiding this comment

Maratyszcza commented Oct 7, 2020 •

edited

Loading

omnisip commented Oct 8, 2020 •

edited

Loading

omnisip commented Dec 5, 2020 •

edited

Loading

omnisip commented Dec 7, 2020 •

edited

Loading

omnisip commented Dec 7, 2020 •

edited

Loading

jlb6740 commented Dec 9, 2020 •

edited

Loading

omnisip commented Dec 9, 2020 •

edited

Loading

ngzhian commented Dec 10, 2020 •

edited

Loading

Maratyszcza commented Dec 10, 2020 •

edited

Loading

omnisip commented Dec 11, 2020 •

edited

Loading

Maratyszcza commented Dec 11, 2020 •

edited

Loading