Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Extended multiplication instructions #376

Merged
merged 1 commit into from
Dec 14, 2020
Merged

Conversation

Maratyszcza
Copy link
Contributor

@Maratyszcza Maratyszcza commented Oct 7, 2020

Introduction

The result of integer multiplication is generally twice wider than its inputs, and the lane-wise multiplication instructions currently in WebAssembly SIMD specification would commonly overflow and wrap. DSP algorithms often need the full multiplication result, and there have been requests to provide such functionality (e.g. #175 and #226) in WebAssembly SIMD. However, the current WebAssembly SIMD specification lacks such instructions, and the extended multiplication must be simulated via a combination of widen instructions and mul instruction of the twice wider result type. This PR adds extended multiplication instructions, which compute the full result of multiplication, and enable more efficient lowering to native instruction sets than the emulation sequence with widen and mul instructions:

  • i16x8.mul(i16x8.widen_low_i8x16_s(a), i16x8.widen_low_i8x16_s(b)) -> i16x8.extmul_low_i8x16_s(a, b).
  • i16x8.mul(i16x8.widen_high_i8x16_s(a), i16x8.widen_high_i8x16_s(b)) -> i16x8.extmul_high_i8x16_s(a, b).
  • i16x8.mul(i16x8.widen_low_i8x16_u(a), i16x8.widen_low_i8x16_u(b)) -> i16x8.extmul_low_i8x16_u(a, b).
  • i16x8.mul(i16x8.widen_high_i8x16_u(a), i16x8.widen_high_i8x16_u(b)) -> i16x8.extmul_high_i8x16_u(a, b).
  • i32x4.mul(i32x4.widen_low_i16x8_s(a), i32x4.widen_low_i16x8_s(b)) -> i32x4.extmul_low_i16x8_s(a, b).
  • i32x4.mul(i32x4.widen_high_i16x8_s(a), i32x4.widen_high_i16x8_s(b)) -> i32x4.extmul_high_i16x8_s(a, b).
  • i32x4.mul(i32x4.widen_low_i16x8_u(a), i32x4.widen_low_i16x8_u(b)) -> i32x4.extmul_low_i16x8_u(a, b).
  • i32x4.mul(i32x4.widen_high_i16x8_u(a), i32x4.widen_high_i16x8_u(b)) -> i32x4.extmul_high_i16x8_u(a, b).
  • i64x2.mul(i64x2.widen_low_i32x4_s(a), i64x2.widen_low_i32x4_s(b)) -> i64x2.extmul_low_i32x4_s(a, b).
  • i64x2.mul(i64x2.widen_high_i32x4_s(a), i64x2.widen_high_i32x4_s(b)) -> i64x2.extmul_high_i32x4_s(a, b).
  • i64x2.mul(i64x2.widen_low_i32x4_u(a), i64x2.widen_low_i32x4_u(b)) -> i64x2.extmul_low_i32x4_u(a, b).
  • i64x2.mul(i64x2.widen_high_i32x4_u(a), i64x2.widen_high_i32x4_u(b)) -> i64x2.extmul_high_i32x4_u(a, b).

Native instruction sets typically include means to compute the full result of a multiplication, although exact details vary by architecture and data type. ARM NEON provides instructions to compute extended multiplication on the low or high halves of the input SIMD vectors and producing a full SIMD vector of the results (they map 1:1 to the proposed WebAssembly SIMD instructions). x86 provides different instructions depending on data type: 32x32->64 multiplication instruction consume two even-numbered lanes as input and produce a single 128-bit vector with two full 64-bit results. 16x16->32 multiplication is provided via separate instructions to compute low and high 16-bit parts of the 32-bit result, which can be interleaved to get vectors of full 32-bit results.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

  • i32x4.extmul_low_i16x8_s

    • y = i32x4.extmul_low_i16x8_s(a, b) is lowered to VPMULLW xmm_tmp, xmm_a, xmm_b + VPMULHW xmm_y, xmm_a, xmm_b + VPUNPCKLWD xmm_y, xmm_tmp, xmm_y
  • i32x4.extmul_high_i16x8_s

    • y = i32x4.extmul_high_i16x8_s(a, b) is lowered to VPMULLW xmm_tmp, xmm_a, xmm_b + VPMULHW xmm_y, xmm_a, xmm_b + VPUNPCKHWD xmm_y, xmm_tmp, xmm_y
  • i32x4.extmul_low_i16x8_u

    • y = i32x4.extmul_low_i16x8_u(a, b) is lowered to VPMULLW xmm_tmp, xmm_a, xmm_b + VPMULHUW xmm_y, xmm_a, xmm_b + VPUNPCKLWD xmm_y, xmm_tmp, xmm_y
  • i32x4.extmul_high_i16x8_u

    • y = i32x4.extmul_high_i16x8_u(a, b) is lowered to VPMULLW xmm_tmp, xmm_a, xmm_b + VPMULHUW xmm_y, xmm_a, xmm_b + VPUNPCKHWD xmm_y, xmm_tmp, xmm_y
  • i64x2.extmul_low_i32x4_s(a, b)

    • y = i64x2.extmul_low_i32x4_s(a, b) is lowered to VPUNPCKLDQ xmm_tmp, xmm_a, xmm_a + VPUNPCKLDQ xmm_y, xmm_b, xmm_b + VPMULDQ xmm_y, xmm_tmp, xmm_y
  • i64x2.extmul_high_i32x4_s(a, b)

    • y = i64x2.extmul_high_i32x4_s(a, b) is lowered to VPUNPCKHDQ xmm_tmp, xmm_a, xmm_a + VPUNPCKHDQ xmm_y, xmm_b, xmm_b + VPMULDQ xmm_y, xmm_tmp, xmm_y
  • i64x2.extmul_low_i32x4_u(a, b)

    • y = i64x2.extmul_low_i32x4_u(a, b) is lowered to VPUNPCKLDQ xmm_tmp, xmm_a, xmm_a + VPUNPCKLDQ xmm_y, xmm_b, xmm_b + VPMULUDQ xmm_y, xmm_tmp, xmm_y
  • i64x2.extmul_high_i32x4_u(a, b)

    • y = i64x2.extmul_high_i32x4_u(a, b) is lowered to VPUNPCKHDQ xmm_tmp, xmm_a, xmm_a + VPUNPCKHDQ xmm_y, xmm_b, xmm_b + VPMULUDQ xmm_y, xmm_tmp, xmm_y

x86/x86-64 processors with SSE4.1 instruction set

  • i64x2.extmul_low_i32x4_s(a, b)

    • y = i64x2.extmul_low_i32x4_s(a, b) is lowered to PSHUFD xmm_tmp, xmm_a, 0x50 + PSHUFD xmm_y, xmm_b, 0x50 + PMULDQ xmm_y, xmm_tmp
  • i64x2.extmul_high_i32x4_s(a, b)

    • y = i64x2.extmul_high_i32x4_s(a, b) is lowered to PSHUFD xmm_tmp, xmm_a, 0xFA + PSHUFD xmm_y, xmm_b, 0xFA + PMULDQ xmm_y, xmm_tmp

x86/x86-64 processors with SSE2 instruction set

  • i32x4.extmul_low_i16x8_s

    • y = i32x4.extmul_low_i16x8_s(a, b) (y is NOT b) is lowered to MOVDQA xmm_y, xmm_a + MOVDQA xmm_tmp, xmm_a + PMULLW xmm_y, xmm_b + PMULHW xmm_tmp, xmm_b + PUNPCKLWD xmm_y, xmm_tmp
  • i32x4.extmul_high_i16x8_s

    • y = i32x4.extmul_high_i16x8_s(a, b) (y is NOT b) is lowered to MOVDQA xmm_y, xmm_a + MOVDQA xmm_tmp, xmm_a + PMULLW xmm_y, xmm_b + PMULHW xmm_tmp, xmm_b + PUNPCKHWD xmm_y, xmm_tmp
  • i32x4.extmul_low_i16x8_u

    • y = i32x4.extmul_low_i16x8_u(a, b) (y is NOT b) is lowered to MOVDQA xmm_y, xmm_a + MOVDQA xmm_tmp, xmm_a + PMULLW xmm_y, xmm_b + PMULHUW xmm_tmp, xmm_b + PUNPCKLWD xmm_y, xmm_tmp
  • i32x4.extmul_high_i16x8_u

    • y = i32x4.extmul_high_i16x8_u(a, b) (y is NOT b) is lowered to MOVDQA xmm_y, xmm_a + MOVDQA xmm_tmp, xmm_a + PMULLW xmm_y, xmm_b + PMULHUW xmm_tmp, xmm_b + PUNPCKHWD xmm_y, xmm_tmp
  • i64x2.extmul_low_i32x4_u(a, b)

    • y = i64x2.extmul_low_i32x4_u(a, b) is lowered to PSHUFD xmm_tmp, xmm_a, 0x50 + PSHUFD xmm_y, xmm_b, 0x50 + PMULUDQ xmm_y, xmm_tmp
  • i64x2.extmul_high_i32x4_u(a, b)

    • y = i64x2.extmul_high_i32x4_u(a, b) is lowered to PSHUFD xmm_tmp, xmm_a, 0xFA + PSHUFD xmm_y, xmm_b, 0xFA + PMULUDQ xmm_y, xmm_tmp

ARM64 processors

  • i16x8.extmul_low_i8x16_s

    • y = i16x8.extmul_low_i8x16_s(a, b) is lowered to SMULL Vy.8H, Va.8B, Vb.8B
  • i16x8.extmul_high_i8x16_s

    • y = i16x8.extmul_high_i8x16_s(a, b) is lowered to SMULL2 Vy.8H, Va.16B, Vb.16B
  • i16x8.extmul_low_i8x16_u

    • y = i16x8.extmul_low_i8x16_u(a, b) is lowered to UMULL Vy.8H, Va.8B, Vb.8B
  • i16x8.extmul_high_i8x16_u

    • y = i16x8.extmul_high_i8x16_u(a, b) is lowered to UMULL2 Vy.8H, Va.16B, Vb.16B
  • i32x4.extmul_low_i8x16_s

    • y = i32x4.extmul_low_i16x8_s(a, b) is lowered to SMULL Vy.4S, Va.4H, Vb.4H
  • i32x4.extmul_high_i8x16_s

    • y = i32x4.extmul_high_i16x8_s(a, b) is lowered to SMULL2 Vy.4S, Va.8H, Vb.8H
  • i32x4.extmul_low_i8x16_u

    • y = i32x4.extmul_low_i16x8_u(a, b) is lowered to UMULL Vy.4S, Va.4H, Vb.4H
  • i32x4.extmul_high_i8x16_u

    • y = i32x4.extmul_high_i16x8_u(a, b) is lowered to UMULL2 Vy.4S, Va.8H, Vb.8H
  • i64x2.extmul_low_i32x4_s

    • y = i64x2.extmul_low_i32x4_s(a, b) is lowered to SMULL Vy.2D, Va.2S, Vb.2S
  • i64x2.extmul_high_i32x4_s

    • y = i64x2.extmul_high_i32x4_s(a, b) is lowered to SMULL2 Vy.2D, Va.4S, Vb.4S
  • i64x2.extmul_low_i32x4_u

    • y = i64x2.extmul_low_i32x4_u(a, b) is lowered to UMULL Vy.2D, Va.2S, Vb.2S
  • i64x2.extmul_high_i32x4_u

    • y = i64x2.extmul_high_i32x4_u(a, b) is lowered to UMULL2 Vy.2D, Va.4S, Vb.4S

ARMv7 processors with NEON instruction set

  • i16x8.extmul_low_i8x16_s

    • y = i16x8.extmul_low_i8x16_s(a, b) is lowered to VMULL.S8 Qy, Da_lo, Db_lo
  • i16x8.extmul_high_i8x16_s

    • y = i16x8.extmul_high_i8x16_s(a, b) is lowered to VMULL.S8 Qy, Da_hi, Db_hi
  • i16x8.extmul_low_i8x16_u

    • y = i16x8.extmul_low_i8x16_u(a, b) is lowered to VMULL.U8 Qy, Da_lo, Db_lo
  • i16x8.extmul_high_i8x16_u

    • y = i16x8.extmul_high_i8x16_u(a, b) is lowered to VMULL.U8 Qy, Da_hi, Db_hi
  • i32x4.extmul_low_i8x16_s

    • y = i32x4.extmul_low_i16x8_s(a, b) is lowered to VMULL.S16 Qy, Da_lo, Db_lo
  • i32x4.extmul_high_i8x16_s

    • y = i32x4.extmul_high_i16x8_s(a, b) is lowered to VMULL.S16 Qy, Da_hi, Db_hi
  • i32x4.extmul_low_i8x16_u

    • y = i32x4.extmul_low_i16x8_u(a, b) is lowered to VMULL.U16 Qy, Da_lo, Db_lo
  • i32x4.extmul_high_i8x16_u

    • y = i32x4.extmul_high_i16x8_u(a, b) is lowered to VMULL.U16 Qy, Da_hi, Db_hi
  • i64x2.extmul_low_i32x4_s

    • y = i64x2.extmul_low_i32x4_s(a, b) is lowered to VMULL.S32 Qy, Da_lo, Db_lo
  • i64x2.extmul_high_i32x4_s

    • y = i64x2.extmul_high_i32x4_s(a, b) is lowered to VMULL.S32 Qy, Da_hi, Db_hi
  • i64x2.extmul_low_i32x4_u

    • y = i64x2.extmul_low_i32x4_u(a, b) is lowered to VMULL.U32 Qy, Da_lo, Db_lo
  • i64x2.extmul_high_i32x4_u

    • y = i64x2.extmul_high_i32x4_u(a, b) is lowered to VMULL.U32 Qy, Da_hi, Db_hi

@omnisip
Copy link

omnisip commented Oct 8, 2020

@Maratyszcza

I really like this proposal because it's a real world common situation. Is there any way semi- consistent across both architectures to end up with two i32x4s? The reason I ask is that the Intel ops produce the full multiplication for 8 and the only difference is the unpack which means if we want the second vector we have to repeat the multiplication.

With respect to the ARM instructions, it looks like it's a single call on each which makes you wonder if it's doing the same style of calculation under the hood. If ARM provided a method to do this similar to Intel, would it make more sense to implement the proposal in that way? This would get us the benefit of the 8 wide multiplication happening only once.

@Maratyszcza
Copy link
Contributor Author

@omnisip We could try to add an instruction that produce two output SIMD vectors - @tlively mentioned in the last CG meeting that this is now possible. However, 16x16->32 multiplication on x86 is the only case that would benefit here, so I decided to leave these two-output instructions for later.

@omnisip
Copy link

omnisip commented Oct 8, 2020

@omnisip We could try to add an instruction that produce two output SIMD vectors - @tlively mentioned in the last CG meeting that this is now possible. However, 16x16->32 multiplication on x86 is the only case that would benefit here, so I decided to leave these two-output instructions for later.

Well then these next comments will go together:

To add the signed variant for SSE2

sign1 = sar(number1, 31); // this will produce a full 32bit mask (if it's negative)
sign2 = sar(number2, 31);
for each vector:
number = number xor sign; number = number - sign; // this produces the absolute value in 32bits in each slot.
// example: ((-700 ^ (-700 >> 31)) - (-700 >> 31)) === 700
muludq xmm, xmm// absolute value multiplication
// now the fun part.
properSign = sign1 ^ sign2; // proper sign in 32 bits for the output of multiplication
pshufd properSign // expand it to 64 bits.
// Finally...
signed64 = signed64 xor properSign; signed64 = signed64 - properSign;
...

@omnisip
Copy link

omnisip commented Oct 8, 2020

Side note:

8 to 16-bit multiplication can be implemented entirely within pmullw or pmuludq too if you intend on adding support for it.

@omnisip
Copy link

omnisip commented Oct 9, 2020

Full assembly code and running examples to show the signed arithmetic works for sse2.

https://godbolt.org/z/fxqE7r

@ngzhian
Copy link
Member

ngzhian commented Oct 19, 2020

Prototyped on arm64 in https://crrev.com/c/2469156

tlively added a commit to tlively/binaryen that referenced this pull request Oct 27, 2020
Including saturating, rounding Q15 multiplication as proposed in
WebAssembly/simd#365 and extending multiplications as
proposed in WebAssembly/simd#376. Since these are just
prototypes, skips adding them to the C or JS APIs and the fuzzer, as well as
implementing them in the interpreter.
tlively added a commit to WebAssembly/binaryen that referenced this pull request Oct 28, 2020
Including saturating, rounding Q15 multiplication as proposed in
WebAssembly/simd#365 and extending multiplications as
proposed in WebAssembly/simd#376. Since these are just
prototypes, skips adding them to the C or JS APIs and the fuzzer, as well as
implementing them in the interpreter.
tlively added a commit to llvm/llvm-project that referenced this pull request Oct 28, 2020
As proposed in WebAssembly/simd#376. This commit
implements new builtin functions and intrinsics for these instructions, but does
not yet add them to wasm_simd128.h because they have not yet been merged to the
proposal. These are the first instructions with opcodes greater than 0xff, so
this commit updates the MC layer and disassembler to handle that correctly.

Differential Revision: https://reviews.llvm.org/D90253
@tlively
Copy link
Member

tlively commented Oct 28, 2020

These have now landed in LLVM and Binaryen and should be ready to use in tip-of-tree Emscripten in a few hours. The builtin functions to use are __builtin_wasm_extmul_{low,high}_{arg interpretation}_{s,u}_{result interpretation}.

@ngzhian
Copy link
Member

ngzhian commented Dec 1, 2020

@Maratyszcza any suggested lowering for i16x8.extmul_{high,low}i8x16{s,u} for x86 and x64? ?

@Maratyszcza
Copy link
Contributor Author

@ngzhian There isn't anything more efficient than naive i16x8.mul(i16x8.widen_low_i8x16_s(a), i16x8.widen_low_i8x16_s(b))

@Maratyszcza
Copy link
Contributor Author

I evaluated performance impact of these instructions by leveraging them in requantization parts of fixed-point neural network inference primitives in XNNPACK library. Fixed-point neural network operators typically accumulate intermediate result in high precision (32-bit) and in the end need to convert the intermediate result into a low-precision representation (typically 8-bit), the transformation called requantization. Performance impact on the requantization primitive is summarized in the table below:

Processor (Device)  Performance with WAsm SIMD + Extended Multiplication Performance with WAsm SIMD (baseline) Speedup
Snapdragon 855 (LG G8 ThinQ) 3.50 GB/s 2.37 GB/s 48%
Snapdragon 670 (Pixel 3a) 2.21 GB/s 1.36 GB/s 63%
Exynos 8895 (Galaxy S8) 2.05 GB/s 1.34 GB/s 53%

Requantization is only one of several components of fixed-point inference that could benefit from the extended multiplication instructions, but even it alone has noticeable end-to-end impact, as demonstrated below for the MobileNet v2 model:

Processor (Device)  Latency with WAsm SIMD + Extended Multiplication Latency with WAsm SIMD (baseline) Speedup
Snapdragon 855 (LG G8 ThinQ) 73 ms 78 ms 7%
Snapdragon 670 (Pixel 3a) 137 ms 146 ms 7%
Exynos 8895 (Galaxy S8) 156 ms 165 ms 6%

The code modifications can be seen in google/XNNPACK#1202.

@omnisip
Copy link

omnisip commented Dec 4, 2020

@Maratyszcza Is this because 3 instructions are replaced by 1?

@Maratyszcza
Copy link
Contributor Author

@omnisip Not quite. On one side, the baseline doesn't use 32->64-bit extension instructions as these are still experimental and have to be emulated via 2 WAsm SIMD instructions. On the other side, the baseline version pre-multiplies multiplier by 2 as an optimization, but the version with extended multiplication instructions instead explicitly doubles the result because if we pre-multiply multiplier by 2 it no longer fits into 32 bits.

@omnisip
Copy link

omnisip commented Dec 5, 2020

That makes sense and reviewing the code really shows how much of a performance bump there should be.

Is this a practical use case for SQDMLAL on a native implementation?

@jlb6740
Copy link

jlb6740 commented Dec 7, 2020

@Maratyszcza The desire for the full multiplication result is there and the speed up seen on ARM looks very good. On x86/x64 it seems as is there is less flexibility for a more efficient lowering. Any idea what that comparable XNNPACK speed-up is for x86/x64?

@omnisip
Copy link

omnisip commented Dec 7, 2020

@jlb6740 The bulk of the speedup for x64 is the fact that x64 has no native 64x64 multiplication instruction -- so in a roundabout way, one would have to convert to 64 bit then behind the scenes, then perform the same underlying operation as if they were 32bit integers -- Yielding 8 instructions for each 64 x 64 multiply plus 2 instructions for each conversion to 64, yielding 20 to do the full set. Compare that with the 6 instructions it can get the job done with here. Similarly, the performance increase can be made even further if this does multiple return values. PMULDQ and PMULUDQ use even-numbered lanes for multiplication, but otherwise do integer expansion natively. If these return the full set from 32->64, the speed up on x86/x64 should come from not just fewer instructions, but also at least 1 cycle less of latency.

e.g.

;; i64x2.extmul_low_i32x4_s(a, b)
VPSHUFD xmm_tmp, xmm_a, 0x50 ; 1tp, 1 lat; finishes in cycle 1
VPSHUFD xmm_y, xmm_b, 0x50 ; 1tp, 1 lat; finishes in cycle 2;
VPMULDQ xmm_y, xmm_tmp ; 0.5tp, 5 lat; finishes in cycle 7
;; i64x2.extmul_high_i32x4_s(a, b)
VPSHUFD xmm_tmp, xmm_a, 0xFA ; 1tp, 1 lat; finishes in cycle 3;
VPSHUFD xmm_y, xmm_b, 0xFA ; 1tp, 1 lat; finishes in cycle 4;
VPMULDQ xmm_y, xmm_tmp ; 0.5tp, 5 lat; finishes in cycle 9

becomes

VPMULDQ xmm_ab_even, xmm_a, xmm_b ; 0.5tp, lat 5; finishes in cycle 5
VPSRLQ xmm_a_odd, xmm_a, 32; 0.5tp, lat 1; finishes in cycle 1
VPSRLQ xmm_b_odd, xmm_b, 32 0.5tp,  lat 1;  can be swapped for a shuffle to get all three instructions done in the first cycle; but may be inconsequential since VPMULDQ is also 0.5tp
VPMULDQ xmm_ab_odd, xmm_a_odd, xmm_b_odd; 0.5tp lat 5; finishes in cycle 7 (non-shuffle), cycle 6 (shuffle)
VPUNPCKLQDQ xmm_y_low, xmm_ab_even, xmm_ab_odd; 1tp, lat 1; finishes in cycle 7 or 8
VPUNPCKHQDQ xmm_y_high, xmm_ab_even, xmm_ab_odd; 1tp, lat 1 finishes in cycle 8 or 9

@penzn
Copy link
Contributor

penzn commented Dec 7, 2020

@omnisip, I don't think x86/64 results have been posted, the table above (at least currently) shows only Arm platforms. I think that what @jlb6740 was wondering about.

@Maratyszcza
Copy link
Contributor Author

@jlb6740 V8 doesn't implement these instructions for x86 yet. Once these instructions appear in x86-64 V8 I will benchmark.

@omnisip
Copy link

omnisip commented Dec 7, 2020

@penzn You're right. The comment was in general with the relative utility of this instruction versus this specific application. If I had Marat's test bench to run on x64, I would expect a performance improvement on x64 that dwarfs the one on ARM perhaps by a long shot. If you look at the implementations (old and new) -- google/XNNPACK@f63a54a#diff-e1f02777660513b5d83f8648e2a4cdcd55e843e50e5625cbc163224a50446d43 -- you can see how many shuffle operations it takes to get from 32 to 64. Fundamentally that's the same on x64 with or without specialized instructions. Then it calls i64x2.mul -- which in a roundabout way undoes the exact same shuffle and shift operation that preceded it. I'm going to verify the V8 implementation, but this is the most efficient way to do signed 64bit multiplication on x64.

@omnisip
Copy link

omnisip commented Dec 7, 2020

Here is a general analysis using LLVM-MCA comparing i64 multiplication (pre-converted, meaning no shuffle or movsx* steps) vs. i32 to i64 (including the conversions inlined). The top right shows the former at 683 cycles for 100 iterations, and the bottom right shows the latter at 415 cycles for 100 iterations. That suggests a 40% improvement by itself, not including any shifts or shuffles that would otherwise be needed for the conversion in the former.

@jlb6740
Copy link

jlb6740 commented Dec 9, 2020

@omnisip @Maratyszcza .. Thanks guys. Yes as @penzn comment I was just wondering if there was a similar table for x64 and I guess the answer is no (not yet anyways) but it's coming. Anticipating performance benefits for x64 are dwarfed as expected or even non existent in this particular use case what would that mean for this proposal? This proposal seems to take good advantage of Arm semantics but unfortunately there aren't equivalent semantics for x64 so does this proposal really only target one platform? Could similar boost be achieved on ARM without these new instructions or is not trivial recognize this widening pattern when lowering ARM? I ask because if I understand correctly there are many other combinations to target such as "i16x8.mul = i16x8 x i8x16" that also don't have Wasm instructions and then suppose there is addition as well right?

@omnisip
Copy link

omnisip commented Dec 9, 2020

@jlb6740 By dwarfed, I meant, if you thought the ARM benefits were good, the x64 benefits are likely to be AMAZING. pmuldq appears to be one of the most efficient instructions in the x64 instruction set (SSE4.1), and in a roundabout way, I'm surprised this support wasn't implemented sooner because it's so efficient. Last night I put together a simulation that covered a lot of what @Maratyszcza does in his XNNPACK in different forms to model it. You can see them here.

The modeling was designed to match @Maratyszcza's nested loop with a loaded multiplier and rounding element from memory as part of a greater loop followed by the multiply, add, and doubling operations before persisting to memory. To add to it, I made sure to check what cost analysis it would be on 'core2' (x86-64 appears to be a synonym for this cost model in LLVM) and made every load and store unaligned to match V8.

After that, I split it up into four result panes which you'll see on the right-hand side. The first two (top left and bottom left) present alternative algorithms for SSE2 to do the calculations with pmuludq 1 time versus 3 times. On the top right, you'll see the optimal solution using pmuldq which is proposed by this feature. The fourth pane (bottom right) compares the performance against a purely scalarized 64 bit implementation which unpacks the vector into registers for multiplication before reloading.

The difference in performance is HUMONGOUS. 23-24K instructions become 9600 and cycles go from 7208 to 2668. That's roughly a 2.7x speedup.

(updated to fix typo: pmulld -> pmuldq)

@omnisip
Copy link

omnisip commented Dec 9, 2020

I ask because if I understand correctly there are many other combinations to target such as "i16x8.mul = i16x8 x i8x16" that also don't have Wasm instructions and then suppose there is addition as well right?

Addressing this separately, mul(i8x16,i8x16) -> i16x8 is pretty efficient by itself on x64 (even with the naive conversions in front) -- each multiplication can be done in 3 instructions in it's the best case (converting each operand argument that's coming from memory to 2 movsxbw/movzxbw followed by pmullw). For the mul(i16x8,i16x8) -> i32x4 case, there's actually room for better implementation on x64 than ARM. This occurs if we add an implementation that does multi-return. Since pmullw and pmulhw are required to perform a single 16->32bit multiplication on x64, we end up discarding half of the results by only leveraging one shuffle (to combine the low 16 and high 16 bits). By adding a second shuffle, you can have two vectors returned yielding optimal performance.

@penzn
Copy link
Contributor

penzn commented Dec 10, 2020

I think we need performance data to confirm this. I am somewhat skeptical about a lowering using two pshufd instructions, but will be happy to be wrong.

I evaluated performance impact of these instructions by leveraging them in requantization parts of fixed-point neural network inference primitives in XNNPACK library.

@Maratyszcza, is it a benchmark in XNNPACK or you used a framework accelerated by the library?

@Maratyszcza
Copy link
Contributor Author

@penzn I used built-in end_to_end_bench in XNNPACK.

@ngzhian
Copy link
Member

ngzhian commented Dec 10, 2020

@Maratyszcza
Copy link
Contributor Author

Maratyszcza commented Dec 10, 2020

Here are the results on x86-64 systems. Impact on requantization only:

Processor  Performance with WAsm SIMD + Extended Multiplication Performance with WAsm SIMD (baseline) Speedup
Intel Xeon W-2135 4.35 GB/s 3.59 GB/s 21%
Intel Celeron N3060 594 MB/s 493 MB/s 20%
AMD PRO A10-8700B 3.04 GB/s 2.44 GB/s 25%

Impact on end-to-end MobileNet v2 latency:

Processor  Latency with WAsm SIMD + Extended Multiplication Latency with WAsm SIMD (baseline) Speedup
Intel Xeon W-2135 42 ms 43 ms 2%
Intel Celeron N3060 260 ms 272 ms 5%
AMD PRO A10-8700B 64 ms 67 ms 5%

@omnisip
Copy link

omnisip commented Dec 11, 2020

Hey @Maratyszcza I'm going to post my questions here on build setup just in case someone else wants to see how to do this too:

Here's what I've got so far:

  • Latest v8 trunk built for ARM64 and x64.
  • Bazel is installed
  • Latest Emscripten SDK (activate/install latest) [Do I need to do something from git on this one to get the latest instructions?]
  • XNNPack is fresh from Git (trunk/master and will switch to your tag/branch for extended multiplication tests).

Now I just would love to know how to build and run this. Do you know what the commands are for bazel? Then do I run something special to get it to go through d8?

Thanks so much!
Dan

(Update/Side note: I definitely know how to wire emscripten to a git versions of llvm if I need to as well).

@Maratyszcza
Copy link
Contributor Author

Maratyszcza commented Dec 11, 2020

AFAIK Bazel doesn't support building for Emscripten out of the box, you'd need a custom toolchain for that. You can get one from TensorFlow.js as well as copy the .blazerc file. Then it should be as simple as bazel build -c opt --config wasm --copt=-msimd128 --linkopt=-msimd128 //:end2end_bench.

@tlively
Copy link
Member

tlively commented Dec 11, 2020

No need for separate Emscripten or LLVM branches. Everything is checked in for both.

@omnisip
Copy link

omnisip commented Dec 11, 2020

@tlively @abrown Once I get xnnpack to build, I should have all of the changes done to v8. Here's the first rendition of what I want to try: https://gist.github.com/omnisip/67850c665ac33ced75272b3780f2a937

Basically, it eliminates half of the shuffles by shuffling the inputs together, then shifting the result of that to get the second multiplier/multiplicand. There should be move elimination on the SSE4.1 set for the two extra instructions I added but will test that too. On the vex enabled versions, it's still the same number of instructions.

@tlively
Copy link
Member

tlively commented Dec 14, 2020

We achieved consensus to merge this instruction to the proposal at the most recent sync meeting (#400).

Comment on lines +228 to +239
| `i16x8.extmul_low_i8x16_s` | `0x110`| - |
| `i16x8.extmul_high_i8x16_s` | `0x111`| - |
| `i16x8.extmul_low_i8x16_u` | `0x112`| - |
| `i16x8.extmul_high_i8x16_u` | `0x113`| - |
| `i32x4.extmul_low_i16x8_s` | `0x114`| - |
| `i32x4.extmul_high_i16x8_s` | `0x115`| - |
| `i32x4.extmul_low_i16x8_u` | `0x116`| - |
| `i32x4.extmul_high_i16x8_u` | `0x117`| - |
| `i64x2.extmul_low_i32x4_s` | `0x118`| - |
| `i64x2.extmul_high_i32x4_s` | `0x119`| - |
| `i64x2.extmul_low_i32x4_u` | `0x11a`| - |
| `i64x2.extmul_high_i32x4_u` | `0x11b`| - |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These opcodes don't match our current implementation. @tlively I believe LLVM and v8 are in sync right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It worked for me when I pulled latest emscripten and v8

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea emscripten and v8 is in sync, this document isn't.

ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 2, 2021
These were accepted into the proposal in WebAssembly#376.

There are 12 instructions in total:

- i16x8.extmul_{low,high}_i8x16_{s,u}
- i32x4.extmul_{low,high}_i16x8_{s,u}
- i64x2.extmul_{low,high}_i32x4_{s,u}

The implementation is straightforward, widen (using existing
operations), then a multiply with the wider shape.

Added a test generation script that reuses some logic in the generator
for arithmetic instructions. Since these instructions have different
src and dst shapes, I tweaked the base class to allow for having
different shapes.
ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 2, 2021
These were accepted into the proposal in WebAssembly#376.

There are 12 instructions in total:

- i16x8.extmul_{low,high}_i8x16_{s,u}
- i32x4.extmul_{low,high}_i16x8_{s,u}
- i64x2.extmul_{low,high}_i32x4_{s,u}

The implementation is straightforward, widen (using existing
operations), then a multiply with the wider shape.

Added a test generation script that reuses some logic in the generator
for arithmetic instructions. Since these instructions have different
src and dst shapes, I tweaked the base class to allow for having
different shapes.
ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 2, 2021
These were accepted into the proposal in WebAssembly#376.

There are 12 instructions in total:

- i16x8.extmul_{low,high}_i8x16_{s,u}
- i32x4.extmul_{low,high}_i16x8_{s,u}
- i64x2.extmul_{low,high}_i32x4_{s,u}

The implementation is straightforward, widen (using existing
operations), then a multiply with the wider shape.

The binary opcodes are not decided yet, they currently follow the ones
used in V8, when those are finalized, we can change it to match.

Added a test generation script that reuses some logic in the generator
for arithmetic instructions. Since these instructions have different
src and dst shapes, I tweaked the base class to allow for having
different shapes.
ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 3, 2021
These were accepted into the proposal in WebAssembly#376.

There are 12 instructions in total:

- i16x8.extmul_{low,high}_i8x16_{s,u}
- i32x4.extmul_{low,high}_i16x8_{s,u}
- i64x2.extmul_{low,high}_i32x4_{s,u}

The implementation is straightforward, widen (using existing
operations), then a multiply with the wider shape.

The binary opcodes are not decided yet, they currently follow the ones
used in V8, when those are finalized, we can change it to match.

Added a test generation script that reuses some logic in the generator
for arithmetic instructions. Since these instructions have different
src and dst shapes, I tweaked the base class to allow for having
different shapes.
ngzhian added a commit that referenced this pull request Feb 3, 2021
These were accepted into the proposal in #376.

There are 12 instructions in total:

- i16x8.extmul_{low,high}_i8x16_{s,u}
- i32x4.extmul_{low,high}_i16x8_{s,u}
- i64x2.extmul_{low,high}_i32x4_{s,u}

The implementation is straightforward, widen (using existing
operations), then a multiply with the wider shape.

The binary opcodes are not decided yet, they currently follow the ones
used in V8, when those are finalized, we can change it to match.

Added a test generation script that reuses some logic in the generator
for arithmetic instructions. Since these instructions have different
src and dst shapes, I tweaked the base class to allow for having
different shapes.
arichardson pushed a commit to arichardson/llvm-project that referenced this pull request Mar 25, 2021
As proposed in WebAssembly/simd#376. This commit
implements new builtin functions and intrinsics for these instructions, but does
not yet add them to wasm_simd128.h because they have not yet been merged to the
proposal. These are the first instructions with opcodes greater than 0xff, so
this commit updates the MC layer and disassembler to handle that correctly.

Differential Revision: https://reviews.llvm.org/D90253
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consistently support widening/long variants of integer instructions Proposal to add mul 32x32=64
6 participants