Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relaxed Integer Dot Product instructions #52

Open
Maratyszcza opened this issue Feb 18, 2022 · 11 comments
Open

Relaxed Integer Dot Product instructions #52

Maratyszcza opened this issue Feb 18, 2022 · 11 comments
Labels
in-overview Instruction has been added to Overview.md instruction-proposal

Comments

@Maratyszcza
Copy link
Collaborator

Maratyszcza commented Feb 18, 2022

What are the instructions being proposed?

I propose relaxed 8-bit versions of the Dot Product instructions introduced in WebAssembly/simd#127. These instruction expose multipliplication of 8-bit (unsigned or signed) elements by 7-bit (treated as unsigned) elements with accumulation of adjacent products. These instructions are designed to expose the performance benefits of the following native instructions in a portable way:

  • x86/x86-64 SSSE3 PMADDUBSW instruction
  • x86/x86-64 VNNI (either AVX2-VNNI or AVX512-VNNI) VPDPBUSD instruction.
  • AArch32 NEON Dot Product instructions (VSDOT.S8 and VUDOT.U8)
  • AArch64 Dot Product instructions (SDOT and UDOT)

Discussion on Issue #9 goes into great length explaining the performance benefits of the above-mentioned native instructions.

I suggest i16x8.dot_i8x16_i7x16_s, i16x8.dot_i8x16_i7x16_u, i32x4.dot_i8x16_i7x16_add_s, and i32x4.dot_i8x16_i7x16_add_u as the tentative names for the proposed instructions.

What are the semantics of these instructions?

Both x86 and ARM provide variants of Dot Product instructions on SIMD vectors of 8-bit elements, but differ in the semantics of the input elements:

  • On x86 the instructions treat one of the SIMD vectors as having signed 8-bit elements and the other as unsigned 8-bit elements.
  • On ARM the input SIMD vectors are treated as either both having signed 8-bit elements, or both having unsigned 8-bit elements.

The proposed instructions resolve this incompatibility by guaranteeing the result only when elements of the second input SIMD vector have at most 7 non-zero bits, as in this case there is no difference between signed and unsigned representation.

i16x8.dot_i8x16_i7x16_s is a 2-element dot product instruction consuming signed 8-bit integer elements as the first input SIMD vector and 7-bit integer elements (treated as unsigned) as the second input SIMD vector and producing signed 16-bit integer output elements. The 2-element dot product never overflows as the worst case outputs fit into signed 16-bit integer:

  • -128 * 127 + -128 * 127 = -32512 > INT16_MIN = -32768
  • 127 * 127 + 127 * 127 = 32512 < INT16_MAX = 32767

i16x8.dot_i8x16_i7x16_u is a 2-element dot product instruction consuming unsigned 8-bit integer elements as the first input SIMD vector and 7-bit integer elements (treated as unsigned) as the second input SIMD vector and producing unsigned 16-bit integer output elements. The 2-element dot product never overflows as the worst case outputs fit into unsigned 16-bit integer:

  • 255 * 127 + 255 * 127 = 64770 < UINT16_MAX = 65536

i32x4.dot_i8x16_i7x16_add_s is a 4-element dot product with accumulation instruction consuming signed 8-bit integer elements in the first input SIMD vector, 7-bit integer elements (treated as unsigned) in the second input SIMD vector, and 32-bit integer elements (signedness-agnostic) in the third input SIMD vector and producing (signedness-agnostic) 32-bit integer output elements. The 4-element dot product producing a 32-bit result never overflows, and the addition of the third input SIMD vector is performed in modulo arithmetics.

i32x4.dot_i8x16_i7x16_add_u is a 4-element dot product with accumulation instruction consuming unsigned 8-bit integer elements in the first input SIMD vector, 7-bit integer elements (treated as unsigned) in the second input SIMD vector, and 32-bit integer elements (signedness-agnostic) in the third input SIMD vector and producing (signedness-agnostic) 32-bit integer output elements. The 4-element dot product producing a 32-bit result never overflows, and the addition of the third input SIMD vector is performed in modulo arithmetics.

How will these instructions be implemented?

x86/x86-64 processors with AVX2-VNNI or AVX512-VNNI instruction set

  • i32x4.dot_i8x16_i7x16_add_s

    • c = i32x4.dot_i8x16_i7x16_add_s(a, b, c) is lowered to VPDPBUSD xmm_c, xmm_a, xmm_b
    • y = i32x4.dot_i8x16_i7x16_add_s(a, b, c) is lowered to VMOVDQA xmm_y, xmm_c + VPDPBUSD xmm_c, xmm_a, xmm_b
  • i32x4.dot_i8x16_i7x16_add_u

    • c = i32x4.dot_i8x16_i7x16_add_u(a, b, c) is lowered to VPDPBUSD xmm_c, xmm_b, xmm_a
    • y = i32x4.dot_i8x16_i7x16_add_u(a, b, c) is lowered to VMOVDQA xmm_y, xmm_c + VPDPBUSD xmm_c, xmm_b, xmm_a

x86/x86-64 processors with XOP instruction set

  • i32x4.dot_i8x16_i7x16_add_s

    • y = i32x4.dot_i8x16_i7x16_add_s(a, b) is lowered to:
      • VPMADDUBSW xmm_tmp, xmm_b, xmm_a
      • VPHADDWD xmm_tmp, xmm_tmp
      • VPADDD xmm_y, xmm_y, xmm_tmp
  • i32x4.dot_i8x16_i7x16_add_u

    • y = i32x4.dot_i8x16_i7x16_add_u(a, b) is lowered to:
      • VPMADDUBSW xmm_tmp, xmm_a, xmm_b
      • VPHADDWD xmm_tmp, xmm_tmp
      • VPADDD xmm_y, xmm_y, xmm_tmp

x86/x86-64 processors with AVX instruction set

  • i16x8.dot_i8x16_i7x16_s

    • y = i16x8.dot_i8x16_i7x16_s(a, b) is lowered to VPMADDUBSW xmm_y, xmm_b, xmm_a
  • i16x8.dot_i8x16_i7x16_u

    • y = i16x8.dot_i8x16_i7x16_u(a, b) is lowered to VPMADDUBSW xmm_y, xmm_a, xmm_b
  • i32x4.dot_i8x16_i7x16_add_s

    • y = i32x4.dot_i8x16_i7x16_add_s(a, b) is lowered to:
      • VPMADDUBSW xmm_tmp, xmm_b, xmm_a
      • VPMADDWD xmm_tmp, xmm_tmp, [wasm_i16x8_splat(1)]
      • VPADDD xmm_y, xmm_y, xmm_tmp
  • i32x4.dot_i8x16_i7x16_add_u

    • y = i32x4.dot_i8x16_i7x16_add_u(a, b) is lowered to:
      • VPMADDUBSW xmm_tmp, xmm_a, xmm_b
      • VPMADDWD xmm_tmp, xmm_tmp, [wasm_i16x8_splat(1)]
      • VPADDD xmm_y, xmm_y, xmm_tmp

x86/x86-64 processors with SSSE3 instruction set

  • i16x8.dot_i8x16_i7x16_s

    • y = i16x8.dot_i8x16_i7x16_s(a, b) (y is NOT a) is lowered to MOVDQA xmm_y, xmm_b + PMADDUBSW xmm_y, xmm_a
  • i16x8.dot_i8x16_i7x16_u

    • y = i16x8.dot_i8x16_i7x16_u(a, b) (y is NOT b) is lowered to MOVDQA xmm_y, xmm_a + PMADDUBSW xmm_y, xmm_b
  • i32x4.dot_i8x16_i7x16_add_s

    • y = i32x4.dot_i8x16_i7x16_add_s(a, b) is lowered to:
      • MOVDQA xmm_tmp, xmm_b
      • PMADDUBSW xmm_tmp, xmm_a
      • PMADDWD xmm_tmp, [wasm_i16x8_splat(1)]
      • PADDD xmm_y, xmm_tmp
  • i32x4.dot_i8x16_i7x16_add_u

    • y = i32x4.dot_i8x16_i7x16_add_u(a, b) is lowered to:
      • MOVDQA xmm_tmp, xmm_a
      • PMADDUBSW xmm_tmp, xmm_b
      • PMADDWD xmm_tmp, [wasm_i16x8_splat(1)]
      • PADDD xmm_y, xmm_tmp

ARM64 processors with Dot Product extension

  • i32x4.dot_i8x16_i7x16_add_s

    • y = i32x4.dot_i8x16_i7x16_add_s(a, b, c) is lowered to MOV Vy.16B, Vc.16B + SDOT Vy.4S, Va.16B, Vb.16B
    • c = i32x4.dot_i8x16_i7x16_add_s(a, b, c) is lowered to SDOT Vc.4S, Va.16B, Vb.16B
  • i32x4.dot_i8x16_i7x16_add_u

    • y = i32x4.dot_i8x16_i7x16_add_u(a, b, c) is lowered to MOV Vy.16B, Vc.16B + UDOT Vy.4S, Va.16B, Vb.16B
    • c = i32x4.dot_i8x16_i7x16_add_u(a, b, c) is lowered to UDOT Vc.4S, Va.16B, Vb.16B

ARM64 processors

  • i16x8.dot_i8x16_i7x16_s

    • y = i16x8.dot_i8x16_i7x16_s(a, b) is lowered to:
      • SMULL Vy.8H, Va.8B, 4B.8B
      • SMULL2 Vtmp.8H, Va.16B, 4B.16B
      • ADDP Vy.8H, Vy.8H, Vtmp.8H
  • i16x8.dot_i8x16_i7x16_u

    • y = i16x8.dot_i8x16_i7x16_u(a, b) is lowered to:
      • UMULL Vy.8H, Va.8B, 4B.8B
      • UMULL2 Vtmp.8H, Va.16B, 4B.16B
      • ADDP Vy.8H, Vy.8H, Vtmp.8H
  • i32x4.dot_i8x16_i7x16_add_s

    • y = i32x4.dot_i8x16_i7x16_add_s(a, b, c) is lowered to:
      • SMULL Vtmp.8H, Va.8B, 4B.8B
      • SMULL2 Vtmp2.8H, Va.16B, 4B.16B
      • ADDP Vtmp.8H, Vtmp.8H, Vtmp2.8H
      • SADDLP Vtmp.4S, Vtmp.8H
      • ADD Vy.4S, Vtmp.4S, Vc.8H
    • c = i32x4.dot_i8x16_i7x16_add_s(a, b, c) is lowered to:
      • SMULL Vtmp.8H, Va.8B, 4B.8B
      • SMULL2 Vtmp2.8H, Va.16B, 4B.16B
      • ADDP Vtmp.8H, Vtmp.8H, Vtmp2.8H
      • SADALP Vc.4S, Vtmp.8H
  • i32x4.dot_i8x16_i7x16_add_u

    • y = i32x4.dot_i8x16_i7x16_add_s(a, b, c) is lowered to:
      • UMULL Vtmp.8H, Va.8B, 4B.8B
      • UMULL2 Vtmp2.8H, Va.16B, 4B.16B
      • ADDP Vtmp.8H, Vtmp.8H, Vtmp2.8H
      • UADDLP Vtmp.4S, Vtmp.8H
      • ADD Vy.4S, Vtmp.4S, Vc.8H
    • c = i32x4.dot_i8x16_i7x16_add_u(a, b, c) is lowered to:
      • UMULL Vtmp.8H, Va.8B, 4B.8B
      • UMULL2 Vtmp2.8H, Va.16B, 4B.16B
      • ADDP Vtmp.8H, Vtmp.8H, Vtmp2.8H
      • UADALP Vc.4S, Vtmp.8H

Reference lowering through the WAsm SIMD128 instruction set

  • i16x8.dot_i8x16_i7x16_s

    • y = i16x8.dot_i8x16_i7x16_s(a, b) is lowered to:
      • const v128_t prod_low = wasm_i16x8_extmul_low_i8x16(a, b)
      • const v128_t prod_high = wasm_i16x8_extmul_high_i8x16(a, b)
      • const v128_t prod_even = wasm_v16x8_shuffle(prod_low, prod_high, 0, 2, 4, 6, 8, 10, 12, 14)
      • const v128_t prod_odd = wasm_v16x8_shuffle(prod_low, prod_high, 1, 3, 5, 7, 8, 11, 13, 15)
      • y = wasm_i16x8_add(prod_even, prod_odd)
  • i16x8.dot_i8x16_i7x16_u

    • y = i16x8.dot_i8x16_i7x16_u(a, b) is lowered to:
      • const v128_t prod_low = wasm_u16x8_extmul_low_u8x16(a, b)
      • const v128_t prod_high = wasm_u16x8_extmul_high_u8x16(a, b)
      • const v128_t prod_even = wasm_v16x8_shuffle(prod_low, prod_high, 0, 2, 4, 6, 8, 10, 12, 14)
      • const v128_t prod_odd = wasm_v16x8_shuffle(prod_low, prod_high, 1, 3, 5, 7, 8, 11, 13, 15)
      • y = wasm_i16x8_add(prod_even, prod_odd)
  • i32x4.dot_i8x16_i7x16_add_s

    • y = i32x4.dot_i8x16_i7x16_add_s(a, b, c) is lowered to:
      • const v128_t prod_low = wasm_i16x8_extmul_low_i8x16(a, b)
      • const v128_t prod_high = wasm_i16x8_extmul_high_i8x16(a, b)
      • const v128_t psum_low = wasm_i32x4_extadd_pairwise_i16x8(prod_low)
      • const v128_t psum_high = wasm_i32x4_extadd_pairwise_i16x8(prod_high)
      • const v128_t psum_even = wasm_v32x4_shuffle(psum_low, psum_high, 0, 2, 4, 6)
      • const v128_t psum_odd = wasm_v32x4_shuffle(psum_low, psum_high, 1, 3, 5, 7)
      • const v128_t psum = wasm_i32x4_add(prod_even, prod_odd)
      • y = wasm_i32x4_add(psum, c)
  • i32x4.dot_i8x16_i7x16_add_u

    • y = i32x4.dot_i8x16_i7x16_add_s(a, b, c) is lowered to:
      • const v128_t prod_low = wasm_u16x8_extmul_low_u8x16(a, b)
      • const v128_t prod_high = wasm_u16x8_extmul_high_u8x16(a, b)
      • const v128_t psum_low = wasm_u32x4_extadd_pairwise_u16x8(prod_low)
      • const v128_t psum_high = wasm_u32x4_extadd_pairwise_u16x8(prod_high)
      • const v128_t psum_even = wasm_v32x4_shuffle(psum_low, psum_high, 0, 2, 4, 6)
      • const v128_t psum_odd = wasm_v32x4_shuffle(psum_low, psum_high, 1, 3, 5, 7)
      • const v128_t psum = wasm_i32x4_add(prod_even, prod_odd)
      • y = wasm_i32x4_add(psum, c)

How does behavior differ across processors? What new fingerprinting surfaces will be exposed?

As the native equivalents of the proposed Dot Product instructions on x86 perform signed-by-unsigned multiplication and the native equivalents on ARM perform either signed-by-signed, or unsigned-by-unsigned multiplications, it is possible to distinguish these architectures from results on out-of-bounds (when the high bit of the elements in the second input SIMD vector) inputs. x86/x86-64 can already be distinguished from ARM/ARM64 based on NaN behavior, so this aspect doesn't expose any new fingerprinting surfaces.

However, it is also possible to distinguish processors with AVX2-VNNI or AVX512-VNNI instruction sets on x86 from processors without these instruction sets by detecting saturation of intermediate results (in PMADDUBSW instruction), and distinguish ARM processors with Dot Product extension from ARM processors without this extension by detecting wrapping of intermediate results (in ADDP instructions). WebAssembly engines have three options to manage exposure of this fingerprinting surface:

  1. Wait it out, as new processors tend to support the AVX2-VNNI / AVX512-VNNI extension on the x86 and the NEON Dot Product extension on ARM. 2022 processor cores from Intel (Golden Cove, Gracemont), AMD (Zen 4), ARM (Cortex-X2, Cortex-A710, Cortex-A510), and Apple (A15) all support these instruction set extensions.

  2. Mimick the behaviour of the AVX2-VNNI / AVX512-VNNI VPDPBUSD instruction on the x86 processors without this instruction extension and the behaviour of the NEON Dot Product instructions on the ARM processors without this instruction set extension. This option comes at a performance cost on the older processors.

  3. Avoid the AVX2-VNNI / AVX512-VNNI VPDPBUSD instruction on x86 and the NEON Dot Product instructions on ARM, and use the same instruction sequences as the older processors. This option comes at a performance cost on the newer processors.

What use cases are there?

@lars-t-hansen
Copy link

I'm in favor of trying to make this instruction work, but in reference to #9 i think we need to check whether that performance number is realizable when the more relaxed semantics are applied, or whether intermediate results need to be fixed up to account for the relaxed semantics. The experiment @kpu references was x86-only and the semantics were explicitly those of PMADDUBSW. @yurydelendik looked into whether the experiment could use the dot products on arm64 and ran into some trouble, iirc.

@yurydelendik
Copy link
Contributor

yurydelendik commented Feb 18, 2022

FWIW the code that benefited from introducing PMADDUBSW can be found at https://github.com/kpu/intgemm/blob/d3657687c9b84d2ea2ea39b2fac7b89597bde848/intgemm/multiply.h#L305-L360

@ngzhian ngzhian added the outstanding instruction proposed instructions not yet added to overview label Feb 18, 2022
@ngzhian
Copy link
Member

ngzhian commented Mar 22, 2022

This instruction is the first one that is not just a relaxed version of SIMD proposal. I wonder if it should be i16x8.relaxed_dot_i8x16_s instead. It's a dot product of i8x16 elements, the "relaxed" means that the results are only guaranteed if the top bit of the second input is set.
Anyway, I have added these to the overview, with the proposed names used in this issue. We can change them at a later time.

@ngzhian ngzhian added in-overview Instruction has been added to Overview.md and removed outstanding instruction proposed instructions not yet added to overview labels Mar 22, 2022
tlively added a commit to WebAssembly/binaryen that referenced this issue Apr 7, 2022
tlively added a commit to WebAssembly/binaryen that referenced this issue Apr 11, 2022
@yurydelendik
Copy link
Contributor

During implementation I encountered this case i16x8.dot_i8x16_i7x16_u (v128.const i8x16 129,192, ....) (v128.const i8x16 65,127, ....) which generates v128.const i16x8 32767, ... on Intel. I wonder if OP analysis is incorrect.

PMADDUBS produces intermediate signed 16-bit integers, and the saturated result of adding is packed to the destination. The above 129 * 65 + 192 * 127 == 32769, but was saturated to 32767.

@Maratyszcza
Copy link
Collaborator Author

@yurydelendik You are right, I missed the saturating behavior of PMADDUBS. I suggest we remove i16x8.dot_i8x16_i7x16_u and i32x4.dot_i8x16_i7x16_add_u instructions from the proposal, as I don't see a way to implement them equally efficiently across both x86 and ARM64.

@kpu
Copy link

kpu commented Apr 14, 2022

The end goal is matrix multiply. Native matrix multiply implementations have highly-optimized implementations that are separate for x86 and ARM. If I'm honest about how I would use these instructions, I would:

  1. Compile both x86 and ARM versions to WASM using the proposed instructions.
  2. At runtime, reverse engineer which CPU is actually running via saturation behavior.
  3. Run the relevant native implementation by exploiting the unofficial arch-dependent behavior of using these instructions outside the documented bounds.

Models quantized for the dominant case of native code will use all 8 bits and be prepared for Intel's saturation (some scale both arguments down by sqrt(2) to avoid saturation); restricting to 7 bits would require new models to be distributed and validated, probably with quantization loss.
There's also the issue of register allocation, since multiply routines use all the registers they can but popping to the stack slows things down. Here, sensing the underlying architecture then calling an arch-specific multiply will correctly optimize register allocation provided the JIT is written well enough.

So really I'm asking for instructions whose behavior can be exploited to implement 8-bit multiply.

@kpu
Copy link

kpu commented Apr 14, 2022

What about just having an ARM USDOT / x86 VNNI wrapper that always does unsigned * signed?

On pre-VNNI x86 it lowers to pmaddubsw (which gets saturated 16 bit results) followed by pmaddwd against a register of 1s to accomplish a horizontal add into a 32-bit signed result. This is the main idiom used for 8-bit multiply on pre-VNNI x86 and yes it saturates.

Disadvantage is older ARM with only SDOT (no USDOT) is slower than necessary.

@Maratyszcza
Copy link
Collaborator Author

Maratyszcza commented Apr 14, 2022

USDOT is pretty much non-existent on ARM (there're two SoCs on the market the support it), while SDOT is widespread.

The specification of i32x4.dot_i8x16_i7x16_add_s enables both x86 VNNI and ARM SDOT to be used efficiently.

@Maratyszcza
Copy link
Collaborator Author

Updated the proposal to reflect the removal of i16x8.dot_i8x16_i7x16_u and i32x4.dot_i8x16_i7x16_add_u instructions.

pull bot pushed a commit to jamlee-t/v8 that referenced this issue Jun 6, 2022
Port commit a52b44f

Original Commit Message:

    Prototype the instruction on the interpreter, and Arm64. Details of
    instruction lowerings on all relevant architectures can be found at:
    WebAssembly/relaxed-simd#52

Change-Id: Ie0415f5c6a543517aa488a36ea5e575c6612ec0e
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/3687424
Auto-Submit: Yahan Lu <yahan@iscas.ac.cn>
Commit-Queue: ji qiu <qiuji@iscas.ac.cn>
Reviewed-by: ji qiu <qiuji@iscas.ac.cn>
Cr-Commit-Position: refs/heads/main@{#80951}
@ngzhian
Copy link
Member

ngzhian commented Jan 25, 2023

For the Wasm SIMD 128 lowering, i assume the wasm_i16x8_extmul_low_i8x16 refers to the signed ext mul, right?

@Maratyszcza
Copy link
Collaborator Author

Yes, wasm_i16x8_extmul_low_i8x16 is signed extended multiplication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in-overview Instruction has been added to Overview.md instruction-proposal
Projects
None yet
Development

No branches or pull requests

5 participants