-
Notifications
You must be signed in to change notification settings - Fork 43
Conversation
I really like this proposal because it's a real world common situation. Is there any way semi- consistent across both architectures to end up with two i32x4s? The reason I ask is that the Intel ops produce the full multiplication for 8 and the only difference is the unpack which means if we want the second vector we have to repeat the multiplication. With respect to the ARM instructions, it looks like it's a single call on each which makes you wonder if it's doing the same style of calculation under the hood. If ARM provided a method to do this similar to Intel, would it make more sense to implement the proposal in that way? This would get us the benefit of the 8 wide multiplication happening only once. |
Well then these next comments will go together: To add the signed variant for SSE2 sign1 = sar(number1, 31); // this will produce a full 32bit mask (if it's negative) |
Side note: 8 to 16-bit multiplication can be implemented entirely within pmullw or pmuludq too if you intend on adding support for it. |
Full assembly code and running examples to show the signed arithmetic works for sse2. |
Prototyped on arm64 in https://crrev.com/c/2469156 |
Including saturating, rounding Q15 multiplication as proposed in WebAssembly/simd#365 and extending multiplications as proposed in WebAssembly/simd#376. Since these are just prototypes, skips adding them to the C or JS APIs and the fuzzer, as well as implementing them in the interpreter.
Including saturating, rounding Q15 multiplication as proposed in WebAssembly/simd#365 and extending multiplications as proposed in WebAssembly/simd#376. Since these are just prototypes, skips adding them to the C or JS APIs and the fuzzer, as well as implementing them in the interpreter.
As proposed in WebAssembly/simd#376. This commit implements new builtin functions and intrinsics for these instructions, but does not yet add them to wasm_simd128.h because they have not yet been merged to the proposal. These are the first instructions with opcodes greater than 0xff, so this commit updates the MC layer and disassembler to handle that correctly. Differential Revision: https://reviews.llvm.org/D90253
These have now landed in LLVM and Binaryen and should be ready to use in tip-of-tree Emscripten in a few hours. The builtin functions to use are |
@Maratyszcza any suggested lowering for i16x8.extmul_{high,low}i8x16{s,u} for x86 and x64? ? |
@ngzhian There isn't anything more efficient than naive |
I evaluated performance impact of these instructions by leveraging them in requantization parts of fixed-point neural network inference primitives in XNNPACK library. Fixed-point neural network operators typically accumulate intermediate result in high precision (32-bit) and in the end need to convert the intermediate result into a low-precision representation (typically 8-bit), the transformation called requantization. Performance impact on the requantization primitive is summarized in the table below:
Requantization is only one of several components of fixed-point inference that could benefit from the extended multiplication instructions, but even it alone has noticeable end-to-end impact, as demonstrated below for the MobileNet v2 model:
The code modifications can be seen in google/XNNPACK#1202. |
@Maratyszcza Is this because 3 instructions are replaced by 1? |
@omnisip Not quite. On one side, the baseline doesn't use 32->64-bit extension instructions as these are still experimental and have to be emulated via 2 WAsm SIMD instructions. On the other side, the baseline version pre-multiplies multiplier by 2 as an optimization, but the version with extended multiplication instructions instead explicitly doubles the result because if we pre-multiply multiplier by 2 it no longer fits into 32 bits. |
That makes sense and reviewing the code really shows how much of a performance bump there should be. Is this a practical use case for SQDMLAL on a native implementation? |
@Maratyszcza The desire for the full multiplication result is there and the speed up seen on ARM looks very good. On x86/x64 it seems as is there is less flexibility for a more efficient lowering. Any idea what that comparable XNNPACK speed-up is for x86/x64? |
@jlb6740 The bulk of the speedup for x64 is the fact that x64 has no native 64x64 multiplication instruction -- so in a roundabout way, one would have to convert to 64 bit then behind the scenes, then perform the same underlying operation as if they were 32bit integers -- Yielding 8 instructions for each 64 x 64 multiply plus 2 instructions for each conversion to 64, yielding 20 to do the full set. Compare that with the 6 instructions it can get the job done with here. Similarly, the performance increase can be made even further if this does multiple return values. e.g. ;; i64x2.extmul_low_i32x4_s(a, b)
VPSHUFD xmm_tmp, xmm_a, 0x50 ; 1tp, 1 lat; finishes in cycle 1
VPSHUFD xmm_y, xmm_b, 0x50 ; 1tp, 1 lat; finishes in cycle 2;
VPMULDQ xmm_y, xmm_tmp ; 0.5tp, 5 lat; finishes in cycle 7
;; i64x2.extmul_high_i32x4_s(a, b)
VPSHUFD xmm_tmp, xmm_a, 0xFA ; 1tp, 1 lat; finishes in cycle 3;
VPSHUFD xmm_y, xmm_b, 0xFA ; 1tp, 1 lat; finishes in cycle 4;
VPMULDQ xmm_y, xmm_tmp ; 0.5tp, 5 lat; finishes in cycle 9 becomes VPMULDQ xmm_ab_even, xmm_a, xmm_b ; 0.5tp, lat 5; finishes in cycle 5
VPSRLQ xmm_a_odd, xmm_a, 32; 0.5tp, lat 1; finishes in cycle 1
VPSRLQ xmm_b_odd, xmm_b, 32 0.5tp, lat 1; can be swapped for a shuffle to get all three instructions done in the first cycle; but may be inconsequential since VPMULDQ is also 0.5tp
VPMULDQ xmm_ab_odd, xmm_a_odd, xmm_b_odd; 0.5tp lat 5; finishes in cycle 7 (non-shuffle), cycle 6 (shuffle)
VPUNPCKLQDQ xmm_y_low, xmm_ab_even, xmm_ab_odd; 1tp, lat 1; finishes in cycle 7 or 8
VPUNPCKHQDQ xmm_y_high, xmm_ab_even, xmm_ab_odd; 1tp, lat 1 finishes in cycle 8 or 9 |
@jlb6740 V8 doesn't implement these instructions for x86 yet. Once these instructions appear in x86-64 V8 I will benchmark. |
@penzn You're right. The comment was in general with the relative utility of this instruction versus this specific application. If I had Marat's test bench to run on x64, I would expect a performance improvement on x64 that dwarfs the one on ARM perhaps by a long shot. If you look at the implementations (old and new) -- google/XNNPACK@f63a54a#diff-e1f02777660513b5d83f8648e2a4cdcd55e843e50e5625cbc163224a50446d43 -- you can see how many shuffle operations it takes to get from 32 to 64. Fundamentally that's the same on x64 with or without specialized instructions. Then it calls i64x2.mul -- which in a roundabout way undoes the exact same shuffle and shift operation that preceded it. I'm going to verify the V8 implementation, but this is the most efficient way to do signed 64bit multiplication on x64. |
Here is a general analysis using LLVM-MCA comparing i64 multiplication (pre-converted, meaning no shuffle or movsx* steps) vs. i32 to i64 (including the conversions inlined). The top right shows the former at 683 cycles for 100 iterations, and the bottom right shows the latter at 415 cycles for 100 iterations. That suggests a 40% improvement by itself, not including any shifts or shuffles that would otherwise be needed for the conversion in the former. |
@omnisip @Maratyszcza .. Thanks guys. Yes as @penzn comment I was just wondering if there was a similar table for x64 and I guess the answer is no (not yet anyways) but it's coming. Anticipating performance benefits for x64 are dwarfed as expected or even non existent in this particular use case what would that mean for this proposal? This proposal seems to take good advantage of Arm semantics but unfortunately there aren't equivalent semantics for x64 so does this proposal really only target one platform? Could similar boost be achieved on ARM without these new instructions or is not trivial recognize this widening pattern when lowering ARM? I ask because if I understand correctly there are many other combinations to target such as "i16x8.mul = i16x8 x i8x16" that also don't have Wasm instructions and then suppose there is addition as well right? |
@jlb6740 By dwarfed, I meant, if you thought the ARM benefits were good, the x64 benefits are likely to be AMAZING. The modeling was designed to match @Maratyszcza's nested loop with a loaded multiplier and rounding element from memory as part of a greater loop followed by the multiply, add, and doubling operations before persisting to memory. To add to it, I made sure to check what cost analysis it would be on 'core2' (x86-64 appears to be a synonym for this cost model in LLVM) and made every load and store unaligned to match V8. After that, I split it up into four result panes which you'll see on the right-hand side. The first two (top left and bottom left) present alternative algorithms for SSE2 to do the calculations with pmuludq 1 time versus 3 times. On the top right, you'll see the optimal solution using pmuldq which is proposed by this feature. The fourth pane (bottom right) compares the performance against a purely scalarized 64 bit implementation which unpacks the vector into registers for multiplication before reloading. The difference in performance is HUMONGOUS. 23-24K instructions become 9600 and cycles go from 7208 to 2668. That's roughly a 2.7x speedup. (updated to fix typo: pmulld -> pmuldq) |
Addressing this separately, mul(i8x16,i8x16) -> i16x8 is pretty efficient by itself on x64 (even with the naive conversions in front) -- each multiplication can be done in 3 instructions in it's the best case (converting each operand argument that's coming from memory to 2 |
I think we need performance data to confirm this. I am somewhat skeptical about a lowering using two
@Maratyszcza, is it a benchmark in XNNPACK or you used a framework accelerated by the library? |
@penzn I used built-in |
Here are the results on x86-64 systems. Impact on requantization only:
Impact on end-to-end MobileNet v2 latency:
|
Hey @Maratyszcza I'm going to post my questions here on build setup just in case someone else wants to see how to do this too: Here's what I've got so far:
Now I just would love to know how to build and run this. Do you know what the commands are for bazel? Then do I run something special to get it to go through d8? Thanks so much! (Update/Side note: I definitely know how to wire emscripten to a git versions of llvm if I need to as well). |
AFAIK Bazel doesn't support building for Emscripten out of the box, you'd need a custom toolchain for that. You can get one from TensorFlow.js as well as copy the |
No need for separate Emscripten or LLVM branches. Everything is checked in for both. |
@tlively @abrown Once I get xnnpack to build, I should have all of the changes done to v8. Here's the first rendition of what I want to try: https://gist.github.com/omnisip/67850c665ac33ced75272b3780f2a937 Basically, it eliminates half of the shuffles by shuffling the inputs together, then shifting the result of that to get the second multiplier/multiplicand. There should be move elimination on the SSE4.1 set for the two extra instructions I added but will test that too. On the vex enabled versions, it's still the same number of instructions. |
We achieved consensus to merge this instruction to the proposal at the most recent sync meeting (#400). |
| `i16x8.extmul_low_i8x16_s` | `0x110`| - | | ||
| `i16x8.extmul_high_i8x16_s` | `0x111`| - | | ||
| `i16x8.extmul_low_i8x16_u` | `0x112`| - | | ||
| `i16x8.extmul_high_i8x16_u` | `0x113`| - | | ||
| `i32x4.extmul_low_i16x8_s` | `0x114`| - | | ||
| `i32x4.extmul_high_i16x8_s` | `0x115`| - | | ||
| `i32x4.extmul_low_i16x8_u` | `0x116`| - | | ||
| `i32x4.extmul_high_i16x8_u` | `0x117`| - | | ||
| `i64x2.extmul_low_i32x4_s` | `0x118`| - | | ||
| `i64x2.extmul_high_i32x4_s` | `0x119`| - | | ||
| `i64x2.extmul_low_i32x4_u` | `0x11a`| - | | ||
| `i64x2.extmul_high_i32x4_u` | `0x11b`| - | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These opcodes don't match our current implementation. @tlively I believe LLVM and v8 are in sync right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It worked for me when I pulled latest emscripten and v8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea emscripten and v8 is in sync, this document isn't.
These were accepted into the proposal in WebAssembly#376. There are 12 instructions in total: - i16x8.extmul_{low,high}_i8x16_{s,u} - i32x4.extmul_{low,high}_i16x8_{s,u} - i64x2.extmul_{low,high}_i32x4_{s,u} The implementation is straightforward, widen (using existing operations), then a multiply with the wider shape. Added a test generation script that reuses some logic in the generator for arithmetic instructions. Since these instructions have different src and dst shapes, I tweaked the base class to allow for having different shapes.
These were accepted into the proposal in WebAssembly#376. There are 12 instructions in total: - i16x8.extmul_{low,high}_i8x16_{s,u} - i32x4.extmul_{low,high}_i16x8_{s,u} - i64x2.extmul_{low,high}_i32x4_{s,u} The implementation is straightforward, widen (using existing operations), then a multiply with the wider shape. Added a test generation script that reuses some logic in the generator for arithmetic instructions. Since these instructions have different src and dst shapes, I tweaked the base class to allow for having different shapes.
These were accepted into the proposal in WebAssembly#376. There are 12 instructions in total: - i16x8.extmul_{low,high}_i8x16_{s,u} - i32x4.extmul_{low,high}_i16x8_{s,u} - i64x2.extmul_{low,high}_i32x4_{s,u} The implementation is straightforward, widen (using existing operations), then a multiply with the wider shape. The binary opcodes are not decided yet, they currently follow the ones used in V8, when those are finalized, we can change it to match. Added a test generation script that reuses some logic in the generator for arithmetic instructions. Since these instructions have different src and dst shapes, I tweaked the base class to allow for having different shapes.
These were accepted into the proposal in WebAssembly#376. There are 12 instructions in total: - i16x8.extmul_{low,high}_i8x16_{s,u} - i32x4.extmul_{low,high}_i16x8_{s,u} - i64x2.extmul_{low,high}_i32x4_{s,u} The implementation is straightforward, widen (using existing operations), then a multiply with the wider shape. The binary opcodes are not decided yet, they currently follow the ones used in V8, when those are finalized, we can change it to match. Added a test generation script that reuses some logic in the generator for arithmetic instructions. Since these instructions have different src and dst shapes, I tweaked the base class to allow for having different shapes.
These were accepted into the proposal in #376. There are 12 instructions in total: - i16x8.extmul_{low,high}_i8x16_{s,u} - i32x4.extmul_{low,high}_i16x8_{s,u} - i64x2.extmul_{low,high}_i32x4_{s,u} The implementation is straightforward, widen (using existing operations), then a multiply with the wider shape. The binary opcodes are not decided yet, they currently follow the ones used in V8, when those are finalized, we can change it to match. Added a test generation script that reuses some logic in the generator for arithmetic instructions. Since these instructions have different src and dst shapes, I tweaked the base class to allow for having different shapes.
As proposed in WebAssembly/simd#376. This commit implements new builtin functions and intrinsics for these instructions, but does not yet add them to wasm_simd128.h because they have not yet been merged to the proposal. These are the first instructions with opcodes greater than 0xff, so this commit updates the MC layer and disassembler to handle that correctly. Differential Revision: https://reviews.llvm.org/D90253
Introduction
The result of integer multiplication is generally twice wider than its inputs, and the lane-wise multiplication instructions currently in WebAssembly SIMD specification would commonly overflow and wrap. DSP algorithms often need the full multiplication result, and there have been requests to provide such functionality (e.g. #175 and #226) in WebAssembly SIMD. However, the current WebAssembly SIMD specification lacks such instructions, and the extended multiplication must be simulated via a combination of
widen
instructions andmul
instruction of the twice wider result type. This PR adds extended multiplication instructions, which compute the full result of multiplication, and enable more efficient lowering to native instruction sets than the emulation sequence withwiden
andmul
instructions:i16x8.mul(i16x8.widen_low_i8x16_s(a), i16x8.widen_low_i8x16_s(b))
->i16x8.extmul_low_i8x16_s(a, b)
.i16x8.mul(i16x8.widen_high_i8x16_s(a), i16x8.widen_high_i8x16_s(b))
->i16x8.extmul_high_i8x16_s(a, b)
.i16x8.mul(i16x8.widen_low_i8x16_u(a), i16x8.widen_low_i8x16_u(b))
->i16x8.extmul_low_i8x16_u(a, b)
.i16x8.mul(i16x8.widen_high_i8x16_u(a), i16x8.widen_high_i8x16_u(b))
->i16x8.extmul_high_i8x16_u(a, b)
.i32x4.mul(i32x4.widen_low_i16x8_s(a), i32x4.widen_low_i16x8_s(b))
->i32x4.extmul_low_i16x8_s(a, b)
.i32x4.mul(i32x4.widen_high_i16x8_s(a), i32x4.widen_high_i16x8_s(b))
->i32x4.extmul_high_i16x8_s(a, b)
.i32x4.mul(i32x4.widen_low_i16x8_u(a), i32x4.widen_low_i16x8_u(b))
->i32x4.extmul_low_i16x8_u(a, b)
.i32x4.mul(i32x4.widen_high_i16x8_u(a), i32x4.widen_high_i16x8_u(b))
->i32x4.extmul_high_i16x8_u(a, b)
.i64x2.mul(i64x2.widen_low_i32x4_s(a), i64x2.widen_low_i32x4_s(b))
->i64x2.extmul_low_i32x4_s(a, b)
.i64x2.mul(i64x2.widen_high_i32x4_s(a), i64x2.widen_high_i32x4_s(b))
->i64x2.extmul_high_i32x4_s(a, b)
.i64x2.mul(i64x2.widen_low_i32x4_u(a), i64x2.widen_low_i32x4_u(b))
->i64x2.extmul_low_i32x4_u(a, b)
.i64x2.mul(i64x2.widen_high_i32x4_u(a), i64x2.widen_high_i32x4_u(b))
->i64x2.extmul_high_i32x4_u(a, b)
.Native instruction sets typically include means to compute the full result of a multiplication, although exact details vary by architecture and data type. ARM NEON provides instructions to compute extended multiplication on the low or high halves of the input SIMD vectors and producing a full SIMD vector of the results (they map 1:1 to the proposed WebAssembly SIMD instructions). x86 provides different instructions depending on data type:
32x32->64
multiplication instruction consume two even-numbered lanes as input and produce a single 128-bit vector with two full 64-bit results.16x16->32
multiplication is provided via separate instructions to compute low and high 16-bit parts of the 32-bit result, which can be interleaved to get vectors of full 32-bit results.Applications
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX instruction set
i32x4.extmul_low_i16x8_s
y = i32x4.extmul_low_i16x8_s(a, b)
is lowered toVPMULLW xmm_tmp, xmm_a, xmm_b
+VPMULHW xmm_y, xmm_a, xmm_b
+VPUNPCKLWD xmm_y, xmm_tmp, xmm_y
i32x4.extmul_high_i16x8_s
y = i32x4.extmul_high_i16x8_s(a, b)
is lowered toVPMULLW xmm_tmp, xmm_a, xmm_b
+VPMULHW xmm_y, xmm_a, xmm_b
+VPUNPCKHWD xmm_y, xmm_tmp, xmm_y
i32x4.extmul_low_i16x8_u
y = i32x4.extmul_low_i16x8_u(a, b)
is lowered toVPMULLW xmm_tmp, xmm_a, xmm_b
+VPMULHUW xmm_y, xmm_a, xmm_b
+VPUNPCKLWD xmm_y, xmm_tmp, xmm_y
i32x4.extmul_high_i16x8_u
y = i32x4.extmul_high_i16x8_u(a, b)
is lowered toVPMULLW xmm_tmp, xmm_a, xmm_b
+VPMULHUW xmm_y, xmm_a, xmm_b
+VPUNPCKHWD xmm_y, xmm_tmp, xmm_y
i64x2.extmul_low_i32x4_s(a, b)
y = i64x2.extmul_low_i32x4_s(a, b)
is lowered toVPUNPCKLDQ xmm_tmp, xmm_a, xmm_a
+VPUNPCKLDQ xmm_y, xmm_b, xmm_b
+VPMULDQ xmm_y, xmm_tmp, xmm_y
i64x2.extmul_high_i32x4_s(a, b)
y = i64x2.extmul_high_i32x4_s(a, b)
is lowered toVPUNPCKHDQ xmm_tmp, xmm_a, xmm_a
+VPUNPCKHDQ xmm_y, xmm_b, xmm_b
+VPMULDQ xmm_y, xmm_tmp, xmm_y
i64x2.extmul_low_i32x4_u(a, b)
y = i64x2.extmul_low_i32x4_u(a, b)
is lowered toVPUNPCKLDQ xmm_tmp, xmm_a, xmm_a
+VPUNPCKLDQ xmm_y, xmm_b, xmm_b
+VPMULUDQ xmm_y, xmm_tmp, xmm_y
i64x2.extmul_high_i32x4_u(a, b)
y = i64x2.extmul_high_i32x4_u(a, b)
is lowered toVPUNPCKHDQ xmm_tmp, xmm_a, xmm_a
+VPUNPCKHDQ xmm_y, xmm_b, xmm_b
+VPMULUDQ xmm_y, xmm_tmp, xmm_y
x86/x86-64 processors with SSE4.1 instruction set
i64x2.extmul_low_i32x4_s(a, b)
y = i64x2.extmul_low_i32x4_s(a, b)
is lowered toPSHUFD xmm_tmp, xmm_a, 0x50
+PSHUFD xmm_y, xmm_b, 0x50
+PMULDQ xmm_y, xmm_tmp
i64x2.extmul_high_i32x4_s(a, b)
y = i64x2.extmul_high_i32x4_s(a, b)
is lowered toPSHUFD xmm_tmp, xmm_a, 0xFA
+PSHUFD xmm_y, xmm_b, 0xFA
+PMULDQ xmm_y, xmm_tmp
x86/x86-64 processors with SSE2 instruction set
i32x4.extmul_low_i16x8_s
y = i32x4.extmul_low_i16x8_s(a, b)
(y
is NOT b) is lowered toMOVDQA xmm_y, xmm_a
+MOVDQA xmm_tmp, xmm_a
+PMULLW xmm_y, xmm_b
+PMULHW xmm_tmp, xmm_b
+PUNPCKLWD xmm_y, xmm_tmp
i32x4.extmul_high_i16x8_s
y = i32x4.extmul_high_i16x8_s(a, b)
(y
is NOT b) is lowered toMOVDQA xmm_y, xmm_a
+MOVDQA xmm_tmp, xmm_a
+PMULLW xmm_y, xmm_b
+PMULHW xmm_tmp, xmm_b
+PUNPCKHWD xmm_y, xmm_tmp
i32x4.extmul_low_i16x8_u
y = i32x4.extmul_low_i16x8_u(a, b)
(y
is NOT b) is lowered toMOVDQA xmm_y, xmm_a
+MOVDQA xmm_tmp, xmm_a
+PMULLW xmm_y, xmm_b
+PMULHUW xmm_tmp, xmm_b
+PUNPCKLWD xmm_y, xmm_tmp
i32x4.extmul_high_i16x8_u
y = i32x4.extmul_high_i16x8_u(a, b)
(y
is NOT b) is lowered toMOVDQA xmm_y, xmm_a
+MOVDQA xmm_tmp, xmm_a
+PMULLW xmm_y, xmm_b
+PMULHUW xmm_tmp, xmm_b
+PUNPCKHWD xmm_y, xmm_tmp
i64x2.extmul_low_i32x4_u(a, b)
y = i64x2.extmul_low_i32x4_u(a, b)
is lowered toPSHUFD xmm_tmp, xmm_a, 0x50
+PSHUFD xmm_y, xmm_b, 0x50
+PMULUDQ xmm_y, xmm_tmp
i64x2.extmul_high_i32x4_u(a, b)
y = i64x2.extmul_high_i32x4_u(a, b)
is lowered toPSHUFD xmm_tmp, xmm_a, 0xFA
+PSHUFD xmm_y, xmm_b, 0xFA
+PMULUDQ xmm_y, xmm_tmp
ARM64 processors
i16x8.extmul_low_i8x16_s
y = i16x8.extmul_low_i8x16_s(a, b)
is lowered toSMULL Vy.8H, Va.8B, Vb.8B
i16x8.extmul_high_i8x16_s
y = i16x8.extmul_high_i8x16_s(a, b)
is lowered toSMULL2 Vy.8H, Va.16B, Vb.16B
i16x8.extmul_low_i8x16_u
y = i16x8.extmul_low_i8x16_u(a, b)
is lowered toUMULL Vy.8H, Va.8B, Vb.8B
i16x8.extmul_high_i8x16_u
y = i16x8.extmul_high_i8x16_u(a, b)
is lowered toUMULL2 Vy.8H, Va.16B, Vb.16B
i32x4.extmul_low_i8x16_s
y = i32x4.extmul_low_i16x8_s(a, b)
is lowered toSMULL Vy.4S, Va.4H, Vb.4H
i32x4.extmul_high_i8x16_s
y = i32x4.extmul_high_i16x8_s(a, b)
is lowered toSMULL2 Vy.4S, Va.8H, Vb.8H
i32x4.extmul_low_i8x16_u
y = i32x4.extmul_low_i16x8_u(a, b)
is lowered toUMULL Vy.4S, Va.4H, Vb.4H
i32x4.extmul_high_i8x16_u
y = i32x4.extmul_high_i16x8_u(a, b)
is lowered toUMULL2 Vy.4S, Va.8H, Vb.8H
i64x2.extmul_low_i32x4_s
y = i64x2.extmul_low_i32x4_s(a, b)
is lowered toSMULL Vy.2D, Va.2S, Vb.2S
i64x2.extmul_high_i32x4_s
y = i64x2.extmul_high_i32x4_s(a, b)
is lowered toSMULL2 Vy.2D, Va.4S, Vb.4S
i64x2.extmul_low_i32x4_u
y = i64x2.extmul_low_i32x4_u(a, b)
is lowered toUMULL Vy.2D, Va.2S, Vb.2S
i64x2.extmul_high_i32x4_u
y = i64x2.extmul_high_i32x4_u(a, b)
is lowered toUMULL2 Vy.2D, Va.4S, Vb.4S
ARMv7 processors with NEON instruction set
i16x8.extmul_low_i8x16_s
y = i16x8.extmul_low_i8x16_s(a, b)
is lowered toVMULL.S8 Qy, Da_lo, Db_lo
i16x8.extmul_high_i8x16_s
y = i16x8.extmul_high_i8x16_s(a, b)
is lowered toVMULL.S8 Qy, Da_hi, Db_hi
i16x8.extmul_low_i8x16_u
y = i16x8.extmul_low_i8x16_u(a, b)
is lowered toVMULL.U8 Qy, Da_lo, Db_lo
i16x8.extmul_high_i8x16_u
y = i16x8.extmul_high_i8x16_u(a, b)
is lowered toVMULL.U8 Qy, Da_hi, Db_hi
i32x4.extmul_low_i8x16_s
y = i32x4.extmul_low_i16x8_s(a, b)
is lowered toVMULL.S16 Qy, Da_lo, Db_lo
i32x4.extmul_high_i8x16_s
y = i32x4.extmul_high_i16x8_s(a, b)
is lowered toVMULL.S16 Qy, Da_hi, Db_hi
i32x4.extmul_low_i8x16_u
y = i32x4.extmul_low_i16x8_u(a, b)
is lowered toVMULL.U16 Qy, Da_lo, Db_lo
i32x4.extmul_high_i8x16_u
y = i32x4.extmul_high_i16x8_u(a, b)
is lowered toVMULL.U16 Qy, Da_hi, Db_hi
i64x2.extmul_low_i32x4_s
y = i64x2.extmul_low_i32x4_s(a, b)
is lowered toVMULL.S32 Qy, Da_lo, Db_lo
i64x2.extmul_high_i32x4_s
y = i64x2.extmul_high_i32x4_s(a, b)
is lowered toVMULL.S32 Qy, Da_hi, Db_hi
i64x2.extmul_low_i32x4_u
y = i64x2.extmul_low_i32x4_u(a, b)
is lowered toVMULL.U32 Qy, Da_lo, Db_lo
i64x2.extmul_high_i32x4_u
y = i64x2.extmul_high_i32x4_u(a, b)
is lowered toVMULL.U32 Qy, Da_hi, Db_hi