You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I propose a relaxed version of the Saturating Rounding Q-format Multiplication i16x8.q15mulr_sat_s introduced in WebAssembly/simd#365. I suggest i16x8.q15mulr_s as the tentative name for the relaxed instruction.
What are the semantics of these instructions?
i16x8.q15mulr_sat_s implements the mathematical operation of multiplication of fixed-point numbers in Q15 format (see WebAssembly/simd#365 for details). The multiplication overflows if and only if both inputs are INT16_MIN, and x86 SSSE3 and ARM NEON instructions differ in how they handle this situation: x86 version wraps around while ARM version saturates. WebAssembly SIMD instruction i16x8.q15mulr_sat_s standardized on the ARM overflow semantics, resulting in additional overflow checks on x86. However, as the case of both inputs INT16_MIN is rare and often can be guaranteed to never happen due to higher-level structure of an algorithm, having an relaxed version that allows both overflow options would help performance on x86.
The proposed i16x8.q15mulr_s Relaxed SIMD instruction computes the lane-wise rounded multiplication of Q15 numbers, and allows for either saturation or wrap-around behavior in the overflow case (where both inputs are INT16_MIN).
How will these instructions be implemented?
x86/x86-64 processors with AVX instruction set
y = i16x8.q15mulr_s(a, b) is lowered to VPMULHRSW xmm_y, xmm_a, xmm_b
x86/x86-64 processors with SSSE3 instruction set
y = i16x8.q15mulr_s(a, b) is lowered to MOVDQA xmm_y, xmm_a + PMULHRSW xmm_y, xmm_b
x86/x86-64 processors with SSE2 instruction set
y = i16x8.q15mulr_s(a, b) (y is NOTa and y is NOTb) is lowered to
MOVDQA xmm_y, xmm_a
MOVDQA xmm_tmp, xmm_a
PMULLW xmm_y, xmm_b
PMULHW xmm_tmp, xmm_b
PSRLW xmm_y, 14
PADDW xmm_tmp, xmm_tmp
PAVGW xmm_y, wasm_i16x8_splat(0)
PADDW xmm_y, xmm_tmp
ARM64 processors
y = i16x8.q15mulr_s(a, b) is lowered to SQRDMULH Vy.8H, Va.8H, Vb.8H
ARMv7 processors with NEON instruction set
y = i16x8.q15mulr_s(a, b) is lowered to VQRDMULH.S16 Qy, Qa, Qb
Reference lowering through the WAsm SIMD128 instruction set
y = i16x8.q15mulr_s(a, b) is lowered as y = i16x8.q15mulr_sat_s(a, b)
How does behavior differ across processors? What new fingerprinting surfaces will be exposed?
When both inputs are INT16_MIN, x86/x86-64 will produce INT16_MIN result while ARM/ARM64 will produce INT16_MAX result. x86/x86-64 can already be distinguished from ARM/ARM64 based on NaN behavior, so this instruction doesn't add any new fingerprinting surfaces.
What are the instructions being proposed?
I propose a relaxed version of the Saturating Rounding Q-format Multiplication
i16x8.q15mulr_sat_s
introduced in WebAssembly/simd#365. I suggesti16x8.q15mulr_s
as the tentative name for the relaxed instruction.What are the semantics of these instructions?
i16x8.q15mulr_sat_s
implements the mathematical operation of multiplication of fixed-point numbers in Q15 format (see WebAssembly/simd#365 for details). The multiplication overflows if and only if both inputs areINT16_MIN
, and x86 SSSE3 and ARM NEON instructions differ in how they handle this situation: x86 version wraps around while ARM version saturates. WebAssembly SIMD instructioni16x8.q15mulr_sat_s
standardized on the ARM overflow semantics, resulting in additional overflow checks on x86. However, as the case of both inputsINT16_MIN
is rare and often can be guaranteed to never happen due to higher-level structure of an algorithm, having an relaxed version that allows both overflow options would help performance on x86.The proposed
i16x8.q15mulr_s
Relaxed SIMD instruction computes the lane-wise rounded multiplication of Q15 numbers, and allows for either saturation or wrap-around behavior in the overflow case (where both inputs areINT16_MIN
).How will these instructions be implemented?
x86/x86-64 processors with AVX instruction set
VPMULHRSW xmm_y, xmm_a, xmm_b
x86/x86-64 processors with SSSE3 instruction set
MOVDQA xmm_y, xmm_a
+PMULHRSW xmm_y, xmm_b
x86/x86-64 processors with SSE2 instruction set
y
is NOTa
andy
is NOTb
) is lowered toMOVDQA xmm_y, xmm_a
MOVDQA xmm_tmp, xmm_a
PMULLW xmm_y, xmm_b
PMULHW xmm_tmp, xmm_b
PSRLW xmm_y, 14
PADDW xmm_tmp, xmm_tmp
PAVGW xmm_y, wasm_i16x8_splat(0)
PADDW xmm_y, xmm_tmp
ARM64 processors
SQRDMULH Vy.8H, Va.8H, Vb.8H
ARMv7 processors with NEON instruction set
VQRDMULH.S16 Qy, Qa, Qb
Reference lowering through the WAsm SIMD128 instruction set
y = i16x8.q15mulr_sat_s(a, b)
How does behavior differ across processors? What new fingerprinting surfaces will be exposed?
When both inputs are
INT16_MIN
, x86/x86-64 will produceINT16_MIN
result while ARM/ARM64 will produceINT16_MAX
result. x86/x86-64 can already be distinguished from ARM/ARM64 based on NaN behavior, so this instruction doesn't add any new fingerprinting surfaces.What use cases are there?
The text was updated successfully, but these errors were encountered: