Floating-point rounding instructions #232

Maratyszcza · 2020-05-19T03:41:31Z

Introduction

Floating-point round-to-integer is a widely used operation, available in many software and hardware specifications:

As f32.nearest/f32.trunc/f32.ceil/f32.floor/f64.nearest/f64.trunc/f64.ceil/f64.floor scalar instruction in WebAssembly
As rint/nearbyint/trunc/ceil/floor functions in C and C++
As ROUNDPS and ROUNDPD instructions in SSE4.1
As VRINTN/VRINTZ/VRINTP/VRINTM instructions in ARMv8 AArch32
As FRINTN/FRINTZ/FRINTP/FRINTM instructions in AArch64

These PR introduce the rounding instructions in WebAssembly SIMD.

New instructions

Round to nearest integer, ties to even: f32x4.nearest/f64x2.nearest
Round to integer towards zero (truncate to integer): f32x4.trunc/f64x2.trunc
Round to integer above (ceiling): f32x4.ceil/f64x2.ceil
Round to integer below (floor): f32x4.floor/f64x2.floor

The instructions match the scalar WebAssembly analogs both in names and in semantics.

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

f32x4.nearest
- y = f32x4.nearest(x) is lowered to VROUNDPS xmm_y, xmm_x, 0x08
f32x4.trunc
- y = f32x4.trunc(x) is lowered to VROUNDPS xmm_y, xmm_x, 0x0B
f32x4.ceil
- y = f32x4.ceil(x) is lowered to VROUNDPS xmm_y, xmm_x, 0x0A
f32x4.floor
- y = f32x4.floor(x) is lowered to VROUNDPS xmm_y, xmm_x, 0x09
f64x2.nearest
- y = f64x2.nearest(x) is lowered to VROUNDPD xmm_y, xmm_x, 0x08
f64x2.trunc
- y = f64x2.trunc(x) is lowered to VROUNDPD xmm_y, xmm_x, 0x0B
f64x2.ceil
- y = f64x2.ceil(x) is lowered to VROUNDPD xmm_y, xmm_x, 0x0A
f64x2.floor
- y = f64x2.floor(x) is lowered to VROUNDPD xmm_y, xmm_x, 0x09

x86/x86-64 processors with SSE4.1 instruction set

f32x4.nearest
- y = f32x4.nearest(x) is lowered to ROUNDPS xmm_y, xmm_x, 0x08
f32x4.trunc
- y = f32x4.trunc(x) is lowered to ROUNDPS xmm_y, xmm_x, 0x0B
f32x4.ceil
- y = f32x4.ceil(x) is lowered to ROUNDPS xmm_y, xmm_x, 0x0A
f32x4.floor
- y = f32x4.floor(x) is lowered to ROUNDPS xmm_y, xmm_x, 0x09
f64x2.nearest
- y = f64x2.nearest(x) is lowered to ROUNDPD xmm_y, xmm_x, 0x08
f64x2.trunc
- y = f64x2.trunc(x) is lowered to ROUNDPD xmm_y, xmm_x, 0x0B
f64x2.ceil
- y = f64x2.ceil(x) is lowered to ROUNDPD xmm_y, xmm_x, 0x0A
f64x2.floor
- y = f64x2.floor(x) is lowered to ROUNDPD xmm_y, xmm_x, 0x09

x86/x86-64 processors with SSE2 instruction set

f32x4.nearest
- y = f32x4.nearest(x) (y is NOT x) is lowered to:
  - MOVDQA xmm_tmp0, wasm_splat_u32(0x80000000)
  - CVTPS2DQ xmm_y, xmm_x
  - CVTDQ2PS xmm_tmp1, xmm_y
  - PCMPEQD xmm_y, xmm_tmp0
  - POR xmm_y, xmm_tmp0
  - ADDPS xmm_tmp0, xmm_x
  - ANDPS xmm_tmp0, xmm_y
  - ANDNPS xmm_y, xmm_tmp1
  - ORPS xmm_y, xmm_tmp0
f32x4.trunc
- y = f32x4.trunc(x) (y is NOT x) is lowered to:
  - MOVDQA xmm_tmp0, wasm_splat_u32(0x80000000)
  - CVTTPS2DQ xmm_y, xmm_x
  - CVTDQ2PS xmm_tmp1, xmm_y
  - PCMPEQD xmm_y, xmm_tmp0
  - POR xmm_y, xmm_tmp0
  - ADDPS xmm_tmp0, xmm_x
  - ANDPS xmm_tmp0, xmm_y
  - ANDNPS xmm_y, xmm_tmp1
  - ORPS xmm_y, xmm_tmp0
f32x4.ceil
- x = f32x4.ceil(x) is lowered to:
  - CVTTPS2DQ xmm_tmp0, xmm_x
  - MOVDQA xmm_tmp1, wasm_splat_u32(0x80000000)
  - CVTDQ2PS xmm_tmp2, xmm_tmp0
  - PCMPEQD xmm_tmp0, xmm_tmp1
  - POR xmm_tmp0, xmm_tmp1
  - MOVDQA xmm_tmp3, xmm_tmp0
  - ANDPS xmm_tmp3, xmm_x
  - ANDNPS xmm_tmp0, xmm_tmp2
  - ORPS xmm_tmp0, xmm_tmp3
  - CMPLEPS xmm_x, xmm_tmp0
  - ORPS xmm_x, xmm_tmp1
  - MOVAPS xmm_tmp2, xmm_x
  - ANDPS xmm_tmp2, xmm_tmp0
  - ADDPS xmm_tmp0, wasm_splat_f32(1.0f)
  - ANDNPS xmm_x, xmm_tmp0
  - ORPS xmm_x, xmm_tmp2
f32x4.floor
- y = f32x4.floor(x) (y is NOT x) is lowered to:
  - MOVDQA xmm_tmp0, wasm_splat_u32(0x80000000)
  - CVTTPS2DQ xmm_y, xmm_x
  - CVTDQ2PS xmm_tmp1, xmm_y
  - PCMPEQD xmm_y, xmm_tmp0
  - POR xmm_y, xmm_tmp0
  - MOVAPS xmm_tmp0, xmm_y
  - ANDPS xmm_tmp0, xmm_x
  - ANDNPS xmm_y, xmm_tmp1
  - MOVAPS xmm_tmp1, xmm_x
  - ORPS xmm_y, xmm_tmp0
  - CMPLTPS xmm_tmp1, xmm_y
  - ANDPS xmm_tmp1, wasm_splat_f32(1.0f)
  - SUBPS xmm_y, xmm_tmp1
f64x2.nearest
- y = f64x2.nearest(x) (y is NOT x) is lowered to:
  - MOVAPS xmm_tmp0, wasm_splat_u64(0x7FFFFFFFFFFFFFFF)
  - MOVAPS xmm_y, xmm_x
  - MOVAPS xmm_tmp1, wasm_splat_f64(0x1.0p+52)
  - MOVAPS xmm_tmp2, xmm_tmp0
  - ANDPS xmm_y, xmm_tmp1
  - CMPLEPD xmm_tmp2, xmm_y
  - ADDPD xmm_y, xmm_tmp0
  - SUBPD xmm_y, xmm_tmp0
  - ANDNPS xmm_tmp2, xmm_tmp1
  - MOVAPS xmm_tmp1, xmm_tmp2
  - ANDNPS xmm_tmp1, xmm_x
  - ANDPS xmm_y, xmm_tmp2
  - ORPS xmm_y, xmm_tmp1
f64x2.trunc
- y = f64x2.trunc(x) (y is NOT x) is lowered to:
  - MOVAPS xmm_y, wasm_splat_u64(0x7FFFFFFFFFFFFFFF)
  - MOVAPS xmm_tmp0, wasm_splat_f64(0x1.0p+52)
  - MOVAPS xmm_tmp1, xmm_x
  - ANDPS xmm_tmp1, xmm_y
  - MOVAPS xmm_tmp2, xmm_tmp0
  - CMPNLEPD xmm_tmp2, xmm_tmp1
  - ANDPS xmm_y, xmm_tmp2
  - MOVAPS xmm_tmp2, xmm_tmp1
  - ADDPD xmm_tmp2, xmm_tmp0
  - SUBPD xmm_tmp2, xmm_tmp0
  - CMPLTPD xmm_tmp1, xmm_tmp2
  - ANDPS xmm_tmp1, wasm_splat_f64(1.0)
  - SUBPD xmm_tmp2, xmm_tmp1
  - ANDPS xmm_tmp2, xmm_y
  - ANDNPS xmm_y, xmm_x
  - ORPS xmm_y, xmm_tmp2
f64x2.ceil
- y = f64x2.ceil(x) (y is NOT x) is lowered to:
  - MOVAPS xmm_tmp0, wasm_splat_u64(0x7FFFFFFFFFFFFFFF)
  - MOVAPS xmm_y, xmm_x
  - MOVAPS xmm_tmp1, wasm_splat_f64(0x1.0p+52)
  - ANDPS xmm_y, xmm_tmp0
  - MOVAPS xmm_tmp2, xmm_tmp1
  - CMPNLEPD xmm_tmp2, xmm_y
  - ADDPD xmm_y, xmm_tmp1
  - ANDPS xmm_tmp2, xmm_tmp0
  - SUBPD xmm_y, xmm_tmp1
  - ANDPS xmm_y, xmm_tmp2
  - ANDNPS xmm_tmp2, xmm_x
  - ORPS xmm_tmp2, xmm_y
  - MOVAPS xmm_y, xmm_tmp2
  - MOVAPS xmm_tmp1, xmm_tmp2
  - CMPLTPD xmm_y, xmm_x
  - ADDPD xmm_tmp1, wasm_splat_f64(1.0)
  - ANDPS xmm_y, xmm_tmp0
  - ANDPS xmm_tmp1, xmm_y
  - ANDNPS xmm_y, xmm_tmp2
  - ORPS xmm_y, xmm_tmp1
f64x2.floor
- y = f64x2.floor(x) (y is NOT x) is lowered to:
  - MOVAPS xmm_tmp0, wasm_splat_u64(0x7FFFFFFFFFFFFFFF)
  - MOVAPS xmm_tmp1, xmm_x
  - MOVAPS xmm_tmp2, wasm_splat_f64(0x1.0p+52)
  - ANDPS xmm_tmp1, xmm_tmp0
  - MOVAPS xmm_y, xmm_tmp2
  - CMPNLEPD xmm_y, xmm_tmp1
  - ANDPS xmm_y, xmm_tmp0
  - ADDPD xmm_tmp1, xmm_tmp2
  - SUBPD xmm_tmp1, xmm_tmp2
  - ANDPS xmm_tmp1, xmm_y
  - ANDNPS xmm_y, xmm_x
  - MOVAPS xmm_tmp0, xmm_x
  - ORPS xmm_y, xmm_tmp1
  - CMPLTPD xmm_tmp0, xmm_y
  - ANDPS xmm_tmp0, wasm_splat_f64(1.0)
  - SUBPD xmm_y, xmm_tmp0

ARM64 processors

f32x4.nearest
- y = f32x4.nearest(x) is lowered to FRINTN Vy.4S, Vx.4S
f32x4.trunc
- y = f32x4.trunc(x) is lowered to FRINTZ Vy.4S, Vx.4S
f32x4.ceil
- y = f32x4.ceil(x) is lowered to FRINTP Vy.4S, Vx.4S
f32x4.floor
- y = f32x4.floor(x) is lowered to FRINTM Vy.4S, Vx.4S
f64x2.nearest
- y = f64x2.nearest(x) is lowered to FRINTN Vy.2D, Vx.2D
f64x2.trunc
- y = f64x2.trunc(x) is lowered to FRINTZ Vy.2D, Vx.2D
f64x2.ceil
- y = f64x2.ceil(x) is lowered to FRINTP Vy.2D, Vx.2D
f64x2.floor
- y = f64x2.floor(x) is lowered to FRINTM Vy.2D, Vx.2D

ARM processors with ARMv8 (32-bit) instruction set

f32x4.nearest
- y = f32x4.nearest(x) is lowered to VRINTN.F32 Qy, Qx
f32x4.trunc
- y = f32x4.trunc(x) is lowered to VRINTZ.F32 Qy, Qx
f32x4.ceil
- y = f32x4.ceil(x) is lowered to VRINTP.F32 Qy, Qx
f32x4.floor
- y = f32x4.floor(x) is lowered to VRINTM.F32 Qy, Qx
f64x2.nearest
- y = f64x2.nearest(x) is lowered to VRINTN.F64 Dy_lo, Dx_lo + VRINTN.F64 Dy_hi, Dx_hi
f64x2.trunc
- y = f64x2.trunc(x) is lowered to VRINTZ.F64 Dy_lo, Dx_lo + VRINTZ.F64 Dy_hi, Dx_hi
f64x2.ceil
- y = f64x2.ceil(x) is lowered to VRINTP.F64 Dy_lo, Dx_lo + VRINTP.F64 Dy_hi, Dx_hi
f64x2.floor
- y = f64x2.floor(x) is lowered to VRINTM.F64 Dy_lo, Dx_lo + VRINTM.F64 Dy_hi, Dx_hi

ARM processors with ARMv7 (32-bit) instruction set

f32x4.nearest
- y = f32x4.nearest(x) (y is NOT x) is lowered to:
  - VMOV.I32 Qtmp0, 0x4B000000
  - VABS.F32 Qtmp1, Qx
  - VACGT.F32 Qy, Qx, Qtmp0
  - VADD.F32 Qtmp1, Qtmp1, Qtmp0
  - VORR.I32 Qy, 0x80000000
  - VSUB.F32 Qtmp1, Qtmp1, Qtmp0
  - VBSL Qy, Qx, Qtmp1
f32x4.trunc
- y = f32x4.trunc(x) (y is NOT x) is lowered to:
  - VCVT.S32.F32 Qtmp0, Qx
  - VMOV.I32 Qtmp1, 0x4B000000
  - VACGT.F32 Qy, Qtmp1, Qx
  - VCVT.F32.S32 Qtmp0, Qtmp0
  - VBIC.I32 Qy, 0x80000000
  - VBSL Qy, Qtmp0, Qx
f32x4.ceil
- y = f32x4.ceil(x) (y is NOT x) is lowered to:
  - VCVT.S32.F32 Qtmp0, Qx
  - VMOV.I32 Qtmp1, 0x4B000000
  - VACGT.F32 Qtmp1, Qtmp1, Qx
  - VCVT.F32.S32 Qtmp0, Qtmp0
  - VBIC.I32 Qtmp1, 0x80000000
  - VBSL Qtmp1, Qtmp0, Qx
  - VMOV.F32 Qtmp0, 0x3F800000
  - VCGE.F32 Qy, Qtmp1, Qx
  - VADD.F32 Qtmp0, Qtmp1, Qtmp0
  - VORR.I32 Qy, 0x80000000
  - VBSL Qy, Qtmp1, Qtmp0
f32x4.floor
- y = f32x4.floor(x) (y is NOT x) is lowered to:
  - VCVT.S32.F32 Qtmp0, Qx
  - VMOV.I32 Qtmp1, 0x4B000000
  - VACGT.F32 Qy, Qtmp1, Qx
  - VCVT.F32.S32 Qtmp0, Qtmp0
  - VBIC.I32 Qy, 0x80000000
  - VBSL Qy, Qtmp0, Qx
  - VMOV.F32 Qtmp1, 0x3F800000
  - VCGT.F32 Qtmp0, Qy, Qx
  - VAND Qtmp0, Qtmp0, Qtmp1
  - VSUB.F32 Qy, Qy, Qtmp0
f64x2.round
- y = f64x2.round(x) (y is NOT x) is lowered to:
  - VABS.F64 Dy_lo, Dx_lo
  - VABS.F64 Dy_hi, Dx_hi
  - VLDR Dtmp0, 0x1.0p+52
  - VSUB.F64 Dtmp1_lo, Dtmp0, Dy_lo
  - VSUB.F64 Dtmp1_hi, Dtmp0, Dy_hi
  - VADD.F64 Dtmp2_lo, Dy_lo, Dtmp0
  - VADD.F64 Dtmp2_hi, Dy_hi, Dtmp0
  - VEOR Qy, Qx, Qy
  - VSHR.S64 Qtmp1, Qtmp1, 63
  - VSUB.F64 Dtmp2_lo, Dtmp2_lo, Dtmp0
  - VSUB.F64 Dtmp2_hi, Dtmp2_hi, Dtmp0
  - VORR Qy, Qy, Qtmp1
  - VBSL Qy, Qx, Qtmp2
f64x2.trunc
- y = f64x2.trunc(x) (y is NOT x) is lowered to:
  - VLDR Dtmp0, 0x1.0p+52
  - VABS.F64 Qy_lo, Dx_lo
  - VABS.F64 Qy_hi, Dx_hi
  - VADD.F64 Dtmp1_lo, Qy_lo, Dtmp0
  - VADD.F64 Dtmp1_hi, Qy_hi, Dtmp0
  - VSUB.F64 Dtmp2_lo, Dtmp0, Qy_lo
  - VSUB.F64 Dtmp2_hi, d9, Qy_hi
  - VEOR Qtmp3, Qy, Qx
  - VSUB.F64 Dtmp1_lo, Dtmp1_lo, Dtmp0
  - VSUB.F64 Dtmp1_hi, Dtmp1_hi, d9
  - VLDR Dtmp0, 1.0
  - VSHR.S64 Qtmp2, Qtmp2, 63
  - VORR Qtmp3, Qtmp3, Qtmp2
  - VSUB.I64 Qy, Qy, Qtmp1
  - VSHR.S64 Qy, Qy, 63
  - VAND Qy_lo, Qy_lo, Dtmp0
  - VAND Qy_hi, Qy_hi, Dtmp0
  - VSUB.F64 Qy_lo, Dtmp1_lo, Qy
  - VSUB.F64 Qy_hi, Dtmp1_hi, Qx
  - VBIT Qy, Qx, Qtmp3
f64x2.ceil
- y = f64x2.ceil(x) (y is NOT x) is lowered to:
  - VLDR Dtmp0, 0x1.0p+52
  - VABS.F64 Dtmp1_lo, Dx_lo
  - VABS.F64 Dtmp1_hi, Dx_hi
  - VSUB.F64 Dtmp2_lo, Dtmp0, Dtmp1_lo
  - VSUB.F64 Dtmp2_hi, Dtmp0, Dtmp1_hi
  - VADD.F64 Dtmp3_lo, Dtmp1_lo, Dtmp0
  - VADD.F64 Dtmp3_hi, Dtmp1_hi, Dtmp0
  - VEOR Qtmp1, Qtmp1, Qx
  - VSHR.S64 Qtmp2, Qtmp2, 63
  - VSUB.F64 Dtmp3_lo, Dtmp3_lo, Dtmp0
  - VSUB.F64 Dtmp3_hi, Dtmp3_hi, Dtmp0
  - VLDR Dtmp0, 1.0
  - VORR Qtmp2, Qtmp2, Qtmp1
  - VBSL Qtmp2, Qx, Qtmp3
  - VSUB.F64 Dy_lo, Dtmp2_lo, Dx_lo
  - VSUB.F64 Dy_hi, Dtmp2_hi, Dx_hi
  - VADD.F64 Dtmp3_lo, Dtmp2_lo, Dtmp0
  - VADD.F64 Dtmp3_hi, Dtmp2_hi, Dtmp0
  - VSHR.S64 Qy, Qy, 63
  - VBIC Qy, Qy, Qtmp1
  - VBSL Qy, Qtmp3, Qtmp2
f64x2.floor
- y = f64x2.floor(x) (y iD NOT x) iD lowereQ to:
  - VLDR Dtmp0, 0x1.0p+52
  - VABS.F64 Dy_lo, Dx_lo
  - VABS.F64 Dy_hi, Dx_hi
  - VADD.F64 Dtmp1_lo, Dy_lo, Dtmp0
  - VADD.F64 Dtmp1_hi, Dy_hi, Dtmp0
  - VSUB.I64 Dtmp2_lo, Dtmp0, Dy_lo
  - VSUB.I64 Dtmp2_hi, Dtmp0, Dy_hi
  - VEOR Qy, Qy, Qx
  - VSUB.F64 Dtmp1_lo, Dtmp1_lo, Dtmp0
  - VSUB.F64 Dtmp1_hi, Dtmp1_hi, Dtmp0
  - VLDR Dtmp0, 1.0
  - VSHR.S64 Qtmp2, Qtmp2, 63
  - VORR Qy, Qy, Qtmp2
  - VBSL Qy, Qx, Qtmp1
  - VSUB.F64 Dx_lo, Dx_lo, Dy_lo
  - VSUB.F64 Dx_hi, Dx_hi, Dy_hi
  - VSHR.S64 Qtmp2, Qx, 63
  - VAND Dtmp2_lo, Dtmp2_lo, Dtmp0
  - VAND Dtmp2_hi, Dtmp2_hi, Dtmp0
  - VSUB.F64 Dy_lo, Dy_lo, Dtmp2_lo
  - VSUB.F64 Dy_hi, Dy_hi, Dtmp2_hi

tlively · 2020-05-21T03:47:28Z

Yikes, the new numbering only has room for one rounding instruction. We'll have to figure out what to do about that in the long term. Meanwhile, @dtig and @ngzhian do you have a preference about which opcodes to use to prototype this?

ngzhian · 2020-05-21T17:37:20Z

No preferences for prototyping, we can probably squeeze them into
0xdc-0xdf
0xe2, 0xee, 0xf8, 0xf9
for now.

dtig · 2020-05-21T18:00:10Z

No strong preferences either, it's somewhat awkward, but we could also do something in the range of 0xc2- 0xca if contiguous opcodes make this simpler, because I don't see the 64x2 AnyTrue/AllTrue and the widen/narrowing instructions to be relevant for 64x2 operations going forward.

If we do have to spill over, it's not terrible but we can make that call when we decide to move past prototyping.

richgel999 · 2020-05-28T03:11:40Z

These instructions aren't optional IMO. They're fundamental operations. Having to emulate them will be quite painful for many SIMD/SPMD kernels and vectorized math functions.

I have a Perlin noise kernel that computes 24 floors per output pixel:
https://t.co/u9w35T6oTq?amp=1

In another example, I have a vectorized approximate math library. It can compute vectorized tan, sin, cos, log, exp, etc. It uses floor and round for range reduction:
https://t.co/3JlYyZ2oMI?amp=1

Without efficient round/floor/trunc, WebAssembly SIMD will be in the same position SSE2 is relative to SSE4.1. When we execute kernels on SSE2, we commonly get a 15-20% reduction in performance due to having to emulate round/floor/trunc on some kernels, or if they call sin/cos/tan/etc. These are very important operations.

I am currently porting CppSPMD_Fast to WebAssembly, and the lack of efficient round/floor/trunc is going to hurt some kernels by quite a bit. I should have it up and running in 2-3 days.

zeux · 2020-05-29T01:51:01Z

Worth noting is that the common way to emulate round/floor/trunc includes conversions back & forth to integers (obviously this is application-dependent as it assumes a specific range and is typically non-IEEE compliant for some operations); however, due to #173 this workaround is going to be slow.

If the inputs are known to be within a 23-bit integer range or thereabouts, floating point addition can be abused to round, and it's probably possible to implement floor etc. in a similar fashion but that route doesn't seems like one we would want to recommend.

Marc-B-Reynolds · 2020-05-29T07:31:52Z

If the inputs are known to be within a 23-bit integer range or thereabouts, floating point addition can be abused to round, and it's probably possible to implement floor etc. in a similar fashion but that route doesn't seems like one we would want to recommend.

Worth nothing that this stops working if FP rules are relaxed: (x+K)-K x-formed to x

ngzhian · 2020-05-29T23:16:20Z

@Maratyszcza any suggestions for ARM v7 instruction sequence? It will probably look a lot like the x86 SSE2 one?

SIMD equivalents of the nearest/trunc/ceil/floor instructions

Maratyszcza · 2020-05-31T08:01:38Z

Updated opcodes post-renumbering, put into 0xd8-0xdf range

Maratyszcza · 2020-06-01T03:37:49Z

Mapping to SSE2 is finished. @ngzhian ARMv7 NEON is quite different, because of its unique features:

Compare absolute values instruction
Single-instruction bitwise selection (VBSL/VBIT/VBIF)
Bitwise OR and bitwise AND instructions with immediate values

Maratyszcza · 2020-06-01T08:28:40Z

Added ARMv7 NEON mapping for f32 instructions

ngzhian · 2020-06-01T16:58:48Z

There's some magic going on there. Thanks Marat!

Maratyszcza · 2020-06-02T07:55:04Z

All instructions mappings are finished, and PR is ready for review

tlively

It would be good to change the order of instructions to be consistent with their corresponding MVP intructions.

proposals/simd/BinarySIMD.md

proposals/simd/ImplementationStatus.md

proposals/simd/SIMD.md

Co-authored-by: Thomas Lively <7121787+tlively@users.noreply.github.com>

As specified in WebAssembly/simd#232.

dtig · 2020-06-09T00:08:41Z

Thanks @Maratyszcza for filing the issues, moving this to prototyping as on all platforms that we are using as a baseline currently these have a direct mapping to instructions, and on ARMv7, there is a precedent for them being slow as this is the case for the scalar versions of these operations as well, some implementations call out to the runtime to implement them. Moving to pending prototype data as we are prototyping them in V8, adding a retroactive label update.

Summary: As specified in WebAssembly/simd#232. These instructions are implemented as LLVM intrinsics for now rather than normal ISel patterns to make these instructions opt-in. Once the instructions are merged to the spec proposal, the intrinsics will be replaced with proper ISel patterns. Reviewers: aheejin Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits, llvm-commits Tags: #clang, #llvm Differential Revision: https://reviews.llvm.org/D81222

tlively · 2020-06-11T03:15:22Z

These will be available in the next version of Emscripten via __builtin_wasm_{ceil,floor,trunc,nearest}_{f32x4,f64x2}.

ngzhian · 2020-06-16T00:25:51Z

Prototype in V8 is done for x64, ia32, ARM64. Still working on ARM.
Update: 2020-06-30, prototype on ARM is done as of https://crrev.com/8e54afbe2499cefbccda7ab8a9786451b57db961

Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2d73a015caaa3e70c175172158a6548625dc6da3

Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2e7ddb00c8f9240e148cf5843b50a7ba7b913351

Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2d73a015caaa3e70c175172158a6548625dc6da3

Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2e7ddb00c8f9240e148cf5843b50a7ba7b913351

Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2d73a015caaa3e70c175172158a6548625dc6da3

ngzhian · 2020-09-11T17:47:51Z

This has been accepted into the proposal [0] during the sync on 2020-09-04. This LGTM, as it is.

Note, I would like https://github.com/WebAssembly/simd/blob/master/proposals/simd/NewOpcodes.md to be updated too, but it requires more tweaks (since there is a bit of a collision in opcodes for these instructions and the "reserved ones" under i64x2, and also ordering of instructions for presentation). But that's not a big problem, and can be worked on in the future.

[0] https://docs.google.com/document/d/138cF6aOUa9RZC2tOR7AhlIQWdmX5EtpzXRTVDAN3bfo/edit# see "4. Floating point rounding"

proposals/simd/SIMD.md

Co-authored-by: Thomas Lively <7121787+tlively@users.noreply.github.com>

Implement f32x4 and f64x2 nearest, trunc, ceil, and floor. These instructions were accepted into the proposal [0], this change removes all the ifdefs and todo guarding the prototypes, and moves these instructions out of the post-mvp flag. [0] WebAssembly/simd#232 Bug: v8:10906 Change-Id: I44ec21dd09f3bf7cf3cae5d35f70f9d2c178c4e4 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2406547 Commit-Queue: Zhi An Ng <zhin@chromium.org> Reviewed-by: Bill Budge <bbudge@chromium.org> Cr-Commit-Position: refs/heads/master@{#69923}

Port 068cf20 Original Commit Message: Implement f32x4 and f64x2 nearest, trunc, ceil, and floor. These instructions were accepted into the proposal [0], this change removes all the ifdefs and todo guarding the prototypes, and moves these instructions out of the post-mvp flag. [0] WebAssembly/simd#232 R=zhin@chromium.org, joransiu@ca.ibm.com, jyan@ca.ibm.com, michael_dawson@ca.ibm.com BUG= LOG=N Change-Id: I02086255f635f1d47586fc74dd754426f6beccb0 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2411675 Reviewed-by: Milad Farazmand <mfarazma@redhat.com> Reviewed-by: Junliang Yan <junyan@redhat.com> Commit-Queue: Milad Farazmand <mfarazma@redhat.com> Cr-Commit-Position: refs/heads/master@{#69925}

…status. r=jseward Background: WebAssembly/simd#232 For all the rounding SIMD instructions: - remove the internal 'Experimental' opcode suffix in the C++ code - remove the guard on experimental Wasm instructions in all the C++ decoders - move the test cases from simd/experimental.js to simd/ad-hack.js I have checked that current V8 and wasm-tools use the same opcode mappings. V8 in turn guarantees the correct mapping for LLVM and binaryen. Drive-by bug fix: the test predicate for f64 square root was wrong, it would round its argument to float. This did not matter for the test inputs we had but started to matter when I added more difficult inputs for testing rounding. Differential Revision: https://phabricator.services.mozilla.com/D92926

…structions This patch implements, for aarch64, the following wasm SIMD extensions Floating-point rounding instructions WebAssembly/simd#232 Pseudo-Minimum and Pseudo-Maximum instructions WebAssembly/simd#122 The changes are straightforward: * `build.rs`: the relevant tests have been enabled * `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions `fmin_pseudo` and `fmax_pseudo`. The wasm rounding instructions do not need any new CLIF instructions. * `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is pretty much the same as any other unary or binary vector instruction (for the rounding and the pmin/max respectively) * `cranelift/codegen/src/isa/aarch64/lower_inst.rs`: - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction sequence, `fcmpgt` followed by `bsl` - the CLIF rounding instructions are converted to a suitable vector `frint{n,z,p,m}` instruction. * `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub enum VecMisc2` to handle the rounding operations. And corresponding `emit` cases.

Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982

…structions This patch implements, for aarch64, the following wasm SIMD extensions Floating-point rounding instructions WebAssembly/simd#232 Pseudo-Minimum and Pseudo-Maximum instructions WebAssembly/simd#122 The changes are straightforward: * `build.rs`: the relevant tests have been enabled * `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions `fmin_pseudo` and `fmax_pseudo`. The wasm rounding instructions do not need any new CLIF instructions. * `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is pretty much the same as any other unary or binary vector instruction (for the rounding and the pmin/max respectively) * `cranelift/codegen/src/isa/aarch64/lower_inst.rs`: - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction sequence, `fcmpgt` followed by `bsl` - the CLIF rounding instructions are converted to a suitable vector `frint{n,z,p,m}` instruction. * `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub enum VecMisc2` to handle the rounding operations. And corresponding `emit` cases.

ngzhian · 2021-02-06T01:39:46Z

@tlively this wasn't added to NewOpcodes.md, just fyi in case you are looking at that doc for opcode organization.

tlively · 2021-02-06T03:37:54Z

Oh, thanks for point that out. I had indeed missed them.

Maratyszcza force-pushed the round branch from fc3aeeb to edf4a31 Compare May 19, 2020 03:42

dtig mentioned this pull request May 20, 2020

f32x4 = roundXX(f32x4)? #177

Closed

dtig linked an issue May 20, 2020 that may be closed by this pull request

f32x4 = roundXX(f32x4)? #177

Closed

Floating-point rounding instructions

de9189d

SIMD equivalents of the nearest/trunc/ceil/floor instructions

Maratyszcza force-pushed the round branch from edf4a31 to de9189d Compare May 31, 2020 08:00

Maratyszcza changed the title ~~[WIP] Floating-point rounding instructions~~ Floating-point rounding instructions Jun 2, 2020

tlively reviewed Jun 4, 2020

View reviewed changes

proposals/simd/BinarySIMD.md Outdated Show resolved Hide resolved

proposals/simd/ImplementationStatus.md Outdated Show resolved Hide resolved

proposals/simd/ImplementationStatus.md Outdated Show resolved Hide resolved

proposals/simd/SIMD.md Outdated Show resolved Hide resolved

Maratyszcza and others added 3 commits June 4, 2020 12:06

Update proposals/simd/BinarySIMD.md

eb1de5f

Co-authored-by: Thomas Lively <7121787+tlively@users.noreply.github.com>

Update proposals/simd/ImplementationStatus.md

80d19f8

Co-authored-by: Thomas Lively <7121787+tlively@users.noreply.github.com>

Update proposals/simd/ImplementationStatus.md

dd43218

Co-authored-by: Thomas Lively <7121787+tlively@users.noreply.github.com>

tlively added a commit to tlively/binaryen that referenced this pull request Jun 4, 2020

Add prototype SIMD rounding instructions

8816cc5

As specified in WebAssembly/simd#232.

tlively mentioned this pull request Jun 4, 2020

Add prototype SIMD rounding instructions WebAssembly/binaryen#2895

Merged

tlively added a commit to WebAssembly/binaryen that referenced this pull request Jun 5, 2020

Add prototype SIMD rounding instructions (#2895)

037d7a5

As specified in WebAssembly/simd#232.

dtig added the pending prototype data label Jun 9, 2020

ngzhian mentioned this pull request Aug 25, 2020

Agenda for Sync meeting 09/04/20 (?) #323

Closed

ngzhian removed the pending prototype data label Sep 11, 2020

ngzhian reviewed Sep 11, 2020

View reviewed changes

proposals/simd/SIMD.md Show resolved Hide resolved

Update proposals/simd/SIMD.md

70aba5e

Co-authored-by: Thomas Lively <7121787+tlively@users.noreply.github.com>

Maratyszcza force-pushed the round branch from 4604d0d to 70aba5e Compare September 11, 2020 18:31

ngzhian merged commit 8e87db7 into WebAssembly:master Sep 11, 2020

julian-seward1 mentioned this pull request Oct 23, 2020

CL/aarch64: implement the wasm SIMD pseudo-max/min and FP-rounding in… bytecodealliance/wasmtime#2312

Merged

tlively added a commit that referenced this pull request Feb 6, 2021

Add ops from #232 to NewOpcodes.md

7633dd0

ngzhian mentioned this pull request Feb 18, 2021

Include floating-point rounding instructions #7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Floating-point rounding instructions #232

Floating-point rounding instructions #232

Maratyszcza commented May 19, 2020 •

edited

Loading

tlively commented May 21, 2020

ngzhian commented May 21, 2020

dtig commented May 21, 2020

richgel999 commented May 28, 2020 •

edited

Loading

zeux commented May 29, 2020

Marc-B-Reynolds commented May 29, 2020 •

edited

Loading

ngzhian commented May 29, 2020

Maratyszcza commented May 31, 2020

Maratyszcza commented Jun 1, 2020

Maratyszcza commented Jun 1, 2020

ngzhian commented Jun 1, 2020

Maratyszcza commented Jun 2, 2020

tlively left a comment

dtig commented Jun 9, 2020

tlively commented Jun 11, 2020

ngzhian commented Jun 16, 2020 •

edited

Loading

ngzhian commented Sep 11, 2020

ngzhian commented Feb 6, 2021

tlively commented Feb 6, 2021

Floating-point rounding instructions #232

Floating-point rounding instructions #232

Conversation

Maratyszcza commented May 19, 2020 • edited Loading

Introduction

New instructions

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE4.1 instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARM processors with ARMv8 (32-bit) instruction set

ARM processors with ARMv7 (32-bit) instruction set

tlively commented May 21, 2020

ngzhian commented May 21, 2020

dtig commented May 21, 2020

richgel999 commented May 28, 2020 • edited Loading

zeux commented May 29, 2020

Marc-B-Reynolds commented May 29, 2020 • edited Loading

ngzhian commented May 29, 2020

Maratyszcza commented May 31, 2020

Maratyszcza commented Jun 1, 2020

Maratyszcza commented Jun 1, 2020

ngzhian commented Jun 1, 2020

Maratyszcza commented Jun 2, 2020

tlively left a comment

Choose a reason for hiding this comment

dtig commented Jun 9, 2020

tlively commented Jun 11, 2020

ngzhian commented Jun 16, 2020 • edited Loading

ngzhian commented Sep 11, 2020

ngzhian commented Feb 6, 2021

tlively commented Feb 6, 2021

Maratyszcza commented May 19, 2020 •

edited

Loading

richgel999 commented May 28, 2020 •

edited

Loading

Marc-B-Reynolds commented May 29, 2020 •

edited

Loading

ngzhian commented Jun 16, 2020 •

edited

Loading