-
Notifications
You must be signed in to change notification settings - Fork 43
Conversation
No preferences for prototyping, we can probably squeeze them into |
No strong preferences either, it's somewhat awkward, but we could also do something in the range of 0xc2- 0xca if contiguous opcodes make this simpler, because I don't see the 64x2 AnyTrue/AllTrue and the widen/narrowing instructions to be relevant for 64x2 operations going forward. If we do have to spill over, it's not terrible but we can make that call when we decide to move past prototyping. |
These instructions aren't optional IMO. They're fundamental operations. Having to emulate them will be quite painful for many SIMD/SPMD kernels and vectorized math functions. I have a Perlin noise kernel that computes 24 floors per output pixel: In another example, I have a vectorized approximate math library. It can compute vectorized tan, sin, cos, log, exp, etc. It uses floor and round for range reduction: Without efficient round/floor/trunc, WebAssembly SIMD will be in the same position SSE2 is relative to SSE4.1. When we execute kernels on SSE2, we commonly get a 15-20% reduction in performance due to having to emulate round/floor/trunc on some kernels, or if they call sin/cos/tan/etc. These are very important operations. I am currently porting CppSPMD_Fast to WebAssembly, and the lack of efficient round/floor/trunc is going to hurt some kernels by quite a bit. I should have it up and running in 2-3 days. |
Worth noting is that the common way to emulate round/floor/trunc includes conversions back & forth to integers (obviously this is application-dependent as it assumes a specific range and is typically non-IEEE compliant for some operations); however, due to #173 this workaround is going to be slow. If the inputs are known to be within a 23-bit integer range or thereabouts, floating point addition can be abused to round, and it's probably possible to implement floor etc. in a similar fashion but that route doesn't seems like one we would want to recommend. |
Worth nothing that this stops working if FP rules are relaxed: |
@Maratyszcza any suggestions for ARM v7 instruction sequence? It will probably look a lot like the x86 SSE2 one? |
SIMD equivalents of the nearest/trunc/ceil/floor instructions
Updated opcodes post-renumbering, put into 0xd8-0xdf range |
Mapping to SSE2 is finished. @ngzhian ARMv7 NEON is quite different, because of its unique features:
|
Added ARMv7 NEON mapping for |
There's some magic going on there. Thanks Marat! |
All instructions mappings are finished, and PR is ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to change the order of instructions to be consistent with their corresponding MVP intructions.
Co-authored-by: Thomas Lively <7121787+tlively@users.noreply.github.com>
Co-authored-by: Thomas Lively <7121787+tlively@users.noreply.github.com>
Co-authored-by: Thomas Lively <7121787+tlively@users.noreply.github.com>
As specified in WebAssembly/simd#232.
As specified in WebAssembly/simd#232.
Thanks @Maratyszcza for filing the issues, moving this to prototyping as on all platforms that we are using as a baseline currently these have a direct mapping to instructions, and on ARMv7, there is a precedent for them being slow as this is the case for the scalar versions of these operations as well, some implementations call out to the runtime to implement them. Moving to pending prototype data as we are prototyping them in V8, adding a retroactive label update. |
Summary: As specified in WebAssembly/simd#232. These instructions are implemented as LLVM intrinsics for now rather than normal ISel patterns to make these instructions opt-in. Once the instructions are merged to the spec proposal, the intrinsics will be replaced with proper ISel patterns. Reviewers: aheejin Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits, llvm-commits Tags: #clang, #llvm Differential Revision: https://reviews.llvm.org/D81222
These will be available in the next version of Emscripten via |
Prototype in V8 is done for x64, ia32, ARM64. Still working on ARM. |
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2d73a015caaa3e70c175172158a6548625dc6da3
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2e7ddb00c8f9240e148cf5843b50a7ba7b913351
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2d73a015caaa3e70c175172158a6548625dc6da3
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2e7ddb00c8f9240e148cf5843b50a7ba7b913351
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2d73a015caaa3e70c175172158a6548625dc6da3
This has been accepted into the proposal [0] during the sync on 2020-09-04. This LGTM, as it is. Note, I would like https://github.com/WebAssembly/simd/blob/master/proposals/simd/NewOpcodes.md to be updated too, but it requires more tweaks (since there is a bit of a collision in opcodes for these instructions and the "reserved ones" under i64x2, and also ordering of instructions for presentation). But that's not a big problem, and can be worked on in the future. [0] https://docs.google.com/document/d/138cF6aOUa9RZC2tOR7AhlIQWdmX5EtpzXRTVDAN3bfo/edit# see "4. Floating point rounding" |
Co-authored-by: Thomas Lively <7121787+tlively@users.noreply.github.com>
Implement f32x4 and f64x2 nearest, trunc, ceil, and floor. These instructions were accepted into the proposal [0], this change removes all the ifdefs and todo guarding the prototypes, and moves these instructions out of the post-mvp flag. [0] WebAssembly/simd#232 Bug: v8:10906 Change-Id: I44ec21dd09f3bf7cf3cae5d35f70f9d2c178c4e4 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2406547 Commit-Queue: Zhi An Ng <zhin@chromium.org> Reviewed-by: Bill Budge <bbudge@chromium.org> Cr-Commit-Position: refs/heads/master@{#69923}
Port 068cf20 Original Commit Message: Implement f32x4 and f64x2 nearest, trunc, ceil, and floor. These instructions were accepted into the proposal [0], this change removes all the ifdefs and todo guarding the prototypes, and moves these instructions out of the post-mvp flag. [0] WebAssembly/simd#232 R=zhin@chromium.org, joransiu@ca.ibm.com, jyan@ca.ibm.com, michael_dawson@ca.ibm.com BUG= LOG=N Change-Id: I02086255f635f1d47586fc74dd754426f6beccb0 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2411675 Reviewed-by: Milad Farazmand <mfarazma@redhat.com> Reviewed-by: Junliang Yan <junyan@redhat.com> Commit-Queue: Milad Farazmand <mfarazma@redhat.com> Cr-Commit-Position: refs/heads/master@{#69925}
…status. r=jseward Background: WebAssembly/simd#232 For all the rounding SIMD instructions: - remove the internal 'Experimental' opcode suffix in the C++ code - remove the guard on experimental Wasm instructions in all the C++ decoders - move the test cases from simd/experimental.js to simd/ad-hack.js I have checked that current V8 and wasm-tools use the same opcode mappings. V8 in turn guarantees the correct mapping for LLVM and binaryen. Drive-by bug fix: the test predicate for f64 square root was wrong, it would round its argument to float. This did not matter for the test inputs we had but started to matter when I added more difficult inputs for testing rounding. Differential Revision: https://phabricator.services.mozilla.com/D92926
…status. r=jseward Background: WebAssembly/simd#232 For all the rounding SIMD instructions: - remove the internal 'Experimental' opcode suffix in the C++ code - remove the guard on experimental Wasm instructions in all the C++ decoders - move the test cases from simd/experimental.js to simd/ad-hack.js I have checked that current V8 and wasm-tools use the same opcode mappings. V8 in turn guarantees the correct mapping for LLVM and binaryen. Drive-by bug fix: the test predicate for f64 square root was wrong, it would round its argument to float. This did not matter for the test inputs we had but started to matter when I added more difficult inputs for testing rounding. Differential Revision: https://phabricator.services.mozilla.com/D92926
…structions This patch implements, for aarch64, the following wasm SIMD extensions Floating-point rounding instructions WebAssembly/simd#232 Pseudo-Minimum and Pseudo-Maximum instructions WebAssembly/simd#122 The changes are straightforward: * `build.rs`: the relevant tests have been enabled * `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions `fmin_pseudo` and `fmax_pseudo`. The wasm rounding instructions do not need any new CLIF instructions. * `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is pretty much the same as any other unary or binary vector instruction (for the rounding and the pmin/max respectively) * `cranelift/codegen/src/isa/aarch64/lower_inst.rs`: - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction sequence, `fcmpgt` followed by `bsl` - the CLIF rounding instructions are converted to a suitable vector `frint{n,z,p,m}` instruction. * `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub enum VecMisc2` to handle the rounding operations. And corresponding `emit` cases.
…structions This patch implements, for aarch64, the following wasm SIMD extensions Floating-point rounding instructions WebAssembly/simd#232 Pseudo-Minimum and Pseudo-Maximum instructions WebAssembly/simd#122 The changes are straightforward: * `build.rs`: the relevant tests have been enabled * `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions `fmin_pseudo` and `fmax_pseudo`. The wasm rounding instructions do not need any new CLIF instructions. * `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is pretty much the same as any other unary or binary vector instruction (for the rounding and the pmin/max respectively) * `cranelift/codegen/src/isa/aarch64/lower_inst.rs`: - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction sequence, `fcmpgt` followed by `bsl` - the CLIF rounding instructions are converted to a suitable vector `frint{n,z,p,m}` instruction. * `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub enum VecMisc2` to handle the rounding operations. And corresponding `emit` cases.
…structions This patch implements, for aarch64, the following wasm SIMD extensions Floating-point rounding instructions WebAssembly/simd#232 Pseudo-Minimum and Pseudo-Maximum instructions WebAssembly/simd#122 The changes are straightforward: * `build.rs`: the relevant tests have been enabled * `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions `fmin_pseudo` and `fmax_pseudo`. The wasm rounding instructions do not need any new CLIF instructions. * `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is pretty much the same as any other unary or binary vector instruction (for the rounding and the pmin/max respectively) * `cranelift/codegen/src/isa/aarch64/lower_inst.rs`: - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction sequence, `fcmpgt` followed by `bsl` - the CLIF rounding instructions are converted to a suitable vector `frint{n,z,p,m}` instruction. * `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub enum VecMisc2` to handle the rounding operations. And corresponding `emit` cases.
…structions This patch implements, for aarch64, the following wasm SIMD extensions Floating-point rounding instructions WebAssembly/simd#232 Pseudo-Minimum and Pseudo-Maximum instructions WebAssembly/simd#122 The changes are straightforward: * `build.rs`: the relevant tests have been enabled * `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions `fmin_pseudo` and `fmax_pseudo`. The wasm rounding instructions do not need any new CLIF instructions. * `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is pretty much the same as any other unary or binary vector instruction (for the rounding and the pmin/max respectively) * `cranelift/codegen/src/isa/aarch64/lower_inst.rs`: - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction sequence, `fcmpgt` followed by `bsl` - the CLIF rounding instructions are converted to a suitable vector `frint{n,z,p,m}` instruction. * `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub enum VecMisc2` to handle the rounding operations. And corresponding `emit` cases.
…structions This patch implements, for aarch64, the following wasm SIMD extensions Floating-point rounding instructions WebAssembly/simd#232 Pseudo-Minimum and Pseudo-Maximum instructions WebAssembly/simd#122 The changes are straightforward: * `build.rs`: the relevant tests have been enabled * `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions `fmin_pseudo` and `fmax_pseudo`. The wasm rounding instructions do not need any new CLIF instructions. * `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is pretty much the same as any other unary or binary vector instruction (for the rounding and the pmin/max respectively) * `cranelift/codegen/src/isa/aarch64/lower_inst.rs`: - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction sequence, `fcmpgt` followed by `bsl` - the CLIF rounding instructions are converted to a suitable vector `frint{n,z,p,m}` instruction. * `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub enum VecMisc2` to handle the rounding operations. And corresponding `emit` cases.
…structions This patch implements, for aarch64, the following wasm SIMD extensions Floating-point rounding instructions WebAssembly/simd#232 Pseudo-Minimum and Pseudo-Maximum instructions WebAssembly/simd#122 The changes are straightforward: * `build.rs`: the relevant tests have been enabled * `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions `fmin_pseudo` and `fmax_pseudo`. The wasm rounding instructions do not need any new CLIF instructions. * `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is pretty much the same as any other unary or binary vector instruction (for the rounding and the pmin/max respectively) * `cranelift/codegen/src/isa/aarch64/lower_inst.rs`: - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction sequence, `fcmpgt` followed by `bsl` - the CLIF rounding instructions are converted to a suitable vector `frint{n,z,p,m}` instruction. * `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub enum VecMisc2` to handle the rounding operations. And corresponding `emit` cases.
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982
…structions This patch implements, for aarch64, the following wasm SIMD extensions Floating-point rounding instructions WebAssembly/simd#232 Pseudo-Minimum and Pseudo-Maximum instructions WebAssembly/simd#122 The changes are straightforward: * `build.rs`: the relevant tests have been enabled * `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions `fmin_pseudo` and `fmax_pseudo`. The wasm rounding instructions do not need any new CLIF instructions. * `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is pretty much the same as any other unary or binary vector instruction (for the rounding and the pmin/max respectively) * `cranelift/codegen/src/isa/aarch64/lower_inst.rs`: - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction sequence, `fcmpgt` followed by `bsl` - the CLIF rounding instructions are converted to a suitable vector `frint{n,z,p,m}` instruction. * `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub enum VecMisc2` to handle the rounding operations. And corresponding `emit` cases.
@tlively this wasn't added to NewOpcodes.md, just fyi in case you are looking at that doc for opcode organization. |
Oh, thanks for point that out. I had indeed missed them. |
Introduction
Floating-point round-to-integer is a widely used operation, available in many software and hardware specifications:
f32.nearest
/f32.trunc
/f32.ceil
/f32.floor
/f64.nearest
/f64.trunc
/f64.ceil
/f64.floor
scalar instruction in WebAssemblyrint
/nearbyint
/trunc
/ceil
/floor
functions in C and C++ROUNDPS
andROUNDPD
instructions in SSE4.1VRINTN
/VRINTZ
/VRINTP
/VRINTM
instructions in ARMv8 AArch32FRINTN
/FRINTZ
/FRINTP
/FRINTM
instructions in AArch64These PR introduce the rounding instructions in WebAssembly SIMD.
New instructions
f32x4.nearest
/f64x2.nearest
f32x4.trunc
/f64x2.trunc
f32x4.ceil
/f64x2.ceil
f32x4.floor
/f64x2.floor
The instructions match the scalar WebAssembly analogs both in names and in semantics.
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX instruction set
y = f32x4.nearest(x)
is lowered toVROUNDPS xmm_y, xmm_x, 0x08
y = f32x4.trunc(x)
is lowered toVROUNDPS xmm_y, xmm_x, 0x0B
y = f32x4.ceil(x)
is lowered toVROUNDPS xmm_y, xmm_x, 0x0A
y = f32x4.floor(x)
is lowered toVROUNDPS xmm_y, xmm_x, 0x09
y = f64x2.nearest(x)
is lowered toVROUNDPD xmm_y, xmm_x, 0x08
y = f64x2.trunc(x)
is lowered toVROUNDPD xmm_y, xmm_x, 0x0B
y = f64x2.ceil(x)
is lowered toVROUNDPD xmm_y, xmm_x, 0x0A
y = f64x2.floor(x)
is lowered toVROUNDPD xmm_y, xmm_x, 0x09
x86/x86-64 processors with SSE4.1 instruction set
y = f32x4.nearest(x)
is lowered toROUNDPS xmm_y, xmm_x, 0x08
y = f32x4.trunc(x)
is lowered toROUNDPS xmm_y, xmm_x, 0x0B
y = f32x4.ceil(x)
is lowered toROUNDPS xmm_y, xmm_x, 0x0A
y = f32x4.floor(x)
is lowered toROUNDPS xmm_y, xmm_x, 0x09
y = f64x2.nearest(x)
is lowered toROUNDPD xmm_y, xmm_x, 0x08
y = f64x2.trunc(x)
is lowered toROUNDPD xmm_y, xmm_x, 0x0B
y = f64x2.ceil(x)
is lowered toROUNDPD xmm_y, xmm_x, 0x0A
y = f64x2.floor(x)
is lowered toROUNDPD xmm_y, xmm_x, 0x09
x86/x86-64 processors with SSE2 instruction set
y = f32x4.nearest(x)
(y
is NOTx
) is lowered to:MOVDQA xmm_tmp0, wasm_splat_u32(0x80000000)
CVTPS2DQ xmm_y, xmm_x
CVTDQ2PS xmm_tmp1, xmm_y
PCMPEQD xmm_y, xmm_tmp0
POR xmm_y, xmm_tmp0
ADDPS xmm_tmp0, xmm_x
ANDPS xmm_tmp0, xmm_y
ANDNPS xmm_y, xmm_tmp1
ORPS xmm_y, xmm_tmp0
y = f32x4.trunc(x)
(y
is NOTx
) is lowered to:MOVDQA xmm_tmp0, wasm_splat_u32(0x80000000)
CVTTPS2DQ xmm_y, xmm_x
CVTDQ2PS xmm_tmp1, xmm_y
PCMPEQD xmm_y, xmm_tmp0
POR xmm_y, xmm_tmp0
ADDPS xmm_tmp0, xmm_x
ANDPS xmm_tmp0, xmm_y
ANDNPS xmm_y, xmm_tmp1
ORPS xmm_y, xmm_tmp0
x = f32x4.ceil(x)
is lowered to:CVTTPS2DQ xmm_tmp0, xmm_x
MOVDQA xmm_tmp1, wasm_splat_u32(0x80000000)
CVTDQ2PS xmm_tmp2, xmm_tmp0
PCMPEQD xmm_tmp0, xmm_tmp1
POR xmm_tmp0, xmm_tmp1
MOVDQA xmm_tmp3, xmm_tmp0
ANDPS xmm_tmp3, xmm_x
ANDNPS xmm_tmp0, xmm_tmp2
ORPS xmm_tmp0, xmm_tmp3
CMPLEPS xmm_x, xmm_tmp0
ORPS xmm_x, xmm_tmp1
MOVAPS xmm_tmp2, xmm_x
ANDPS xmm_tmp2, xmm_tmp0
ADDPS xmm_tmp0, wasm_splat_f32(1.0f)
ANDNPS xmm_x, xmm_tmp0
ORPS xmm_x, xmm_tmp2
y = f32x4.floor(x)
(y
is NOTx
) is lowered to:MOVDQA xmm_tmp0, wasm_splat_u32(0x80000000)
CVTTPS2DQ xmm_y, xmm_x
CVTDQ2PS xmm_tmp1, xmm_y
PCMPEQD xmm_y, xmm_tmp0
POR xmm_y, xmm_tmp0
MOVAPS xmm_tmp0, xmm_y
ANDPS xmm_tmp0, xmm_x
ANDNPS xmm_y, xmm_tmp1
MOVAPS xmm_tmp1, xmm_x
ORPS xmm_y, xmm_tmp0
CMPLTPS xmm_tmp1, xmm_y
ANDPS xmm_tmp1, wasm_splat_f32(1.0f)
SUBPS xmm_y, xmm_tmp1
y = f64x2.nearest(x)
(y
is NOTx
) is lowered to:MOVAPS xmm_tmp0, wasm_splat_u64(0x7FFFFFFFFFFFFFFF)
MOVAPS xmm_y, xmm_x
MOVAPS xmm_tmp1, wasm_splat_f64(0x1.0p+52)
MOVAPS xmm_tmp2, xmm_tmp0
ANDPS xmm_y, xmm_tmp1
CMPLEPD xmm_tmp2, xmm_y
ADDPD xmm_y, xmm_tmp0
SUBPD xmm_y, xmm_tmp0
ANDNPS xmm_tmp2, xmm_tmp1
MOVAPS xmm_tmp1, xmm_tmp2
ANDNPS xmm_tmp1, xmm_x
ANDPS xmm_y, xmm_tmp2
ORPS xmm_y, xmm_tmp1
y = f64x2.trunc(x)
(y
is NOTx
) is lowered to:MOVAPS xmm_y, wasm_splat_u64(0x7FFFFFFFFFFFFFFF)
MOVAPS xmm_tmp0, wasm_splat_f64(0x1.0p+52)
MOVAPS xmm_tmp1, xmm_x
ANDPS xmm_tmp1, xmm_y
MOVAPS xmm_tmp2, xmm_tmp0
CMPNLEPD xmm_tmp2, xmm_tmp1
ANDPS xmm_y, xmm_tmp2
MOVAPS xmm_tmp2, xmm_tmp1
ADDPD xmm_tmp2, xmm_tmp0
SUBPD xmm_tmp2, xmm_tmp0
CMPLTPD xmm_tmp1, xmm_tmp2
ANDPS xmm_tmp1, wasm_splat_f64(1.0)
SUBPD xmm_tmp2, xmm_tmp1
ANDPS xmm_tmp2, xmm_y
ANDNPS xmm_y, xmm_x
ORPS xmm_y, xmm_tmp2
y = f64x2.ceil(x)
(y
is NOTx
) is lowered to:MOVAPS xmm_tmp0, wasm_splat_u64(0x7FFFFFFFFFFFFFFF)
MOVAPS xmm_y, xmm_x
MOVAPS xmm_tmp1, wasm_splat_f64(0x1.0p+52)
ANDPS xmm_y, xmm_tmp0
MOVAPS xmm_tmp2, xmm_tmp1
CMPNLEPD xmm_tmp2, xmm_y
ADDPD xmm_y, xmm_tmp1
ANDPS xmm_tmp2, xmm_tmp0
SUBPD xmm_y, xmm_tmp1
ANDPS xmm_y, xmm_tmp2
ANDNPS xmm_tmp2, xmm_x
ORPS xmm_tmp2, xmm_y
MOVAPS xmm_y, xmm_tmp2
MOVAPS xmm_tmp1, xmm_tmp2
CMPLTPD xmm_y, xmm_x
ADDPD xmm_tmp1, wasm_splat_f64(1.0)
ANDPS xmm_y, xmm_tmp0
ANDPS xmm_tmp1, xmm_y
ANDNPS xmm_y, xmm_tmp2
ORPS xmm_y, xmm_tmp1
y = f64x2.floor(x)
(y
is NOTx
) is lowered to:MOVAPS xmm_tmp0, wasm_splat_u64(0x7FFFFFFFFFFFFFFF)
MOVAPS xmm_tmp1, xmm_x
MOVAPS xmm_tmp2, wasm_splat_f64(0x1.0p+52)
ANDPS xmm_tmp1, xmm_tmp0
MOVAPS xmm_y, xmm_tmp2
CMPNLEPD xmm_y, xmm_tmp1
ANDPS xmm_y, xmm_tmp0
ADDPD xmm_tmp1, xmm_tmp2
SUBPD xmm_tmp1, xmm_tmp2
ANDPS xmm_tmp1, xmm_y
ANDNPS xmm_y, xmm_x
MOVAPS xmm_tmp0, xmm_x
ORPS xmm_y, xmm_tmp1
CMPLTPD xmm_tmp0, xmm_y
ANDPS xmm_tmp0, wasm_splat_f64(1.0)
SUBPD xmm_y, xmm_tmp0
ARM64 processors
y = f32x4.nearest(x)
is lowered toFRINTN Vy.4S, Vx.4S
y = f32x4.trunc(x)
is lowered toFRINTZ Vy.4S, Vx.4S
y = f32x4.ceil(x)
is lowered toFRINTP Vy.4S, Vx.4S
y = f32x4.floor(x)
is lowered toFRINTM Vy.4S, Vx.4S
y = f64x2.nearest(x)
is lowered toFRINTN Vy.2D, Vx.2D
y = f64x2.trunc(x)
is lowered toFRINTZ Vy.2D, Vx.2D
y = f64x2.ceil(x)
is lowered toFRINTP Vy.2D, Vx.2D
y = f64x2.floor(x)
is lowered toFRINTM Vy.2D, Vx.2D
ARM processors with ARMv8 (32-bit) instruction set
y = f32x4.nearest(x)
is lowered toVRINTN.F32 Qy, Qx
y = f32x4.trunc(x)
is lowered toVRINTZ.F32 Qy, Qx
y = f32x4.ceil(x)
is lowered toVRINTP.F32 Qy, Qx
y = f32x4.floor(x)
is lowered toVRINTM.F32 Qy, Qx
y = f64x2.nearest(x)
is lowered toVRINTN.F64 Dy_lo, Dx_lo
+VRINTN.F64 Dy_hi, Dx_hi
y = f64x2.trunc(x)
is lowered toVRINTZ.F64 Dy_lo, Dx_lo
+VRINTZ.F64 Dy_hi, Dx_hi
y = f64x2.ceil(x)
is lowered toVRINTP.F64 Dy_lo, Dx_lo
+VRINTP.F64 Dy_hi, Dx_hi
y = f64x2.floor(x)
is lowered toVRINTM.F64 Dy_lo, Dx_lo
+VRINTM.F64 Dy_hi, Dx_hi
ARM processors with ARMv7 (32-bit) instruction set
y = f32x4.nearest(x)
(y
is NOTx
) is lowered to:VMOV.I32 Qtmp0, 0x4B000000
VABS.F32 Qtmp1, Qx
VACGT.F32 Qy, Qx, Qtmp0
VADD.F32 Qtmp1, Qtmp1, Qtmp0
VORR.I32 Qy, 0x80000000
VSUB.F32 Qtmp1, Qtmp1, Qtmp0
VBSL Qy, Qx, Qtmp1
y = f32x4.trunc(x)
(y
is NOTx
) is lowered to:VCVT.S32.F32 Qtmp0, Qx
VMOV.I32 Qtmp1, 0x4B000000
VACGT.F32 Qy, Qtmp1, Qx
VCVT.F32.S32 Qtmp0, Qtmp0
VBIC.I32 Qy, 0x80000000
VBSL Qy, Qtmp0, Qx
y = f32x4.ceil(x)
(y
is NOTx
) is lowered to:VCVT.S32.F32 Qtmp0, Qx
VMOV.I32 Qtmp1, 0x4B000000
VACGT.F32 Qtmp1, Qtmp1, Qx
VCVT.F32.S32 Qtmp0, Qtmp0
VBIC.I32 Qtmp1, 0x80000000
VBSL Qtmp1, Qtmp0, Qx
VMOV.F32 Qtmp0, 0x3F800000
VCGE.F32 Qy, Qtmp1, Qx
VADD.F32 Qtmp0, Qtmp1, Qtmp0
VORR.I32 Qy, 0x80000000
VBSL Qy, Qtmp1, Qtmp0
y = f32x4.floor(x)
(y
is NOTx
) is lowered to:VCVT.S32.F32 Qtmp0, Qx
VMOV.I32 Qtmp1, 0x4B000000
VACGT.F32 Qy, Qtmp1, Qx
VCVT.F32.S32 Qtmp0, Qtmp0
VBIC.I32 Qy, 0x80000000
VBSL Qy, Qtmp0, Qx
VMOV.F32 Qtmp1, 0x3F800000
VCGT.F32 Qtmp0, Qy, Qx
VAND Qtmp0, Qtmp0, Qtmp1
VSUB.F32 Qy, Qy, Qtmp0
y = f64x2.round(x)
(y
is NOTx
) is lowered to:VABS.F64 Dy_lo, Dx_lo
VABS.F64 Dy_hi, Dx_hi
VLDR Dtmp0, 0x1.0p+52
VSUB.F64 Dtmp1_lo, Dtmp0, Dy_lo
VSUB.F64 Dtmp1_hi, Dtmp0, Dy_hi
VADD.F64 Dtmp2_lo, Dy_lo, Dtmp0
VADD.F64 Dtmp2_hi, Dy_hi, Dtmp0
VEOR Qy, Qx, Qy
VSHR.S64 Qtmp1, Qtmp1, 63
VSUB.F64 Dtmp2_lo, Dtmp2_lo, Dtmp0
VSUB.F64 Dtmp2_hi, Dtmp2_hi, Dtmp0
VORR Qy, Qy, Qtmp1
VBSL Qy, Qx, Qtmp2
y = f64x2.trunc(x)
(y
is NOTx
) is lowered to:VLDR Dtmp0, 0x1.0p+52
VABS.F64 Qy_lo, Dx_lo
VABS.F64 Qy_hi, Dx_hi
VADD.F64 Dtmp1_lo, Qy_lo, Dtmp0
VADD.F64 Dtmp1_hi, Qy_hi, Dtmp0
VSUB.F64 Dtmp2_lo, Dtmp0, Qy_lo
VSUB.F64 Dtmp2_hi, d9, Qy_hi
VEOR Qtmp3, Qy, Qx
VSUB.F64 Dtmp1_lo, Dtmp1_lo, Dtmp0
VSUB.F64 Dtmp1_hi, Dtmp1_hi, d9
VLDR Dtmp0, 1.0
VSHR.S64 Qtmp2, Qtmp2, 63
VORR Qtmp3, Qtmp3, Qtmp2
VSUB.I64 Qy, Qy, Qtmp1
VSHR.S64 Qy, Qy, 63
VAND Qy_lo, Qy_lo, Dtmp0
VAND Qy_hi, Qy_hi, Dtmp0
VSUB.F64 Qy_lo, Dtmp1_lo, Qy
VSUB.F64 Qy_hi, Dtmp1_hi, Qx
VBIT Qy, Qx, Qtmp3
y = f64x2.ceil(x)
(y
is NOTx
) is lowered to:VLDR Dtmp0, 0x1.0p+52
VABS.F64 Dtmp1_lo, Dx_lo
VABS.F64 Dtmp1_hi, Dx_hi
VSUB.F64 Dtmp2_lo, Dtmp0, Dtmp1_lo
VSUB.F64 Dtmp2_hi, Dtmp0, Dtmp1_hi
VADD.F64 Dtmp3_lo, Dtmp1_lo, Dtmp0
VADD.F64 Dtmp3_hi, Dtmp1_hi, Dtmp0
VEOR Qtmp1, Qtmp1, Qx
VSHR.S64 Qtmp2, Qtmp2, 63
VSUB.F64 Dtmp3_lo, Dtmp3_lo, Dtmp0
VSUB.F64 Dtmp3_hi, Dtmp3_hi, Dtmp0
VLDR Dtmp0, 1.0
VORR Qtmp2, Qtmp2, Qtmp1
VBSL Qtmp2, Qx, Qtmp3
VSUB.F64 Dy_lo, Dtmp2_lo, Dx_lo
VSUB.F64 Dy_hi, Dtmp2_hi, Dx_hi
VADD.F64 Dtmp3_lo, Dtmp2_lo, Dtmp0
VADD.F64 Dtmp3_hi, Dtmp2_hi, Dtmp0
VSHR.S64 Qy, Qy, 63
VBIC Qy, Qy, Qtmp1
VBSL Qy, Qtmp3, Qtmp2
y = f64x2.floor(x)
(y
iD NOTx
) iD lowereQ to:VLDR Dtmp0, 0x1.0p+52
VABS.F64 Dy_lo, Dx_lo
VABS.F64 Dy_hi, Dx_hi
VADD.F64 Dtmp1_lo, Dy_lo, Dtmp0
VADD.F64 Dtmp1_hi, Dy_hi, Dtmp0
VSUB.I64 Dtmp2_lo, Dtmp0, Dy_lo
VSUB.I64 Dtmp2_hi, Dtmp0, Dy_hi
VEOR Qy, Qy, Qx
VSUB.F64 Dtmp1_lo, Dtmp1_lo, Dtmp0
VSUB.F64 Dtmp1_hi, Dtmp1_hi, Dtmp0
VLDR Dtmp0, 1.0
VSHR.S64 Qtmp2, Qtmp2, 63
VORR Qy, Qy, Qtmp2
VBSL Qy, Qx, Qtmp1
VSUB.F64 Dx_lo, Dx_lo, Dy_lo
VSUB.F64 Dx_hi, Dx_hi, Dy_hi
VSHR.S64 Qtmp2, Qx, 63
VAND Dtmp2_lo, Dtmp2_lo, Dtmp0
VAND Dtmp2_hi, Dtmp2_hi, Dtmp0
VSUB.F64 Dy_lo, Dy_lo, Dtmp2_lo
VSUB.F64 Dy_hi, Dy_hi, Dtmp2_hi