-
Notifications
You must be signed in to change notification settings - Fork 43
v128.load32_zero and v128.load64_zero instructions #237
Conversation
Thanks for the suggestion, @Maratyszcza! Now that we are in phase 3, we have stricter guidelines on adding new instructions. It sounds like these instructions are well supported on multiple architectures, but we need to agree that they are used in multiple important use cases and that they would be expensive to emulate. Can you point to real-world uses of this pattern that we could adapt as benchmarks to determine how much of a benefit these instructions would be? |
@tlively Added examples of applications using these instructions |
XNNPACK has SIMD table-based exp and sigmoid implementations that could be used for evaluation |
@tlively I agree these would be helpful. Another expensive to emulate use case is when you have an existing data structure of 1-2 floats and we can't be sure 4 floats are accessible. Or the much more common case of remainder handling - using the same code as the main loop, but with 32-bit loads/stores going one element at a time. JPEG XL has several examples of this. |
Maybe the more general Basically, it can loads a single element into any lane, not just the first one, and leaves the other lanes untouched. The problem I see is that it would need some kind pattern matching to make this the most efficient on x86 where we only have "load into the first lane and put the rest at zero". If pattern recognition fails (or is disabled), the generated code for a single The store counterpart might also be interesting. |
For consistency with the load_splat instructions, these instructions should probably have |
Specified in WebAssembly/simd#237. Since these are just prototypes necessary for benchmarking, this PR does not add support for these instructions to the fuzzer or the C or JS APIs. This PR also renumbers the QFMA instructions that previously used the opcodes for these new instructions. The renumbering matches the renumbering in V8 and LLVM.
IMO it is best to save We could rename these instructions to |
Specified in WebAssembly/simd#237. Since these are just prototypes necessary for benchmarking, this PR does not add support for these instructions to the fuzzer or the C or JS APIs. This PR also renumbers the QFMA instructions that previously used the opcodes for these new instructions. The renumbering matches the renumbering in V8 and LLVM.
Specified in WebAssembly/simd#237, these instructions load the first vector lane from memory and zero the other lanes. Since these instructions are not officially part of the SIMD proposal, they are only available on an opt-in basis via LLVM intrinsics and clang builtin functions. If these instructions are merged to the proposal, this implementation will change so that the instructions will be generated from normal IR. At that point the intrinsics and builtin functions would be removed. This PR also changes the opcodes for the experimental f32x4.qfm{a,s} instructions because their opcodes conflicted with those of the v128.load{32,64}_zero instructions. The new opcodes were chosen to match those used in V8. Differential Revision: https://reviews.llvm.org/D84820
Saving On a different note, prototypes of these instructions have been merged to both LLVM and Binaryen and will be available in the next version of Emscripten via the builtin functions |
I think the memory instructions should all start with The shape prefix suggests how the operands are treated, which doesn't apply for loads, since the operands are all memargs. This might be a point of confusion. Making everything start with
For load_splat we might even consider change it to: So maybe load zeroes can be: I think this has a (imo nice) side effect of making the spec text a bit clearer, because you can now say, all instructions that start with the shape prefix describe how they treat their operands (and you don't have to say "except for memory instructions"). |
@ngzhian That seems reasonable and consistent to me. We would want to use the |
Yea that looks good to me. It becomes really clear from the name that v128 is the return type, and how many bytes will be loaded. The remaining portion will be to tell us how to get from bytes to v128, of which we can have many different wants. |
Specified in WebAssembly/simd#237, these instructions load the first vector lane from memory and zero the other lanes. Since these instructions are not officially part of the SIMD proposal, they are only available on an opt-in basis via LLVM intrinsics and clang builtin functions. If these instructions are merged to the proposal, this implementation will change so that the instructions will be generated from normal IR. At that point the intrinsics and builtin functions would be removed. This PR also changes the opcodes for the experimental f32x4.qfm{a,s} instructions because their opcodes conflicted with those of the v128.load{32,64}_zero instructions. The new opcodes were chosen to match those used in V8. Differential Revision: https://reviews.llvm.org/D84820
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2e7ddb00c8f9240e148cf5843b50a7ba7b913351
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2d73a015caaa3e70c175172158a6548625dc6da3
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2e7ddb00c8f9240e148cf5843b50a7ba7b913351
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2d73a015caaa3e70c175172158a6548625dc6da3
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2e7ddb00c8f9240e148cf5843b50a7ba7b913351
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2d73a015caaa3e70c175172158a6548625dc6da3
@Maratyszcza we briefly discussed this in the sync meeting today, and there is general support for these instructions, but we still need benchmarking data to make the case for including them. Would you be able to get performance numbers for these? |
In this case, it would be the same operation with a register instead of a
memory argument. There's no undefined behavior.
…On Mon, Oct 5, 2020, 18:13 Thomas Lively ***@***.***> wrote:
@omnisip <https://github.com/omnisip> that sounds like v128.const, but
leaving some lanes undefined. Unfortunately we can't have instructions that
leave parts of values undefined in WebAssembly because we can't have
undefined behavior and we don't want to make the type system so complex
that it can track undefined lanes statically.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#237 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKQJQF2UM2HZL24ORTJRLSJJOLBANCNFSM4NREEJEQ>
.
|
How is that different from |
The value is calculated at runtime but isn't a pointer in memory.
…On Mon, Oct 5, 2020, 20:03 Thomas Lively ***@***.***> wrote:
How is that different from v128.const then?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#237 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKQJVRHEU5APTQCBL5DGLSJJ3FLANCNFSM4NREEJEQ>
.
|
Then it sounds like a In general, the answer to "why doesn't the instruction set have X" is some combination of "X is not portable enough," "X is not useful enough," or "no one has suggested X yet". You're totally welcome to suggest new instructions when you identify deficiencies in the instruction set. Useful information to include is how the instruction would be lowered on Intel and ARM ISAs, applications that would benefit from the instruction, and any estimates you have for the performance improvement that instruction can bring. See many of @Maratyszcza's PRs for good examples of new instruction proposals. |
In this case, it would belong with this PR because it would still be the same underlying instructions with a different argument. @Maratyszcza do you want me to make a patch or do you want to add this yourself? |
WebAssembly doesn't overload instructions with different kinds of arguments like that, so it will have to be a new instruction proposal. |
There are three identical implementations for this functionality that are
useful:
1) from memory like this currently is
2) from 32 or 64 bit register
3) from another v128 vector
I'll put the other two in my proposal so they can be evaluated together
when nomenclature is discussed.
…On Mon, Oct 5, 2020, 20:56 Thomas Lively ***@***.***> wrote:
WebAssembly doesn't overload instructions with different kinds of
arguments like that, so it will have to be a new instruction proposal.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#237 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKQJTHVPROKEOKOAS7OQTSJKBOHANCNFSM4NREEJEQ>
.
|
Memory operations either go from memory to a stack slot (load) or from stack slot to memory (store). There are no other possible semantics for memory operations in WebAssembly. (edit: had virtual registers instead of stack slots, my bad) |
In the sync today we agreed to add these 2 instructions. |
c4bae8d
to
90a79a0
Compare
@ngzhian Done |
90a79a0
to
6021a4e
Compare
Thanks, LGTM |
…ons. This patch implements, for aarch64, the following wasm SIMD extensions. v128.load32_zero and v128.load64_zero instructions WebAssembly/simd#237 The changes are straightforward: * no new CLIF instructions. They are translated into an existing CLIF scalar load followed by a CLIF `scalar_to_vector`. * the comment/specification for CLIF `scalar_to_vector` has been changed to match the actual intended semantics, per consulation with Andrew Brown. * translation from `scalar_to_vector` to the obvious aarch64 insns. * special-case zero in `lower_constant_f128` in order to avoid a potentially slow call to `Inst::load_fp_constant128`. * Once "Allow loads to merge into other operations during instruction selection in MachInst backends" (bytecodealliance#2340) lands, we can use that functionality to pattern match the two-CLIF pair and emit a single AArch64 instruction. There is no testcase in this commit, because that is a separate repo. The implementation has been tested, nevertheless.
…ons. This patch implements, for aarch64, the following wasm SIMD extensions. v128.load32_zero and v128.load64_zero instructions WebAssembly/simd#237 The changes are straightforward: * no new CLIF instructions. They are translated into an existing CLIF scalar load followed by a CLIF `scalar_to_vector`. * the comment/specification for CLIF `scalar_to_vector` has been changed to match the actual intended semantics, per consulation with Andrew Brown. * translation from `scalar_to_vector` to the obvious aarch64 insns. * special-case zero in `lower_constant_f128` in order to avoid a potentially slow call to `Inst::load_fp_constant128`. * Once "Allow loads to merge into other operations during instruction selection in MachInst backends" (bytecodealliance#2340) lands, we can use that functionality to pattern match the two-CLIF pair and emit a single AArch64 instruction. There is no testcase in this commit, because that is a separate repo. The implementation has been tested, nevertheless.
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982
…ons. This patch implements, for aarch64, the following wasm SIMD extensions. v128.load32_zero and v128.load64_zero instructions WebAssembly/simd#237 The changes are straightforward: * no new CLIF instructions. They are translated into an existing CLIF scalar load followed by a CLIF `scalar_to_vector`. * the comment/specification for CLIF `scalar_to_vector` has been changed to match the actual intended semantics, per consulation with Andrew Brown. * translation from `scalar_to_vector` to aarch64 `fmov` instruction. This has been generalised slightly so as to allow both 32- and 64-bit transfers. * special-case zero in `lower_constant_f128` in order to avoid a potentially slow call to `Inst::load_fp_constant128`. * Once "Allow loads to merge into other operations during instruction selection in MachInst backends" (bytecodealliance#2340) lands, we can use that functionality to pattern match the two-CLIF pair and emit a single AArch64 instruction. * A simple filetest has been added. There is no comprehensive testcase in this commit, because that is a separate repo. The implementation has been tested, nevertheless.
…ons. This patch implements, for aarch64, the following wasm SIMD extensions. v128.load32_zero and v128.load64_zero instructions WebAssembly/simd#237 The changes are straightforward: * no new CLIF instructions. They are translated into an existing CLIF scalar load followed by a CLIF `scalar_to_vector`. * the comment/specification for CLIF `scalar_to_vector` has been changed to match the actual intended semantics, per consulation with Andrew Brown. * translation from `scalar_to_vector` to aarch64 `fmov` instruction. This has been generalised slightly so as to allow both 32- and 64-bit transfers. * special-case zero in `lower_constant_f128` in order to avoid a potentially slow call to `Inst::load_fp_constant128`. * Once "Allow loads to merge into other operations during instruction selection in MachInst backends" (#2340) lands, we can use that functionality to pattern match the two-CLIF pair and emit a single AArch64 instruction. * A simple filetest has been added. There is no comprehensive testcase in this commit, because that is a separate repo. The implementation has been tested, nevertheless.
…ons. This patch implements, for aarch64, the following wasm SIMD extensions. v128.load32_zero and v128.load64_zero instructions WebAssembly/simd#237 The changes are straightforward: * no new CLIF instructions. They are translated into an existing CLIF scalar load followed by a CLIF `scalar_to_vector`. * the comment/specification for CLIF `scalar_to_vector` has been changed to match the actual intended semantics, per consulation with Andrew Brown. * translation from `scalar_to_vector` to aarch64 `fmov` instruction. This has been generalised slightly so as to allow both 32- and 64-bit transfers. * special-case zero in `lower_constant_f128` in order to avoid a potentially slow call to `Inst::load_fp_constant128`. * Once "Allow loads to merge into other operations during instruction selection in MachInst backends" (#2340) lands, we can use that functionality to pattern match the two-CLIF pair and emit a single AArch64 instruction. * A simple filetest has been added. There is no comprehensive testcase in this commit, because that is a separate repo. The implementation has been tested, nevertheless.
Specified in WebAssembly/simd#237, these instructions load the first vector lane from memory and zero the other lanes. Since these instructions are not officially part of the SIMD proposal, they are only available on an opt-in basis via LLVM intrinsics and clang builtin functions. If these instructions are merged to the proposal, this implementation will change so that the instructions will be generated from normal IR. At that point the intrinsics and builtin functions would be removed. This PR also changes the opcodes for the experimental f32x4.qfm{a,s} instructions because their opcodes conflicted with those of the v128.load{32,64}_zero instructions. The new opcodes were chosen to match those used in V8. Differential Revision: https://reviews.llvm.org/D84820
Introduction
This PR introduce two new variants of load instructions which load a single 32-bit or 64-bit element into the lowest part of 128-bit SIMD vector and zero-extend it to full 128 bits. These instructions natively map to SSE2 and ARM64 NEON instruction, and have two broad use-cases:
v128.replace_lane
instructions, the resulting code would use be inefficient in using too many general-purpose registers, producing an overly long dependency chain (everyv128.replace_lane
depends on the previous one), and hitting the long-latency/low-throughput instructions to copy from general-purpose registers to SIMD registers. Non-contiguous loads using the proposedv128.load32_zero
andv128.load64_zero
instructions avoid all these bottlenecks.Applications
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX instruction set
v = v128.load32_zero(mem)
is lowered toVMOVSS xmm_v, [mem]
v = v128.load64_zero(mem)
is lowered toVMOVSD xmm_v, [mem]
x86/x86-64 processors with SSE2 instruction set
v = v128.load32_zero(mem)
is lowered toMOVSS xmm_v, [mem]
v = v128.load64_zero(mem)
is lowered toMOVSD xmm_v, [mem]
ARM64 processors
v = v128.load32_zero(mem)
is lowered toLDR Sv, [mem]
v = v128.load64_zero(mem)
is lowered toLDR Dv, [mem]
ARMv7 processors with NEON instruction set
v = v128.load32_zero(mem)
is lowered toVMOV.I32 Qv, 0
+VLD1.32 {Dv_lo[0]}, [mem]
v = v128.load64_zero(mem)
is lowered toVMOV.I32 Dv_hi, 0
+VLD1.32 {Dv_lo}, [mem]