Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

v128.load32_zero and v128.load64_zero instructions #237

Merged
merged 1 commit into from
Oct 19, 2020

Conversation

Maratyszcza
Copy link
Contributor

@Maratyszcza Maratyszcza commented Jun 2, 2020

Introduction

This PR introduce two new variants of load instructions which load a single 32-bit or 64-bit element into the lowest part of 128-bit SIMD vector and zero-extend it to full 128 bits. These instructions natively map to SSE2 and ARM64 NEON instruction, and have two broad use-cases:

  1. Non-contiguous loads, when we need to combine elements from disjoint locations in a single SIMD vector. Non-contiguous loads are commonly emulated by doing loads a single elements and then combining the values through shuffles. While is it possible to do through a combination of scalar loads and v128.replace_lane instructions, the resulting code would use be inefficient in using too many general-purpose registers, producing an overly long dependency chain (every v128.replace_lane depends on the previous one), and hitting the long-latency/low-throughput instructions to copy from general-purpose registers to SIMD registers. Non-contiguous loads using the proposed v128.load32_zero and v128.load64_zero instructions avoid all these bottlenecks.
  2. Processing fewer than 128 bits of data. Sometimes the algorithm or data structures just don't expose enough data to utilize all 128 bits of a SIMD vector, but would nevertheless benefit from processing fewer elements in parallel (e.g. adding 8 bytes in one SIMD instruction rather than eight scalar instructions).

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

  • v128.load32_zero
    • v = v128.load32_zero(mem) is lowered to VMOVSS xmm_v, [mem]
  • v128.load64_zero
    • v = v128.load64_zero(mem) is lowered to VMOVSD xmm_v, [mem]

x86/x86-64 processors with SSE2 instruction set

  • v128.load32_zero
    • v = v128.load32_zero(mem) is lowered to MOVSS xmm_v, [mem]
  • v128.load64_zero
    • v = v128.load64_zero(mem) is lowered to MOVSD xmm_v, [mem]

ARM64 processors

  • v128.load32_zero
    • v = v128.load32_zero(mem) is lowered to LDR Sv, [mem]
  • v128.load64_zero
    • v = v128.load64_zero(mem) is lowered to LDR Dv, [mem]

ARMv7 processors with NEON instruction set

  • v128.load32_zero
    • v = v128.load32_zero(mem) is lowered to VMOV.I32 Qv, 0 + VLD1.32 {Dv_lo[0]}, [mem]
  • v128.load64_zero
    • v = v128.load64_zero(mem) is lowered to VMOV.I32 Dv_hi, 0 + VLD1.32 {Dv_lo}, [mem]

@tlively
Copy link
Member

tlively commented Jun 2, 2020

Thanks for the suggestion, @Maratyszcza! Now that we are in phase 3, we have stricter guidelines on adding new instructions. It sounds like these instructions are well supported on multiple architectures, but we need to agree that they are used in multiple important use cases and that they would be expensive to emulate. Can you point to real-world uses of this pattern that we could adapt as benchmarks to determine how much of a benefit these instructions would be?

@Maratyszcza
Copy link
Contributor Author

@tlively Added examples of applications using these instructions

@Maratyszcza
Copy link
Contributor Author

XNNPACK has SIMD table-based exp and sigmoid implementations that could be used for evaluation

@jan-wassenberg
Copy link

@tlively I agree these would be helpful. Another expensive to emulate use case is when you have an existing data structure of 1-2 floats and we can't be sure 4 floats are accessible. Or the much more common case of remainder handling - using the same code as the main loop, but with 32-bit loads/stores going one element at a time. JPEG XL has several examples of this.

@lemaitre
Copy link

lemaitre commented Jun 3, 2020

Maybe the more general vld1q_lane from Neon might be desirable (https://static.docs.arm.com/den0018/a/DEN0018A_neon_programmers_guide_en.pdf#G15.1154120)

Basically, it can loads a single element into any lane, not just the first one, and leaves the other lanes untouched.

The problem I see is that it would need some kind pattern matching to make this the most efficient on x86 where we only have "load into the first lane and put the rest at zero".
But we can envision that a sequence of load_lane could be converted into shuffles in SSE and even a (masked) gather in AVX2.

If pattern recognition fails (or is disabled), the generated code for a single load_lane would still be faster than scalar load + insert_lane as it would be converted into loadl + shuffle and stay in the SIMD register space.
And the shuffle can be easily eliminated if the WASM runtime detect that the lane index is 0 and the input vector already contains zeros.


The store counterpart might also be interesting.
However, the store variant will not be able to handle 8- and 16-bit types efficiently on x86.
We can stay with 32- and 64-bit types, as proposed here, though.

@tlively
Copy link
Member

tlively commented Jul 31, 2020

For consistency with the load_splat instructions, these instructions should probably have v32x4 and v64x2 prefixes. More descriptive names might be v32x4.load_lane and v64x2.load_lane.

tlively added a commit to tlively/binaryen that referenced this pull request Jul 31, 2020
Specified in WebAssembly/simd#237. Since these
are just prototypes necessary for benchmarking, this PR does not add
support for these instructions to the fuzzer or the C or JS APIs. This
PR also renumbers the QFMA instructions that previously used the
opcodes for these new instructions. The renumbering matches the
renumbering in V8 and LLVM.
@Maratyszcza
Copy link
Contributor Author

IMO it is best to save load_lane for (future) variants which load a single lane while leaving other unchanged (i.e. analogs of vld1q_lane_XX in ARM NEON and _mm_insert_epiXX on x86).

We could rename these instructions to v128.load32_u and v128.load64_u for consistency with v128.load32x2_u and other zero-extending instructions.

tlively added a commit to WebAssembly/binaryen that referenced this pull request Aug 3, 2020
Specified in WebAssembly/simd#237. Since these
are just prototypes necessary for benchmarking, this PR does not add
support for these instructions to the fuzzer or the C or JS APIs. This
PR also renumbers the QFMA instructions that previously used the
opcodes for these new instructions. The renumbering matches the
renumbering in V8 and LLVM.
tlively added a commit to llvm/llvm-project that referenced this pull request Aug 3, 2020
Specified in WebAssembly/simd#237, these
instructions load the first vector lane from memory and zero the other
lanes. Since these instructions are not officially part of the SIMD
proposal, they are only available on an opt-in basis via LLVM
intrinsics and clang builtin functions. If these instructions are
merged to the proposal, this implementation will change so that the
instructions will be generated from normal IR. At that point the
intrinsics and builtin functions would be removed.

This PR also changes the opcodes for the experimental f32x4.qfm{a,s}
instructions because their opcodes conflicted with those of the
v128.load{32,64}_zero instructions. The new opcodes were chosen to
match those used in V8.

Differential Revision: https://reviews.llvm.org/D84820
@tlively
Copy link
Member

tlively commented Aug 4, 2020

Saving load_lane for potential future instructions makes sense to me. How about v32x4.load32 and v64x2.load64? The _u suffix doesn't seem necessary because there is no sign interpretation happening. I still think it makes sense to use the prefixes for hinting at the lane interpretation, but I could probably be convinced otherwise as well.

On a different note, prototypes of these instructions have been merged to both LLVM and Binaryen and will be available in the next version of Emscripten via the builtin functions __builtin_wasm_load32_zero and __builtin_wasm_load64_zero.

@ngzhian
Copy link
Member

ngzhian commented Aug 4, 2020

I think the memory instructions should all start with v128. (Ref: mvp instructions are all of the form <type>.load[<n>_<sx>].)

The shape prefix suggests how the operands are treated, which doesn't apply for loads, since the operands are all memargs. This might be a point of confusion. Making everything start with v128.load_ will help categorize all these variants of load as: "load from memory to get a v128", i.e. these are all the ways you can load something from memory to get a v128. Then the format becomes:

v128.load_<splat/extend/zero/others>_<numberofbytesloaded>_<sign extension>"

For load_splat we might even consider change it to: v128.load_splat8, similar to how we have i32.load8_s.

So maybe load zeroes can be: v128.load_zero32.

I think this has a (imo nice) side effect of making the spec text a bit clearer, because you can now say, all instructions that start with the shape prefix describe how they treat their operands (and you don't have to say "except for memory instructions").

@tlively
Copy link
Member

tlively commented Aug 4, 2020

@ngzhian That seems reasonable and consistent to me. We would want to use the v128 prefix for load-extend operations as well. I think it would look more consistent with MVP if we put <numberofbytesloaded> after load, like in v128.load8_splat or v128.load32_zero. WDYT?

@ngzhian
Copy link
Member

ngzhian commented Aug 4, 2020

Yea that looks good to me. It becomes really clear from the name that v128 is the return type, and how many bytes will be loaded. The remaining portion will be to tell us how to get from bytes to v128, of which we can have many different wants.

hanswinderix pushed a commit to hanswinderix/llvm-project that referenced this pull request Aug 5, 2020
Specified in WebAssembly/simd#237, these
instructions load the first vector lane from memory and zero the other
lanes. Since these instructions are not officially part of the SIMD
proposal, they are only available on an opt-in basis via LLVM
intrinsics and clang builtin functions. If these instructions are
merged to the proposal, this implementation will change so that the
instructions will be generated from normal IR. At that point the
intrinsics and builtin functions would be removed.

This PR also changes the opcodes for the experimental f32x4.qfm{a,s}
instructions because their opcodes conflicted with those of the
v128.load{32,64}_zero instructions. The new opcodes were chosen to
match those used in V8.

Differential Revision: https://reviews.llvm.org/D84820
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this pull request Aug 12, 2020
Implement some of the experimental SIMD opcodes that are supported by
all of V8, LLVM, and Binaryen, for maximum compatibility with test
content we might be exposed to.  Most/all of these will probably make
it into the spec, as they lead to substantial speedups in some
programs, and they are deterministic.

For spec and cpu mapping details, see:

WebAssembly/simd#122 (pmax/pmin)
WebAssembly/simd#232 (rounding)
WebAssembly/simd#127 (dot product)
WebAssembly/simd#237 (load zero)

The wasm bytecode values used here come from the binaryen changes that
are linked from those tickets, that's the best documentation right
now.  Current binaryen opcode mappings are here:
https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h

Also: Drive-by fix for signatures of vroundss and vroundsd, these are
unary operations and should follow the conventions for these with
src/dest arguments, not src0/src1/dest.

Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that
take Operand source and FloatRegister destination.

Differential Revision: https://phabricator.services.mozilla.com/D85982
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this pull request Aug 12, 2020
Implement some of the experimental SIMD opcodes that are supported by
all of V8, LLVM, and Binaryen, for maximum compatibility with test
content we might be exposed to.  Most/all of these will probably make
it into the spec, as they lead to substantial speedups in some
programs, and they are deterministic.

For spec and cpu mapping details, see:

WebAssembly/simd#122 (pmax/pmin)
WebAssembly/simd#232 (rounding)
WebAssembly/simd#127 (dot product)
WebAssembly/simd#237 (load zero)

The wasm bytecode values used here come from the binaryen changes that
are linked from those tickets, that's the best documentation right
now.  Current binaryen opcode mappings are here:
https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h

Also: Drive-by fix for signatures of vroundss and vroundsd, these are
unary operations and should follow the conventions for these with
src/dest arguments, not src0/src1/dest.

Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that
take Operand source and FloatRegister destination.

Differential Revision: https://phabricator.services.mozilla.com/D85982
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this pull request Aug 16, 2020
Implement some of the experimental SIMD opcodes that are supported by
all of V8, LLVM, and Binaryen, for maximum compatibility with test
content we might be exposed to.  Most/all of these will probably make
it into the spec, as they lead to substantial speedups in some
programs, and they are deterministic.

For spec and cpu mapping details, see:

WebAssembly/simd#122 (pmax/pmin)
WebAssembly/simd#232 (rounding)
WebAssembly/simd#127 (dot product)
WebAssembly/simd#237 (load zero)

The wasm bytecode values used here come from the binaryen changes that
are linked from those tickets, that's the best documentation right
now.  Current binaryen opcode mappings are here:
https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h

Also: Drive-by fix for signatures of vroundss and vroundsd, these are
unary operations and should follow the conventions for these with
src/dest arguments, not src0/src1/dest.

Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that
take Operand source and FloatRegister destination.

Differential Revision: https://phabricator.services.mozilla.com/D85982

UltraBlame original commit: 2e7ddb00c8f9240e148cf5843b50a7ba7b913351
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this pull request Aug 16, 2020
Implement some of the experimental SIMD opcodes that are supported by
all of V8, LLVM, and Binaryen, for maximum compatibility with test
content we might be exposed to.  Most/all of these will probably make
it into the spec, as they lead to substantial speedups in some
programs, and they are deterministic.

For spec and cpu mapping details, see:

WebAssembly/simd#122 (pmax/pmin)
WebAssembly/simd#232 (rounding)
WebAssembly/simd#127 (dot product)
WebAssembly/simd#237 (load zero)

The wasm bytecode values used here come from the binaryen changes that
are linked from those tickets, that's the best documentation right
now.  Current binaryen opcode mappings are here:
https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h

Also: Drive-by fix for signatures of vroundss and vroundsd, these are
unary operations and should follow the conventions for these with
src/dest arguments, not src0/src1/dest.

Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that
take Operand source and FloatRegister destination.

Differential Revision: https://phabricator.services.mozilla.com/D85982

UltraBlame original commit: 2d73a015caaa3e70c175172158a6548625dc6da3
gecko-dev-updater pushed a commit to marco-c/gecko-dev-comments-removed that referenced this pull request Aug 16, 2020
Implement some of the experimental SIMD opcodes that are supported by
all of V8, LLVM, and Binaryen, for maximum compatibility with test
content we might be exposed to.  Most/all of these will probably make
it into the spec, as they lead to substantial speedups in some
programs, and they are deterministic.

For spec and cpu mapping details, see:

WebAssembly/simd#122 (pmax/pmin)
WebAssembly/simd#232 (rounding)
WebAssembly/simd#127 (dot product)
WebAssembly/simd#237 (load zero)

The wasm bytecode values used here come from the binaryen changes that
are linked from those tickets, that's the best documentation right
now.  Current binaryen opcode mappings are here:
https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h

Also: Drive-by fix for signatures of vroundss and vroundsd, these are
unary operations and should follow the conventions for these with
src/dest arguments, not src0/src1/dest.

Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that
take Operand source and FloatRegister destination.

Differential Revision: https://phabricator.services.mozilla.com/D85982

UltraBlame original commit: 2e7ddb00c8f9240e148cf5843b50a7ba7b913351
gecko-dev-updater pushed a commit to marco-c/gecko-dev-comments-removed that referenced this pull request Aug 16, 2020
Implement some of the experimental SIMD opcodes that are supported by
all of V8, LLVM, and Binaryen, for maximum compatibility with test
content we might be exposed to.  Most/all of these will probably make
it into the spec, as they lead to substantial speedups in some
programs, and they are deterministic.

For spec and cpu mapping details, see:

WebAssembly/simd#122 (pmax/pmin)
WebAssembly/simd#232 (rounding)
WebAssembly/simd#127 (dot product)
WebAssembly/simd#237 (load zero)

The wasm bytecode values used here come from the binaryen changes that
are linked from those tickets, that's the best documentation right
now.  Current binaryen opcode mappings are here:
https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h

Also: Drive-by fix for signatures of vroundss and vroundsd, these are
unary operations and should follow the conventions for these with
src/dest arguments, not src0/src1/dest.

Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that
take Operand source and FloatRegister destination.

Differential Revision: https://phabricator.services.mozilla.com/D85982

UltraBlame original commit: 2d73a015caaa3e70c175172158a6548625dc6da3
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this pull request Aug 16, 2020
Implement some of the experimental SIMD opcodes that are supported by
all of V8, LLVM, and Binaryen, for maximum compatibility with test
content we might be exposed to.  Most/all of these will probably make
it into the spec, as they lead to substantial speedups in some
programs, and they are deterministic.

For spec and cpu mapping details, see:

WebAssembly/simd#122 (pmax/pmin)
WebAssembly/simd#232 (rounding)
WebAssembly/simd#127 (dot product)
WebAssembly/simd#237 (load zero)

The wasm bytecode values used here come from the binaryen changes that
are linked from those tickets, that's the best documentation right
now.  Current binaryen opcode mappings are here:
https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h

Also: Drive-by fix for signatures of vroundss and vroundsd, these are
unary operations and should follow the conventions for these with
src/dest arguments, not src0/src1/dest.

Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that
take Operand source and FloatRegister destination.

Differential Revision: https://phabricator.services.mozilla.com/D85982

UltraBlame original commit: 2e7ddb00c8f9240e148cf5843b50a7ba7b913351
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this pull request Aug 16, 2020
Implement some of the experimental SIMD opcodes that are supported by
all of V8, LLVM, and Binaryen, for maximum compatibility with test
content we might be exposed to.  Most/all of these will probably make
it into the spec, as they lead to substantial speedups in some
programs, and they are deterministic.

For spec and cpu mapping details, see:

WebAssembly/simd#122 (pmax/pmin)
WebAssembly/simd#232 (rounding)
WebAssembly/simd#127 (dot product)
WebAssembly/simd#237 (load zero)

The wasm bytecode values used here come from the binaryen changes that
are linked from those tickets, that's the best documentation right
now.  Current binaryen opcode mappings are here:
https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h

Also: Drive-by fix for signatures of vroundss and vroundsd, these are
unary operations and should follow the conventions for these with
src/dest arguments, not src0/src1/dest.

Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that
take Operand source and FloatRegister destination.

Differential Revision: https://phabricator.services.mozilla.com/D85982

UltraBlame original commit: 2d73a015caaa3e70c175172158a6548625dc6da3
@tlively
Copy link
Member

tlively commented Sep 4, 2020

@Maratyszcza we briefly discussed this in the sync meeting today, and there is general support for these instructions, but we still need benchmarking data to make the case for including them. Would you be able to get performance numbers for these?

@omnisip
Copy link

omnisip commented Oct 6, 2020 via email

@tlively
Copy link
Member

tlively commented Oct 6, 2020

How is that different from v128.const then?

@omnisip
Copy link

omnisip commented Oct 6, 2020 via email

@tlively
Copy link
Member

tlively commented Oct 6, 2020

Then it sounds like a i64x2.replace_lane 0 I guess that requires a full vector to be materialized first, though.

In general, the answer to "why doesn't the instruction set have X" is some combination of "X is not portable enough," "X is not useful enough," or "no one has suggested X yet". You're totally welcome to suggest new instructions when you identify deficiencies in the instruction set. Useful information to include is how the instruction would be lowered on Intel and ARM ISAs, applications that would benefit from the instruction, and any estimates you have for the performance improvement that instruction can bring. See many of @Maratyszcza's PRs for good examples of new instruction proposals.

@omnisip
Copy link

omnisip commented Oct 6, 2020

In this case, it would belong with this PR because it would still be the same underlying instructions with a different argument. @Maratyszcza do you want me to make a patch or do you want to add this yourself?

@tlively
Copy link
Member

tlively commented Oct 6, 2020

WebAssembly doesn't overload instructions with different kinds of arguments like that, so it will have to be a new instruction proposal.

@omnisip
Copy link

omnisip commented Oct 6, 2020 via email

@penzn
Copy link
Contributor

penzn commented Oct 6, 2020

Why doesn't this instruction set have a non-memory load? E.g. to load into the lower part without pulling from memory?

Memory operations either go from memory to a stack slot (load) or from stack slot to memory (store). There are no other possible semantics for memory operations in WebAssembly.

(edit: had virtual registers instead of stack slots, my bad)

@ngzhian
Copy link
Member

ngzhian commented Oct 16, 2020

In the sync today we agreed to add these 2 instructions.
@Maratyszcza can you please update https://github.com/WebAssembly/simd/blob/master/proposals/simd/NewOpcodes.md as well? Put it in the table of memory instructions.

@Maratyszcza
Copy link
Contributor Author

@ngzhian Done

@ngzhian
Copy link
Member

ngzhian commented Oct 19, 2020

Thanks, LGTM

@ngzhian ngzhian merged commit b9b54b0 into WebAssembly:master Oct 19, 2020
bmeurer added a commit to bmeurer/wasmparser that referenced this pull request Oct 24, 2020
bmeurer added a commit to wasdk/wasmparser that referenced this pull request Oct 24, 2020
julian-seward1 added a commit to julian-seward1/wasmtime that referenced this pull request Nov 3, 2020
…ons.

This patch implements, for aarch64, the following wasm SIMD extensions.

  v128.load32_zero and v128.load64_zero instructions
  WebAssembly/simd#237

The changes are straightforward:

* no new CLIF instructions.  They are translated into an existing CLIF scalar
  load followed by a CLIF `scalar_to_vector`.

* the comment/specification for CLIF `scalar_to_vector` has been changed to
  match the actual intended semantics, per consulation with Andrew Brown.

* translation from `scalar_to_vector` to the obvious aarch64 insns.

* special-case zero in `lower_constant_f128` in order to avoid a
  potentially slow call to `Inst::load_fp_constant128`.

* Once "Allow loads to merge into other operations during instruction
  selection in MachInst backends"
  (bytecodealliance#2340) lands,
  we can use that functionality to pattern match the two-CLIF pair and
  emit a single AArch64 instruction.

There is no testcase in this commit, because that is a separate repo.  The
implementation has been tested, nevertheless.
julian-seward1 added a commit to julian-seward1/wasmtime that referenced this pull request Nov 3, 2020
…ons.

This patch implements, for aarch64, the following wasm SIMD extensions.

  v128.load32_zero and v128.load64_zero instructions
  WebAssembly/simd#237

The changes are straightforward:

* no new CLIF instructions.  They are translated into an existing CLIF scalar
  load followed by a CLIF `scalar_to_vector`.

* the comment/specification for CLIF `scalar_to_vector` has been changed to
  match the actual intended semantics, per consulation with Andrew Brown.

* translation from `scalar_to_vector` to the obvious aarch64 insns.

* special-case zero in `lower_constant_f128` in order to avoid a
  potentially slow call to `Inst::load_fp_constant128`.

* Once "Allow loads to merge into other operations during instruction
  selection in MachInst backends"
  (bytecodealliance#2340) lands,
  we can use that functionality to pattern match the two-CLIF pair and
  emit a single AArch64 instruction.

There is no testcase in this commit, because that is a separate repo.  The
implementation has been tested, nevertheless.
ambroff pushed a commit to ambroff/gecko that referenced this pull request Nov 4, 2020
Implement some of the experimental SIMD opcodes that are supported by
all of V8, LLVM, and Binaryen, for maximum compatibility with test
content we might be exposed to.  Most/all of these will probably make
it into the spec, as they lead to substantial speedups in some
programs, and they are deterministic.

For spec and cpu mapping details, see:

WebAssembly/simd#122 (pmax/pmin)
WebAssembly/simd#232 (rounding)
WebAssembly/simd#127 (dot product)
WebAssembly/simd#237 (load zero)

The wasm bytecode values used here come from the binaryen changes that
are linked from those tickets, that's the best documentation right
now.  Current binaryen opcode mappings are here:
https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h

Also: Drive-by fix for signatures of vroundss and vroundsd, these are
unary operations and should follow the conventions for these with
src/dest arguments, not src0/src1/dest.

Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that
take Operand source and FloatRegister destination.

Differential Revision: https://phabricator.services.mozilla.com/D85982
ambroff pushed a commit to ambroff/gecko that referenced this pull request Nov 4, 2020
Implement some of the experimental SIMD opcodes that are supported by
all of V8, LLVM, and Binaryen, for maximum compatibility with test
content we might be exposed to.  Most/all of these will probably make
it into the spec, as they lead to substantial speedups in some
programs, and they are deterministic.

For spec and cpu mapping details, see:

WebAssembly/simd#122 (pmax/pmin)
WebAssembly/simd#232 (rounding)
WebAssembly/simd#127 (dot product)
WebAssembly/simd#237 (load zero)

The wasm bytecode values used here come from the binaryen changes that
are linked from those tickets, that's the best documentation right
now.  Current binaryen opcode mappings are here:
https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h

Also: Drive-by fix for signatures of vroundss and vroundsd, these are
unary operations and should follow the conventions for these with
src/dest arguments, not src0/src1/dest.

Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that
take Operand source and FloatRegister destination.

Differential Revision: https://phabricator.services.mozilla.com/D85982
julian-seward1 added a commit to julian-seward1/wasmtime that referenced this pull request Nov 4, 2020
…ons.

This patch implements, for aarch64, the following wasm SIMD extensions.

  v128.load32_zero and v128.load64_zero instructions
  WebAssembly/simd#237

The changes are straightforward:

* no new CLIF instructions.  They are translated into an existing CLIF scalar
  load followed by a CLIF `scalar_to_vector`.

* the comment/specification for CLIF `scalar_to_vector` has been changed to
  match the actual intended semantics, per consulation with Andrew Brown.

* translation from `scalar_to_vector` to aarch64 `fmov` instruction.  This
  has been generalised slightly so as to allow both 32- and 64-bit transfers.

* special-case zero in `lower_constant_f128` in order to avoid a
  potentially slow call to `Inst::load_fp_constant128`.

* Once "Allow loads to merge into other operations during instruction
  selection in MachInst backends"
  (bytecodealliance#2340) lands,
  we can use that functionality to pattern match the two-CLIF pair and
  emit a single AArch64 instruction.

* A simple filetest has been added.

There is no comprehensive testcase in this commit, because that is a separate
repo.  The implementation has been tested, nevertheless.
julian-seward1 added a commit to bytecodealliance/wasmtime that referenced this pull request Nov 4, 2020
…ons.

This patch implements, for aarch64, the following wasm SIMD extensions.

  v128.load32_zero and v128.load64_zero instructions
  WebAssembly/simd#237

The changes are straightforward:

* no new CLIF instructions.  They are translated into an existing CLIF scalar
  load followed by a CLIF `scalar_to_vector`.

* the comment/specification for CLIF `scalar_to_vector` has been changed to
  match the actual intended semantics, per consulation with Andrew Brown.

* translation from `scalar_to_vector` to aarch64 `fmov` instruction.  This
  has been generalised slightly so as to allow both 32- and 64-bit transfers.

* special-case zero in `lower_constant_f128` in order to avoid a
  potentially slow call to `Inst::load_fp_constant128`.

* Once "Allow loads to merge into other operations during instruction
  selection in MachInst backends"
  (#2340) lands,
  we can use that functionality to pattern match the two-CLIF pair and
  emit a single AArch64 instruction.

* A simple filetest has been added.

There is no comprehensive testcase in this commit, because that is a separate
repo.  The implementation has been tested, nevertheless.
cfallin pushed a commit to bytecodealliance/wasmtime that referenced this pull request Nov 30, 2020
…ons.

This patch implements, for aarch64, the following wasm SIMD extensions.

  v128.load32_zero and v128.load64_zero instructions
  WebAssembly/simd#237

The changes are straightforward:

* no new CLIF instructions.  They are translated into an existing CLIF scalar
  load followed by a CLIF `scalar_to_vector`.

* the comment/specification for CLIF `scalar_to_vector` has been changed to
  match the actual intended semantics, per consulation with Andrew Brown.

* translation from `scalar_to_vector` to aarch64 `fmov` instruction.  This
  has been generalised slightly so as to allow both 32- and 64-bit transfers.

* special-case zero in `lower_constant_f128` in order to avoid a
  potentially slow call to `Inst::load_fp_constant128`.

* Once "Allow loads to merge into other operations during instruction
  selection in MachInst backends"
  (#2340) lands,
  we can use that functionality to pattern match the two-CLIF pair and
  emit a single AArch64 instruction.

* A simple filetest has been added.

There is no comprehensive testcase in this commit, because that is a separate
repo.  The implementation has been tested, nevertheless.
arichardson pushed a commit to arichardson/llvm-project that referenced this pull request Mar 22, 2021
Specified in WebAssembly/simd#237, these
instructions load the first vector lane from memory and zero the other
lanes. Since these instructions are not officially part of the SIMD
proposal, they are only available on an opt-in basis via LLVM
intrinsics and clang builtin functions. If these instructions are
merged to the proposal, this implementation will change so that the
instructions will be generated from normal IR. At that point the
intrinsics and builtin functions would be removed.

This PR also changes the opcodes for the experimental f32x4.qfm{a,s}
instructions because their opcodes conflicted with those of the
v128.load{32,64}_zero instructions. The new opcodes were chosen to
match those used in V8.

Differential Revision: https://reviews.llvm.org/D84820
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants