Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Introduce Load and Extend #98

Merged
merged 3 commits into from
Sep 13, 2019
Merged

Conversation

penzn
Copy link
Contributor

@penzn penzn commented Aug 28, 2019

Rebasing #77 on current master and incorporating the latest review feedback.

This change proposes six new load instructions, that would combine memory read of "half-size" vector with extending each lane to the next standard lane size. Motivating workloads are machine learning, image compression, video rendering, and data processing. There is widespread hardware support. Also see #23, #28, #77

Hardware support from #23:

  • PMOVZXWD xmm, [mem] on x86 with SSE4.1
  • MOVQ xmm, [mem] + PXOR xmm0, xmm0 + PUNPCKLWD xmm, xmm0 on SSE2
  • VLD1.16 {dX}, [rAddr] + VMOVL.U16 qX, dX on ARMv7+NEON
  • LD1 {Vx.4H}, xAddr + UXTL Vx.4S, Vx.4H on ARM64

Incorporate review suggestions from WebAssembly#77
@penzn
Copy link
Contributor Author

penzn commented Sep 3, 2019

pinging @dtig, @tlively, @Maratyszcza, and @arunetm

@tlively
Copy link
Member

tlively commented Sep 3, 2019

Looks fine to me from a tooling perspective. @ngzhian would the renumbering of the narrowing and widening operations here be problematic?

@Maratyszcza
Copy link
Contributor

LGTM

@dtig
Copy link
Member

dtig commented Sep 10, 2019

Thanks for distilling multiple PRs/Issues into this one, I'm in favor of merging this because collaborators from different application domains have indicated that this would be a useful operation to have, this has been a requested addition since 2017 and seems to be reasonably well supported on all architectures. A couple of outstanding things to make sure we handle previous concerns.

  • There were some concerns about this set of operations when presented previously, but IIRC load+extend operations as enumerated here would be okay to include. cc @sunfishcode
  • This was originally proposed in conjunction with removal of i8x16.mul because of the large number of overflow cases, but this PR doesn't seem to address that, is the motivation still to remove the i8x16.mul operation?

@arunetm
Copy link
Collaborator

arunetm commented Sep 10, 2019

There was general consensus on removing i8x16.mul. @penzn can we update this PR to address it as well.

@penzn
Copy link
Contributor Author

penzn commented Sep 10, 2019

Do you mean the discussion in #28? Will add a commit to remove those if there is no objections.

Consensus for the removal is documented in WebAssembly#28 and
WebAssembly#98.
@tlively
Copy link
Member

tlively commented Sep 12, 2019

Can we add formal pseudocode to SIMD.md for these instructions?

@penzn
Copy link
Contributor Author

penzn commented Sep 12, 2019

Good point, maybe we should.

@penzn
Copy link
Contributor Author

penzn commented Sep 12, 2019

Sorry for double-posting. We don't have semantics on memory ops, I am not sure how to describe that yet, the "extend" part of the operation should be very similar to "widen" operation, which does not have semantics yet either.

@tlively
Copy link
Member

tlively commented Sep 12, 2019

Ok, I'm fine with merging this without pseudocode. I haven't thought of any ambiguities in the semantics here.

@tlively tlively merged commit 36a8199 into WebAssembly:master Sep 13, 2019
abrown added a commit to abrown/cranelift that referenced this pull request Sep 18, 2019
Only i16x8 and i32x4 are encoded in this commit mainly because i8x16 and i64x2 do not have simple encodings in x86. i64x2 is not required by the SIMD spec and there is discussion (WebAssembly/simd#98 (comment)) about removing i8x16.
abrown added a commit to abrown/cranelift that referenced this pull request Sep 18, 2019
Only i16x8 and i32x4 are encoded in this commit mainly because i8x16 and i64x2 do not have simple encodings in x86. i64x2 is not required by the SIMD spec and there is discussion (WebAssembly/simd#98 (comment)) about removing i8x16.
abrown added a commit to abrown/cranelift that referenced this pull request Sep 20, 2019
Only i16x8 and i32x4 are encoded in this commit mainly because i8x16 and i64x2 do not have simple encodings in x86. i64x2 is not required by the SIMD spec and there is discussion (WebAssembly/simd#98 (comment)) about removing i8x16.
abrown added a commit to abrown/cranelift that referenced this pull request Sep 20, 2019
Only i16x8 and i32x4 are encoded in this commit mainly because i8x16 and i64x2 do not have simple encodings in x86. i64x2 is not required by the SIMD spec and there is discussion (WebAssembly/simd#98 (comment)) about removing i8x16.
abrown added a commit to abrown/cranelift that referenced this pull request Sep 25, 2019
Only i16x8 and i32x4 are encoded in this commit mainly because i8x16 and i64x2 do not have simple encodings in x86. i64x2 is not required by the SIMD spec and there is discussion (WebAssembly/simd#98 (comment)) about removing i8x16.
abrown added a commit to abrown/cranelift that referenced this pull request Sep 27, 2019
Only i16x8 and i32x4 are encoded in this commit mainly because i8x16 and i64x2 do not have simple encodings in x86. i64x2 is not required by the SIMD spec and there is discussion (WebAssembly/simd#98 (comment)) about removing i8x16.
abrown added a commit to abrown/cranelift that referenced this pull request Sep 30, 2019
Only i16x8 and i32x4 are encoded in this commit mainly because i8x16 and i64x2 do not have simple encodings in x86. i64x2 is not required by the SIMD spec and there is discussion (WebAssembly/simd#98 (comment)) about removing i8x16.
abrown added a commit to abrown/cranelift that referenced this pull request Sep 30, 2019
Only i16x8 and i32x4 are encoded in this commit mainly because i8x16 and i64x2 do not have simple encodings in x86. i64x2 is not required by the SIMD spec and there is discussion (WebAssembly/simd#98 (comment)) about removing i8x16.
abrown added a commit to bytecodealliance/cranelift that referenced this pull request Sep 30, 2019
Only i16x8 and i32x4 are encoded in this commit mainly because i8x16 and i64x2 do not have simple encodings in x86. i64x2 is not required by the SIMD spec and there is discussion (WebAssembly/simd#98 (comment)) about removing i8x16.
Honry pushed a commit to Honry/simd that referenced this pull request Oct 19, 2019
@ngzhian
Copy link
Member

ngzhian commented Nov 7, 2019

Sorry I'm late to point this out:

VLD1.16 {dX}, [rAddr] doesn't allow for load via base + index (both in registers).
According to the manual, ARM DDI 0487D.b C7-1629, it only allows for a post index addressing mode, or an immediate offset.
So it will need a temporary add of base + offset, before passing the final memory address to VLD1.
Am I right in this interpretation of VLD1 instruction?

@Maratyszcza
Copy link
Contributor

You are right that VLD1 instructions don't support addressing with offset. There is also an option of using VLDR instruction, which support immediate (but not register) offset from a register base.

@ngzhian
Copy link
Member

ngzhian commented Nov 8, 2019

Thanks, I think I'll implement it using add to temporary first, and then add the optimization to use VLDR if the offset can be an immediate.

@penzn
Copy link
Contributor Author

penzn commented Nov 12, 2019

@ngzhian are you working on implementing that in V8? We wanted to get some timings for this, can lend a hand with implementation.

@ngzhian
Copy link
Member

ngzhian commented Nov 12, 2019

Yup I am, I have done it all for x64, so if you are interested only for x64 you can build locally and run. I have not started on arm/arm64/ia32 yet, so if you want to pick those up, lmk!

@penzn penzn deleted the extended-load-upstream branch November 13, 2019 00:17
@penzn
Copy link
Contributor Author

penzn commented Nov 14, 2019

I believe @rrwinterton was interested in testing this. Arm - probably not, let us think about ia32, could be somebody else would be willing to take it as well.

@ngzhian
Copy link
Member

ngzhian commented Nov 14, 2019

Sounds good, I'll be working on arm/arm64 soon, and will leave ia32 to yall for now, please keep me updated (here or via email) so we don't overlap :) Thanks!

arkpar pushed a commit to paritytech/wasmtime that referenced this pull request Mar 4, 2020
Only i16x8 and i32x4 are encoded in this commit mainly because i8x16 and i64x2 do not have simple encodings in x86. i64x2 is not required by the SIMD spec and there is discussion (WebAssembly/simd#98 (comment)) about removing i8x16.
@penzn penzn mentioned this pull request Oct 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants