Introduce Load and Extend #98

penzn · 2019-08-28T20:15:59Z

Rebasing #77 on current master and incorporating the latest review feedback.

This change proposes six new load instructions, that would combine memory read of "half-size" vector with extending each lane to the next standard lane size. Motivating workloads are machine learning, image compression, video rendering, and data processing. There is widespread hardware support. Also see #23, #28, #77

Hardware support from #23:

PMOVZXWD xmm, [mem] on x86 with SSE4.1
MOVQ xmm, [mem] + PXOR xmm0, xmm0 + PUNPCKLWD xmm, xmm0 on SSE2
VLD1.16 {dX}, [rAddr] + VMOVL.U16 qX, dX on ARMv7+NEON
LD1 {Vx.4H}, xAddr + UXTL Vx.4S, Vx.4H on ARM64

Incorporate review suggestions from WebAssembly#77

penzn · 2019-09-03T20:50:44Z

pinging @dtig, @tlively, @Maratyszcza, and @arunetm

tlively · 2019-09-03T21:10:19Z

Looks fine to me from a tooling perspective. @ngzhian would the renumbering of the narrowing and widening operations here be problematic?

proposals/simd/ImplementationStatus.md

proposals/simd/BinarySIMD.md

Maratyszcza · 2019-09-09T22:36:12Z

LGTM

dtig · 2019-09-10T19:33:46Z

Thanks for distilling multiple PRs/Issues into this one, I'm in favor of merging this because collaborators from different application domains have indicated that this would be a useful operation to have, this has been a requested addition since 2017 and seems to be reasonably well supported on all architectures. A couple of outstanding things to make sure we handle previous concerns.

There were some concerns about this set of operations when presented previously, but IIRC load+extend operations as enumerated here would be okay to include. cc @sunfishcode
This was originally proposed in conjunction with removal of i8x16.mul because of the large number of overflow cases, but this PR doesn't seem to address that, is the motivation still to remove the i8x16.mul operation?

arunetm · 2019-09-10T19:49:55Z

There was general consensus on removing i8x16.mul. @penzn can we update this PR to address it as well.

penzn · 2019-09-10T20:48:57Z

Do you mean the discussion in #28? Will add a commit to remove those if there is no objections.

Consensus for the removal is documented in WebAssembly#28 and WebAssembly#98.

tlively · 2019-09-12T17:30:38Z

Can we add formal pseudocode to SIMD.md for these instructions?

penzn · 2019-09-12T17:33:43Z

Good point, maybe we should.

penzn · 2019-09-12T17:43:55Z

Sorry for double-posting. We don't have semantics on memory ops, I am not sure how to describe that yet, the "extend" part of the operation should be very similar to "widen" operation, which does not have semantics yet either.

tlively · 2019-09-12T21:14:42Z

Ok, I'm fine with merging this without pseudocode. I haven't thought of any ambiguities in the semantics here.

Only i16x8 and i32x4 are encoded in this commit mainly because i8x16 and i64x2 do not have simple encodings in x86. i64x2 is not required by the SIMD spec and there is discussion (WebAssembly/simd#98 (comment)) about removing i8x16.

And remove i8x16.mul, as documented in WebAssembly#28 and WebAssembly#98.

ngzhian · 2019-11-07T20:46:29Z

Sorry I'm late to point this out:

VLD1.16 {dX}, [rAddr] doesn't allow for load via base + index (both in registers).
According to the manual, ARM DDI 0487D.b C7-1629, it only allows for a post index addressing mode, or an immediate offset.
So it will need a temporary add of base + offset, before passing the final memory address to VLD1.
Am I right in this interpretation of VLD1 instruction?

Maratyszcza · 2019-11-07T21:05:01Z

You are right that VLD1 instructions don't support addressing with offset. There is also an option of using VLDR instruction, which support immediate (but not register) offset from a register base.

ngzhian · 2019-11-08T00:12:20Z

Thanks, I think I'll implement it using add to temporary first, and then add the optimization to use VLDR if the offset can be an immediate.

penzn · 2019-11-12T23:45:52Z

@ngzhian are you working on implementing that in V8? We wanted to get some timings for this, can lend a hand with implementation.

ngzhian · 2019-11-12T23:53:42Z

Yup I am, I have done it all for x64, so if you are interested only for x64 you can build locally and run. I have not started on arm/arm64/ia32 yet, so if you want to pick those up, lmk!

penzn · 2019-11-14T19:12:38Z

I believe @rrwinterton was interested in testing this. Arm - probably not, let us think about ia32, could be somebody else would be willing to take it as well.

ngzhian · 2019-11-14T22:26:54Z

Sounds good, I'll be working on arm/arm64 soon, and will leave ia32 to yall for now, please keep me updated (here or via email) so we don't overlap :) Thanks!

Only i16x8 and i32x4 are encoded in this commit mainly because i8x16 and i64x2 do not have simple encodings in x86. i64x2 is not required by the SIMD spec and there is discussion (WebAssembly/simd#98 (comment)) about removing i8x16.

Introduce Load and Extend

408027c

Incorporate review suggestions from WebAssembly#77

tlively reviewed Sep 3, 2019

View reviewed changes

proposals/simd/ImplementationStatus.md Show resolved Hide resolved

ngzhian reviewed Sep 3, 2019

View reviewed changes

proposals/simd/BinarySIMD.md Outdated Show resolved Hide resolved

Move extended loads to the bottom of opcode list

0e99279

Remove i8x16.mul

075c14a

Consensus for the removal is documented in WebAssembly#28 and WebAssembly#98.

dtig approved these changes Sep 12, 2019

View reviewed changes

tlively merged commit 36a8199 into WebAssembly:master Sep 13, 2019

This was referenced Sep 13, 2019

Load with extension operation #23

Closed

Adding new SIMD instructions to load sign and zero extend 8, 16 and 32 byte integers #28

Closed

Honry pushed a commit to Honry/simd that referenced this pull request Oct 19, 2019

Introduce Load and Extend (WebAssembly#98)

072bde0

And remove i8x16.mul, as documented in WebAssembly#28 and WebAssembly#98.

penzn mentioned this pull request Oct 31, 2019

i32x4.dot_i16x8_s instruction #127

Merged

penzn deleted the extended-load-upstream branch November 13, 2019 00:17

rrwinterton mentioned this pull request Feb 20, 2020

Proposal to add mul 32x32=64 #175

Closed

penzn mentioned this pull request Oct 7, 2021

No i8x16.mul? #524

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Load and Extend #98

Introduce Load and Extend #98

penzn commented Aug 28, 2019 •

edited

Loading

penzn commented Sep 3, 2019

tlively commented Sep 3, 2019

Maratyszcza commented Sep 9, 2019

dtig commented Sep 10, 2019

arunetm commented Sep 10, 2019

penzn commented Sep 10, 2019 •

edited

Loading

tlively commented Sep 12, 2019

penzn commented Sep 12, 2019

penzn commented Sep 12, 2019

tlively commented Sep 12, 2019

ngzhian commented Nov 7, 2019

Maratyszcza commented Nov 7, 2019

ngzhian commented Nov 8, 2019

penzn commented Nov 12, 2019

ngzhian commented Nov 12, 2019

penzn commented Nov 14, 2019

ngzhian commented Nov 14, 2019

Introduce Load and Extend #98

Introduce Load and Extend #98

Conversation

penzn commented Aug 28, 2019 • edited Loading

penzn commented Sep 3, 2019

tlively commented Sep 3, 2019

Maratyszcza commented Sep 9, 2019

dtig commented Sep 10, 2019

arunetm commented Sep 10, 2019

penzn commented Sep 10, 2019 • edited Loading

tlively commented Sep 12, 2019

penzn commented Sep 12, 2019

penzn commented Sep 12, 2019

tlively commented Sep 12, 2019

ngzhian commented Nov 7, 2019

Maratyszcza commented Nov 7, 2019

ngzhian commented Nov 8, 2019

penzn commented Nov 12, 2019

ngzhian commented Nov 12, 2019

penzn commented Nov 14, 2019

ngzhian commented Nov 14, 2019

penzn commented Aug 28, 2019 •

edited

Loading

penzn commented Sep 10, 2019 •

edited

Loading