Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

simdsignunsignedextendedload.md #77

Closed
wants to merge 8 commits into from

Conversation

rrwinterton
Copy link

proposes 8, 16, 32 signed and unsigned extended load instructions.

proposes 8, 16, 32 signed and unsigned extended load instructions.
Copy link
Member

@dtig dtig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating the PR! Please add the new proposed instructions to SIMD.md in their own section, and include them in BinarySIMD.md with proposed opcode numbers as well. The Spec should be architecture agnostic, so while the details about Intel/ARM instructions are useful, I think they are documented well in the existing documentation in the issue, and don't need to be in the Spec.


Currently as proposed there is an instructions defined in the WASM SIMD ISA as follows.

**i8x16.mul** which is a register to register operation that takes 16 8 bit integers and
Copy link
Contributor

@gnzlbg gnzlbg May 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instruction multiples 32 8-bit integers, 16 integers from the first register with 16 integers from the second register, resulting in 16 8-bit integers.


**i8x16.mul** which is a register to register operation that takes 16 8 bit integers and

multiplies them together resulting in an 8 bit value. If the distribution of the integers it flat this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: it -> is


Also what do you mean with the distribution of the integers being flat? And even if that is the case, does it always result in overflow? e.g., if all integers are 0, then the distribution would be as flat as it gets but no overflow would happen.


multiplies them together resulting in an 8 bit value. If the distribution of the integers it flat this

would result in a large percent of the instructions with overflow. This is a problem for many applications.
Copy link
Contributor

@gnzlbg gnzlbg May 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i8x16.mul implements wrapping integer multiplication, therefore, I don't see why overflow there is a problem (as in the behavior of these on overflow is well-defined, its neither an error nor UB), and wrapping multiplies is a very common and desirable operation to provide (the text reads a bit like "wrapping multiply is bad").


### Proposed new instructions

Six new load instructions are being proposed to make integer multiplies easier. i8x16zxload, i8x16sxload, i16x8zxload, i16x8sxload, i32x4zxload, i32x4sxload. This would make i8, i16, i32 multiplies useful and more practical for applications such as machine learning, image compression and video and rendering data processing.The new instructions would take consecutive integers of the corresponding size and zero sign extend and sign extend the consecutive bytes, words or dword to the promoted size of signed or unsigned data. An example of zero sign extend is shown below: Intel and ARM both have this capability by doing the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowhere does the text say what the proposed instructions actually do, what are their argument types, results, operational semantics, etc.


and unsigned overflow on the data it operates on.

The following is a partial sample example of how sign extended loads are be used in a matrix multiply of 8 bit integers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please show WASM instead.

* i32x2.zxload
* i8x8.sxload
* i16x4.sxload
* i32x2.sxload
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these types are 64-bit wide vector types, but the spec does not define v64 anywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the document and the issue quite unclear, so I might be completely misunderstanding what problem this is trying to address, but if what you want is to be able load, e.g., 4 x 16-bit integers into a v128 containing 4 x 32-bit integers by sign or zero extending them, then I would prefer something that's more similar to the scalar intrinsics, e.g., like i32x4.load_i16x4_s(memarg) -> v128.

Copy link
Member

@dtig dtig Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intent here is to introduce something equivalent to the intel pmovzx(b/d/q) instruction, so the operation would be as follows -

Packed_Zero_Extend_BYTE_to_WORD(DEST, SRC)
DEST[15:0] <- ZeroExtend(SRC[7:0]);
DEST[31:16] <- ZeroExtend(SRC[15:8]);
DEST[47:32] <- ZeroExtend(SRC[23:16]);
DEST[63:48] <- ZeroExtend(SRC[31:24]);
DEST[79:64] <- ZeroExtend(SRC[39:32]);
DEST[95:80] <- ZeroExtend(SRC[47:40]);
DEST[111:96] <- ZeroExtend(SRC[55:48]);
DEST[127:112] <- ZeroExtend(SRC[63:56]);

So it's not strictly a V128->V128 operation. I do agree that this would benefit from clearer names instead of zx/sx convention that's not very readable.

"pmaddwd %%xmm0, %%xmm3 \n\t"
"paddd %%xmm2, %%xmm4 \n\t"
"paddd %%xmm3, %%xmm5 \n\t"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: backticks missing

* i32x2.sxload

As a result of these new instructions a multiply can now be done without worrying about signed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: line break

**i8x16.mul** which is a register to register operation that takes 16 8 bit integers and

multiplies them together resulting in an 8 bit value. If the distribution of the integers it flat this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: line break

Currently as proposed there is an instructions defined in the WASM SIMD ISA as follows.

**i8x16.mul** which is a register to register operation that takes 16 8 bit integers and

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: line break

@gnzlbg
Copy link
Contributor

gnzlbg commented May 3, 2019

Please link the issue (#28) and the rendered document in the top-level comment.

Add extended load definitions to SIMD.md
@penzn
Copy link
Contributor

penzn commented Jul 27, 2019

I believe we have populated the opcodes and instruction descriptions in this PR. One thing to note is that it implements #28 and not #21 (latter proposes vector to vector operations).




* movzxbw
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to include the packed versions of these instructions (ex: pmovzxbw/pmovsxbw)? Please link the exact instruction you mean if that's not the one.

@@ -0,0 +1,64 @@
### **Proposal WebAssembly SIMD Modification**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this document be removed? While this is useful information, I don't see a need for this to be included here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want it to be included in the body of the PR or just simply remove it? @rrwinterton, what do you think?

@@ -666,6 +666,15 @@ natural alignment.

Load a `v128` vector from the given heap address.

Extended loads:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be made clear here that this is not operating on 64bit widths widening to a 128-bit vector? Having a concise description here will be useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to clarify that we are extending every lane, rather than widening a 64-bit value?

@penzn
Copy link
Contributor

penzn commented Aug 20, 2019

The description of the change (so that we can delete the file):

Proposed new instructions

Six new load instructions are being proposed to make integer multiplies easier. i8x16.zxload, i8x16.sxload, i16x8.zxload, i16x8.sxload, i32x4.zxload, i32x4.sxload. This would make i8, i16, i32 multiplies useful and more practical for applications such as machine learning, image compression and video and rendering data processing. The new instructions would take consecutive integers of the corresponding size and zero sign extend and sign extend the consecutive bytes, words or dword to the promoted size of signed or unsigned data. An example of zero sign extend is shown below: Intel and ARM both have this capability by doing the following:

Intel Instructions:

  • movzxbw
  • movzxwd
  • movzxdq
  • movsxbw
  • movsxwd
  • movsxdq

ARM Instructions:

  • LDR X0, [X1] Load from the address in X1
  • LDR X0, [X1, #8] Load from address X1 + 8
  • LDR X0, [X1, X2] Load from address X1 + X2
  • LDR X0, [X1, X2, LSL, #3] Load from address X1 + (X2 << 3)
  • LDR X0, [X1, W2, SXTW] Load from address X1 + sign extend(W2)
  • LDR X0, [X1, W2, SXTW, #3] Load from address X1 + (sign extend(W2) << 3)

So the new instructions for WASM would be defined as follows:

  • i8x8.zxload
  • i16x4.zxload
  • i32x2.zxload
  • i8x8.sxload
  • i16x4.sxload
  • i32x2.sxload

As a result of these new instructions a multiply can now be done without worrying about signed

and unsigned overflow on the data it operates on.

The following is a partial sample example of how sign extended loads are be used in a matrix multiply of 8 bit integers:

       "pmovzxbw 0x00(%[mem]), %%xmm0\n\t"
       "pshufd $0x00,%%xmm1,%%xmm2     \n\t"
       "pshufd $0x55,%%xmm1,%%xmm3     \n\t"
       "pmaddwd %%xmm0, %%xmm2         \n\t"
       "pmaddwd %%xmm0, %%xmm3         \n\t"
       "paddd %%xmm2, %%xmm4           \n\t"
       "paddd %%xmm3, %%xmm5           \n\t"

penzn added a commit to penzn/simd that referenced this pull request Aug 20, 2019
Clarify what are the inputs and outputs to extended loads, remove the
file with the description (move it to issue comments).

For WebAssembly#77
@AndrewScheidecker
Copy link
Contributor

These names would be more consistent with the scalar extending load instructions (i.e. i32.load8_s):

original alternative 1 alternative 2
i8x8.zxload i16x8.load8x8_u i16x8.load8_u
i16x4.zxload i32x4.load16x4_u i32x4.load16_u
i32x2.zxload i64x2.load32x2_u i64x2.load32_u
i8x8.sxload i16x8.load8x8_s i16x8.load8_s
i16x4.sxload i32x4.load16x4_s i32x4.load16_s
i32x2.sxload i64x2.load32x2_s i64x2.load32_s

@tlively
Copy link
Member

tlively commented Aug 21, 2019

Thanks for those suggestions, @AndrewScheidecker! I personally prefer alternative 1 because they are more explicit about how much memory is actually being loaded.

@penzn
Copy link
Contributor

penzn commented Aug 21, 2019

I like that scheme as well, incorporated in rrwinterton#4

@penzn penzn mentioned this pull request Aug 22, 2019
penzn pushed a commit to penzn/simd that referenced this pull request Aug 28, 2019
Incorporate review suggestions from WebAssembly#77
penzn pushed a commit to penzn/simd that referenced this pull request Aug 28, 2019
Incorporate review suggestions from WebAssembly#77
@penzn penzn mentioned this pull request Aug 28, 2019
@dtig dtig closed this Sep 13, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants