-
Notifications
You must be signed in to change notification settings - Fork 43
Conversation
proposes 8, 16, 32 signed and unsigned extended load instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating the PR! Please add the new proposed instructions to SIMD.md in their own section, and include them in BinarySIMD.md with proposed opcode numbers as well. The Spec should be architecture agnostic, so while the details about Intel/ARM instructions are useful, I think they are documented well in the existing documentation in the issue, and don't need to be in the Spec.
|
||
Currently as proposed there is an instructions defined in the WASM SIMD ISA as follows. | ||
|
||
**i8x16.mul** which is a register to register operation that takes 16 8 bit integers and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The instruction multiples 32 8-bit integers, 16 integers from the first register with 16 integers from the second register, resulting in 16 8-bit integers.
|
||
**i8x16.mul** which is a register to register operation that takes 16 8 bit integers and | ||
|
||
multiplies them together resulting in an 8 bit value. If the distribution of the integers it flat this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: it -> is
Also what do you mean with the distribution of the integers being flat? And even if that is the case, does it always result in overflow? e.g., if all integers are 0
, then the distribution would be as flat as it gets but no overflow would happen.
|
||
multiplies them together resulting in an 8 bit value. If the distribution of the integers it flat this | ||
|
||
would result in a large percent of the instructions with overflow. This is a problem for many applications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i8x16.mul implements wrapping integer multiplication, therefore, I don't see why overflow there is a problem (as in the behavior of these on overflow is well-defined, its neither an error nor UB), and wrapping multiplies is a very common and desirable operation to provide (the text reads a bit like "wrapping multiply is bad").
|
||
### Proposed new instructions | ||
|
||
Six new load instructions are being proposed to make integer multiplies easier. i8x16zxload, i8x16sxload, i16x8zxload, i16x8sxload, i32x4zxload, i32x4sxload. This would make i8, i16, i32 multiplies useful and more practical for applications such as machine learning, image compression and video and rendering data processing.The new instructions would take consecutive integers of the corresponding size and zero sign extend and sign extend the consecutive bytes, words or dword to the promoted size of signed or unsigned data. An example of zero sign extend is shown below: Intel and ARM both have this capability by doing the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nowhere does the text say what the proposed instructions actually do, what are their argument types, results, operational semantics, etc.
|
||
and unsigned overflow on the data it operates on. | ||
|
||
The following is a partial sample example of how sign extended loads are be used in a matrix multiply of 8 bit integers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please show WASM instead.
* i32x2.zxload | ||
* i8x8.sxload | ||
* i16x4.sxload | ||
* i32x2.sxload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these types are 64-bit wide vector types, but the spec does not define v64
anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the document and the issue quite unclear, so I might be completely misunderstanding what problem this is trying to address, but if what you want is to be able load, e.g., 4 x 16-bit integers into a v128
containing 4 x 32-bit integers by sign or zero extending them, then I would prefer something that's more similar to the scalar intrinsics, e.g., like i32x4.load_i16x4_s(memarg) -> v128
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intent here is to introduce something equivalent to the intel pmovzx(b/d/q) instruction, so the operation would be as follows -
Packed_Zero_Extend_BYTE_to_WORD(DEST, SRC)
DEST[15:0] <- ZeroExtend(SRC[7:0]);
DEST[31:16] <- ZeroExtend(SRC[15:8]);
DEST[47:32] <- ZeroExtend(SRC[23:16]);
DEST[63:48] <- ZeroExtend(SRC[31:24]);
DEST[79:64] <- ZeroExtend(SRC[39:32]);
DEST[95:80] <- ZeroExtend(SRC[47:40]);
DEST[111:96] <- ZeroExtend(SRC[55:48]);
DEST[127:112] <- ZeroExtend(SRC[63:56]);
So it's not strictly a V128->V128 operation. I do agree that this would benefit from clearer names instead of zx/sx convention that's not very readable.
"pmaddwd %%xmm0, %%xmm3 \n\t" | ||
"paddd %%xmm2, %%xmm4 \n\t" | ||
"paddd %%xmm3, %%xmm5 \n\t" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: backticks missing
* i32x2.sxload | ||
|
||
As a result of these new instructions a multiply can now be done without worrying about signed | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: line break
**i8x16.mul** which is a register to register operation that takes 16 8 bit integers and | ||
|
||
multiplies them together resulting in an 8 bit value. If the distribution of the integers it flat this | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: line break
Currently as proposed there is an instructions defined in the WASM SIMD ISA as follows. | ||
|
||
**i8x16.mul** which is a register to register operation that takes 16 8 bit integers and | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: line break
Please link the issue (#28) and the rendered document in the top-level comment. |
Add extended load definitions to SIMD.md
Extended loads: 32->64 bit ops
|
||
|
||
|
||
* movzxbw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to include the packed versions of these instructions (ex: pmovzxbw/pmovsxbw)? Please link the exact instruction you mean if that's not the one.
@@ -0,0 +1,64 @@ | |||
### **Proposal WebAssembly SIMD Modification** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this document be removed? While this is useful information, I don't see a need for this to be included here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want it to be included in the body of the PR or just simply remove it? @rrwinterton, what do you think?
@@ -666,6 +666,15 @@ natural alignment. | |||
|
|||
Load a `v128` vector from the given heap address. | |||
|
|||
Extended loads: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be made clear here that this is not operating on 64bit widths widening to a 128-bit vector? Having a concise description here will be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to clarify that we are extending every lane, rather than widening a 64-bit value?
The description of the change (so that we can delete the file): Proposed new instructionsSix new load instructions are being proposed to make integer multiplies easier. i8x16.zxload, i8x16.sxload, i16x8.zxload, i16x8.sxload, i32x4.zxload, i32x4.sxload. This would make i8, i16, i32 multiplies useful and more practical for applications such as machine learning, image compression and video and rendering data processing. The new instructions would take consecutive integers of the corresponding size and zero sign extend and sign extend the consecutive bytes, words or dword to the promoted size of signed or unsigned data. An example of zero sign extend is shown below: Intel and ARM both have this capability by doing the following: Intel Instructions:
ARM Instructions:
So the new instructions for WASM would be defined as follows:
As a result of these new instructions a multiply can now be done without worrying about signed and unsigned overflow on the data it operates on. The following is a partial sample example of how sign extended loads are be used in a matrix multiply of 8 bit integers:
|
Clarify what are the inputs and outputs to extended loads, remove the file with the description (move it to issue comments). For WebAssembly#77
These names would be more consistent with the scalar extending load instructions (i.e.
|
Thanks for those suggestions, @AndrewScheidecker! I personally prefer alternative 1 because they are more explicit about how much memory is actually being loaded. |
I like that scheme as well, incorporated in rrwinterton#4 |
Incorporate review suggestions from WebAssembly#77
Incorporate review suggestions from WebAssembly#77
proposes 8, 16, 32 signed and unsigned extended load instructions.