-
Notifications
You must be signed in to change notification settings - Fork 43
Proposal to add mul 32x32=64 #175
Comments
I am supportive of adding this. (V8 uses this in a bunch of places, e.g. in x64 we use pmuludq for i64x2mul, so this proposal will expose the existing instruction.) |
For completeness, here's a proposed ARM implementation:
|
Thanks @jan-wassenberg for adding mappings to x64, and Aarch64 instructions. This is possibly a useful operation, but the ARM sequence in this case is somewhat unfortunate, concretely on Aarch64, this would map to 5 instructions - 2 For ARMv7, the code sequence is somewhat fuzzy because though the |
So @jan-wassenberg @dtig what you would like to have is a register instruction as opposed to a load extend instruction proposed in Introduce Load and Extend issue #98 ? It depends on what you are doing I guess? If you have to load anyway you could keep it in a register. It depends on how many you want and if you start to spill? I probably am missing something. |
@dtig You're welcome! I believe the DUPs can be elided? https://gcc.godbolt.org/z/zYJGPJ @rrwinterton The proposal is aimed at multiplication, it can be independent of load-extend. We're trying to expose pmuludq, because IIRC only AVX3 has full 64x64 bit multiplication. |
@jan-wassenberg Not that I can see for pre-ARMv8, am I missing something? I would like to hear more opinions about this, and perhaps some links to code in the wild if possible? As with general positive interest, would not be opposed to prototyping this and benchmarking as well considering most of the concerns about performance are from me. |
@dtig Ah yes, looks like those are A64 instructions - I am not very familiar with v7. |
Thanks @jan-wassenberg - originally we planned to optimize the subset of operations just for Neon support but as more engines/applications have come up it does look like they care about performance on some subset of older ARM devices especially on mobile, as most of the operations till now have codegen that's fairly reasonable so far this hasn't been an issue, but with some of the newer proposed operations it's less clear how much we should index on this particular aspect. To be consistent across architectures, though it would make sense to bias this towards it being future facing. I'll take an AI to get a more accurate sense for this and open an issue, or respond here. |
As a tangent, I would like to mention that for fixed-point calculations, it would be useful to also have fixed-point multiplication instructions, e.g. 32x32=32 and 16x16=16, similar to ARM SQRDMULH. I mention this here because some may think that the availability of a 32x32=64 integer multiplication would remove the need for that, but that would be sub-optimal: staying within 32bit means doing 4 scalar operations per 128-bit vector operation, and most applications want to use the rounding flavor (SQRDMULH not SQDMULH) which would require a few more instructions to emulate if the instruction is missing, which in practice would result in applications making compromises between accuracy and performance. (This is critical to integer-quantized neural network applications as performed by TensorFlow Lite using the ruy matrix multiplication library, see e.g. the usage of these instructions here, https://github.com/google/ruy/blob/57e64b4c8f32e813ce46eb495d13ee301826e498/ruy/kernel_arm64.cc#L517 ) |
Branched into Issue #221. |
Closing since we have 32x32=64, as Marat pointed out. |
I haven't seen _mm_mul_epu32 = PMULUDQ mentioned here. This is important for crypto/hashing, and a quick search on Github turned up uses in poly1305, QNNPACK, skia, HighwayHash.
The proposed semantics are: u64x2 r = mul_even(u32x4 a, u32x4 b), where
MULE on PPC has the same semantics as PMULUDQ on x86. ARM can vmull_u32 after two vuzp1q_u32.
Does this seem like a useful operation to add?
The text was updated successfully, but these errors were encountered: