-
Notifications
You must be signed in to change notification settings - Fork 43
Add Quasi-Fused Multiply-Add/Subtract instructions #79
base: main
Are you sure you want to change the base?
Conversation
Is this equivalent to LLVM's |
Thanks for the thorough analysis! I agree that FMA is desirable, and it's always always been about "how" and "when" rather than "if". In light of the decision to remove the subnormal nondeterminism, effectively declaring 32-bit ARM NEON implementations non-conforming, another option here would be to just add a regular FMA instruction, without the nondeterminism. This has portability downsides, but on the other hand it would address some use cases not covered by A variant would be to standardize simd128 without FMA, and then add FMA as a separate feature, which implementations could decide to implement, and toolchains could decide to depend on, separately from the base simd128. In the long term, perhaps all implementations would support FMA, while in the short term this might allow for a transition period.
LLVM's
|
I thought that LLVM
Note that under this proposal a module that requires FMA would fail to validate on an implementation that does not support FMA, so such a transition period might result in users having to ship multiple modules (w/o FMA). Something like #80 would help here. |
@gnzlbg |
There are plenty of new x86 CPUs which don't support FMA operation. First, low-power series from Intel (Goldmont/Goldmont Plus microarchitectures) and AMD (Jaguar microarchitecture) lack FMA instructions. Secondly, Intel has a long history of stripping any post-SSE4 ISA extension from low-cost versions (i.e. Celeron & Pentium) of its mainstream architectures. E.g. Pentium Gold G5600 is based on the latest Coffee Lake microarchitecture, but doesn't support AVX and FMA instruction sets.
I expect that decision whether to use FMA would be based only on hardware characteristics, i.e. whether processor support FMA and whether it is fast on this processor. Unlike LLVM's |
It seems that I misuse the concept of "WebAssembly module". By "WebAssembly module" I mean everything in a ".wasm" or ".wast" file, but "WebAssembly module" seems to have a different meaning in the WAsm spec. What would be the right wording here? |
These are sometimes used interchangeable, but a WebAssembly module is the result of compiling wasm bytecode. Depending on the context you may be looking for different words - for everything in a .wasm file, it could just be the Wasm binary, but in the cases where there are references to failing to validate/compile - module would still be the right wording IMO. Thanks for the thorough analysis! Having thought about this a little more, I have a few concerns about the inclusion of this in the SIMD MVP. Firstly the performance inconsistencies between the FMA/non-FMA enabled hardware for this set of operations would be hard to miss. We have usually tried to stick to the instructions that are available across the board, and setting SSE4.1, and Neon as thresholds to guarantee performance across a large cross section of devices. Introduction of FMA makes this predictable performance across devices somewhat nebulous. Secondly, there is a compromise to be made here with the inclusion of QFMA, that it's not strictly FMA, and including just FMA without non-determinism violates portability. It is true that this is already the case due to removal of subnormal determinism, but I'm more comfortable with that because it is consistent with the existing Wasm spec, the number of devices that that would affect is not large, and shrinking, vs. the FMA non-determinsim is a larger surface area. I'm leaning towards having FMA without the non-determinism, but punting this to Post-MVP. That said, we have an open V8 bug to prototype/experiment with variants of this and their Polyfills on non-FMA enabled hardware. I'm labeling this with pending prototype data to follow up with after we have a prototype. |
Thanks to great work of @tlively who added QFMA in LLVM, Clang, and Binaryen, and @ngzhian who implemented QFMA in V8, it is now possible to try this instruction on real workloads. I evaluated a prototype QFMA-enabled implementation of neural network inference, and updated the top-level post with the new results. |
78a77e5
to
aa2f1fb
Compare
FWIW from my point of view, this is a good addition. In native code [I am used to], the non-determinism between different compilers / architectures is a given - different optimizations / lowering / precision of different operations just is a fact of life. Within these constraints, automatic fma substitution (-ffast-math -mfma or equivalent) usually gives appreciable performance gains in the 10-20% range on real floating-point-heavy code. In native world, it's not always easy to exercise this because you don't know if FMA is supported on the target platform, and compiling multiple versions usually doesn't work out of the box. But in WebAssembly due to JIT compilation something like qfma can work well. So in practice:
(I understand that this runs contrary to the desire to get identical results between different platforms... but the reality for SIMD is that for floating-point math, often performance trumps determinism in my experience. Of course in theory we could include instructions like qfma and estimated rsqrt in a separate proposal post-MVP, but this gets hard to deal with in practice when the actual support for different instruction sets between implementations varies) |
My experience runs contrary to yours, @zeux . In my tests, FMA instructions are always slower under x64. I did notice fast-math automatically generating them and it is one of the many reasons why I hate fast-math. FMA instructions are designed for throughput, not latency. With modern processors easily being able to dual issue mul/add pairs, FMA offers little benefit aside from reducing the code size and potentially using fewer registers (although in practice I haven't observed this either). Here are some benchmarks I ran with it in my own math lib here. I had high hopes for it and judging from the replies I got on twitter, FMA is underwhelming. It seems to me that it would perform unusually well in a setup like neural network evaluation because it is most likely throughput bound. Fast-math also doesn't give a gain anywhere near what you claim in the workloads I have seen, not for a long time. This might have been true at some point in the past but not with VS2017 and later generations. I disabled fast-math in my animation compression code because determinism is more important than whatever gain might arise. I saw no impact on the performance after turning it off. 3dsmax 2019 also disabled fast-math when they switched to VS2017. While some workloads saw a minor impact, overall we only saw benefits from turning it off. Determinism matters more than performance unless you really know what you are doing and I would argue that if you do, then fast-math offers little benefit because you can do what the compiler does by hand. Fast-math means that from compiler version to version, behavior can change, sometimes drastically as we've seen with VS2017. It introduced sin/cos pair calculation through SSE float32 arithmetic while prior compilers calculated both separately with float64 arithmetic (the VS stdlib does this for transcendental functions). This leads to a dramatic reduction in accuracy and can lead to visual artifacts and other undesired behavior. Visual studio defaults to precise math and that is a sane default IMO. I hope V8 & cie use a similar default. |
This obviously depends on the workload. In (some) floating point heavy code that I see, FMA results in performance gains.
On modern Intel processors the latency is the same as of multiplication, no? So you're making the latency strictly shorter by (potentially) reducing the critical path. Or reducing the port pressure, allowing for more coissue opportunities. On Skylake, both multiplication and FMA have 0.5 cycle rec. throughput and 4 cycles latency. So on computations that have a lot of opportunity for fusing, you're strictly winning - you're not going to lose. Of course this can vary with architecture.
I'm specifically referring to floating-point-heavy matrix-like code that's dominated by multiplies and adds and clang. Please refer to numbers posted by @Maratyszcza for even larger gains, gains I'm used to are more moderate. I'm not sure to what extent it's valid to compare this on Visual Studio.
The default is always precise math, this seems orthogonal? My point is precisely that if we do have qfma support, the developer of the code is in control - they can enable the use of fused instruction for extra performance gains, or keep it off. If we don't have qfma support, we don't have this optimization opportunity, |
Here's a motivating example from my experiments (caveats apply, ymmv, etc etc.): This is an implementation of an slerp optimization from https://zeux.io/2015/07/23/approximating-slerp/ + https://zeux.io/2016/05/05/optimizing-slerp/ for fitted nlerp (the middle ground between nlerp and much more precise version, onlerp): Baseline: https://gcc.godbolt.org/z/w-j8QW, 64.97 cycles per loop iteration I would expect that the results produced with llvm-mca closely match the actual timing results on stream-transform using this function, e.g. "given two streams of quaternions, interpolate between them". |
Beware that your baseline is in SSE while fast-math+fma is in AVX2. This does not change the overall conclusion, though. As a side note, since version 5, GCC automatically fuses MULs and ADDs into FMAs by default at -O2 without |
@nfrechette Using FMAs cannot be slower on recent hardware because the FMA instruction has the exact same latency and throughput as the multiplication (latency 4c, throughput 2/c on skylake).
Also, you mention MSVC, but beware that MSVC lags behind all its competitor when it comes to speed. |
Thanks! Good catch, I forgot about this. I've updated the post to include fast-math avx2 (55.82 cycles). As you say it doesn't change the overall conclusion much, FMA vs AVX2 here is 55.82 / 40.79 = 1.36x speedup. |
Thank you for this proposal! We just released an alpha version of the TensorFlow.js WASM backend, which was one of our most requested features. Benchmarks show 30-50% speedup with QFMA on various ML models, on top of existing SIMD. Adding these instructions would greatly benefit machine learning and numerical computation libraries in general. |
FMA is definitely not free on Ryzen. The picture isn't clear cut.
For example, Ryzen executes up to 5 instructions per cycle.
addps takes 1 op, has a latency of 3, reciprocal execution of 0.5, and
executes on pipes 2 or 3
mulps takes 1 op, has a latency of 3, reciprocal execution of 0.5, and
executes on pipes 0 or 1
fmadd takes 1 op, has a latency of 5, reciprocal execution of 0.5, and
executes on pipes 0 or 1
In comparison, Skylake executes up to 4 instructions per cycle.
addps takes 1 uop, has a latency of 4, a reciprocal execution of 0.5, and
executes on port 0 or 1
mulps takes 1 uop, has a latency of 4, a reciprocal execution of 0.5, and
executes on port 0 or 1
fmadd takes 1 uop, has a latency of 4, a reciprocal execution of 0.5, and
executes on port 0 or 1
In code that does mul/add pairs, on ryzen we can execute up to 4 per cycle,
two adds (pipe 2 and 3) and two mul (pipes 0 and 1).
But we can only execute 2 fmadd on pipes 0 and 1 per cycle. And the latency
is higher. In code that can mix in other things on the other ports to
increase the instruction per cycle rate that is probably fine but
definitely not a clear win.
On Skylake, regardless of whether we use mul/add pairs of fmadd, we can
only execute 2 instructions per cycle on port 0 and 1. Here the latency is
the same and it does appear that the addition becomes more or less free.
If we look at 1 generation earlier with Broadwell.
addps takes 1 uop, has a latency of 3, a reciprocal execution of 1, and
executes on port 1
mulps takes 1 uop, has a latency of 3, a reciprocal execution of 0.5, and
executes on port 0 or 1
fmadd takes 1 uop, has a latency of 5, a reciprocal execution of 0.5, and
executes on port 0 or 1
Here, at most with mul/add pairs we'll execute 2 instructions per cycle,
same as fmadd but it now has a much longer latency.
Earlier generations on either platform have similar poor performance with
FMA. It appears that knight's landing has 1 uop per instruction and a
latency of 6 for all 3 instructions with a reciprocal execution of 0.5 but
while add/mul can use either port 0 or 1, fmadd isn't documented by agner,
it seems a sane assumption that it executes on the same ports, same as
Skylake.
Mobile cpus might similarly have fewer execution ports or longer latency
for some instructions to reduce the power draw and the heat generated. They
might not fare as well. FMA can be a net win, clearly, but it is too early
IMO to claim it is always so on common hardware today. Most people with a
browser will not have a high end x64 CPU and most workloads might not be
throughput bound. I can't speak for FMA on ARM NEON beyond my observation
that it was a net win on the iPad Pro I have. Here as well though, its
performance varies significantly from my other ARM64 devices.
I would consider support for FMA to be unnecessary for now although I do
see the benefit of supporting it.
…On Fri, Dec 6, 2019 at 1:48 PM Daniel Smilkov ***@***.***> wrote:
Thank you for this proposal! We just released an alpha version of the
TensorFlow.js WASM backend, which was one of our most requested features.
Benchmarks show 30-50% speedup with QFMA on various ML models, on top of
existing SIMD. Adding these instructions would greatly benefit machine
learning and numerical computation libraries in general.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#79?email_source=notifications&email_token=AC4ERDX6AOVKPH7UMICCZPTQXKNBDA5CNFSM4HQJVBQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGFADUQ#issuecomment-562692562>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC4ERDSF77WHHBRZZPSCIEDQXKNBDANCNFSM4HQJVBQQ>
.
|
With fmadd having the latency of 5 on older CPUs, you still save latency because the mul+add pair has a combined latency of 6 cycles due to the dependency. You don't save as much as you do on Skylake of course. If I take the benchmark I posted earlier and ask llvm-mca to generate timings for Broadwell, I get 48.51 cycles for AVX2 fast-math version and 42.29 cycles for FMA fast-math version, which is still 15% faster. The beauty of this proposal is that even if somehow there are CPUs that execute fma slower than mul+add pairs (although I still don't see how / when this would be possible?), the code generation engine can disable it. Clearly there are demonstrable noticeable wins on multiple workloads on modern Intel chips, it seems like we would not have much to discuss if the faster execution didn't come at a price of having different results on different CPUs. FWIW curiously gcc fuses mul+add into fmadd by default at -O2 when targeting ISAs with FMA at least for x64, without requiring -ffast-math at all. |
Am I right that QFMA care only about performance and become not correctly rounded according to true FMA operation in IEEE 757-2008? There are another way to polyfill FMA with respect to correctness. It is much harder than simply RN(a + RN(b * c)) but it useful for applications with arbitrary precision arithmetic. |
I also want to pile onto this discussion and add support for this proposal. TensorFlow.js released a new WASM backend which will greatly lifts the floor for CPU accelerated machine learning. This QFMA proposal will give us even larger wins on top of SIMD which starts reducing the divide between CPU and WebGL accelerators (WebGL has a lot of driver issues w.r.t precision / correctness). This proposal would really be great for machine learning on the web! |
With this Quasi FMA proposal you are going to bake new specification that not grantee the same results on different machines and will fail CI tests not only in spec tests but also in all another software. Probably you will get about the same results in ML model, because you use matrix multiplication, but this error will increase after each operation and another algorithms may be not robust at all, warn everybody about this in manuals. Why you don't want to allow feature detect and inlining functions? Here basically the same but give to all a raw predictable API QFMA(a, b, c) = isFmaAvailable() ? fma(a, b, c) : a * b + c Another software want correctness and robustness in execution in general. You solve precision problems on GPU in machine learing but make a huge problem to another software, you literally return FP arithmetic in 70s before IEEE, where no one could garantee anything. Also in your proposal Also check how it is implemented in Java and Julia, they have FMAC (FMA Accurate) but sometime it very slow. By the way it not so slow as a BigNumber implementation, because it is based on Sylvie Boldo theorems, at least in Julia. FMAC(a, b, c) = isFmaAvailable() ? fma(a, b, c) : sylvie_boldo_polyfill(a, b, c) Also exist third variant, and it can be popular in Rust, AssemblyScript and Emscripten, because you don't have wasm_pow(x) = ifFmaAvailable() ? algo_with_fma(x) : algo_witout_fma(x) |
Ok, even if we will have QFMA with feature detect #80, we still can create all of the functions. Because FMAC(a, b, c) = isFmaAvailable() ? QFMA(a, b, c) : sylvie_boldo_polyfill(a, b, c)
wasm_pow(x, e) = isFmaAvailable() ? algo_with_fma(x, e, QFMA) : algo_witout_fma(x, e) But only if you use real fma when it available. |
aa2f1fb
to
3b13f43
Compare
The performance numbers from the QFMA operations have been quite compelling, and @Maratyszcza's data demonstrates 30-50% speedup over the current SIMD proposal. In the interest of standardizing the MVP of the SIMD proposal, one of the goals though is minimizing non-determinisim, and different results on different platforms unfortunately out of scope for the current proposal as it stands. That said, these will be available in a future SIMD proposal that is currently under discussion and biases towards performance over consistency. Till we have a repository for the proposal, please continue to use this issue for discussion. Marking this with a Post MVP label. |
Introduction
Most modern processors support Fused Multiply-Add (FMA) family instructions. These instructions compute multiplication and addition of floating-point numbers without intermediate rounding. Elimination of intermediate rounding improves accuracy in most numerical algorithms, and moreover FMA instructions provide performance improvements on all processors which support these instructions:
However, on processors which do not support FMA instructions, it is expensive to emulate these operations exactly, without intermediate rounding.
This PR introduce Quasi-Fused Multiply-Add instruction which provides the performance and accuracy benefits of FMA, where supported, while preserving compatibility with older processors. Quasi-Fused Multiply-Add (QFMA) instruction represents
a + b * c
with optional rounding after multiplication. WebAssembly implementations are required to be consistent, and either always generate FMA instruction, or always generate multiplication+addition pair for a QFMA instruction within a module. QFMA instruction is augmented by Quasi-Fused Multiply-Subtract (QFMA) instruction which representsa - b * c
with similar optional rounding after multiplication.Performance Impact
Fused Multiply-Add instructions improve performance in two ways:
QFMA(a, b, c)
result overwrites operanda
. This situation is very typical in dense linear algebra and neural network computations, and without FMA the implementation would have to allocate a temporary registert
for the result of multiplication (t <- b * c ; a <- a + t
).Evaluation on native ARM64 code
To estimate the speedup from FMA in practice, I replaced NEON FMA intrinsics (
vfmaq_f32
andvfmaq_lane_f32
) in ARM64 implementation of neural network inference with non-fused NEON intrinsics (vmlaq_f32
andvmlaq_lane_f32
). Both versions were compiled with Clang to native (i.e. not WebAssembly) binaries for ARM64 architecture, and evaluated in single-threaded mode. Speedups of FMA-based version compared to version with separate multiplication + addition are presented in the table below:Across 3 neural network architectures and 4 mobile devices, the minimum speedup is 1.4X, and speedup on the most compute-intensive neural network (MobileNet v2) exceeds 2X. I suggest that an improvement this big justifies extending WebAssembly SIMD spec with 4 new instructions.
[October 3 update] Evaluation of QFMA prototype in V8
@tlively implemented experimental support of QFMA instruction in LLVM & Clang (commit) and Binaryen (PR), and @ngzhian implemented QFMA lowering in V8 for x86-64 (commit) and ARM64 (commit) architectures. Due to experimental nature of the prototype toolchain, it is conservative in leveraging QFMA instructions, and generates them only through an explicit intrinsic.
I ported the most critical micro-kernels in XNNPACK neural network operator library to use QFMA, and evaluated its performance on 9 neural network models. The table below presents the results:
While the speedup from QFMA in the prototype WebAssembly implementation is smaller than in native code, QFMA improved performance on all 6 evaluated devices, and on modern CPU microarchitectures (Intel Sky Lake, ARM Cortex-A76, Samsung Exynos-M4) QFMA improves performance on average by one third.
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with FMA3 (but no FMA4) instruction set
These processors include Intel Haswell (and later) and AMD Zen (and later).
a = f32x4.qfma(a, b, c)
is lowered to one of two options:VFMADD231PS xmm_a, xmm_b, xmm_c
(c
can be in-memory)VFMADD231PS xmm_a, xmm_c, xmm_b
(b
can be in-memory)b = f32x4.qfma(a, b, c)
is lowered to one of two options:VFMADD132PS xmm_b, xmm_a, xmm_c
(c
can be in-memory)VFMADD213PS xmm_b, xmm_c, xmm_a
(a
can be in-memory)c = f32x4.qfma(a, b, c)
is lowered to one of two options:VFMADD132PS xmm_c, xmm_a, xmm_b
(b
can be in-memory)VFMADD213PS xmm_c, xmm_b, xmm_a
(a
can be in-memory)d = f32x4.qfma(a, b, c)
is lowered to one of six options:VMOVUPS xmm_d, xmm_a + VFMADD231PS xmm_d, xmm_b, xmm_c
(a
andc
can be in-memory)VMOVUPS xmm_d, xmm_a + VFMADD231PS xmm_d, xmm_c, xmm_b
(a
andb
can be in-memory)VMOVUPS xmm_d, xmm_b + VFMADD132PS xmm_d, xmm_a, xmm_c
(b
andc
can be in-memory)VMOVUPS xmm_d, xmm_b + VFMADD213PS xmm_d, xmm_c, xmm_a
(b
anda
can be in-memory)VMOVUPS xmm_d, xmm_c + VFMADD132PS xmm_d, xmm_a, xmm_b
(c
andb
can be in-memory)VMOVUPS xmm_d, xmm_c + VFMADD213PS xmm_d, xmm_b, xmm_a
(c
anda
can be in-memory)a = f32x4.qfms(a, b, c)
is lowered to one of two options:VFNMADD231PS xmm_a, xmm_b, xmm_c
(c
can be in-memory)VFNMADD231PS xmm_a, xmm_c, xmm_b
(b
can be in-memory)b = f32x4.qfms(a, b, c)
is lowered to one of two options:VFNMADD132PS xmm_b, xmm_a, xmm_c
(c
can be in-memory)VFNMADD213PS xmm_b, xmm_c, xmm_a
(a
can be in-memory)c = f32x4.qfms(a, b, c)
is lowered to one of two options:VFNMADD132PS xmm_c, xmm_a, xmm_b
(b
can be in-memory)VFNMADD213PS xmm_c, xmm_b, xmm_a
(a
can be in-memory)d = f32x4.qfms(a, b, c)
is lowered to one of six options:VMOVUPS xmm_d, xmm_a + VFNMADD231PS xmm_d, xmm_b, xmm_c
(a
andc
can be in-memory)VMOVUPS xmm_d, xmm_a + VFNMADD231PS xmm_d, xmm_c, xmm_b
(a
andb
can be in-memory)VMOVUPS xmm_d, xmm_b + VFNMADD132PS xmm_d, xmm_a, xmm_c
(b
andc
can be in-memory)VMOVUPS xmm_d, xmm_b + VFNMADD213PS xmm_d, xmm_c, xmm_a
(b
anda
can be in-memory)VMOVUPS xmm_d, xmm_c + VFNMADD132PS xmm_d, xmm_a, xmm_b
(c
andb
can be in-memory)VMOVUPS xmm_d, xmm_c + VFNMADD213PS xmm_d, xmm_b, xmm_a
(c
anda
can be in-memory)a = f64x2.qfma(a, b, c)
is lowered to one of two options:VFMADD231PD xmm_a, xmm_b, xmm_c
(c
can be in-memory)VFMADD231PD xmm_a, xmm_c, xmm_b
(b
can be in-memory)b = f64x2.qfma(a, b, c)
is lowered to one of two options:VFMADD132PD xmm_b, xmm_a, xmm_c
(c
can be in-memory)VFMADD213PD xmm_b, xmm_c, xmm_a
(a
can be in-memory)c = f64x2.qfma(a, b, c)
is lowered to one of two options:VFMADD132PD xmm_c, xmm_a, xmm_b
(b
can be in-memory)VFMADD213PD xmm_c, xmm_b, xmm_a
(a
can be in-memory)d = f64x2.qfma(a, b, c)
is lowered to one of six options:VMOVUPD xmm_d, xmm_a + VFMADD231PD xmm_d, xmm_b, xmm_c
(a
andc
can be in-memory)VMOVUPD xmm_d, xmm_a + VFMADD231PD xmm_d, xmm_c, xmm_b
(a
andb
can be in-memory)VMOVUPD xmm_d, xmm_b + VFMADD132PD xmm_d, xmm_a, xmm_c
(b
andc
can be in-memory)VMOVUPD xmm_d, xmm_b + VFMADD213PD xmm_d, xmm_c, xmm_a
(b
anda
can be in-memory)VMOVUPD xmm_d, xmm_c + VFMADD132PD xmm_d, xmm_a, xmm_b
(c
andb
can be in-memory)VMOVUPD xmm_d, xmm_c + VFMADD213PD xmm_d, xmm_b, xmm_a
(c
anda
can be in-memory)a = f64x2.qfms(a, b, c)
is lowered to one of two options:VFNMADD231PD xmm_a, xmm_b, xmm_c
(c
can be in-memory)VFNMADD231PD xmm_a, xmm_c, xmm_b
(b
can be in-memory)b = f64x2.qfms(a, b, c)
is lowered to one of two options:VFNMADD132PD xmm_b, xmm_a, xmm_c
(c
can be in-memory)VFNMADD213PD xmm_b, xmm_c, xmm_a
(a
can be in-memory)c = f64x2.qfms(a, b, c)
is lowered to one of two options:VFNMADD132PD xmm_c, xmm_a, xmm_b
(b
can be in-memory)VFNMADD213PD xmm_c, xmm_b, xmm_a
(a
can be in-memory)d = f64x2.qfms(a, b, c)
is lowered to one of six options:VMOVUPD xmm_d, xmm_a + VFNMADD231PD xmm_d, xmm_b, xmm_c
(a
andc
can be in-memory)VMOVUPD xmm_d, xmm_a + VFNMADD231PD xmm_d, xmm_c, xmm_b
(a
andb
can be in-memory)VMOVUPD xmm_d, xmm_b + VFNMADD132PD xmm_d, xmm_a, xmm_c
(b
andc
can be in-memory)VMOVUPD xmm_d, xmm_b + VFNMADD213PD xmm_d, xmm_c, xmm_a
(b
anda
can be in-memory)VMOVUPD xmm_d, xmm_c + VFNMADD132PD xmm_d, xmm_a, xmm_b
(c
andb
can be in-memory)VMOVUPD xmm_d, xmm_c + VFNMADD213PD xmm_d, xmm_b, xmm_a
(c
anda
can be in-memory)x86/x86-64 processors with FMA3 and FMA4 instruction sets
These processors include AMD Piledriver, AMD Steamroller, AMD Excavator, but not AMD Zen.
a = f32x4.qfma(a, b, c)
is lowered to one of two options:VFMADD231PS xmm_a, xmm_b, xmm_c
(c
can be in-memory)VFMADD231PS xmm_a, xmm_c, xmm_b
(b
can be in-memory)b = f32x4.qfma(a, b, c)
is lowered to one of two options:VFMADD132PS xmm_b, xmm_a, xmm_c
(c
can be in-memory)VFMADD213PS xmm_b, xmm_c, xmm_a
(a
can be in-memory)c = f32x4.qfma(a, b, c)
is lowered to one of two options:VFMADD132PS xmm_c, xmm_a, xmm_b
(b
can be in-memory)VFMADD213PS xmm_c, xmm_b, xmm_a
(a
can be in-memory)d = f32x4.qfma(a, b, c)
is lowered to one of two options:VFMADDPS xmm_d, xmm_b, xmm_c, xmm_a
(a
orc
can be in-memory)VFMADDPS xmm_d, xmm_c, xmm_b, xmm_a
(a
orb
can be in-memory)be in-memory)
a = f32x4.qfms(a, b, c)
is lowered to one of two options:VFNMADD231PS xmm_a, xmm_b, xmm_c
(c
can be in-memory)VFNMADD231PS xmm_a, xmm_c, xmm_b
(b
can be in-memory)b = f32x4.qfms(a, b, c)
is lowered to one of two options:VFNMADD132PS xmm_b, xmm_a, xmm_c
(c
can be in-memory)VFNMADD213PS xmm_b, xmm_c, xmm_a
(a
can be in-memory)c = f32x4.qfms(a, b, c)
is lowered to one of two options:VFNMADD132PS xmm_c, xmm_a, xmm_b
(b
can be in-memory)VFNMADD213PS xmm_c, xmm_b, xmm_a
(a
can be in-memory)d = f32x4.qfms(a, b, c)
is lowered to one of two options:VFNMADDPS xmm_d, xmm_b, xmm_c, xmm_a
(a
orc
can be in-memory)VFNMADDPS xmm_d, xmm_c, xmm_b, xmm_a
(a
orb
can be in-memory)be in-memory)
a = f64x2.qfma(a, b, c)
is lowered to one of two options:VFMADD231PD xmm_a, xmm_b, xmm_c
(c
can be in-memory)VFMADD231PD xmm_a, xmm_c, xmm_b
(b
can be in-memory)b = f64x2.qfma(a, b, c)
is lowered to one of two options:VFMADD132PD xmm_b, xmm_a, xmm_c
(c
can be in-memory)VFMADD213PD xmm_b, xmm_c, xmm_a
(a
can be in-memory)c = f64x2.qfma(a, b, c)
is lowered to one of two options:VFMADD132PD xmm_c, xmm_a, xmm_b
(b
can be in-memory)VFMADD213PD xmm_c, xmm_b, xmm_a
(a
can be in-memory)d = f64x2.qfma(a, b, c)
is lowered to one of two options:VFMADDPD xmm_d, xmm_b, xmm_c, xmm_a
(a
orc
can be in-memory)VFMADDPD xmm_d, xmm_c, xmm_b, xmm_a
(a
orb
can be in-memory)be in-memory)
a = f64x2.qfms(a, b, c)
is lowered to one of two options:VFNMADD231PD xmm_a, xmm_b, xmm_c
(c
can be in-memory)VFNMADD231PD xmm_a, xmm_c, xmm_b
(b
can be in-memory)b = f64x2.qfms(a, b, c)
is lowered to one of two options:VFNMADD132PD xmm_b, xmm_a, xmm_c
(c
can be in-memory)VFNMADD213PD xmm_b, xmm_c, xmm_a
(a
can be in-memory)c = f64x2.qfms(a, b, c)
is lowered to one of two options:VFNMADD132PD xmm_c, xmm_a, xmm_b
(b
can be in-memory)VFNMADD213PD xmm_c, xmm_b, xmm_a
(a
can be in-memory)d = f64x2.qfms(a, b, c)
is lowered to one of two options:VFNMADDPD xmm_d, xmm_b, xmm_c, xmm_a
(a
orc
can be in-memory)VFNMADDPD xmm_d, xmm_c, xmm_b, xmm_a
(a
orb
can be in-memory)x86/x86-64 processors with FMA4 (and no FMA3) instruction sets
AMD Bulldozer is the only family of such processors.
d = f32x4.qfma(a, b, c)
is lowered to one of two options:VFMADDPS xmm_d, xmm_b, xmm_c, xmm_a
(a
orc
can be in-memory)VFMADDPS xmm_d, xmm_c, xmm_b, xmm_a
(a
orb
can be in-memory)d = f32x4.qfms(a, b, c)
is lowered to one of two options:VFNMADDPS xmm_d, xmm_b, xmm_c, xmm_a
(a
orc
can be in-memory)VFNMADDPS xmm_d, xmm_c, xmm_b, xmm_a
(a
orb
can be in-memory)d = f64x2.qfma(a, b, c)
is lowered to one of two options:VFMADDPD xmm_d, xmm_b, xmm_c, xmm_a
(a
orc
can be in-memory)VFMADDPD xmm_d, xmm_c, xmm_b, xmm_a
(a
orb
can be in-memory)d = f64x2.qfms(a, b, c)
is lowered to one of two options:VFNMADDPD xmm_d, xmm_b, xmm_c, xmm_a
(a
orc
can be in-memory)VFNMADDPD xmm_d, xmm_c, xmm_b, xmm_a
(a
orb
can be in-memory)ARM64 processors
All ARM64 application processors support SIMD with FMA
a = f32x4.qfma(a, b, c)
is lowered toFMLA Va.4S, Vb.4S, Vc.4S
d = f32x4.qfma(a, b, c)
is lowered toMOV Vd.16B, Va.16B + FMLA Va.4S, Vb.4S, Vc.4S
a = f32x4.qfms(a, b, c)
is lowered toFMLS Va.4S, Vb.4S, Vc.4S
d = f32x4.qfms(a, b, c)
is lowered toMOV Vd.16B, Va.16B + FMLS Va.4S, Vb.4S, Vc.4S
a = f64x2.qfma(a, b, c)
is lowered toFMLA Va.2D, Vb.2D, Vc.2D
d = f64x2.qfma(a, b, c)
is lowered toMOV Vd.16B, Va.16B + FMLA Va.2D, Vb.2D, Vc.2D
a = f64x2.qfms(a, b, c)
is lowered toFMLS Va.2D, Vb.2D, Vc.2D
d = f64x2.qfms(a, b, c)
is lowered toMOV Vd.16B, Va.16B + FMLS Va.2D, Vb.2D, Vc.2D
ARMv7 processors with NEONv2 (NEON-FMA) instruction set
Most 32-bit ARM application processors support SIMD (NEON) with FMA instructions.
a = f32x4.qfma(a, b, c)
is lowered toVFMA.F32 q_a, q_b, q_c
d = f32x4.qfma(a, b, c)
is lowered toVMOV q_d, q_a + VFMA.F32 q_d, q_b, q_c
a = f32x4.qfms(a, b, c)
is lowered toVFMS.F32 q_a, q_b, q_c
d = f32x4.qfms(a, b, c)
is lowered toVMOV q_d, q_a + VFMS.F32 q_d, q_b, q_c
a = f64x2.qfma(a, b, c)
is lowered toVFMA.F64 d_a0, d_b0, d_c0 + VFMA.F64 d_a1, d_b1, d_c1
d = f64x2.qfma(a, b, c)
is lowered toVMOV q_d, q_a + VFMA.F64 q_d0, q_b0, q_c0 + VFMA.F64 q_d1, q_b1, q_c1
a = f64x2.qfms(a, b, c)
is lowered toVFMS.F64 d_a0, d_b0, d_c0 + VFMS.F64 d_a1, d_b1, d_c1
d = f64x2.qfms(a, b, c)
is lowered toVMOV q_d, q_a + VFMA.F64 q_d0, q_b0, q_c0 + VFMS.F64 q_d1, q_b1, q_c1
ARMv7 processors with NEON (but without FMA) instruction set
ARM Cortex-A8, ARM Cortex-A9, and Qualcomm Scorpion are the only significant cores which support SIMD (NEON), but not the FMA extension
a = f32x4.qfma(a, b, c)
is lowered toVMLA.F32 q_a, q_b, q_c
(note: multiply-add with intermediate rounding)d = f32x4.qfma(a, b, c)
is lowered toVMUL.F32 q_d, q_b, q_c + VADD.F32 q_d, q_a, q_d
a = f32x4.qfms(a, b, c)
is lowered toVMLS.F32 q_a, q_b, q_c
(note: multiply-subtract with intermediate rounding)d = f32x4.qfms(a, b, c)
is lowered toVMUL.F32 q_d, q_b, q_c + VSUB.F32 q_d, q_a, q_d
a = f64x2.qfma(a, b, c)
is lowered toVMLA.F64 d_a0, d_b0, d_c0 + VMLA.F64 d_a1, d_b1, d_c1
(note: multiply-add with intermediate rounding)d = f64x2.qfma(a, b, c)
is lowered toVMUL.F64 d_d0, d_b0, d_c0 + VMUL.F64 d_d1, d_b1, d_c1 + VADD.F64 d_d0, d_a0, d_d0 + VADD.F64 d_d1, d_a1, d_d1
a = f64x2.qfms(a, b, c)
is lowered toVMLS.F64 d_a0, d_b0, d_c0 + VMLS.F64 d_a1, d_b1, d_c1
(note: multiply-subtract with intermediate rounding)d = f64x2.qfms(a, b, c)
is lowered toVMUL.F64 d_d0, d_b0, d_c0 + VMUL.F64 d_d1, d_b1, d_c1 + VSUB.F64 d_d0, d_a0, d_d0 + VSUB.F64 d_d1, d_a1, d_d1
POWER processors with VSX instruction set
IBM POWER processors starting with POWER7
a = f32x4.qfma(a, b, c)
is lowered toXVMADDASP x_a, x_b, x_c
b = f32x4.qfma(a, b, c)
is lowered toXVMADDMSP x_b, x_c, x_a
c = f32x4.qfma(a, b, c)
is lowered toXVMADDMSP x_c, x_b, x_a
d = f32x4.qfma(a, b, c)
is lowered toVMR x_d, x_a + XVMADDASP x_d, x_b, x_c
a = f32x4.qfms(a, b, c)
is lowered toXVNMSUBASP x_a, x_b, x_c
b = f32x4.qfms(a, b, c)
is lowered toXVNMSUBMSP x_b, x_c, x_a
c = f32x4.qfms(a, b, c)
is lowered toXVNMSUBMSP x_c, x_b, x_a
d = f32x4.qfms(a, b, c)
is lowered toVMR x_d, x_a + XVNMSUBASP x_d, x_b, x_c
a = f64x2.qfma(a, b, c)
is lowered toXVMADDADP x_a, x_b, x_c
b = f64x2.qfma(a, b, c)
is lowered toXVMADDMDP x_b, x_c, x_a
c = f64x2.qfma(a, b, c)
is lowered toXVMADDMDP x_c, x_b, x_a
d = f64x2.qfma(a, b, c)
is lowered toVMR x_d, x_a + XVMADDADP x_d, x_b, x_c
a = f64x2.qfms(a, b, c)
is lowered toXVNMSUBADP x_a, x_b, x_c
b = f64x2.qfms(a, b, c)
is lowered toXVNMSUBMDP x_b, x_c, x_a
c = f64x2.qfms(a, b, c)
is lowered toXVNMSUBMDP x_c, x_b, x_a
d = f64x2.qfms(a, b, c)
is lowered toVMR x_d, x_a + XVNMSUBADP x_d, x_b, x_c
Other processors and instruction sets
d = f32x4.qfma(a, b, c)
is lowered liked = f32x4.add(a, f32x4.mul(b, c))
d = f32x4.qfms(a, b, c)
is lowered liked = f32x4.sub(a, f32x4.mul(b, c))
d = f64x2.qfma(a, b, c)
is lowered liked = f64x2.add(a, f64x2.mul(b, c))
d = f64x2.qfms(a, b, c)
is lowered liked = f64x2.sub(a, f64x2.mul(b, c))
References
[1] Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD
and VIA CPUs by Agner Fog
[2] ARM Cortex-A55 Software Optimization Guide
[3] ARM Cortex-A75 Software Optimization Guide
[4] MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
[5] MobileNetV2: Inverted Residuals and Linear Bottlenecks
[6] SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size
[7] Real-Time AR Self-Expression with Machine Learning
[8] Mobile Real-time Video Segmentation
[9] BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs
[10] On-Device, Real-Time Hand Tracking with MediaPipe