-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a deterministic FMA #44
Comments
If these instructions were made into a proposal of their own and not tied into relaxed simd, it might make sense for a wasm implementation to provide them only on hardware that has fma natively, and let the wasm program provide its own fallback for other hardware? |
Possibly, however what would they do for a fallback? If they'd fall back to non-fused mul+add, that's really a qfma use case. If they'd fall back to a software fma, that's really an fma use case. If they'd fall back to custom logic, that's a slightly different kind of qfma use case: do a qfma with carefully selected operands such that the result tells you whether it did a true fma or not, and if not, run the custom logic. |
I think a deterministic FMA is quite nice, and can stand as a proposal on its own, and as time goes by, FMA coverage should get better (not too sure about low end phones/embedded space.) However, I've heard that a slow FMA is not useful (@Maratyszcza to correct me, probably more so in the ML world?), so programs will choose an alternative: change algorithms, or tweak it to manage with reduced precision. I think if we want include a deterministic FMA, we also need a way for programs to check if FMA support is available in hardware. |
That's why I suggested that the instruction should not be available if the hardware does not support FMA, thus allowing a feature-test of the instruction to be the test for whether the hardware supports FMA. True, some implementations could choose to provide FMA anyway, but perhaps there's a spec-level fix for that? |
The qfma instruction, as currently proposed, already supports the feature-detection use case. Users can run a qfma with certain operands, and test whether the result is the fma result. This is easier than managing multiple wasm modules in order to feature-test and then pick the module to instantiate. |
To summarize the use cases we want to support:
qfma covers all 3 use cases, since qfma with specific operands can detect hardware support. Use case 1 can ship their own fma implementation if they really need it. Deterministic FMA is a convenience for 1, instead of shipping their own, they rely on engine implementation.
@sunfishcode can you add links to some of these, I am unfamiliar with this domain. |
I see a couple of issues with this proposal. First, FMA is a not a SIMD-specific instruction, but a fundamental floating-point operation, and all processors which provide SIMD version of the FMA instruction provide a scalar one too. Thus, it would be more appropriate to have FMA as a standalone proposal featuring both scalar and SIMD versions of the instructions rather than a part of Relaxed SIMD proposal, especially given that it is not processor-specific. Secondly, I'd like to see a full version of SIMD FMA polyfill using either WAsm SIMD128 or x86 SSE4.1 intrinsics. The polyfill suggested in WebAssembly/design#1391 omits too many details, and the complete polyfills in musl and Apple LibM linked from WebAssembly/design#1391 are scalar and clearly not SIMD-friendly. AFAIK, the main application of the real FMA is for simulating higher than double-precision computations using double-double and similar representations. However, given that WAsm SIMD is limited to 128 bits and WAsm natively supports 64-bit computations, I expect that soft-float emulation of quad-precision arithmetics would be faster on pre-FMA processors than double-double SIMD operations with emulated FMA, and thus there isn't really a convincing use-case for exposing guaranteed-fused multiply-add in WAsm. |
None of qfma, relaxed/asymmetric min/max, or reciprocal sqrt approximation, are SIMD-specific either; should those instructions be moved out to a separate proposal as well? Instead, I imagine the "simd" in "relaxed-simd" isn't something important to be strict about. Scalar versions of SIMD operations are useful for remainder loops. As such, it would make sense for relaxed-simd to define scalar versions of qfma and the others as well. It's true that fma isn't "relaxed" either, but my observation above is that there may be value in adding regular fma at the same time as qfma, and I'm interested in feedback on this. A simple polyfill for simd fma would be to extract the elements and evaluate it in scalar. That's Not Fast, but it's already understood that FMA polyfills won't be Fast, so that might be good enough as a minimum-effort implementation technique. C's |
I believe another advantage of deterministic If possible, this would be a win for spec hygiene (no need for a new kind of non-determinism) and would fully address the concerns @rossberg and I raised at the phase 3 vote. Although, if there are other reasons we need the current "column" approach that I'm missing, please point these out! |
I agree with @conrad-watt. Getting rid of the intrusive new form of non-determinism (a.k.a. implementation-dependent semantics) would make me much less concerned about this proposal. |
Non-quasi, guaranteed-fused FMA:
This proposal is to add these in addition to
qfma
, not instead of it.IEEE 754
fusedMultiplyAdd
, and obvious subtract variant, with modifications to NaN and exception behavior as in other floating-point instructions in wasm.x86-64 and ARM64. Also provide reference implementation in terms of 128-bit
Wasm SIMD.
On x86-64 CPUs with FMA3 or FMA4, and ARM64, and other popular architectures, there is a single instruction that does this. On CPUs without an fma instruction, some options are discussed here.
Since wasm hides floating-point exception flags, and NaN bits are already nondeterministic, the only new differences across platforms are timings.
Some floating-point algorithms depend on a true fma, which is a different use case from qfma. And, some use cases want to be able to specify determinism in the wasm module, independently of whether the host implementation is enforcing determinism.
As discussed here, it seems to make more sense to add explicit instructions for these use cases, rather than using profiles to restrict qfma to work for these use cases.
I expect one of the big questions is whether these instructions belong in relaxed-simd or should go in a separate proposal. I'm open to suggestions here. I'm starting by proposing them here, because I expect it would be confusing to users if relaxed-simd is standardized with qfma before a true fma is standardized. If a user needs a true fma, they might be tempted to use qfma if they don't (think they) care about CPUs without fma support.
The text was updated successfully, but these errors were encountered: