Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Fused Multiply-Add (FMA) #102

Closed
dylanede opened this issue Apr 21, 2021 · 16 comments · Fixed by #138
Closed

Support for Fused Multiply-Add (FMA) #102

dylanede opened this issue Apr 21, 2021 · 16 comments · Fixed by #138
Assignees
Labels
C-feature-request Category: a feature request, i.e. not implemented / a PR

Comments

@dylanede
Copy link

dylanede commented Apr 21, 2021

From my limited testing, multiplying and then adding SimdF32/SimdF64 does not result in FMA instructions, and rightly so, however it would be useful to have a method on these types and a corresponding platform intrinsic to generate LLVM calls to llvm.fma.* intrinsics, which support vector types.

Edit: In fact it looks like the simd_fma platform intrinsic is already available, just not used by the crate?

@dylanede dylanede added the C-feature-request Category: a feature request, i.e. not implemented / a PR label Apr 21, 2021
@calebzulawski
Copy link
Member

Yes, it's simply not implemented yet, but is planned. It's easy enough to add so I'll do it when I get a chance

@dylanede
Copy link
Author

Well, I'm looking at making a PR for it right now.

Currently it looks like this:

/// Performs `self * b + c` as a single operation.
#[inline]
pub fn fma(self, b: Self, c: Self) -> Self {
    unsafe { crate::intrinsics::simd_fma(self, b, c) }
}

But maybe it should be something like c.add_mul(a, b) which is c + a*b? With the above code it would be a.fma(b, c).

@programmerjake
Copy link
Member

we also will want something like llvm.fmuladd.* where llvm picks whichever is more efficient: separate mul/add or fma

@dylanede
Copy link
Author

dylanede commented Apr 21, 2021

Yeah, I've just tested the llvm.fma.* version out, and although it kind of improves performance in some cases when the fma feature is available on x86_64, without it it falls back to an fma procedure call, which completely tanks performance.

Presumably both options should be made available, as I understand that fused multiply add normally comes with the guarantee of machine precision for the whole operation, whereas llvm.fmuladd.* does not provide that guarantee.

Doesn't look like a platform intrinsic for fmuladd is exposed yet.

@calebzulawski
Copy link
Member

I believe the function should be called mul_add to match the equivalent scalar functions in std.

Exposing fast versions of intrinsics is probably a separate issue and definitely affects other functions too

@calebzulawski
Copy link
Member

@dylanede regarding the procedure call, this is related to #76, we have a path forward to fixing that, though it's non-trivial

@workingjubilee
Copy link
Member

/// Performs `self * b + c` as a single operation.
#[inline]
pub fn fma(self, b: Self, c: Self) -> Self {
    unsafe { crate::intrinsics::simd_fma(self, b, c) }
}

But maybe it should be something like c.add_mul(a, b) which is c + a*b? With the above code it would be a.fma(b, c).

Sounds good, with appropriate tests!
I agree with Caleb that it should be mul_add. Unfortunately it needs to be gated under the std feature due to the current fallback to a procedure call.

I would rather not have two functions that do the same thing, nor another f32 type (SimdFastF32? what?). Resolving this is beyond the scope of this issue, however.

@Lokathor
Copy link
Contributor

yeah people know the term "multiply add" and f32 has a mul_add method already, so SIMD types should also stick to mul_add as the method name.

@programmerjake
Copy link
Member

maybe:

impl<const N: usize> f32x<N> {
    pub fn mul_add<const STRICT: bool = true>(self, multiplier: Self, addend: Self) -> Self {
        if STRICT {
            // llvm.fma.*
        } else {
            // llvm.fmuladd.*
        }
    }
}

@Lokathor
Copy link
Contributor

Since we're not tied to being an exact hardware impl, I think we should just define it from the start to always let llvm pick what's best. If people absolutely want strict ops they can transmute the value to a platform version and use stdarch.

Particularly, I think that const-generics in methods and functions that aren't simply passed in from the type are very bad ergonomically at this time, and should be avoided if possible.

@calebzulawski
Copy link
Member

IEEE 754-2008 specifies FMA to have a single rounding step.

@Lokathor
Copy link
Contributor

I can't tell if that means that you're for or against having strict fma

@programmerjake
Copy link
Member

I'm for having strict fma by default, alongside another function with semantics of: use separate mul & add (correctly rounded) or use fma (correctly rounded) at compiler's choice -- no other options (in particular the afn flag is not allowed). The non-strict function matches the semantics of llvm.fmuladd.*

@gilescope
Copy link

@dylanede happy to test your PR out if you've got something that hangs together? I am using _mm_maddubs_epi16 currently - am assuming that mull_add will be able to replace this.

@workingjubilee
Copy link
Member

ping @dylanede to check if you're still interested in followup or have any questions ❤️

@gilescope
Copy link

@dylanede if you have something halfway but don't have time to finish it feel free to pop the branch on your fork and I can have a crack at finishing it.

workingjubilee added a commit that referenced this issue Jun 23, 2021
Add various fns
- Sum/Product traits
- recip/to_degrees/to_radians/min/max/clamp/signum/copysign; #14
- mul_add: #14, fixes #102
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature-request Category: a feature request, i.e. not implemented / a PR
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants