Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simde_mm256_fmadd_pd as two 128 bit FMA operations? #1196

Open
AlexK-BD opened this issue Jul 16, 2024 · 4 comments
Open

simde_mm256_fmadd_pd as two 128 bit FMA operations? #1196

AlexK-BD opened this issue Jul 16, 2024 · 4 comments

Comments

@AlexK-BD
Copy link
Contributor

simde_mm256_fmadd_pd is defined as follows:

simde_mm256_fmadd_pd (simde__m256d a, simde__m256d b, simde__m256d c) {
  #if defined(SIMDE_X86_FMA_NATIVE)
    return _mm256_fmadd_pd(a, b, c);
  #else
    return simde_mm256_add_pd(simde_mm256_mul_pd(a, b), c);
  #endif
}

When building for a target that doesn't have native 256 bit FMA support, why not use two 128 bit FMA operations on the two halves of the input?

If that's possible, I would be happy to attempt a patch adding that support. I wanted to check if there's some behavioral reason that two Neon 128 bit FMA operations wouldn't be appropriate here.

@mr-c
Copy link
Collaborator

mr-c commented Jul 16, 2024

When building for a target that doesn't have native 256 bit FMA support, why not use two 128 bit FMA operations on the two halves of the input?

When in doubt, check the compiler output, yeah?

And then double check the timings for the 128 bit FMA operations versus the alternatives.

Yes, an investigation into this is welcome!

@AlexK-BD
Copy link
Contributor Author

I investigated further and found that fnmadd was producing some pretty bad disassembly; PR here #1197

fmadd seems to compile reasonably with -O2 on gcc 10, as-is. I haven't checked fmsub or fnmsub or any single precision variants yet.

@Remnant44
Copy link

I wanted to add an additional comment here that I've run into some additional issues handling FMAs, specifically on the Windows/MSVC platform and compiling AVX2+ code down to SSE.

The various fallbacks in fma.h vary, but they mostly try to preserve using an FMA op if possible, which makes sense when porting from AVX+ level x86 to neon/webassembly/etc. On MSVC in particular this leads to really bad codegen however, where a single simde__m256 leads to scalar splay-out and individually running each scalar.

When porting from AVX+ (which implies FMA on x86) to SSE (which does not), the primary fallback should crack the FMA apart into two 128bit FMAs, which then should crack apart into mul+add. I've performed this fixup locally for my purposes, and I'd like to contribute this work back if adding fallbacks like this are kosher for the project.

@mr-c
Copy link
Collaborator

mr-c commented Sep 13, 2024

@Remnant44 , thank you for investigating. Yes, that contribution would be welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants