simde_mm256_fmadd_pd as two 128 bit FMA operations? #1196

AlexK-BD · 2024-07-16T14:29:57Z

simde_mm256_fmadd_pd is defined as follows:

simde_mm256_fmadd_pd (simde__m256d a, simde__m256d b, simde__m256d c) {
  #if defined(SIMDE_X86_FMA_NATIVE)
    return _mm256_fmadd_pd(a, b, c);
  #else
    return simde_mm256_add_pd(simde_mm256_mul_pd(a, b), c);
  #endif
}

When building for a target that doesn't have native 256 bit FMA support, why not use two 128 bit FMA operations on the two halves of the input?

If that's possible, I would be happy to attempt a patch adding that support. I wanted to check if there's some behavioral reason that two Neon 128 bit FMA operations wouldn't be appropriate here.

The text was updated successfully, but these errors were encountered:

mr-c · 2024-07-16T14:37:01Z

When building for a target that doesn't have native 256 bit FMA support, why not use two 128 bit FMA operations on the two halves of the input?

When in doubt, check the compiler output, yeah?

And then double check the timings for the 128 bit FMA operations versus the alternatives.

Yes, an investigation into this is welcome!

AlexK-BD · 2024-07-16T16:50:08Z

I investigated further and found that fnmadd was producing some pretty bad disassembly; PR here #1197

fmadd seems to compile reasonably with -O2 on gcc 10, as-is. I haven't checked fmsub or fnmsub or any single precision variants yet.

Remnant44 · 2024-09-12T20:44:53Z

I wanted to add an additional comment here that I've run into some additional issues handling FMAs, specifically on the Windows/MSVC platform and compiling AVX2+ code down to SSE.

The various fallbacks in fma.h vary, but they mostly try to preserve using an FMA op if possible, which makes sense when porting from AVX+ level x86 to neon/webassembly/etc. On MSVC in particular this leads to really bad codegen however, where a single simde__m256 leads to scalar splay-out and individually running each scalar.

When porting from AVX+ (which implies FMA on x86) to SSE (which does not), the primary fallback should crack the FMA apart into two 128bit FMAs, which then should crack apart into mul+add. I've performed this fixup locally for my purposes, and I'd like to contribute this work back if adding fallbacks like this are kosher for the project.

mr-c · 2024-09-13T06:56:21Z

@Remnant44 , thank you for investigating. Yes, that contribution would be welcome!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simde_mm256_fmadd_pd as two 128 bit FMA operations? #1196

simde_mm256_fmadd_pd as two 128 bit FMA operations? #1196

AlexK-BD commented Jul 16, 2024

mr-c commented Jul 16, 2024

AlexK-BD commented Jul 16, 2024

Remnant44 commented Sep 12, 2024

mr-c commented Sep 13, 2024

simde_mm256_fmadd_pd as two 128 bit FMA operations? #1196

simde_mm256_fmadd_pd as two 128 bit FMA operations? #1196

Comments

AlexK-BD commented Jul 16, 2024

mr-c commented Jul 16, 2024

AlexK-BD commented Jul 16, 2024

Remnant44 commented Sep 12, 2024

mr-c commented Sep 13, 2024