Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] Creating fmaddsub for these intrinsics #89

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

dannys4
Copy link

@dannys4 dannys4 commented Jul 22, 2021

Ideally, adding support for fmaddsub intrinsics. This is still a work in progress. In connection with #88

@codecov-commenter
Copy link

codecov-commenter commented Jul 22, 2021

Codecov Report

Merging #89 (5c2560b) into master (ace8eb1) will decrease coverage by 0.53%.
The diff coverage is 0.00%.

❗ Current head 5c2560b differs from pull request most recent head 1e6b689. Consider uploading reports for the commit 1e6b689 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master      #89      +/-   ##
==========================================
- Coverage   90.23%   89.70%   -0.54%     
==========================================
  Files           3        3              
  Lines         502      505       +3     
==========================================
  Hits          453      453              
- Misses         49       52       +3     
Impacted Files Coverage Δ
src/LLVM_intrinsics.jl 96.42% <0.00%> (-1.50%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ace8eb1...1e6b689. Read the comment docs.

# ("d.512", 8, Float64), ("s.512", 16, Float32) # These don't seem supported by LLVM yet
]
@eval @generated function fmaddsub(a::LVec{$N, $T}, b::LVec{$N, $T}, c::LVec{$N, $T})
ff = "llvm.x86.fma.vfmaddsub.p"*$t
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make the letter p part of the suffix t.

function fmaddsub(a::LVec{N, T}, b::LVec{N, T}, c::LVec{N, T}) where {N, T<:FloatingTypes}
Base.llvmcall("llvm.x86.fma.fmaddsub_pd", LVec{N, T}, (LVec{N, T}, LVec{N, T}, LVec{N, T}), a, b, c)

for (t, N, T) in [("d" , 2, Float64), ("s" , 4, Float32),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These intrinsics only exist on x86 architectures. This function should be wrapped in an if statement that checks the CPU architecture, and possible also the Julia and LLVM version, to see whether the intrinsic is supported.

In case it is not, there should be a reasonably efficient generic fallback. The idea is that one can call fmaddsub all the time and expect a reasonable implementation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I genuinely cannot find when these were introduced into LLVM (I've looked at the release notes for pretty much every release since 3.0), however, they start robustly integrating AVX512 into the language at 4.0, so I'm just going to peg it at that and say FMA/AVX instructions are integrated there. I don't really know where to start with Julia versions since different installations of the same Julia version might use different LLVM versions, so I suspect that might be tricky. I'll probably steal your implementation below for the fallback, I don't really see any significant improvements.

("d.256", 4, Float64), ("s.256", 8, Float32),
# ("d.512", 8, Float64), ("s.512", 16, Float32) # These don't seem supported by LLVM yet
]
@eval @generated function fmaddsub(a::LVec{$N, $T}, b::LVec{$N, $T}, c::LVec{$N, $T})
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fmaddsub has a simpler cousin, faddsub. Do you want to add support for it as well?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function vfmaddsub is sufficiently obscure that it needs documentation. Documentation in the sense of equivalent Julia code would be good; maybe something along the lines of

fmaddsub(a,b,c) = SIMD((isodd(n) ? a[n*b[n]-c[n] : a[n]*b[n]+c[n] for n in 1:N)...)

Feel free to improve or modify.

Copy link
Author

@dannys4 dannys4 Jul 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re faddsub: I'd be happy to add it, though I have no need for it. That being said, I don't see any documentation or commits for it anywhere, and it's definitely not in the llvm.fma namespace (at least, when I try to do llvm.fma.faddsub or llvm.fma.vaddsub, it doesn't recognize the call, which makes sense as it's not an fma call), so I'm not sure about whether it's in the scope of this PR.

EDIT: I just saw your comment regarding llvm.sse3.addsub, so I now see that, but my question still remains on whether it's in the scope of this PR

re documentation: sounds great.

$(Expr(:meta, :inline));
ccall($ff, llvmcall, LVec{$($N), $($T)}, (LVec{$($N), $($T)}, LVec{$($N), $($T)}, LVec{$($N), $($T)}), a, b, c)
)
end
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end we will also need test cases.

@KristofferC
Copy link
Collaborator

KristofferC commented Jul 26, 2021

These are very different from all other intrinsics we support (which are based on https://llvm.org/docs/LangRef.html). These are generic in terms of the size of the Vector and LLVM knows how to translate these to different targets.

Regarding target-specific intrinsic, there are a huge number of them (https://software.intel.com/sites/landingpage/IntrinsicsGuide/) and they are only available on certain architectures. It isn't clear to me how exactly this should be exposed by SIMD.jl. For example, what happens if someone calls this new function on non-x86? Or when the size of the vector is not exactly such that there is an intrinsic for it?

I believe this requires quite a lot of careful planning and thought before just implementing it in exactly the same way as the previous intrinsics. For example, it should probably live in its own file and have some consistent naming based on how it would be called in e.g. C.

@dannys4
Copy link
Author

dannys4 commented Jul 26, 2021

Yes, I agree. I'll repeat what I said in the comment on the issue, but this seems to be stepping into territory covered by VectorizationBase.jl (as that package already detects compatibilities with your system). I'm aware that there's a huge number of other intrinsics, but it's unclear which ones are in LLVM and which ones aren't (and where they might be).

For example, what happens if someone calls this new function on non-x86? Or when the size of the vector is not exactly such that there is an intrinsic for it?

I'm not exactly sure the answer to the first question, but I would hope that multiple dispatch would solve the second issue. I know @eschnett was saying only to define these functions for x86 and (theoretically), similar functions could be defined for NEON intrinsics. Then, there could be default implementations. But this brings back the question of scope. I'm happy to work on this and build out more functionality, but (as I said previously) I'm not sure what the scope is for this. If you decide that this doesn't belong here, that sounds fine by me, I just would recommend putting a disclaimer in the description saying that you're only supporting the intrinsics in the Language Reference (and not necessarily all possible intrinsics for all possible architectures).

Either way, I think it's smart for me to hold off until someone tells me what the larger plan is.

@eschnett
Copy link
Owner

It would be good to define a function mulsubadd, which is generically useful to handle complex numbers. This function can then have efficient specialized implementations for various architectures (if possible), but it would exist and would be efficient on all architectures. The case is similar to muladd, which exists everywhere, independent of whether the CPU offers a specialized implementation.

@KristofferC KristofferC mentioned this pull request Aug 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants