Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed semantics for implicit vectorization of primitives #56481

Open
Keno opened this issue Nov 7, 2024 · 2 comments
Open

Proposed semantics for implicit vectorization of primitives #56481

Keno opened this issue Nov 7, 2024 · 2 comments
Labels
compiler:llvm For issues that relate to LLVM compiler:simd instruction-level vectorization design Design of APIs or of the language itself

Comments

@Keno
Copy link
Member

Keno commented Nov 7, 2024

@gbaraldi requested a writeup of the design I had for vectorized primitives.
I have no particular plans to implement this myself anytime soon, so this is
up for grabs.

Background

A longstanding (#21454 is the earliest I can find in Julia, but it's been discussed in various places) issue
is that while LLVM is reasonably good at emitting vector code for straight-line and looped code of arithemetic
primitives, we do not permit LLVM to perform any vectorization of julia functions. This most prominently shows
up with the trig functions and exp and in hyper-optimized code can become the final bottleneck. It has never
been a priority, because there are feasible workarounds (replacing the julia versions with LLVM intrinsics,
hacking LLVM iteself, rewriting the code to explicitly vectorize), but they are all very annoying.

Fixing this issue really has three parts:

  1. Teaching LLVM how to auto-vectorize Julia across function boundaries
  2. Giving Julia frontend semantics that give LLVM license to perform vectorization
  3. Providing an SPMD programming model to allow easily writing vectorized primitives

For this issue, what I'm talking about is #2 (though a little bit of #1 is required to
make it work). Fixing #2 would probably resolve 90% of the cases in which case this issue
is the final performance bottleneck. The primary reason for this is that various groups
have already assembled libraries of hand-vectorized implementations of most common math
functions (SLEEF, SVML, etc.). LLVM does in fact have the ability to replace calls to e.g.
llvm.sin by these vectorized versions. However, Julia does not use either this capability
or the LLVM primitives (at least by default, various people have turned it on for performance
at various points). There are several reasons for this:

  1. In the past we have had trouble controlling the scalar case for these intrinsics, which somtimes
    goes to system math libraries.

  2. LLVM will sometimes constant fold these in ways that are not legal according to our semantics.

  3. The list of vectorizable intrinsics is not extensible and requires a compiler upgrade to extend.
    This is in general counter to our philosophy of making things non-special.

  4. Unless explicitly annotated otherwise, we generally want code to be reproducible across systems.

  5. Our effects system does not understand this replacement, requiring either pessimization or non-soundness

The basic gist is that we would like a mechanism that explicitly lets the user indicate that the
vectorization is permissable and lets the effect system properly analyze all possible cases without
pessimization.

Proposed design

The proposed design is the following (semantically):

struct OpaqueVecTuple{T, Spec #= ::Tuple{Vararg{Int}} =#} #= Special ABI, see below =#
	shuf::ShuffleOrder
	tup::Tuple{Vararg{T}}
	function OpaqueVecTuple{T, Spec}(tup::NTuple{N, T}) where {N, T, Spec}
		(shuf, tup) = semantic_shuffle(Spec, tup)
		new{T, Spec}(shuf, tup)
	end
	function OpaqueVecTuple(parent::OpaqueVecTuple{<:Any, Spec}, tup::NTuple{N, T}) where {N, T, Spec}
		new{T, Spec}(parent.shuf, tup)
	end
end
getindex(vt::OpaqueVecTuple{T, Spec}) = semantic_unshuffle(vt.shuf, tup)

The key new intrinsics here are semantic_shuffle/semantic_unshuffle. These are (non-:consistent)
intrinsics that give the optimize license to expand tup to any of the sizes in Spec.
ShuffleOrder is a zero-sized token. The interpreter semantics are for semantic_shuffle and
semantic_unshuffle to be no-ops. Here's a usage example:

function sin(x::OpaqueVecTuple{VecElement{Float64}, (1, 2, 4, 8)})
	tup = x.tup
	if length(tup) == 1
		return OpaqueVecTuple(x, (sin_scalar(x[1],)))
	elseif length(tup) == 2
		return OpaqueVecTuple(x, ccall(:sin_vector_2, NTuple{2, VecElement{Float64}}, (NTuple{2, VecElement{Float64}},), x))
	elseif length(tup) == 4
		return OpaqueVecTuple(x, ccall(:sin_vector_4, NTuple{4, VecElement{Float64}}, (NTuple{4, VecElement{Float64}},), x))
	else
		@assert length(tup) == 8
		return OpaqueVecTuple(x, ccall(:sin_vector_8, NTuple{8, VecElement{Float64}}, (NTuple{8, VecElement{Float64}},), x))
	end
end

sin(x::Float64) = sin(OpaqueVecTuple{VecElement{Float64}, (1, 2, 4, 8)}((x,)))[][1]

This by itself gives the optimizer license to combine several sin calls into one vectorized sin (under appropriate
effect assumptions, in particular :nothrow and :effect_free, but crucially not :consistent). To futher improve things,
codegen should recognize OpaqueVecTuple in the signature and generate (LLVM-level) specialized versions for the vector
lengths specified in Spec with the appropriate ABI.

Some additional work is then required to teach LLVM to perform this operation in the vectorizer. However, crucially,
this is entirely user-defined. It does not matter whether the vector library is some external library of primitives
or whether the vector intrinsics are written in Julia. The semantics are well defined and the compiler can continue to
optimize, so long as it proves whatever information it wants correct across all brances of the dispatcher.

@Keno Keno added compiler:llvm For issues that relate to LLVM compiler:simd instruction-level vectorization design Design of APIs or of the language itself labels Nov 7, 2024
@vchuravy
Copy link
Member

vchuravy commented Nov 7, 2024

Some additional work is then required to teach LLVM to perform this operation in the vectorizer. However, crucially,

That part is surprisingly easy , we just need to encode the vector variants as metadata, see #56188

OpaqueVecTuple reminds me of C# vector type that has a runtime constant, but compilation opaque length.

We would need to do a level of dispatch in our fptr API to then target the "right" vector length, but that seems doable

@Keno
Copy link
Member Author

Keno commented Nov 7, 2024

I considered just making it a statically opaque-length thing, but I thought it was important for the semantics to also allow scrambling the vector lanes. Otherwise I think you need an additional assumption or encoding that requires the vector lanes to behave uniformly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:llvm For issues that relate to LLVM compiler:simd instruction-level vectorization design Design of APIs or of the language itself
Projects
None yet
Development

No branches or pull requests

2 participants