Vectorizing sqrt #9672

ArchRobison · 2015-01-07T23:16:45Z

This is issue is for discussing what needs to be done to vectorize sqrt effectively.

There are two separate show-stoppers:

LLVM 3.3's loop vectorizer does not handle sqrt. [Updated 2015-1-8 to say "3.3"]
Julia's lowering of sqrt iintroduces a branch to deal with a negative argument.

The first item is a general LLVM issue and we can take it up with the LLVM community. Oddly, LLVM's "BB Vectorizer" seem to have logic to vectorize intrinsics, but the "Loop Vectorizer" does not.

The second item would seem to be more specific to Julia. Some thoughts on ways to address it:

Often the argument is provably non-negative. E.g. sqrt(x^2+y^2) is a common idiom and some users would be happy to write sqrt(abs(z)) if that improved performance over sqrt(z). It would be straightforward to write a pass to do floating-point range propagation through the SSA graph. That's probably too expensive to do everywhere in a JIT environment, but inside @simd loops it seems worth the effort.
Introduce a @nothrow macro that turns that causes domain checking to be turned off, and NaNs returned instead.
@simd semantics are defined to allow out of order checking of iterations. Exploiting this would let us vectorize other functions with domain checks too. But it's such a radical upgrade of the LLVM loop vectorizer that the LLVM community would likely not accept unless C/C++ adopt similar semantics for SIMD loops. But we could implement it via a "front end vectorizer", i.e. vectorizing before lowering from Julia IR to LLVM IR, much like ISPC does.
Crazy idea: we could have a custom pass that strips the checks inside @simd loops, but leaves metadata to mark their place. After the vectorizer runs, we could reinsert the checks in vectorized form.

Anyway, that's my initial brain dump. What's other people's take on this?

The text was updated successfully, but these errors were encountered:

eschnett · 2015-01-08T00:41:00Z

I personally would use @nothrow, and this should probably be part of @fastmath as well.

Apart from this, it may be straightforward to vectorize the nan-check as well. This would not require a front-end vectorizer. Instead, @simd would introduce a new flag nanfound, initialized to false. The nan-check is then rewritten to

nanfound |= x<0.0
y = sqrt(x)

where sqrt does not perform a nan check any more. @simd should be able to vectorize such a loop. After the loop, @simd checks nanfound and reports an error if necessary.

Alternatively, the loop could also be rewritten to

y = sqrt(x)
nanfound |= isnan(y)

which may be faster since preconditions may be more tedious to check, and several isnan calls may be coalesced.

simonbyrne · 2015-01-08T14:32:18Z

It seems like this is another case where utilising the hardware floating point exception machinery, instead of explicit range checking, would be beneficial (see discussion in #5234).

ArchRobison · 2015-01-08T22:02:08Z

I discovered that LLVM 3.5 can vectorize sqrt if we remove the branch. For example, given:

function foo( y, x )
    @inbounds @simd for i=1:length(y)
        y[i] = sqrt(x[i])
    end
end

on a 4th Generation Core i7, with arrays of 100,000 elements in cache, removing the branch got me almost 4x speedup for Float32 and 2x speedup for Float64.

I've started a discussion in the LLVM developer forum on the subject of removing superfluous floating-point checks.

eschnett · 2015-01-08T22:36:29Z

Note: I plan to have @fastmath remove the check from sqrt, which should then vectorize the loop.

ArchRobison · 2015-01-08T22:51:34Z

Good to hear.

ViralBShah · 2015-01-19T14:11:21Z

Does @fastmath do this now already?

eschnett · 2015-01-19T14:50:53Z

Yes it does. It is now up to LLVM to vectorize the sqrt intrinsic that we generate. LLVM 3.3 does not, I hear that LLVM 3.5 does.

ArchRobison · 2015-01-19T15:21:04Z

Yes, LLVM 3.5.0 will vectorize it. Here is an example.

ViralBShah · 2015-01-19T15:32:41Z

Do we need to keep this issue open then?

ArchRobison · 2015-01-19T15:53:07Z

Let's close it. 😄

Also related, I submitted a patch to the LLVM community that removes the domain check in some obvious cases.

ArchRobison closed this as completed Jan 19, 2015

tkelman added the fixed in LLVM 3.5 label Jan 19, 2015

damiendr mentioned this issue Oct 20, 2015

SIMD performance regression tests #13686

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorizing sqrt #9672

Vectorizing sqrt #9672

ArchRobison commented Jan 7, 2015

eschnett commented Jan 8, 2015

simonbyrne commented Jan 8, 2015

ArchRobison commented Jan 8, 2015

eschnett commented Jan 8, 2015

ArchRobison commented Jan 8, 2015

ViralBShah commented Jan 19, 2015

eschnett commented Jan 19, 2015

ArchRobison commented Jan 19, 2015

ViralBShah commented Jan 19, 2015

ArchRobison commented Jan 19, 2015

Vectorizing sqrt #9672

Vectorizing sqrt #9672

Comments

ArchRobison commented Jan 7, 2015

eschnett commented Jan 8, 2015

simonbyrne commented Jan 8, 2015

ArchRobison commented Jan 8, 2015

eschnett commented Jan 8, 2015

ArchRobison commented Jan 8, 2015

ViralBShah commented Jan 19, 2015

eschnett commented Jan 19, 2015

ArchRobison commented Jan 19, 2015

ViralBShah commented Jan 19, 2015

ArchRobison commented Jan 19, 2015