Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorizing sqrt #9672

Closed
ArchRobison opened this issue Jan 7, 2015 · 10 comments
Closed

Vectorizing sqrt #9672

ArchRobison opened this issue Jan 7, 2015 · 10 comments

Comments

@ArchRobison
Copy link
Contributor

This is issue is for discussing what needs to be done to vectorize sqrt effectively.

There are two separate show-stoppers:

  • LLVM 3.3's loop vectorizer does not handle sqrt. [Updated 2015-1-8 to say "3.3"]
  • Julia's lowering of sqrt iintroduces a branch to deal with a negative argument.

The first item is a general LLVM issue and we can take it up with the LLVM community. Oddly, LLVM's "BB Vectorizer" seem to have logic to vectorize intrinsics, but the "Loop Vectorizer" does not.

The second item would seem to be more specific to Julia. Some thoughts on ways to address it:

  • Often the argument is provably non-negative. E.g. sqrt(x^2+y^2) is a common idiom and some users would be happy to write sqrt(abs(z)) if that improved performance over sqrt(z). It would be straightforward to write a pass to do floating-point range propagation through the SSA graph. That's probably too expensive to do everywhere in a JIT environment, but inside @simd loops it seems worth the effort.
  • Introduce a @nothrow macro that turns that causes domain checking to be turned off, and NaNs returned instead.
  • @simd semantics are defined to allow out of order checking of iterations. Exploiting this would let us vectorize other functions with domain checks too. But it's such a radical upgrade of the LLVM loop vectorizer that the LLVM community would likely not accept unless C/C++ adopt similar semantics for SIMD loops. But we could implement it via a "front end vectorizer", i.e. vectorizing before lowering from Julia IR to LLVM IR, much like ISPC does.
  • Crazy idea: we could have a custom pass that strips the checks inside @simd loops, but leaves metadata to mark their place. After the vectorizer runs, we could reinsert the checks in vectorized form.

Anyway, that's my initial brain dump. What's other people's take on this?

@eschnett
Copy link
Contributor

eschnett commented Jan 8, 2015

I personally would use @nothrow, and this should probably be part of @fastmath as well.

Apart from this, it may be straightforward to vectorize the nan-check as well. This would not require a front-end vectorizer. Instead, @simd would introduce a new flag nanfound, initialized to false. The nan-check is then rewritten to

nanfound |= x<0.0
y = sqrt(x)

where sqrt does not perform a nan check any more. @simd should be able to vectorize such a loop. After the loop, @simd checks nanfound and reports an error if necessary.

Alternatively, the loop could also be rewritten to

y = sqrt(x)
nanfound |= isnan(y)

which may be faster since preconditions may be more tedious to check, and several isnan calls may be coalesced.

@simonbyrne
Copy link
Contributor

It seems like this is another case where utilising the hardware floating point exception machinery, instead of explicit range checking, would be beneficial (see discussion in #5234).

@ArchRobison
Copy link
Contributor Author

I discovered that LLVM 3.5 can vectorize sqrt if we remove the branch. For example, given:

function foo( y, x )
    @inbounds @simd for i=1:length(y)
        y[i] = sqrt(x[i])
    end
end

on a 4th Generation Core i7, with arrays of 100,000 elements in cache, removing the branch got me almost 4x speedup for Float32 and 2x speedup for Float64.

I've started a discussion in the LLVM developer forum on the subject of removing superfluous floating-point checks.

@eschnett
Copy link
Contributor

eschnett commented Jan 8, 2015

Note: I plan to have @fastmath remove the check from sqrt, which should then vectorize the loop.

@ArchRobison
Copy link
Contributor Author

Good to hear.

@ViralBShah
Copy link
Member

Does @fastmath do this now already?

@eschnett
Copy link
Contributor

Yes it does. It is now up to LLVM to vectorize the sqrt intrinsic that we generate. LLVM 3.3 does not, I hear that LLVM 3.5 does.

@ArchRobison
Copy link
Contributor Author

Yes, LLVM 3.5.0 will vectorize it. Here is an example.

@ViralBShah
Copy link
Member

Do we need to keep this issue open then?

@ArchRobison
Copy link
Contributor Author

Let's close it. 😄

Also related, I submitted a patch to the LLVM community that removes the domain check in some obvious cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants