Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorization of a FOR loop using "@simd" with nested "norm" #11037

Closed
GravityAssisted opened this issue Apr 28, 2015 · 15 comments
Closed

Vectorization of a FOR loop using "@simd" with nested "norm" #11037

GravityAssisted opened this issue Apr 28, 2015 · 15 comments
Labels
help wanted Indicates that a maintainer wants help on an issue or pull request performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks
Milestone

Comments

@GravityAssisted
Copy link

Hi All,

I am a new Julia user and am trying to get performance improvement from using the @simd instructions in julia-0.3. I have the following function with a nested norm function call:

function prune_range!(DSMPerihelion::Array{Float64,2},SolarRadiiCutoff::Float64,OutInd::Array{Float64})
    bySradii=1.0/SolarRadius
    len = size(DSMPerihelion)[2]
    r = Array(Float64,3)
    for i = 1:len
      r = DSMPerihelion[1:3,i]
      OutInd[i] = norm(r)*bySradii
    end
end

Its runs on an input 2d array: DSMPerihelion of size (3 , 20979) and I get the following performance:

@time prune_range!(DSMPerihelion,SolarRadiiCutoff,DsmR)
elapsed time: 0.00252588 seconds (2181976 bytes allocated)

Now when I try to vectorize the code as follows:

function prune_range!(DSMPerihelion::Array{Float64,2},SolarRadiiCutoff::Float64,OutInd::Array{Float64})
    bySradii=1.0/SolarRadius
    len = size(DSMPerihelion)[2]
    r = Array(Float64,3)
    @simd for i = 1:len
      @inbounds r = DSMPerihelion[1:3,i]
      @inbounds OutInd[i] = norm(r)*bySradii
    end
end

I get almost the same performance:
elapsed time: 0.002716825 seconds (2181976 bytes allocated)

I looked at the code_llvm and the code doesn't seem to get vectorized. Is there something I am missing for vectorizing this code ? I followed the Intel Blog(https://software.intel.com/en-us/articles/vectorization-in-julia) by @ArchRobison to understand vectorization in Julia.

This function will be called millions of times so its imp. it performs well; hence my effort to vectorize it.
Thanks for the help.

@ArchRobison
Copy link
Contributor

There's a couple of issues here that thwart the vectorizer. Unfortunately I was not able to get past all of them, but here is as far as I got.

  1. The type inference system cannot predict the type of SolarRadius, unless it is const. I changed it to be a parameter.
  2. The assignment to r is copying a subarray into a new object, not copying into the old array r. The vectorizer can't deal with allocating new objects in a loop. Replace r with r[1:3] on the left side of the assignment.
  3. Alas the vectorizer works only if the leftmost subscript is the loop index.. So the [1:3,i] subscript pattern is going to confuse it. I worked around this by expanding the norm calculation into scalar arithmetic. (I know, yuck!)
  4. Likewise DSMPerihelion needs to be stored transposed so that the loop index is the left subscript. (This is probably another yuck for the caller.)
  5. The sqrt operation (from the expanded norm) introduces an implicit check and branch, which confuses the vectorizer. This is fixed in a recent version of LLVM, but not the one used by Julia. So use @fast_math to turn off the check.
function prune_range!(DSMPerihelion::Array{Float64,2},SolarRadiiCutoff::Float64,OutInd::Array{Float64},SolarRadius)
    bySradii=1.0/SolarRadius
    len = size(DSMPerihelion,1)
    @inbounds @simd for i = 1:len
      r1 = DSMPerihelion[i,1]
      r2 = DSMPerihelion[i,2]
      r3 = DSMPerihelion[i,3]
      @fastmath OutInd[i] = sqrt(r1^2+r2^2+r3^2)*bySradii
    end
end

After these changes the loop generates better code, but to my chagrin, the vectorizer still refuses to vectorize it. It might be cost-model issue. I'll need to take a closer look at it when I have time later this week. Though I'm guessing the code above, even unvectorized, is likely faster than the original.

@pao
Copy link
Member

pao commented Apr 28, 2015

This question is better suited for julia-users, our mailing list: https://groups.google.com/forum/#!forum/julia-users

I also took a crack at it, though since Arch has already chimed in, I think he's covered everything that I was going to say, plus relevant autovectorizer knowledge.

@pao
Copy link
Member

pao commented Apr 28, 2015

Follow-up for @ArchRobison, if you don't mind: I tested your code without the transposed DSMPerihelion (since it isn't vectorizing anyways) and it runs at about the same speed (on a Sandy Bridge i7). I was expecting some benefit to not transposing since then we'd be making contiguous reads rather than strided reads due to the column-major storage order. It doesn't seem to have any clear effect, though. Is this just a consequence of the reads being so predictable?

@GravityAssisted
Copy link
Author

Thanks for the follow up @ArchRobison and @pao

I also did the following tests. I modified my prune function as follows:

function prune_range!(DSMPerihelion::Array{Float64,2},SolarRadius::Float64,DSMr::Array{Float64})
    bySradii=1.0/SolarRadius
    len = size(DSMPerihelion)[2]
    @inbounds @simd for i = 1:len
      DSMr[i] = sqrt(DSMPerihelion[1,i]^2+DSMPerihelion[2,i]^2+DSMPerihelion[3,i]^2)*bySradii
    end
end

and got the performance results:
elapsed time: 0.000124806 seconds (80 bytes allocated) .
This is > 10x improvement !
Removing ther array and using a sqrt instead of the norm function reduced the memory by A LOT. Is this normal or is this something the compiler should be able to do by itself ?

Next I also tested @ArchRobison version:

function prune_range2!(DSMPerihelion_Transpose::Array{Float64,2},SolarRadius::Float64,DSMr::Array{Float64})
    bySradii=1.0/SolarRadius
    len = size(DSMPerihelion_Transpose)[1]
    @inbounds @simd for i = 1:len
      r1 = DSMPerihelion_Transpose[i,1]
      r2 = DSMPerihelion_Transpose[i,2]
      r3 = DSMPerihelion_Transpose[i,3]
      DSMr[i] = sqrt(r1^2+r2^2+r3^2)*bySradii
    end
end

I couldn't use @fastmath as I am on Julia-0.3
Here I am using a transposed version of the input array (resulting in strided access inside the for loop). I get very similar performance from this code:
elapsed time: 0.000116565 seconds (80 bytes allocated)

So now I am little confused :

  1. Why is Julia still refusing to vectorize the loop ?
  2. Why do we see similar performance from a strided array access as compared to a contiguous (column-major) array access inside the for loop. According to the documentation we should see a performance degradation in `prune_range2!' ? ( @pao brought this up as well)

thanks

@pao
Copy link
Member

pao commented Apr 28, 2015

Removing ther array and using a sqrt instead of the norm function reduced the memory by A LOT. Is this normal or is this something the compiler should be able to do by itself ?

I'm hoping we get to the point where we can do fast, small-vector norms without having to spell them out longhand. If we can reinterpret a slice as a fixed-size array, like those proposed in #7568, then we should be able to dispatch to a specialized implementation.

As to your question (1), this gets back to what Arch mentions in his reply about the cost model. If you go back to the original article, one of the points he makes is that the transformation must be "profitable", which is to say, there are heuristics involved in deciding whether to vectorize and they may need to be tuned a bit.

@GravityAssisted
Copy link
Author

It will be great if we can tell to the vectorizer to bypass its internal heuristics and always optimize a loop. Maybe like @foceVectorize @simd ... . That way it doesn't have to guess if the loop vectorization will be profitable or not. Maybe this should be a separate issue ?

I have not played with the ImmutableArrays package but I hope that it will take care of the issue of writing longhand, spelled-out code until the compiler becomes more smart.

I plan to do some tests late today and see if Immutable Arrays can be used with the norm function.

@pao
Copy link
Member

pao commented Apr 28, 2015

ImmutableArrays.jl does provide a norm for its array types, but there's a lot of compiler trickery needed to elide the construction of the 3-vector. It's immutable so it's possible. Would be great if it works out.

@ihnorton ihnorton added the performance Must go faster label Apr 28, 2015
@simonster
Copy link
Member

@ArchRobison's code is vectorized for me with LLVM 3.6 but not LLVM 3.3. The untransposed version appears to vectorize as well (with @fastmath).

@pao
Copy link
Member

pao commented Apr 28, 2015

That is both my favorite label and least favorite label! Thanks for checking with a new LLVM.

@GravityAssisted
Copy link
Author

Awesome... thanks guys. I will try with LLVM 3.6 and get back later today with some benchmark numbers.

@ArchRobison
Copy link
Contributor

With respect to @GravityAssisted's question about strided vs. unstrided: If the loop is not vectorized, the transposition of DSMPerihelion is not going to make much difference because the difference is between having one incoming stream and three incoming streams of values. The number of cache misses will be identical. This is not to say that the number of streams is always irrelevant. With some stencil computations, you can get spectacular performance degradation when the number of streams exceeds the hardware capacity. Typically this happens when there are more streams than the TLB can handle. There can also be degradation if the number of streams exceeds the number of hardware prefetch units, though in my experience it's been TLB exhaustion that causes the big problems.

I concur with @pao's comment that it would be nice to be able to do small-vector norms. I thought it was just a matter of more clever inlining (particularly recognizing short fixed-count loops), but now I see that norm is is fairly tricky beast since it's trying to avoid oveflow issues. We might want @fast_math to cause norm to be less tricky.

@GravityAssisted
Copy link
Author

Hi all,

Sorry I was on vacations for a week ! Hence the late reply. Following @ArchRobison advice I made the switch to LLVM3.6 and was able to get it vectorized 👍 . Thanks for the explanation on cache misses.
I also tired using immutable arrays package, unfortunately it did not help with the norm , though I did notice a small but noticeable improvement in runtime performance.

@tkelman tkelman added the potential benchmark Could make a good benchmark in BaseBenchmarks label Jan 11, 2016
@StefanKarpinski StefanKarpinski added this to the 0.5.x milestone Sep 13, 2016
@StefanKarpinski
Copy link
Sponsor Member

Is this addressed now?

@StefanKarpinski StefanKarpinski added help wanted Indicates that a maintainer wants help on an issue or pull request and removed help wanted Indicates that a maintainer wants help on an issue or pull request labels Oct 27, 2016
@KristofferC
Copy link
Sponsor Member

SIMD tests have been added

@tkelman
Copy link
Contributor

tkelman commented May 26, 2017

but not this test case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Indicates that a maintainer wants help on an issue or pull request performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks
Projects
None yet
Development

No branches or pull requests

8 participants