Vectorization of a FOR loop using "@simd" with nested "norm" #11037

GravityAssisted · 2015-04-28T00:52:26Z

Hi All,

I am a new Julia user and am trying to get performance improvement from using the @simd instructions in julia-0.3. I have the following function with a nested norm function call:

function prune_range!(DSMPerihelion::Array{Float64,2},SolarRadiiCutoff::Float64,OutInd::Array{Float64})
    bySradii=1.0/SolarRadius
    len = size(DSMPerihelion)[2]
    r = Array(Float64,3)
    for i = 1:len
      r = DSMPerihelion[1:3,i]
      OutInd[i] = norm(r)*bySradii
    end
end

Its runs on an input 2d array: DSMPerihelion of size (3 , 20979) and I get the following performance:

@time prune_range!(DSMPerihelion,SolarRadiiCutoff,DsmR)
elapsed time: 0.00252588 seconds (2181976 bytes allocated)

Now when I try to vectorize the code as follows:

function prune_range!(DSMPerihelion::Array{Float64,2},SolarRadiiCutoff::Float64,OutInd::Array{Float64})
    bySradii=1.0/SolarRadius
    len = size(DSMPerihelion)[2]
    r = Array(Float64,3)
    @simd for i = 1:len
      @inbounds r = DSMPerihelion[1:3,i]
      @inbounds OutInd[i] = norm(r)*bySradii
    end
end

I get almost the same performance:
elapsed time: 0.002716825 seconds (2181976 bytes allocated)

I looked at the code_llvm and the code doesn't seem to get vectorized. Is there something I am missing for vectorizing this code ? I followed the Intel Blog(https://software.intel.com/en-us/articles/vectorization-in-julia) by @ArchRobison to understand vectorization in Julia.

This function will be called millions of times so its imp. it performs well; hence my effort to vectorize it.
Thanks for the help.

The text was updated successfully, but these errors were encountered:

ArchRobison · 2015-04-28T02:26:07Z

There's a couple of issues here that thwart the vectorizer. Unfortunately I was not able to get past all of them, but here is as far as I got.

The type inference system cannot predict the type of SolarRadius, unless it is const. I changed it to be a parameter.
The assignment to r is copying a subarray into a new object, not copying into the old array r. The vectorizer can't deal with allocating new objects in a loop. Replace r with r[1:3] on the left side of the assignment.
Alas the vectorizer works only if the leftmost subscript is the loop index.. So the [1:3,i] subscript pattern is going to confuse it. I worked around this by expanding the norm calculation into scalar arithmetic. (I know, yuck!)
Likewise DSMPerihelion needs to be stored transposed so that the loop index is the left subscript. (This is probably another yuck for the caller.)
The sqrt operation (from the expanded norm) introduces an implicit check and branch, which confuses the vectorizer. This is fixed in a recent version of LLVM, but not the one used by Julia. So use @fast_math to turn off the check.

function prune_range!(DSMPerihelion::Array{Float64,2},SolarRadiiCutoff::Float64,OutInd::Array{Float64},SolarRadius)
    bySradii=1.0/SolarRadius
    len = size(DSMPerihelion,1)
    @inbounds @simd for i = 1:len
      r1 = DSMPerihelion[i,1]
      r2 = DSMPerihelion[i,2]
      r3 = DSMPerihelion[i,3]
      @fastmath OutInd[i] = sqrt(r1^2+r2^2+r3^2)*bySradii
    end
end

After these changes the loop generates better code, but to my chagrin, the vectorizer still refuses to vectorize it. It might be cost-model issue. I'll need to take a closer look at it when I have time later this week. Though I'm guessing the code above, even unvectorized, is likely faster than the original.

pao · 2015-04-28T02:31:52Z

This question is better suited for julia-users, our mailing list: https://groups.google.com/forum/#!forum/julia-users

I also took a crack at it, though since Arch has already chimed in, I think he's covered everything that I was going to say, plus relevant autovectorizer knowledge.

pao · 2015-04-28T03:21:46Z

Follow-up for @ArchRobison, if you don't mind: I tested your code without the transposed DSMPerihelion (since it isn't vectorizing anyways) and it runs at about the same speed (on a Sandy Bridge i7). I was expecting some benefit to not transposing since then we'd be making contiguous reads rather than strided reads due to the column-major storage order. It doesn't seem to have any clear effect, though. Is this just a consequence of the reads being so predictable?

GravityAssisted · 2015-04-28T14:41:18Z

Thanks for the follow up @ArchRobison and @pao

I also did the following tests. I modified my prune function as follows:

function prune_range!(DSMPerihelion::Array{Float64,2},SolarRadius::Float64,DSMr::Array{Float64})
    bySradii=1.0/SolarRadius
    len = size(DSMPerihelion)[2]
    @inbounds @simd for i = 1:len
      DSMr[i] = sqrt(DSMPerihelion[1,i]^2+DSMPerihelion[2,i]^2+DSMPerihelion[3,i]^2)*bySradii
    end
end

and got the performance results:
elapsed time: 0.000124806 seconds (80 bytes allocated) .
This is > 10x improvement !
Removing ther array and using a sqrt instead of the norm function reduced the memory by A LOT. Is this normal or is this something the compiler should be able to do by itself ?

Next I also tested @ArchRobison version:

function prune_range2!(DSMPerihelion_Transpose::Array{Float64,2},SolarRadius::Float64,DSMr::Array{Float64})
    bySradii=1.0/SolarRadius
    len = size(DSMPerihelion_Transpose)[1]
    @inbounds @simd for i = 1:len
      r1 = DSMPerihelion_Transpose[i,1]
      r2 = DSMPerihelion_Transpose[i,2]
      r3 = DSMPerihelion_Transpose[i,3]
      DSMr[i] = sqrt(r1^2+r2^2+r3^2)*bySradii
    end
end

I couldn't use @fastmath as I am on Julia-0.3
Here I am using a transposed version of the input array (resulting in strided access inside the for loop). I get very similar performance from this code:
elapsed time: 0.000116565 seconds (80 bytes allocated)

So now I am little confused :

Why is Julia still refusing to vectorize the loop ?
Why do we see similar performance from a strided array access as compared to a contiguous (column-major) array access inside the for loop. According to the documentation we should see a performance degradation in `prune_range2!' ? ( @pao brought this up as well)

thanks

pao · 2015-04-28T15:18:31Z

Removing ther array and using a sqrt instead of the norm function reduced the memory by A LOT. Is this normal or is this something the compiler should be able to do by itself ?

I'm hoping we get to the point where we can do fast, small-vector norms without having to spell them out longhand. If we can reinterpret a slice as a fixed-size array, like those proposed in #7568, then we should be able to dispatch to a specialized implementation.

As to your question (1), this gets back to what Arch mentions in his reply about the cost model. If you go back to the original article, one of the points he makes is that the transformation must be "profitable", which is to say, there are heuristics involved in deciding whether to vectorize and they may need to be tuned a bit.

GravityAssisted · 2015-04-28T15:54:58Z

It will be great if we can tell to the vectorizer to bypass its internal heuristics and always optimize a loop. Maybe like @foceVectorize @simd ... . That way it doesn't have to guess if the loop vectorization will be profitable or not. Maybe this should be a separate issue ?

I have not played with the ImmutableArrays package but I hope that it will take care of the issue of writing longhand, spelled-out code until the compiler becomes more smart.

I plan to do some tests late today and see if Immutable Arrays can be used with the norm function.

pao · 2015-04-28T16:57:57Z

ImmutableArrays.jl does provide a norm for its array types, but there's a lot of compiler trickery needed to elide the construction of the 3-vector. It's immutable so it's possible. Would be great if it works out.

simonster · 2015-04-28T19:14:07Z

@ArchRobison's code is vectorized for me with LLVM 3.6 but not LLVM 3.3. The untransposed version appears to vectorize as well (with @fastmath).

pao · 2015-04-28T19:16:05Z

That is both my favorite label and least favorite label! Thanks for checking with a new LLVM.

GravityAssisted · 2015-04-28T20:27:59Z

Awesome... thanks guys. I will try with LLVM 3.6 and get back later today with some benchmark numbers.

ArchRobison · 2015-04-29T22:44:48Z

With respect to @GravityAssisted's question about strided vs. unstrided: If the loop is not vectorized, the transposition of DSMPerihelion is not going to make much difference because the difference is between having one incoming stream and three incoming streams of values. The number of cache misses will be identical. This is not to say that the number of streams is always irrelevant. With some stencil computations, you can get spectacular performance degradation when the number of streams exceeds the hardware capacity. Typically this happens when there are more streams than the TLB can handle. There can also be degradation if the number of streams exceeds the number of hardware prefetch units, though in my experience it's been TLB exhaustion that causes the big problems.

I concur with @pao's comment that it would be nice to be able to do small-vector norms. I thought it was just a matter of more clever inlining (particularly recognizing short fixed-count loops), but now I see that norm is is fairly tricky beast since it's trying to avoid oveflow issues. We might want @fast_math to cause norm to be less tricky.

GravityAssisted · 2015-05-09T07:29:59Z

Hi all,

Sorry I was on vacations for a week ! Hence the late reply. Following @ArchRobison advice I made the switch to LLVM3.6 and was able to get it vectorized 👍 . Thanks for the explanation on cache misses.
I also tired using immutable arrays package, unfortunately it did not help with the norm , though I did notice a small but noticeable improvement in runtime performance.

StefanKarpinski · 2016-09-13T23:42:08Z

Is this addressed now?

KristofferC · 2017-05-26T19:07:13Z

SIMD tests have been added

tkelman · 2017-05-26T19:20:41Z

but not this test case

pao mentioned this issue Apr 28, 2015

EXCEPTION_ACCESS_VIOLATION w/probably meaningless backtraces #11038

Closed

ihnorton added the performance Must go faster label Apr 28, 2015

simonster added the fixed in LLVM 3.5 label Apr 28, 2015

simonster added Fixed in LLVM 3.6 and removed fixed in LLVM 3.5 labels Jun 2, 2015

tkelman mentioned this issue Jan 10, 2016

Upgrade to LLVM 3.7.1 and switch over CI #14623

Merged

3 tasks

tkelman added the potential benchmark Could make a good benchmark in BaseBenchmarks label Jan 11, 2016

StefanKarpinski added this to the 0.5.x milestone Sep 13, 2016

StefanKarpinski added help wanted Indicates that a maintainer wants help on an issue or pull request and removed help wanted Indicates that a maintainer wants help on an issue or pull request labels Oct 27, 2016

tkelman referenced this issue in JuliaCI/BaseBenchmarks.jl Dec 28, 2016

add SIMD benchmarks from JuliaLang/julia#13777 and JuliaLang/julia#13686

12637db

KristofferC closed this as completed May 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorization of a FOR loop using "@simd" with nested "norm" #11037

Vectorization of a FOR loop using "@simd" with nested "norm" #11037

GravityAssisted commented Apr 28, 2015

ArchRobison commented Apr 28, 2015

pao commented Apr 28, 2015

pao commented Apr 28, 2015

GravityAssisted commented Apr 28, 2015

pao commented Apr 28, 2015

GravityAssisted commented Apr 28, 2015

pao commented Apr 28, 2015

simonster commented Apr 28, 2015

pao commented Apr 28, 2015

GravityAssisted commented Apr 28, 2015

ArchRobison commented Apr 29, 2015

GravityAssisted commented May 9, 2015

StefanKarpinski commented Sep 13, 2016

KristofferC commented May 26, 2017

tkelman commented May 26, 2017

Vectorization of a FOR loop using "@simd" with nested "norm" #11037

Vectorization of a FOR loop using "@simd" with nested "norm" #11037

Comments

GravityAssisted commented Apr 28, 2015

ArchRobison commented Apr 28, 2015

pao commented Apr 28, 2015

pao commented Apr 28, 2015

GravityAssisted commented Apr 28, 2015

pao commented Apr 28, 2015

GravityAssisted commented Apr 28, 2015

pao commented Apr 28, 2015

simonster commented Apr 28, 2015

pao commented Apr 28, 2015

GravityAssisted commented Apr 28, 2015

ArchRobison commented Apr 29, 2015

GravityAssisted commented May 9, 2015

StefanKarpinski commented Sep 13, 2016

KristofferC commented May 26, 2017

tkelman commented May 26, 2017