Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable SLPVectorizer and make it work for tuples #6271

Merged
merged 1 commit into from
Dec 4, 2014

Conversation

ArchRobison
Copy link
Contributor

[Updated 2014-Nov-25] This pull request enables implicit vectorization of tuple math idioms when -O is on the Julia command line.

The changes comprise:

  • Add -O command-line option (per @ViralBShah's suggestion).
  • Add LLVM's BasicAliasAnalysis and SLPVectorizer to the pass list when -O is present.
  • Add InstructionCombiningPass to clean up after SLPVectorizer. Without the cleanup, the code is horrible.
  • Add a patch that makes LLVM 3.3's SLPVectorizer work for tuply idioms of interest.
    The changes to LLVM 3.3 were backported from recent changes developed by me that are now part of LLVM 3.5. The patch includes backported LLVM unit tests.
  • Move position of LoopVectorizer in the pass list to match its positions in Clang 3.3 and 3.5.
    • For LLVM 3.3, it now comes after LoopUnrollPass instead of before.
    • For LLVM 3.5, it now comes almost last.
  • When using LLVM 3.5, add a followup "InstructionCombiningPass". Without the cleanup, the code is horrible.

For background, see Issue #5857 and in particular my comment. Here is an example of what the enhanced SLPVectorizer can do:

$ cat foo.jl
function add( a::NTuple{4,Float32}, b::NTuple{4,Float32} )
    (a[1]+b[1],a[2]+b[2],a[3]+b[3],a[4]+b[4])
end

function mul( a::NTuple{4,Float32}, b::NTuple{4,Float32} )
    (a[1]*b[1],a[2]*b[2],a[3]*b[3],a[4]*b[4])
end

function madd( a::NTuple{4,Float32}, b::NTuple{4,Float32}, c::NTuple{4,Float32} )
    add(mul(a,b),c)
end

t = NTuple{4,Float32}

code_llvm(madd,(t,t,t))
$ julia -O foo.jl

define <4 x float> @"julia_madd;63815"(<4 x float>, <4 x float>, <4 x float>) {
top:
  %3 = fmul <4 x float> %0, %1, !dbg !45
  %4 = fadd <4 x float> %3, %2, !dbg !45
  ret <4 x float> %4, !dbg !45
}

Without the patch, the LLVM code for madd is lengthy scalar stuff.

Impact on compilation time using -O using LLVM 3.3 is significant, but hardly noticeable withLLVM 3.5. I measured compilation overhead using:

cd base; time -p ../julia sysimg.jl

Here's a table of compilation times (in seconds). "Baseline" is Julia without the PR.

LLVM 3.3 LLVM 3.3 LLVM 3.3 LLVM 3.5 LLVM 3.5 LLVM 3.5
Baseline PR PR -O Baseline PR PR -O
Run 1 46.40 46.83 67.65 53.74 51.57 53.33
Run 2 46.50 46.56 67.63 51.94 51.70 53.13
Run 3 46.40 47.06 67.13 52.32 51.80 52.80
Run 4 46.32 46.24 68.28 52.31 51.83 52.84
Run 5 47.12 46.53 67.79 52.43 51.51 53.01
MEAN 46.55 46.64 67.70 52.55 51.68 53.02
% w.r.t. baseline 0.00% 0.21% 45.43% 0.00% -1.65% 0.90%
STDEV/MEAN 0.70% 0.67% 0.61% 1.32% 0.27% 0.41%

@StefanKarpinski
Copy link
Sponsor Member

This is very exciting – it's lovely to see such nice, short code generated.

@jakebolewski
Copy link
Member

Cool! How does the vectorizer work with vectorized load / stores? Could this be collapsed into a vectorized load / add / store? I assume performance will suffer somewhat due to Julia array alignment issues for larger vector widths.

function vadd_one!(arr::Array{Float64, 1})
    len  = length(arr) # assuming len multiple of 4
    one =  (1.0, 1.0, 1.0, 1.0)
    for i = 1:4:len
        inp = (arr[i], arr[i+1], arr[i+2], arr[i+3])
        out = vadd(inp, one) 
        for j = 1:4
             arr[i + j] = out[j]
       end
    end
end

@ArchRobison
Copy link
Contributor Author

@jakebolewski : My impressions is that SLPVectorizer is supposed to handle that sort of situation if bounds checking is turned off. But alas even the SLPVectorizer in LLVM trunk failed to deal with the case cleanly. It vectorized the add partially, but scalarized the stores. So apparently there's more work to be done on SLPVectorizer.

Though for vectorizing loops, its usually better if the programmer does not choose the chunk size, because the optimal chunk size depends on the target platform, and since Julia is a JIT, it can exploit whatever vector lengths are on the platform. Leaving chunking to the compiler simplifies the code too. See here for an example using the proposed @simd from #5355. To my disappointment, it vectorized for Float32, but not Float64. I suspect there's an issue with the AVX cost model.

@jakebolewski
Copy link
Member

@ArchRobison I agree that if possible the chunk size should not be chosen by the programmer, although this will not be the case if Julia gets builtin SIMD types. I saw that you were using a platform that supports AVX so I tailored my example to see if this patch helps the compiler take advantage of 256 bit SIMD instructions. I've been testing your @simd brach and I've had a hard time coercing it to produce AVX instructions as you noticed.

@StefanKarpinski
Copy link
Sponsor Member

I agree that if possible the chunk size should not be chosen by the programmer, although this will not be the case if Julia gets builtin SIMD types.

The SIMD types proposal (#2299) could be architecture-specific just like the size of Int is.

@jakebolewski
Copy link
Member

I feel you would want to support all SIMD types available for an architecture. Being able to mix and match 128 bit SIMD types and 256 bit SIMD types is often useful, so choosing SIMD size should be left to the user. 128 bit SIMD ops are also more portable between different architectures (assuming Julia get's ARM support in the future).

@ArchRobison
Copy link
Contributor Author

512-bit SIMD types are coming (and in fact here on Intel Xeon Phi). Code hardwired to 128 bit SIMD ops will be throwing away 3/4 of the ALU on those machines. The SIMD types should be parameterized by width, and a global constant could indicate the natural machine width. Though alas "natural width" and "fastest" are not always the same, often because of memory subsystem and pipelining quirks.

I've had the same problem getting @simd to produce AVX instructions. On my to-do list is trying a bunch of LLVM 3.3 AVX patches that were recommended to me.

@Keno
Copy link
Member

Keno commented Apr 4, 2014

LGTM. @JeffBezanson ?

@ArchRobison
Copy link
Contributor Author

@loladiro Thanks for looking it over. The corresponding patch for LLVM trunk has yet to settle -- it was committed to trunk, but then reverted because it broke something. I'll wait for the dust to settle and then redo the backport.

@simonster
Copy link
Member

@ArchRobison Complex arithmetic doesn't seem to vectorize for me even with this patch and the corresponding patch to LLVM 3.3. Should that be happening?

@ArchRobison
Copy link
Contributor Author

I was looking at that yesterday too. The catch is that SLPVectorizer does not know how to vectorize structs, even when they are isomorphic to tuples. See my email exchange for the LLVM details.

What would be the impact if we lowered homogeneous composite types to LLVM vectors? Would that cause a calling convention mismatch with C?

@ArchRobison ArchRobison changed the title WIP: Enable SLPVectorizer and make it work for tuples Enable SLPVectorizer and make it work for tuples Apr 24, 2014
@ArchRobison
Copy link
Contributor Author

I removed the "WIP" from the title because I think this patch is about as ready as it is ever going to be for a while. I've updated the base note with new compilation slowdown measurements. Compilation slowdown seems to be about 4.5%. (Though as noted in the revised base note, someone else should check this.) I suppose this is good for codes that show improvement, and baggage for those that don't.

Can anyone suggest any realistic workloads that might benefit from this patch? It would needs to be a code that would benefit from using SIMD to implement tuple arithmetic. As @simonster noted, it won't help vectorize complex tuples.

@StefanKarpinski
Copy link
Sponsor Member

As long as it seems stable, I'm ok with merging this. 4.5% compilation time overhead is not bad at all.

@vtjnash vtjnash added this to the 0.4 milestone Apr 25, 2014
@ViralBShah
Copy link
Member

+1 to merging, all else being good. I wonder if this shows any improvements in our perf benchmarks.

@JeffBezanson
Copy link
Sponsor Member

Right now this code pattern is not used much so I don't expect much impact, but this will soon change dramatically. Our tuples are overall not efficient enough to be used this way---especially arrays of them. But that will change in 0.4.

The compiler is starting to get slow and will get even slower with MCJIT. I'd like to see if we can do some simple things to selectively enable these optimizations, e.g. not doing loop vectorization if there are no loops.

@ArchRobison
Copy link
Contributor Author

I agree it would be worth looking at profiles and looking for fast rejection tests. For profiling, I need to build with symbols in the binaries (for Julia and LLVM) but no extra baggage. What's the easiest way to build that configuration?

The LoopVectorize pass in LLVM 3.3 would appear to already skip functions that don't have loops, since it's driven by this logic:

  virtual bool runOnLoop(Loop *L, LPPassManager &LPM) {
    // We only vectorize innermost loops.
    if (!L->empty())
      return false;

The LowerSIMDLoop pass similarly is driven by runOnLoop.

For LLVM trunk, the story is less clear. LoopVectorize appears to be driven by runOnFunction, which does:

  1. Get a bunch of analyses
  2. Build a work list of inner loops
  3. Process the work list.
    I don't know whether the operations in step 1 do any significant work or not..

@timholy
Copy link
Sponsor Member

timholy commented Sep 23, 2014

It should be much easier to turn this on now with the new Expr(:meta, ...) mechanism.

@ArchRobison
Copy link
Contributor Author

Yes, it sounds worthwhile. Where's the new mechanism documented?

@timholy
Copy link
Sponsor Member

timholy commented Sep 24, 2014

It's an internal interface, so as usual it's not documented. But see #8459.

@ArchRobison
Copy link
Contributor Author

Looks like there is a pass-order issue: Clang runs GVN (Global Value Numbering) before SLPVectorizer, but I put SLPVectorizer before GVN, and SLPVectorizer leans on GVN. I'll update my PR to reorder them. Though I also noticed that Clang 3.3 and 3.5.0 put the LoopVectorizer in significant different places with respect to GVN/SLPVectorizer, so I have some more digging to do.

[Update: just moving GVN does not solve the problem.]

@toivoh
Copy link
Contributor

toivoh commented Nov 13, 2014

Very nice!
I also had another test where I was trying to get a loop that would be both unrolled and vectorized, but could only get LLVM/Julia to do one of the two. I wonder if changing the order of passes will allow to vectorize the unrolled loop as well? I guess that it's more realistic to hope that the SLPVectorizer would vectorize the unrolled loop than that the loop vectorizer would vectorize it before it gets unrolled?

@ArchRobison
Copy link
Contributor Author

[Pardon long answer: some details are here so I can look them up later.]

Do you have an example of the loop that you were trying to both unroll and vectorize? The vectorizer has its own partial unroller.

As far as I know, partial unrolling before vectorization will thwart the vectorizer because it destroys the regular access pattern. For the passes we've been discussing, Clang 3.5's order (per julia/deps/llvm-3.5.0/lib/Transforms/IPO/PassiManagerBuilder.cpp) is:

  1. Complete unrolling - run early because it exposes scalar optimization opportunities.
  2. Global value numbering (GVN)
  3. SLPVectorizer
  4. LoopVectorizer (which also has its own unroller)
  5. Partial unrolling - run late so it doesn't thwart vectorizer.
    From what I remember, LLVM 3.3 didn't have the unrolling split, so it could cause problems. In fact, I just noticed that we're missing step (5) in the Julia pass list. Though it's not clear to me that it can gain much, because of all the other optimizations that we skip.

@ArchRobison
Copy link
Contributor Author

I found the root problem: GVN uses LLVM's memory dependence analysis, which in turn seems to require "Basic Alias Analysis" to handle this case. And createBasicAliasAnalysisPass is missing from the Julia pass list. With that and the previously mentioned move of GVN, I got SLPVectorizer to do the right thing for your example.

I'll measure the compile-time cost of adding the additional alias analysis. If it's unacceptably high, we can look at what it takes to fix GVN to handle this case with just the existing type-based alias analysis, which in principle should have been enough. Though hopefully the pass will pay for itself by better performance.

@ArchRobison
Copy link
Contributor Author

I took @ViralBShah's suggestion and updated the PR to run BasicAliasAnalysis and SLPVectorizer only if -O is on the command line. When built against LLVM 3.5, julia -O now vectorizes rmw!, rmw2!, but not rmw3! 😦 . I need to play around with pass order a bit more.

@ArchRobison ArchRobison force-pushed the adr/slpvector branch 7 times, most recently from ac40e22 to 2cd9e37 Compare November 25, 2014 22:30
@ArchRobison
Copy link
Contributor Author

I've updated the PR. See revised base note for details and compilation time overheads. When built with LLVM 3.5 and the PR, julia -O now vectorizes all of @toivoh's examples.

Alas Travis is reporting a failure for the parallel test. I need to figure out whether that's a problem that I've introduced.

@tkelman
Copy link
Contributor

tkelman commented Nov 26, 2014

Alas Travis is reporting a failure for the parallel test. I need to figure out whether that's a problem that I've introduced.

No, I've been seeing that intermittently on master and other PR's as well. I very strongly suspect it's unrelated.

@toivoh
Copy link
Contributor

toivoh commented Nov 26, 2014

Nice!

@ViralBShah
Copy link
Member

Looking forward to this being merged. Would love to try out julia -O, even if that means I have to maintain a separate copy of Julia running on LLVM 3.5.

Change pass order to more closely match Clang's.
Add patch to LLVM 3.3 with enhancements for vectorizing tuples.
The patch contains functionality and tests backported from
SLPVectorizer changes added to LLVM 3.5.0.
@ArchRobison
Copy link
Contributor Author

The AppVeyor/Travis problems went away after I rebased earlier this week. I think this PR is in good shape now.

StefanKarpinski added a commit that referenced this pull request Dec 4, 2014
Enable SLPVectorizer and make it work for tuples
@StefanKarpinski StefanKarpinski merged commit 8d41a69 into JuliaLang:master Dec 4, 2014
@toivoh
Copy link
Contributor

toivoh commented Dec 4, 2014

Yay!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.