Enable SLPVectorizer and make it work for tuples #6271

ArchRobison · 2014-03-26T21:52:17Z

[Updated 2014-Nov-25] This pull request enables implicit vectorization of tuple math idioms when -O is on the Julia command line.

The changes comprise:

Add -O command-line option (per @ViralBShah's suggestion).
Add LLVM's BasicAliasAnalysis and SLPVectorizer to the pass list when -O is present.
Add InstructionCombiningPass to clean up after SLPVectorizer. Without the cleanup, the code is horrible.
Add a patch that makes LLVM 3.3's SLPVectorizer work for tuply idioms of interest.
The changes to LLVM 3.3 were backported from recent changes developed by me that are now part of LLVM 3.5. The patch includes backported LLVM unit tests.
Move position of LoopVectorizer in the pass list to match its positions in Clang 3.3 and 3.5.
- For LLVM 3.3, it now comes after LoopUnrollPass instead of before.
- For LLVM 3.5, it now comes almost last.
When using LLVM 3.5, add a followup "InstructionCombiningPass". Without the cleanup, the code is horrible.

For background, see Issue #5857 and in particular my comment. Here is an example of what the enhanced SLPVectorizer can do:

$ cat foo.jl
function add( a::NTuple{4,Float32}, b::NTuple{4,Float32} )
    (a[1]+b[1],a[2]+b[2],a[3]+b[3],a[4]+b[4])
end

function mul( a::NTuple{4,Float32}, b::NTuple{4,Float32} )
    (a[1]*b[1],a[2]*b[2],a[3]*b[3],a[4]*b[4])
end

function madd( a::NTuple{4,Float32}, b::NTuple{4,Float32}, c::NTuple{4,Float32} )
    add(mul(a,b),c)
end

t = NTuple{4,Float32}

code_llvm(madd,(t,t,t))
$ julia -O foo.jl

define <4 x float> @"julia_madd;63815"(<4 x float>, <4 x float>, <4 x float>) {
top:
  %3 = fmul <4 x float> %0, %1, !dbg !45
  %4 = fadd <4 x float> %3, %2, !dbg !45
  ret <4 x float> %4, !dbg !45
}

Without the patch, the LLVM code for madd is lengthy scalar stuff.

Impact on compilation time using -O using LLVM 3.3 is significant, but hardly noticeable withLLVM 3.5. I measured compilation overhead using:

cd base; time -p ../julia sysimg.jl

Here's a table of compilation times (in seconds). "Baseline" is Julia without the PR.

	LLVM 3.3	LLVM 3.3	LLVM 3.3	LLVM 3.5	LLVM 3.5	LLVM 3.5
	Baseline	PR	PR -O	Baseline	PR	PR -O
Run 1	46.40	46.83	67.65	53.74	51.57	53.33
Run 2	46.50	46.56	67.63	51.94	51.70	53.13
Run 3	46.40	47.06	67.13	52.32	51.80	52.80
Run 4	46.32	46.24	68.28	52.31	51.83	52.84
Run 5	47.12	46.53	67.79	52.43	51.51	53.01
MEAN	46.55	46.64	67.70	52.55	51.68	53.02
% w.r.t. baseline	0.00%	0.21%	45.43%	0.00%	-1.65%	0.90%
STDEV/MEAN	0.70%	0.67%	0.61%	1.32%	0.27%	0.41%

StefanKarpinski · 2014-03-26T22:06:35Z

This is very exciting – it's lovely to see such nice, short code generated.

jakebolewski · 2014-03-26T22:37:11Z

Cool! How does the vectorizer work with vectorized load / stores? Could this be collapsed into a vectorized load / add / store? I assume performance will suffer somewhat due to Julia array alignment issues for larger vector widths.

function vadd_one!(arr::Array{Float64, 1})
    len  = length(arr) # assuming len multiple of 4
    one =  (1.0, 1.0, 1.0, 1.0)
    for i = 1:4:len
        inp = (arr[i], arr[i+1], arr[i+2], arr[i+3])
        out = vadd(inp, one) 
        for j = 1:4
             arr[i + j] = out[j]
       end
    end
end

ArchRobison · 2014-03-27T15:05:40Z

@jakebolewski : My impressions is that SLPVectorizer is supposed to handle that sort of situation if bounds checking is turned off. But alas even the SLPVectorizer in LLVM trunk failed to deal with the case cleanly. It vectorized the add partially, but scalarized the stores. So apparently there's more work to be done on SLPVectorizer.

Though for vectorizing loops, its usually better if the programmer does not choose the chunk size, because the optimal chunk size depends on the target platform, and since Julia is a JIT, it can exploit whatever vector lengths are on the platform. Leaving chunking to the compiler simplifies the code too. See here for an example using the proposed @simd from #5355. To my disappointment, it vectorized for Float32, but not Float64. I suspect there's an issue with the AVX cost model.

jakebolewski · 2014-03-27T20:51:32Z

@ArchRobison I agree that if possible the chunk size should not be chosen by the programmer, although this will not be the case if Julia gets builtin SIMD types. I saw that you were using a platform that supports AVX so I tailored my example to see if this patch helps the compiler take advantage of 256 bit SIMD instructions. I've been testing your @simd brach and I've had a hard time coercing it to produce AVX instructions as you noticed.

StefanKarpinski · 2014-03-27T21:02:19Z

I agree that if possible the chunk size should not be chosen by the programmer, although this will not be the case if Julia gets builtin SIMD types.

The SIMD types proposal (#2299) could be architecture-specific just like the size of Int is.

jakebolewski · 2014-03-27T21:27:24Z

I feel you would want to support all SIMD types available for an architecture. Being able to mix and match 128 bit SIMD types and 256 bit SIMD types is often useful, so choosing SIMD size should be left to the user. 128 bit SIMD ops are also more portable between different architectures (assuming Julia get's ARM support in the future).

ArchRobison · 2014-03-27T22:03:52Z

512-bit SIMD types are coming (and in fact here on Intel Xeon Phi). Code hardwired to 128 bit SIMD ops will be throwing away 3/4 of the ALU on those machines. The SIMD types should be parameterized by width, and a global constant could indicate the natural machine width. Though alas "natural width" and "fastest" are not always the same, often because of memory subsystem and pipelining quirks.

I've had the same problem getting @simd to produce AVX instructions. On my to-do list is trying a bunch of LLVM 3.3 AVX patches that were recommended to me.

Keno · 2014-04-04T23:27:25Z

LGTM. @JeffBezanson ?

ArchRobison · 2014-04-07T14:36:55Z

@loladiro Thanks for looking it over. The corresponding patch for LLVM trunk has yet to settle -- it was committed to trunk, but then reverted because it broke something. I'll wait for the dust to settle and then redo the backport.

simonster · 2014-04-18T21:19:17Z

@ArchRobison Complex arithmetic doesn't seem to vectorize for me even with this patch and the corresponding patch to LLVM 3.3. Should that be happening?

ArchRobison · 2014-04-18T22:30:28Z

I was looking at that yesterday too. The catch is that SLPVectorizer does not know how to vectorize structs, even when they are isomorphic to tuples. See my email exchange for the LLVM details.

What would be the impact if we lowered homogeneous composite types to LLVM vectors? Would that cause a calling convention mismatch with C?

ArchRobison · 2014-04-24T17:29:59Z

I removed the "WIP" from the title because I think this patch is about as ready as it is ever going to be for a while. I've updated the base note with new compilation slowdown measurements. Compilation slowdown seems to be about 4.5%. (Though as noted in the revised base note, someone else should check this.) I suppose this is good for codes that show improvement, and baggage for those that don't.

Can anyone suggest any realistic workloads that might benefit from this patch? It would needs to be a code that would benefit from using SIMD to implement tuple arithmetic. As @simonster noted, it won't help vectorize complex tuples.

StefanKarpinski · 2014-04-24T17:32:47Z

As long as it seems stable, I'm ok with merging this. 4.5% compilation time overhead is not bad at all.

ViralBShah · 2014-04-25T07:43:09Z

+1 to merging, all else being good. I wonder if this shows any improvements in our perf benchmarks.

JeffBezanson · 2014-04-25T16:45:46Z

Right now this code pattern is not used much so I don't expect much impact, but this will soon change dramatically. Our tuples are overall not efficient enough to be used this way---especially arrays of them. But that will change in 0.4.

The compiler is starting to get slow and will get even slower with MCJIT. I'd like to see if we can do some simple things to selectively enable these optimizations, e.g. not doing loop vectorization if there are no loops.

ArchRobison · 2014-04-25T17:23:30Z

I agree it would be worth looking at profiles and looking for fast rejection tests. For profiling, I need to build with symbols in the binaries (for Julia and LLVM) but no extra baggage. What's the easiest way to build that configuration?

The LoopVectorize pass in LLVM 3.3 would appear to already skip functions that don't have loops, since it's driven by this logic:

  virtual bool runOnLoop(Loop *L, LPPassManager &LPM) {
    // We only vectorize innermost loops.
    if (!L->empty())
      return false;

The LowerSIMDLoop pass similarly is driven by runOnLoop.

For LLVM trunk, the story is less clear. LoopVectorize appears to be driven by runOnFunction, which does:

Get a bunch of analyses
Build a work list of inner loops
Process the work list.
I don't know whether the operations in step 1 do any significant work or not..

timholy · 2014-09-23T23:12:06Z

It should be much easier to turn this on now with the new Expr(:meta, ...) mechanism.

ArchRobison · 2014-09-23T23:14:44Z

Yes, it sounds worthwhile. Where's the new mechanism documented?

timholy · 2014-09-24T01:25:52Z

It's an internal interface, so as usual it's not documented. But see #8459.

ArchRobison · 2014-11-12T23:29:20Z

Looks like there is a pass-order issue: Clang runs GVN (Global Value Numbering) before SLPVectorizer, but I put SLPVectorizer before GVN, and SLPVectorizer leans on GVN. I'll update my PR to reorder them. Though I also noticed that Clang 3.3 and 3.5.0 put the LoopVectorizer in significant different places with respect to GVN/SLPVectorizer, so I have some more digging to do.

[Update: just moving GVN does not solve the problem.]

toivoh · 2014-11-13T19:34:34Z

Very nice!
I also had another test where I was trying to get a loop that would be both unrolled and vectorized, but could only get LLVM/Julia to do one of the two. I wonder if changing the order of passes will allow to vectorize the unrolled loop as well? I guess that it's more realistic to hope that the SLPVectorizer would vectorize the unrolled loop than that the loop vectorizer would vectorize it before it gets unrolled?

ArchRobison · 2014-11-13T20:53:57Z

[Pardon long answer: some details are here so I can look them up later.]

Do you have an example of the loop that you were trying to both unroll and vectorize? The vectorizer has its own partial unroller.

As far as I know, partial unrolling before vectorization will thwart the vectorizer because it destroys the regular access pattern. For the passes we've been discussing, Clang 3.5's order (per julia/deps/llvm-3.5.0/lib/Transforms/IPO/PassiManagerBuilder.cpp) is:

Complete unrolling - run early because it exposes scalar optimization opportunities.
Global value numbering (GVN)
SLPVectorizer
LoopVectorizer (which also has its own unroller)
Partial unrolling - run late so it doesn't thwart vectorizer.
From what I remember, LLVM 3.3 didn't have the unrolling split, so it could cause problems. In fact, I just noticed that we're missing step (5) in the Julia pass list. Though it's not clear to me that it can gain much, because of all the other optimizations that we skip.

ArchRobison · 2014-11-13T23:40:12Z

I found the root problem: GVN uses LLVM's memory dependence analysis, which in turn seems to require "Basic Alias Analysis" to handle this case. And createBasicAliasAnalysisPass is missing from the Julia pass list. With that and the previously mentioned move of GVN, I got SLPVectorizer to do the right thing for your example.

I'll measure the compile-time cost of adding the additional alias analysis. If it's unacceptably high, we can look at what it takes to fix GVN to handle this case with just the existing type-based alias analysis, which in principle should have been enough. Though hopefully the pass will pay for itself by better performance.

ArchRobison · 2014-11-21T18:59:08Z

I took @ViralBShah's suggestion and updated the PR to run BasicAliasAnalysis and SLPVectorizer only if -O is on the command line. When built against LLVM 3.5, julia -O now vectorizes rmw!, rmw2!, but not rmw3! 😦 . I need to play around with pass order a bit more.

ArchRobison · 2014-11-25T23:22:29Z

I've updated the PR. See revised base note for details and compilation time overheads. When built with LLVM 3.5 and the PR, julia -O now vectorizes all of @toivoh's examples.

Alas Travis is reporting a failure for the parallel test. I need to figure out whether that's a problem that I've introduced.

tkelman · 2014-11-26T01:11:37Z

Alas Travis is reporting a failure for the parallel test. I need to figure out whether that's a problem that I've introduced.

No, I've been seeing that intermittently on master and other PR's as well. I very strongly suspect it's unrelated.

toivoh · 2014-11-26T05:53:56Z

Nice!

ViralBShah · 2014-11-26T06:13:09Z

Looking forward to this being merged. Would love to try out julia -O, even if that means I have to maintain a separate copy of Julia running on LLVM 3.5.

Change pass order to more closely match Clang's. Add patch to LLVM 3.3 with enhancements for vectorizing tuples. The patch contains functionality and tests backported from SLPVectorizer changes added to LLVM 3.5.0.

ArchRobison · 2014-12-04T17:51:34Z

The AppVeyor/Travis problems went away after I rebased earlier this week. I think this PR is in good shape now.

Enable SLPVectorizer and make it work for tuples

toivoh · 2014-12-04T18:34:32Z

Yay!

vchuravy mentioned this pull request Apr 1, 2014

Combine efforts with TaylorSeries.jl jwmerrill/PowerSeries.jl#7

Open

simonster mentioned this pull request Apr 7, 2014

experiment with llvm vectorization passes #4786

Closed

ArchRobison mentioned this pull request Apr 8, 2014

Recent compiler performance regressions for test/linalg1 #6460

Closed

ArchRobison mentioned this pull request Apr 18, 2014

Do we want fixed-size arrays? #5857

Closed

ArchRobison changed the title ~~WIP: Enable SLPVectorizer and make it work for tuples~~ Enable SLPVectorizer and make it work for tuples Apr 24, 2014

vtjnash added this to the 0.4 milestone Apr 25, 2014

vtjnash added the feature label Apr 25, 2014

ArchRobison force-pushed the adr/slpvector branch from 2dd6234 to 4736aff Compare September 4, 2014 17:22

ArchRobison force-pushed the adr/slpvector branch 2 times, most recently from 8b35c41 to a8d4cde Compare September 12, 2014 20:31

ArchRobison force-pushed the adr/slpvector branch from a8d4cde to d1ca2e0 Compare September 22, 2014 16:00

ArchRobison force-pushed the adr/slpvector branch from a9aab60 to 4e7e226 Compare November 11, 2014 23:32

MikeInnes force-pushed the master branch from 5c60996 to b1c3df3 Compare November 14, 2014 17:07

toivoh mentioned this pull request Nov 19, 2014

A new RNG for Julia? #8786

Closed

ArchRobison force-pushed the adr/slpvector branch from 4e7e226 to 8aca539 Compare November 21, 2014 18:32

ArchRobison force-pushed the adr/slpvector branch 7 times, most recently from ac40e22 to 2cd9e37 Compare November 25, 2014 22:30

ArchRobison force-pushed the adr/slpvector branch from 2cd9e37 to 80c0bea Compare December 1, 2014 15:57

Add -O switch and add SLPVectorizer/BasicAliasAnalysis under it.

3bdda37

Change pass order to more closely match Clang's. Add patch to LLVM 3.3 with enhancements for vectorizing tuples. The patch contains functionality and tests backported from SLPVectorizer changes added to LLVM 3.5.0.

ArchRobison force-pushed the adr/slpvector branch from 80c0bea to 3bdda37 Compare December 2, 2014 18:54

ArchRobison mentioned this pull request Dec 4, 2014

ByteVec: immediate / remote immutable byte vectors + intrinsics. #8964

Closed

StefanKarpinski added a commit that referenced this pull request Dec 4, 2014

Merge pull request #6271 from ArchRobison/adr/slpvector

8d41a69

Enable SLPVectorizer and make it work for tuples

StefanKarpinski merged commit 8d41a69 into JuliaLang:master Dec 4, 2014

jrevels mentioned this pull request Jun 27, 2015

SLP vectorization not working for tuples #11899

Closed

simonster mentioned this pull request Jan 11, 2016

Enable LLVM -O optimizations by default #14632

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable SLPVectorizer and make it work for tuples #6271

Enable SLPVectorizer and make it work for tuples #6271

ArchRobison commented Mar 26, 2014

StefanKarpinski commented Mar 26, 2014

jakebolewski commented Mar 26, 2014

ArchRobison commented Mar 27, 2014

jakebolewski commented Mar 27, 2014

StefanKarpinski commented Mar 27, 2014

jakebolewski commented Mar 27, 2014

ArchRobison commented Mar 27, 2014

Keno commented Apr 4, 2014

ArchRobison commented Apr 7, 2014

simonster commented Apr 18, 2014

ArchRobison commented Apr 18, 2014

ArchRobison commented Apr 24, 2014

StefanKarpinski commented Apr 24, 2014

ViralBShah commented Apr 25, 2014

JeffBezanson commented Apr 25, 2014

ArchRobison commented Apr 25, 2014

timholy commented Sep 23, 2014

ArchRobison commented Sep 23, 2014

timholy commented Sep 24, 2014

ArchRobison commented Nov 12, 2014

toivoh commented Nov 13, 2014

ArchRobison commented Nov 13, 2014

ArchRobison commented Nov 13, 2014

ArchRobison commented Nov 21, 2014

ArchRobison commented Nov 25, 2014

tkelman commented Nov 26, 2014

toivoh commented Nov 26, 2014

ViralBShah commented Nov 26, 2014

ArchRobison commented Dec 4, 2014

toivoh commented Dec 4, 2014

Enable SLPVectorizer and make it work for tuples #6271

Enable SLPVectorizer and make it work for tuples #6271

Conversation

ArchRobison commented Mar 26, 2014

StefanKarpinski commented Mar 26, 2014

jakebolewski commented Mar 26, 2014

ArchRobison commented Mar 27, 2014

jakebolewski commented Mar 27, 2014

StefanKarpinski commented Mar 27, 2014

jakebolewski commented Mar 27, 2014

ArchRobison commented Mar 27, 2014

Keno commented Apr 4, 2014

ArchRobison commented Apr 7, 2014

simonster commented Apr 18, 2014

ArchRobison commented Apr 18, 2014

ArchRobison commented Apr 24, 2014

StefanKarpinski commented Apr 24, 2014

ViralBShah commented Apr 25, 2014

JeffBezanson commented Apr 25, 2014

ArchRobison commented Apr 25, 2014

timholy commented Sep 23, 2014

ArchRobison commented Sep 23, 2014

timholy commented Sep 24, 2014

ArchRobison commented Nov 12, 2014

toivoh commented Nov 13, 2014

ArchRobison commented Nov 13, 2014

ArchRobison commented Nov 13, 2014

ArchRobison commented Nov 21, 2014

ArchRobison commented Nov 25, 2014

tkelman commented Nov 26, 2014

toivoh commented Nov 26, 2014

ViralBShah commented Nov 26, 2014

ArchRobison commented Dec 4, 2014

toivoh commented Dec 4, 2014