Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLP vectorization not working for tuples #11899

Closed
jrevels opened this issue Jun 27, 2015 · 25 comments
Closed

SLP vectorization not working for tuples #11899

jrevels opened this issue Jun 27, 2015 · 25 comments
Labels
performance Must go faster regression Regression in behavior compared to a previous version

Comments

@jrevels
Copy link
Member

jrevels commented Jun 27, 2015

After reading #6271, I wanted to reproduce the vectorization that @ArchRobison displayed in his tuple example. Unfortunately, this doesn't seem to work any longer (just a copy-paste of the example given in the aforementioned PR):

$ julia -O
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.0-dev+5513 (2015-06-22 15:58 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit faae194* (5 days old master)
|__/                   |  x86_64-apple-darwin13.4.0

julia> function add( a::NTuple{4,Float32}, b::NTuple{4,Float32} )
           (a[1]+b[1],a[2]+b[2],a[3]+b[3],a[4]+b[4])
       end
add (generic function with 1 method)

julia> function mul( a::NTuple{4,Float32}, b::NTuple{4,Float32} )
           (a[1]*b[1],a[2]*b[2],a[3]*b[3],a[4]*b[4])
       end
mul (generic function with 1 method)

julia> function madd( a::NTuple{4,Float32}, b::NTuple{4,Float32}, c::NTuple{4,Float32} )
           add(mul(a,b),c)
       end
madd (generic function with 1 method)

julia> t = NTuple{4,Float32}
Tuple{Float32,Float32,Float32,Float32}

julia> code_llvm(madd,(t,t,t))

define [4 x float] @julia_madd_21498([4 x float]*, [4 x float]*, [4 x float]*) {
top:
  %3 = getelementptr inbounds [4 x float]* %0, i64 0, i64 0
  %4 = load float* %3, align 4
  %5 = getelementptr inbounds [4 x float]* %1, i64 0, i64 0
  %6 = load float* %5, align 4
  %7 = fmul float %4, %6
  %8 = getelementptr inbounds [4 x float]* %0, i64 0, i64 1
  %9 = load float* %8, align 4
  %10 = getelementptr inbounds [4 x float]* %1, i64 0, i64 1
  %11 = load float* %10, align 4
  %12 = fmul float %9, %11
  %13 = getelementptr inbounds [4 x float]* %0, i64 0, i64 2
  %14 = load float* %13, align 4
  %15 = getelementptr inbounds [4 x float]* %1, i64 0, i64 2
  %16 = load float* %15, align 4
  %17 = fmul float %14, %16
  %18 = getelementptr inbounds [4 x float]* %0, i64 0, i64 3
  %19 = load float* %18, align 4
  %20 = getelementptr inbounds [4 x float]* %1, i64 0, i64 3
  %21 = load float* %20, align 4
  %22 = fmul float %19, %21
  %23 = getelementptr inbounds [4 x float]* %2, i64 0, i64 0
  %24 = load float* %23, align 4
  %25 = fadd float %7, %24
  %26 = insertvalue [4 x float] undef, float %25, 0
  %27 = getelementptr inbounds [4 x float]* %2, i64 0, i64 1
  %28 = load float* %27, align 4
  %29 = fadd float %12, %28
  %30 = insertvalue [4 x float] %26, float %29, 1
  %31 = getelementptr inbounds [4 x float]* %2, i64 0, i64 2
  %32 = load float* %31, align 4
  %33 = fadd float %17, %32
  %34 = insertvalue [4 x float] %30, float %33, 2
  %35 = getelementptr inbounds [4 x float]* %2, i64 0, i64 3
  %36 = load float* %35, align 4
  %37 = fadd float %22, %36
  %38 = insertvalue [4 x float] %34, float %37, 3
  ret [4 x float] %38
}

Caveat: I built v0.4 using LLVM 3.6, which wasn't tested in #6271.

@ScottPJones
Copy link
Contributor

3.6.0 or 3.6.1? (not sure it would make any difference)

@jrevels
Copy link
Member Author

jrevels commented Jun 27, 2015

@ScottPJones 3.6.1

I'm currently building with override LLVM_VER=svn to see what pops out (though I doubt moving forward in the build will actually fix the issue).

@ScottPJones
Copy link
Contributor

OK, that's 3.7, not 3.6 (I've also been building with both 3.6.1 (release) and 3.7 (svn))

@ScottPJones
Copy link
Contributor

I've been having problems with LLVN-svn, today, and even the release 0.3.10 😞

@jrevels
Copy link
Member Author

jrevels commented Jun 27, 2015

Sorry - when I said "currently", I meant "at this very moment I'm rebuilding with 3.7". The version I was running on a couple of minutes ago when I filed this issue was 3.6.1

@ScottPJones
Copy link
Contributor

Ah, OK, my misunderstanding! Brain tired after having 4 days of drinking from the proverbial fire hose!

@ArchRobison
Copy link
Contributor

I can replicate the problem. I'll look into it next week.

@ArchRobison
Copy link
Contributor

I see the problem. Julia 0.3 represents homogeneous tuples as LLVM vectors; Julia 0.4 represents them as LLVM arrays. Doing so torpedoes SLP vectorizer since LLVM has no vector instructions for arrays.

What's the motivation for the change? Short homogeneous tuples seem like a nice abstraction for hardware vectors. (My apologies for missing whatever conversation led to the change.)

@StefanKarpinski
Copy link
Member

cc @vtjnash, @Keno, @JeffBezanson

@ArchRobison
Copy link
Contributor

Looks like #11187 was the motivation. In the commit I see:

    this additionally prepares for turning back on allocating tuples as vectors,
    since the gc now guarantees 16-byte alignment

And the code comments say:

+                    bool validVectorSize = (ntypes == 2 || ntypes == 4 || ntypes == 6);
+                    if (0 && lasttype->isSingleValueType() && !lasttype->isVectorTy() && validVectorSize) // currently disabled due to load/store alignment issues
+                        jst->struct_decl = VectorType::get(lasttype, ntypes);

Is there any way I can help with the load/store alignment issues? We should also consider refining validVectorSize so that we get the LLVM vector representation for 16-wide float vectors if the target machine has AVX-512, and likewise for 8-wide float vectors on AVX2 machines. The test should perhaps consider the bit width of the tuple instead of the number of elements.

@carnaval
Copy link
Contributor

We probably want to wait for Jameson's codegen rewrite to do that, so that we can make choices on storage on a per-case basis instead of having to decide only based on the type itself.

The vector alignment was part of this I think, since if we declare that all homogeneous tuples of size < K have to be llvm vector types, then they also have to be that when inside a struct, which would prevent us from having C compatible layout because of alignment.

@carnaval
Copy link
Contributor

cc @vtjnash to check. But I think with your change we could have tuples be llvm vectors only when we got them out of an array or are passing them as specsig function arguments (and homogeneous, and of a reasonable size, given the buggy handling of weird sized vectors by llvm)

@IainNZ IainNZ added performance Must go faster regression Regression in behavior compared to a previous version labels Jun 30, 2015
@vtjnash
Copy link
Member

vtjnash commented Jun 30, 2015

Looks like #11187 was the motivation. In the commit I see:

it was turned off prior to then. i did a bit of work in #11187 trying to make it possible to turn back on, but didn't manage to get it fully working.

Is there any way I can help with the load/store alignment issues? We should also consider refining validVectorSize so that we get the LLVM vector representation for 16-wide float vectors if the target machine has AVX-512, and likewise for 8-wide float vectors on AVX2 machines. The test should perhaps consider the bit width of the tuple instead of the number of elements.

if you try turning it on, see which tests fail / segfault due to incorrect alignment assumptions

We probably want to wait for Jameson's codegen rewrite to do that, so that we can make choices on storage on a per-case basis instead of having to decide only based on the type itself.

yes, please! it's almost done.

The vector alignment was part of this I think, since if we declare that all homogeneous tuples of size < K have to be llvm vector types, then they also have to be that when inside a struct, which would prevent us from having C compatible layout because of alignment.

cc @vtjnash to check. But I think with your change we could have tuples be llvm vectors only when we got them out of an array or are passing them as specsig function arguments (and homogeneous, and of a reasonable size, given the buggy handling of weird sized vectors by llvm)

i would love to hear ideas for handling this sanely. currently the behavior of tuples in llvmcall / ccall is officially undefined (even though we have a test for it), since it isn't entirely clear when tuples/structs should turn into vector types, and when they should be array types. maybe we should special-case a MMX type in codegen, like we have for floats, signed, and complex?

additionally, since the new AVX stuff requires > 16 byte alignment, it will be a bit complicated to allocate them since both system malloc and Julia's GC only guarantee 16 byte alignment. as a workaround, it may be best to just force LLVM to use the unaligned vmov instructions.

@ArchRobison
Copy link
Contributor

I'm looking forward to the codegen rewrite.

I concur with using unaligned vector move instructions for the >16 byte alignment cases unless the compiler can prove the vector is aligned.

@mlubin
Copy link
Member

mlubin commented Jul 5, 2015

So this is blocked by #11973? (And hence won't be fixed before 0.4)

@ArchRobison
Copy link
Contributor

It would seem so, unless we want to hack an ad-hoc patch for the current codegen, which is probably not worth the effort.

@ArchRobison
Copy link
Contributor

I've worked out an ABI and patch. All the tests pass. I'll wait for #11973 to be committed so I can restructure it for that.

The ABI that I came up with uses an LLVM vector for a tuple under the following conditions:

  1. The tuple is homogeneous
  2. The corresponding element type is legal as an element of an LLVM vector.
  3. The tuple has length 2, 4, or 8.
  4. The tuple is not a sub-object of a larger object.
  5. The tuple is not being passed to C or llvmcall.
    Otherwise it is represented as an array.

Rule 4 ensures layouts have the same offsets as before. Rules 4 and 5 sometimes require bitwise conversion between arrays and tuples, though LLVM seems good about removing gratuitous ones.

The net effect of these rules is that LLVM vector types are used when vector-like tuples act as return values, arguments to specialized signatures, local variables, or SSA values.

@ArchRobison
Copy link
Contributor

Good news/bad news:

  • I got this working again with the latest master branch, and it makes the SLP vectorization work for tuples again.
  • When SLP vectorization does not kick in, it slows down obvious examples.
    An example is this toy benchmark. Using a i4770 ("Haswell") and LLVM 3.7, I get:
    • With the changes and -O, the x86 code is vectorized nicely.
    • With the changes and without -O, the x86 code is ugly. The LLVM code is relatively clean, but the x86 backend is making a mess out of it.
    • Without the changes and without -O, the x86 code is scalar, but clean. Though scalar, superscalar execution is quite effective at overlapping the operations.

Changes are in my repository under branch adr/simdtuple.

There seem to be three ways to take this forward:

  1. Fix SLP Vectorizer to deal with structs, so homogeneous tuples won't need to be treated as a special case at the Julia level.
  2. Fix LLVM's x86 backend.
  3. Give up on SLP Vectorizer and use LLVM call to get at the vector instructions.

I'm leaning in the direction of 1 since it's the most general solution. E.g., it would enable SIMDization of an immutable type with "red", "green", "blue", and "alpha" fields of type float32. Though as an Intel employee, option 2 is easy to justify for reasons of pride and my time. :-)

I'll mull this some more.

@simonster
Copy link
Member

When we switch to LLVM 3.7 (#9336) we can just enable -O all the time, right? In #6271 it looked like the performance difference with LLVM 3.5 was negligible, so unless this regressed between then and LLVM 3.7 we should be okay.

@ArchRobison
Copy link
Contributor

The problem will be codes that use homogeneous tuples in way that is not amenable to SIMD instructions. E.g., elementwise addition of the first three elements, and a subtraction for the fourth. My change as it currently is would slow down those kinds of codes.

@ScottPJones
Copy link
Contributor

But, 1 & 2 are not mutually exclusive, are they? Both sound like they should be done at some point.

@ArchRobison
Copy link
Contributor

Update: I submitted an LLVM patch yesterday that makes SLPVectorization work with tuples again. Once it passes code review, it shouldn't be difficult to backport as a patch against LLVM 3.7. The patch not only enables SLP vectorization of tuples; it also enables SLP vectorization of immutable types.

This gist shows two examples, one with tuple and one with immutable, that both vectorize with the patch.

@ScottPJones
Copy link
Contributor

Nice! What sort of results did you get before/after from the examples in your gist?

@jrevels
Copy link
Member Author

jrevels commented Oct 31, 2015

The patch not only enables SLP vectorization of tuples; it also enables SLP vectorization of immutable types.

This looks awesome! Looking forward to it.

@KristofferC
Copy link
Member

KristofferC commented Jan 22, 2017

This vectorizes when using -O3.

julia> code_llvm(madd,(t,t,t))

define void @julia_madd_65810([4 x float]* noalias sret, [4 x float]*, [4 x float]*, [4 x float]*) #0 !dbg !5 {
top:
  %4 = bitcast [4 x float]* %1 to <4 x float>*
  %5 = load <4 x float>, <4 x float>* %4, align 4
  %6 = bitcast [4 x float]* %2 to <4 x float>*
  %7 = load <4 x float>, <4 x float>* %6, align 4
  %8 = fmul <4 x float> %5, %7
  %9 = bitcast [4 x float]* %3 to <4 x float>*
  %10 = load <4 x float>, <4 x float>* %9, align 4
  %11 = fadd <4 x float> %8, %10
  %12 = bitcast [4 x float]* %0 to <4 x float>*
  store <4 x float> %11, <4 x float>* %12, align 4
  ret void
}

@tkelman tkelman added the potential benchmark Could make a good benchmark in BaseBenchmarks label Jan 22, 2017
@KristofferC KristofferC removed the potential benchmark Could make a good benchmark in BaseBenchmarks label Oct 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests