laplace equation benchmark performance #1168

ViralBShah · 2012-08-17T09:27:55Z

Track performance of the laplace equation benchmark. Discussion here:

https://groups.google.com/d/topic/julia-dev/VDlr3_6Szos/discussion

ViralBShah · 2012-08-17T09:34:38Z

Benchmark has been added to the perf2 suite.

https://github.com/JuliaLang/julia/tree/master/test/perf2

JeffBezanson · 2012-08-17T10:27:57Z

I don't think the vectorized version will ever be as fast as the other one.
These performance issues are getting out of hand. They need to be analyzed and narrowed down.

ViralBShah · 2012-08-17T10:36:48Z

Yes, I was just thinking the same. The vectorized version does far too much work, but there is certainly room for improvement.

-viral

On 17-Aug-2012, at 3:57 PM, Jeff Bezanson wrote:

I don't think the vectorized version will ever be as fast as the other one.
These performance issues are getting out of hand. They need to be analyzed and narrowed down.

—
Reply to this email directly or view it on GitHub.

give the same results (#1168)

ViralBShah · 2012-08-20T09:47:06Z

Original multi-language performance comparison:

    Octave (0):  14.9932 seconds
     GNU R (0):  56.4161 seconds
     Julia (0):  39.1679 seconds
              gcc     intel
  1      :  13.4774     *     NumPy 
  2      :      *     1.6633  Cilk Plus 
  3,   4 :   2.4963   1.9179  Vectorized Fortran (pure)

ViralBShah · 2012-08-20T09:48:59Z

Devectorized julia for the same number of iterations (2^15) now runs in 3.8 seconds for me.

Does numpy implement views and ends up evaluating lazily in this case, and hence end up being much faster? Since octave ends up taking the same amount of time, I presume that we just have a lot of room for improvement in julia.

toivoh · 2012-08-20T10:13:26Z

Yes, numpy indexing with scalars and slices (ranges) always produces views.

ViralBShah · 2012-08-20T10:23:28Z

Does octave do the same? I didn't think so - since it is matching numpy's performance. I'll write the benchmark in a few other languages for comparison.

toivoh · 2012-08-20T10:38:51Z

I would imagine that it does copy on write, since it strives to emulate matlab. But since I'm not an octave user, I really don't know.

StefanKarpinski · 2012-08-20T17:47:30Z

I really think it's arguable that our slices should be views.

toivoh · 2012-08-20T18:07:39Z

+1
Except for copy on write (which I'm definitely not suggesting),
I'm not aware of any other sane way to get performance with slices.

JeffBezanson · 2012-08-21T00:23:39Z

I could get on board with that. We either have to make all ops on SubArray fast, or build views in to the Array type. This is actually a tough decision. Of course we want it to be possible to write something like SubArray and have it be fast.
An issue that immediately comes up is that linear indexing is no longer fast with views, though it is insanely fast for plain dense arrays. This makes me think we need a smarter index abstraction so there is a fast and generic way to iterate over more kinds of arrays. I'm willing to go to extreme lengths to get this. Syntax, changes to the compiler IR, everything is on the table. Rewriting array.jl is no big deal. We can do that in maybe a day.

timholy · 2012-08-21T02:23:43Z

An issue that immediately comes up is that linear indexing is no longer fast with views

In my currently-stalled rewrite of the Image library, I have a substitute brewing:

type Counter{N}
    max::NTuple{N, Int}
end
Counter(sz) = Counter{length(sz)}(to_tuple(sz))

function start{N}(c::Counter{N})
    state = ones(Int,N)
    state[1] = 0 # because of start/done/next sequence, start out one behind
    return state
end
function done(c::Counter, state)
    # we do the increment as part of "done" to make exit-testing more efficient
    state[1] += 1
    i = 1
    max = c.max
    while state[i] > max[i] && i < length(state)
        state[i] = 1
        i += 1
        state[i] += 1
    end
    state[end] > c.max[end]
end
function next(c::Counter, state)
    return state, state
end

This can be pretty fast, particularly if you handle the inner loop like this:

szv = [sz...]  # vector from tuple
szv[1] = 1
for c = Counter(szv)
    cprod = prod(c)
    for i = 1:sz[1]
        p[index] = i*cprod
        index += 1
    end
end

Then for most applications the performance is essentially indistinguishable from linear indexing. (However, there are traps: this is the usage that prompted me to report performance issue 73d8ca4). Assuming it could be generalized to work on general ranges, and to return a linear index, would that help?

I think in my initial version of this I tried returning both the linear index and the coordinate representation, and also ran into performance problems due to tuple creation, but memory of exactly what was going on is now hazy.

timholy · 2012-08-21T02:26:51Z

Should have specified: on the inner loop I was demoing how you'd set each element to the product of coordinates, i.e., p[i,j,k] = i*j*k. And you'd need to set index = 1 at the beginning.

JeffBezanson · 2012-08-21T02:36:09Z

This seems to rely on making the first dimension a special case for performance, or at least assuming it is large enough for the inner loop to dominate. I think we need something more drastic. We may also need to depend on better compiler optimizations like allocating tuples on the stack or in registers.

timholy · 2012-08-21T02:43:43Z

Yes, that's basically true.

OK, might one be able to generate a user-friendly version of gen_array_index_map that would make it trivial to write linear-indexing code wherever it might appear? Or is that something that would have to go into the compiler?

toivoh · 2012-08-21T09:48:58Z

This is turning into a full-blown discussion. Perhaps best to move it over to julia-dev to invite some more ideas?

ViralBShah · 2012-08-21T09:57:41Z

I would rather invite anyone to have a discussion here, so that all the perf issues can be tracked here and are addressed.

ViralBShah · 2012-08-21T12:09:20Z

On the pure scalar indexing bit, we are still a factor of 1.5x-2x slower than C. So long as we do not close that gap either through better code generation and other optimizations, vectorized julia codes will continue to remain that much slower compared to other languages where the vectorized operations are implemented in C.

timholy · 2012-08-22T01:07:01Z

There was a mailing list discussion a while back that found that bounds-checking accounts for some of this. Code with bounds-checks disabled (by commenting out two lines in codegen.cpp) ran in about 75% of the time of the code with bounds-checks, pretty constantly on three different machines (mine, Krys', and Bill's, if memory serves). Not a huge performance difference, but if you're thinking about factors of 2 that will be some chunk of it.

kk49 · 2012-08-22T01:51:51Z

I think with bounds-checking removed the code generated in loops was about the same as what clang generates for C/C++. You just have to feed the JIT engine enough code to be able to optimize it.

StefanKarpinski · 2012-08-22T16:54:22Z

I really think it's arguable that our slices should be views.

I could get on board with that. We either have to make all ops on SubArray fast, or build views in to the Array type. This is actually a tough decision. Of course we want it to be possible to write something like SubArray and have it be fast.

Using views as the default behavior seems appealing for a few reasons:

It's a very fast default behavior that's extremely powerful if you know how to use it.
It doesn't in practice feel very different from Matlab's behavior; sure there are cases where you assign into a view and that would have an effect, but they're relatively rare and can be fixed with an explicit copy.
Instead of two ways of slicing things that we currently have now, we would have only one way of slicing things; the other way would become a slice + copy, which is exactly what it is; essentially, we'd be exposing the more fundamental behavior as core, rather than the opposite.

An issue that immediately comes up is that linear indexing is no longer fast with views, though it is insanely fast for plain dense arrays. This makes me think we need a smarter index abstraction so there is a fast and generic way to iterate over more kinds of arrays. I'm willing to go to extreme lengths to get this. Syntax, changes to the compiler IR, everything is on the table. Rewriting array.jl is no big deal. We can do that in maybe a day.

Now that we have many more examples of types of arrays (sparse, distributed, etc.), we can maybe come up with something that unifies them better and has good performance.

ViralBShah · 2013-03-18T11:33:37Z

I ran the Laplace benchmark on master:

Julia (vec) - 1874 ms
Matlab (vec) - 220 ms

Julia (devec) - 135 ms
C - 162 ms

It would be nice to figure out if this is a problem of too many temporaries, and to what extent we need to rethink our indexing by using views and such.

ViralBShah · 2013-03-18T11:35:11Z

I am marking this as 0.2 just to figure out if there is something we can do to improve our performance for vectorized codes by the next release.

diegozea · 2013-03-18T12:42:20Z

Some other issues related to this:
#1392
#2360
#1802
#1802

timholy · 2013-03-18T13:42:51Z

It's definitely a problem of too many temporaries, as a simple sprofile run will show: out of 1104 captures on perf2/laplace.jl; laplace_iter_vec; line: 25, there were 970 at array.jl; +; line: 875. That line allocates the output array:

  F = Array(promote_type(S,T), promote_shape(size(A),size(B)))

(Out of those 970, 963 were in jl_gc_collect, so the other function calls on that line have negligible impact.)

However, I'd guess that this is not the kind of thing we're likely to solve in 2-3 weeks. I'd also tentatively suggest that Julia's inlining behavior may be a higher priority. The reason is simple: often a careful programmer can "write around" the library's current tendency to allocate temporaries. It can be inconvenient, but it's usually possible. Case in point: laplace_iter_devec is probably more readable than laplace_iter_vec. In contrast, the mere fact that map(abs, A) is much slower than abs(A)---and there's nothing that one not steeped in LLVM lore can do to change that---influences language design and usage patterns in deep ways.

But coming from me this is all hot air, because ATM I don't feel up to the challenge of tackling this myself.

StefanKarpinski · 2013-03-18T13:54:41Z

I think that's a good point, @timholy. cc: @JeffBezanson, who is, of course, already following this thread.

JeffBezanson · 2013-03-18T16:49:47Z

I concur that it will not be solved in 2 weeks just by setting a milestone. But I have plans. I have been ruminating a lot about julia's theory of "what is a function", and I may be starting to get somewhere.

timholy · 2013-03-18T17:32:36Z

Sounds exciting! Here, have a nice cup of tea to help your ruminations.

ViralBShah · 2013-03-18T17:37:52Z

Right, I didn't expect anything to be solved in 2 weeks, but even being able to give a rough plan on tackling this would be an important signal for everyone.

Also, as @JeffBezanson has often said, the best way to avoid GC is to not allocate memory. Perhaps we can revisit our indexing along these lines, as is discussed in this thread.

JeffBezanson · 2013-03-25T06:34:17Z

57e010e helps both laplace_vec and laplace_devec significantly (but does not fix the gap between them).

ViralBShah · 2014-04-27T06:50:04Z

I just reran this benchmark on julia.mit.edu, and find that octave takes 0.8 seconds, whereas julia takes 2.2 seconds.

ViralBShah · 2015-04-16T18:06:59Z

I added a subarray version here, and it is significantly slower. I guess we have to wait for tuple overhaul. In any case, I was wondering if we can characterize why the vectorized version is so much slower (subarrays or otherwise), compared to the devectorized version. I was kind of hoping that @carnaval 's GC work would help here, and I wonder if a better GC will help (perhaps tune some knobs in the new GC), or if the gains need to come from elsewhere.

julia> @time laplace_vec_sub();
elapsed time: 4.229344435 seconds (1028 MB allocated, 3.58% gc time in 47 pauses with 0 full sweep)

julia> @time laplace_vec();
elapsed time: 1.707854514 seconds (1715 MB allocated, 13.36% gc time in 78 pauses with 0 full sweep)

julia> @time laplace_devec();
elapsed time: 0.162232645 seconds (1 MB allocated)

JeffBezanson · 2015-04-16T18:24:26Z

I thought the new gc already helped here significantly. Though of course it does not close the whole gap. We should work on subarrays.

ViralBShah · 2015-04-16T18:28:58Z

If you see my older comment and the new results, the difference went from 13x to 10x. The new GC helped significantly on bench_eu, but not much on laplace.

timholy · 2015-04-16T22:59:34Z

I tried a subarray version a while ago, and it was a little faster. Can you link to your code?

ViralBShah · 2015-04-17T05:59:20Z

I have checked in the subarray version into laplace.jl. With tuple overhaul now, there are significant gains, but still too much memory allocation.

julia> @time laplace_vec_sub();
elapsed time: 2.958093484 seconds (1027 MB allocated, 3.30% gc time in 47 pauses with 0 full sweep)

timholy · 2015-04-17T10:25:52Z

It was an indexing problem, not a subarray problem. Fixed in #10858. Will be improved further if #9080 can be fixed.

Almost all of the remaining memory allocation comes from A+B, c*A, etc.

timholy · 2015-04-17T12:04:09Z

I should add: technically, with the new GC the performance problems are presumably more an issue of cache efficiency than memory allocation per se. When you allocate a temporary C = A+B, you traverse A, B, and C; then in D = 2*C, you traverse C again. In contrast, the devectorized version traverses everything once.

carnaval · 2015-04-17T12:28:22Z

Cache behavior is especially bad if you are allocating temps in a loop because you will get fresh (read: not cached) memory at each iteration up until collection, and start all over again.
With preallocation you get very good temporal coherency of course.
We may have a good solution for this kind of things at the end of the semester if all goes well :-)

timholy · 2015-04-17T13:01:01Z

We may have a good solution for this kind of things at the end of the semester if all goes well :-)

Sounds like something to look forward to!

ViralBShah · 2015-04-17T13:33:49Z

I am really looking forward to it too!

stevengj · 2016-10-06T14:34:55Z

Hopefully the performance of the vectorized version will be fixed by #17623, assuming we change the code to use .+ etcetera.

StefanKarpinski · 2016-10-11T19:59:21Z

Syntactic fusion seems like a completely fair solution to this performance issue.

stevengj · 2016-12-20T15:19:33Z

~~Even with loop fusion, this is still taking a lot more allocations than I would have expected:~~ Nevermind, this was a typo in my code (I accidentally used a global).

julia> function laplace!(u, Niter, dx2, dy2)
           Ny, Nx = size(u)
           c = 1 / (2*(dx2+dy2))
           u0 = @view u[2:Ny-1, 2:Nx-1]
           u1 = @view u[1:Ny-2, 2:Nx-1]
           u2 = @view u[3:Ny, 2:Nx-1]
           u3 = @view u[2:Ny-1,1:Nx-2]
           u4 = @view u[2:Ny-1, 3:Nx]
           for i = 1:Niter
             u0 .= ((u1 .+ u2) .* dy2 .+ (u3 .+ u4) .* dx2) .* c
           end
           return u0
       end
laplace! (generic function with 1 method)

julia> u = zeros(150,150); u[1,:] = 1
1

julia> @time laplace!(u, 2^15, 0.01, 0.01);
  7.999779 seconds (301.50 k allocations: 10.248 MB, 0.58% gc time)

julia> @time laplace!(u, 2^15, 0.01, 0.01);
  7.747673 seconds (10 allocations: 496 bytes)

~~@yuyichao, @pabloferz, any ideas? Is this some problem with views?~~

stevengj · 2016-12-20T15:32:40Z

In comparison, an unrolled version does slightly better, which is not too unexpected — there is some overhead (less than a factor of 2) to using all of those views and generic broadcasting iteration:

julia> function laplace_iter!(u, Niter, dx2, dy2)
                  Ny, Nx = size(u)
                  c = 1 / (2*(dx2+dy2))
                  for iter = 1:Niter
                    for i = 2:size(u,1)-1, j = 2:size(u,2)-1 
                      u[i,j] = ((u[i-1,j]+u[i+1,j])*dx2 + (u[i,j-1]+u[i,j+1])*dy2)*c 
                    end 
                  end
                  return u
              end
laplace_iter! (generic function with 1 method)

julia> @time laplace_iter!(u, 2^15, 0.01, 0.01);
  4.281638 seconds (7.69 k allocations: 327.032 KB)

julia> @time laplace_iter!(u, 2^15, 0.01, 0.01);
  4.221090 seconds (5 allocations: 176 bytes)

But I think the vectorized version is fast enough that we can now close this?

ViralBShah added a commit that referenced this issue Aug 20, 2012

Vectorized and devectorized versions of the laplace benchmark now

6404590

give the same results (#1168)

JeffBezanson mentioned this issue Jan 17, 2014

regression in laplace_vec benchmark #5435

Closed

JeffBezanson mentioned this issue Jul 25, 2014

Poor performance when using in-place operator for arrays #7726

Closed

ViralBShah mentioned this issue Aug 11, 2014

Towards array nirvana #7941

Closed

15 tasks

ViralBShah mentioned this issue Oct 1, 2014

WIP: Incremental GC #5227

Closed

ViralBShah removed this from the 0.4.1 milestone Apr 16, 2015

timholy mentioned this issue Apr 17, 2015

Change for i=1:length(A) to for i in eachindex(A) #10858

Merged

6 tasks

JeffBezanson mentioned this issue May 10, 2015

Vectorized code performance #11220

Closed

StefanKarpinski mentioned this issue May 4, 2016

enumerate allocates for arrays of mutables #16190

Closed

stevengj added the potential benchmark Could make a good benchmark in BaseBenchmarks label Dec 20, 2016

stevengj closed this as completed Dec 20, 2016

stevengj removed the potential benchmark Could make a good benchmark in BaseBenchmarks label Dec 21, 2016

stevengj mentioned this issue Dec 21, 2016

update laplace benchmark to use fused dotops JuliaCI/BaseBenchmarks.jl#52

Merged

blackeneth mentioned this issue Jul 10, 2021

color scheme support #41435

Closed

laplace equation benchmark performance #1168

laplace equation benchmark performance #1168

Comments

ViralBShah commented Aug 17, 2012

ViralBShah commented Aug 17, 2012

JeffBezanson commented Aug 17, 2012

ViralBShah commented Aug 17, 2012

ViralBShah commented Aug 20, 2012

ViralBShah commented Aug 20, 2012

toivoh commented Aug 20, 2012

ViralBShah commented Aug 20, 2012

toivoh commented Aug 20, 2012

StefanKarpinski commented Aug 20, 2012

toivoh commented Aug 20, 2012

JeffBezanson commented Aug 21, 2012

timholy commented Aug 21, 2012

timholy commented Aug 21, 2012

JeffBezanson commented Aug 21, 2012

timholy commented Aug 21, 2012

toivoh commented Aug 21, 2012

ViralBShah commented Aug 21, 2012

ViralBShah commented Aug 21, 2012

timholy commented Aug 22, 2012

kk49 commented Aug 22, 2012

StefanKarpinski commented Aug 22, 2012

ViralBShah commented Mar 18, 2013

ViralBShah commented Mar 18, 2013

diegozea commented Mar 18, 2013

timholy commented Mar 18, 2013

StefanKarpinski commented Mar 18, 2013

JeffBezanson commented Mar 18, 2013

timholy commented Mar 18, 2013

ViralBShah commented Mar 18, 2013

JeffBezanson commented Mar 25, 2013

ViralBShah commented Apr 27, 2014

ViralBShah commented Apr 16, 2015

JeffBezanson commented Apr 16, 2015

ViralBShah commented Apr 16, 2015

timholy commented Apr 16, 2015

ViralBShah commented Apr 17, 2015

timholy commented Apr 17, 2015

timholy commented Apr 17, 2015

carnaval commented Apr 17, 2015

timholy commented Apr 17, 2015

ViralBShah commented Apr 17, 2015

stevengj commented Oct 6, 2016

StefanKarpinski commented Oct 11, 2016

stevengj commented Dec 20, 2016 • edited Loading

stevengj commented Dec 20, 2016

stevengj commented Dec 20, 2016 •

edited

Loading