Slow loop fusion when multiplying a column vector with a row vector #20875

zsoerenm · 2017-03-03T14:44:10Z

I am using Julia v0.6.0-pre.alpha.34
I run this simple function, which should be fast, because of loop fusion:

function test_perf1()
    range = 1:2000000
    range_transp = collect(range)'
    steering_vector = complex.(ones(4,1), ones(4,1))
    sum_signal = zeros(Complex{Float64}, 4, length(range))
    sum_signal .+=
      steering_vector .*
      cis.((2 * pi * 1.023e6 / 4e6) .* range_transp .+ (40 * pi / 180))
    return sum_signal
  end

This results in:

BenchmarkTools.Trial: 
  memory estimate:  137.33 MiB
  allocs estimate:  19
  --------------
  minimum time:     1.328 s (0.98% GC)
  median time:      1.364 s (2.87% GC)
  mean time:        1.363 s (2.99% GC)
  maximum time:     1.398 s (5.34% GC)
  --------------
  samples:          4
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

However, when I remove the dot after cis, I get this result:

BenchmarkTools.Trial: 
  memory estimate:  183.12 MiB
  allocs estimate:  97
  --------------
  minimum time:     440.628 ms (17.45% GC)
  median time:      443.680 ms (17.70% GC)
  mean time:        445.445 ms (17.72% GC)
  maximum time:     458.517 ms (19.00% GC)
  --------------
  samples:          12
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

The memory consumption is reduced, but the fused code it 3 x slower

The text was updated successfully, but these errors were encountered:

KristofferC · 2017-03-03T14:47:26Z

Note that the argument to cis is just a scalar.

zsoerenm · 2017-03-03T14:58:38Z

It is not a scalar. Note that range_transp is included in cis.() .
It is slower because cis is unnecessarily calculated 3 times more often than needed because of the broadcasting with column vector steering_vector as @stevengj notes here: https://discourse.julialang.org/t/why-is-this-julia-code-considerably-slower-than-matlab/2365/62

It it better to create a temp variable first.
I'm closing this for now.

KristofferC · 2017-03-03T15:05:58Z

Oh, sorry, yes I miscounted the parenthesis.

StefanKarpinski · 2017-03-03T15:09:01Z

Why close this if there's a real performance issue? The vectorized cis function is deprecated on 0.6, so I'm a bit confused about the bug report here? The deprecated vectorized cis method is almost certainly horribly slow.

StefanKarpinski · 2017-03-03T15:12:17Z

The output of @code_warntype is quite long and definitely has some ::Any typed variables in it, so I'm going to reopen this as a performance bug.

stevengj · 2017-03-03T15:15:49Z

@StefanKarpinski, it's not a real performance issue. The programmer has to decide whether it is worth it to allocate the extra storage to call the cis function fewer times.

If you do X .+= colvec .* cis.(rowvec), then it allocates no temporary storage but it calls cis for every output. If you do tmp = cis.(rowvec); X .+= colvec .* tmp then it allocates more storage but it calls cis fewer times (saving a factor of length(colvec)).

StefanKarpinski · 2017-03-03T15:16:38Z

One really nice thing about the new (c)transpose wrappers is that you can just transpose a range without needing to expand it to a full vector, so the collect in collect(range)' can be removed.

stevengj · 2017-03-03T15:18:56Z

In principle, the optimal thing would be something like

for j=1:length(rowvec)
    tmp = cis(rowvec[j])
    for i = 1:length(colvec)
        X[i,j] += colvec[i] * tmp
    end
end

which allocates no temporary storage and also calls cis a minimal number of times, but it seems like a hard problem for the compiler to automate this from X .+= colvec .* cis.(rowvec).

StefanKarpinski · 2017-03-03T15:19:05Z

What about the Any-typed variables in the code_warntype output? I.e. T::Any, A_1@_55::Any, A_1@_152::Any. It doesn't seem like the generated code for this should have any such variables.

StefanKarpinski · 2017-03-03T15:22:09Z

That sort of optimization is what an @pure annotation for methods that means what people think it should mean would be really useful for – i.e. if you call this again with the same arguments, it will always give the same answer, so feel free to call it as few or as many times as you want and cache results. Of course, that's not what @vtjnash says @pure means, so cis isn't considered "pure".

stevengj · 2017-03-03T15:27:41Z

@StefanKarpinski, you're right, a sufficiently clever compiler could hoist cis(rowvec[j]) out of the loop (after inlining the broadcast) if it knows that both cis and getindex are pure. That's very similar to #14324, however.

However, the repeated computations of cis are presumably a separate issue from any type-inference failure here. Can we focus this issue on one or the other?

StefanKarpinski · 2017-03-03T15:45:48Z

Let's focus on the type inference issues then, since those should definitely not occur. Being clever enough to hoist repeated pure computations is out of scope for now.

yuyichao · 2017-03-03T16:12:00Z

There isn't an inference issue, the red is showing variables that are not used. I wonder if #20853 is already filtering them.

zsoerenm · 2017-03-03T16:16:38Z

In version 0.5. I don't get those Any-typed variables using @code_warntype.
It is also considerably faster:

v0.5:
BenchmarkTools.Trial: 
  memory estimate:  137.33 MiB
  allocs estimate:  20
  --------------
  minimum time:     1.774 s (0.51% GC)
  median time:      1.830 s (2.06% GC)
  mean time:        1.813 s (1.71% GC)
  maximum time:     1.836 s (2.51% GC)
  --------------
  samples:          3
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

v0.6:
BenchmarkTools.Trial: 
  memory estimate:  122.07 MiB
  allocs estimate:  15
  --------------
  minimum time:     3.675 s (0.27% GC)
  median time:      3.695 s (1.21% GC)
  mean time:        3.695 s (1.21% GC)
  maximum time:     3.715 s (2.13% GC)
  --------------
  samples:          2
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

Code for both:

@inbounds function test_perf6()
    rangeᵀ = (1:2000000)'
    steering_vectors = complex.(ones(4,11), ones(4,11))

    sum_signal = zeros(Complex{Float64}, 4, length(rangeᵀ))
    for i = 1:11
      for kc = 1:length(rangeᵀ)
        carrier_signal = cis((2 * pi * 1.023e6 / 4e6) * rangeᵀ[kc] + (40 * pi / 180))
        for ks = 1:4
          sum_signal[ks,kc] += steering_vectors[ks,i] * carrier_signal
        end
      end
    end
    return sum_signal
  end

vtjnash · 2017-03-03T16:23:41Z

After inlining, we can probably add a pass to attempt loop-hoisting (LICM). Inference is perfectly capable of doing that without the user being required to apply an incorrect @pure annotation to it, especially since pure isn't really the right criterion here.

yuyichao · 2017-03-03T16:24:26Z

That @inbounds isn't doing anything AFAICT.

mbauman · 2018-04-23T21:30:33Z

This is greatly improved on master (and #25377). That said, the LLVM IR looks like we're still computing cis within the innermost loop… have we just gotten way better at it? On the same machine that reports 1.6s on 0.6.2:

julia> @benchmark test_perf1()
BenchmarkTools.Trial:
  memory estimate:  137.33 MiB
  allocs estimate:  8
  --------------
  minimum time:     393.551 ms (1.90% GC)
  median time:      395.808 ms (1.87% GC)
  mean time:        411.531 ms (4.83% GC)
  maximum time:     463.017 ms (16.12% GC)
  --------------
  samples:          13
  evals/sample:     1

vtjnash · 2018-09-20T15:31:19Z

closing in favor of the example in #29285 (this one used a now-deprecated method of cis).

zsoerenm closed this as completed Mar 3, 2017

StefanKarpinski reopened this Mar 3, 2017

StefanKarpinski added performance Must go faster broadcast Applying a function over a collection labels Mar 3, 2017

zsoerenm changed the title ~~Slow loop fusion~~ Slow loop fusion when multiplying a column vector with a row vector Mar 3, 2017

yuyichao mentioned this issue Sep 20, 2018

LICM for pure functions #29285

Open

vtjnash closed this as completed Sep 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow loop fusion when multiplying a column vector with a row vector #20875

Slow loop fusion when multiplying a column vector with a row vector #20875

zsoerenm commented Mar 3, 2017 •

edited

Loading

KristofferC commented Mar 3, 2017 •

edited

Loading

zsoerenm commented Mar 3, 2017

KristofferC commented Mar 3, 2017 •

edited

Loading

StefanKarpinski commented Mar 3, 2017

StefanKarpinski commented Mar 3, 2017 •

edited

Loading

stevengj commented Mar 3, 2017

StefanKarpinski commented Mar 3, 2017

stevengj commented Mar 3, 2017

StefanKarpinski commented Mar 3, 2017

StefanKarpinski commented Mar 3, 2017

stevengj commented Mar 3, 2017 •

edited

Loading

StefanKarpinski commented Mar 3, 2017

yuyichao commented Mar 3, 2017

zsoerenm commented Mar 3, 2017

vtjnash commented Mar 3, 2017

yuyichao commented Mar 3, 2017

mbauman commented Apr 23, 2018

vtjnash commented Sep 20, 2018

Slow loop fusion when multiplying a column vector with a row vector #20875

Slow loop fusion when multiplying a column vector with a row vector #20875

Comments

zsoerenm commented Mar 3, 2017 • edited Loading

KristofferC commented Mar 3, 2017 • edited Loading

zsoerenm commented Mar 3, 2017

KristofferC commented Mar 3, 2017 • edited Loading

StefanKarpinski commented Mar 3, 2017

StefanKarpinski commented Mar 3, 2017 • edited Loading

stevengj commented Mar 3, 2017

StefanKarpinski commented Mar 3, 2017

stevengj commented Mar 3, 2017

StefanKarpinski commented Mar 3, 2017

StefanKarpinski commented Mar 3, 2017

stevengj commented Mar 3, 2017 • edited Loading

StefanKarpinski commented Mar 3, 2017

yuyichao commented Mar 3, 2017

zsoerenm commented Mar 3, 2017

vtjnash commented Mar 3, 2017

yuyichao commented Mar 3, 2017

mbauman commented Apr 23, 2018

vtjnash commented Sep 20, 2018

zsoerenm commented Mar 3, 2017 •

edited

Loading

KristofferC commented Mar 3, 2017 •

edited

Loading

KristofferC commented Mar 3, 2017 •

edited

Loading

StefanKarpinski commented Mar 3, 2017 •

edited

Loading

stevengj commented Mar 3, 2017 •

edited

Loading