Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow loop fusion when multiplying a column vector with a row vector #20875

Closed
zsoerenm opened this issue Mar 3, 2017 · 18 comments
Closed

Slow loop fusion when multiplying a column vector with a row vector #20875

zsoerenm opened this issue Mar 3, 2017 · 18 comments
Labels
broadcast Applying a function over a collection performance Must go faster

Comments

@zsoerenm
Copy link
Contributor

zsoerenm commented Mar 3, 2017

I am using Julia v0.6.0-pre.alpha.34
I run this simple function, which should be fast, because of loop fusion:

function test_perf1()
    range = 1:2000000
    range_transp = collect(range)'
    steering_vector = complex.(ones(4,1), ones(4,1))
    sum_signal = zeros(Complex{Float64}, 4, length(range))
    sum_signal .+=
      steering_vector .*
      cis.((2 * pi * 1.023e6 / 4e6) .* range_transp .+ (40 * pi / 180))
    return sum_signal
  end

This results in:

BenchmarkTools.Trial: 
  memory estimate:  137.33 MiB
  allocs estimate:  19
  --------------
  minimum time:     1.328 s (0.98% GC)
  median time:      1.364 s (2.87% GC)
  mean time:        1.363 s (2.99% GC)
  maximum time:     1.398 s (5.34% GC)
  --------------
  samples:          4
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

However, when I remove the dot after cis, I get this result:

BenchmarkTools.Trial: 
  memory estimate:  183.12 MiB
  allocs estimate:  97
  --------------
  minimum time:     440.628 ms (17.45% GC)
  median time:      443.680 ms (17.70% GC)
  mean time:        445.445 ms (17.72% GC)
  maximum time:     458.517 ms (19.00% GC)
  --------------
  samples:          12
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

The memory consumption is reduced, but the fused code it 3 x slower

@KristofferC
Copy link
Sponsor Member

KristofferC commented Mar 3, 2017

Note that the argument to cis is just a scalar.

@zsoerenm
Copy link
Contributor Author

zsoerenm commented Mar 3, 2017

It is not a scalar. Note that range_transp is included in cis.() .
It is slower because cis is unnecessarily calculated 3 times more often than needed because of the broadcasting with column vector steering_vector as @stevengj notes here: https://discourse.julialang.org/t/why-is-this-julia-code-considerably-slower-than-matlab/2365/62

It it better to create a temp variable first.
I'm closing this for now.

@zsoerenm zsoerenm closed this as completed Mar 3, 2017
@KristofferC
Copy link
Sponsor Member

KristofferC commented Mar 3, 2017

Oh, sorry, yes I miscounted the parenthesis.

@StefanKarpinski
Copy link
Sponsor Member

Why close this if there's a real performance issue? The vectorized cis function is deprecated on 0.6, so I'm a bit confused about the bug report here? The deprecated vectorized cis method is almost certainly horribly slow.

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Mar 3, 2017

The output of @code_warntype is quite long and definitely has some ::Any typed variables in it, so I'm going to reopen this as a performance bug.

@StefanKarpinski StefanKarpinski added performance Must go faster broadcast Applying a function over a collection labels Mar 3, 2017
@stevengj
Copy link
Member

stevengj commented Mar 3, 2017

@StefanKarpinski, it's not a real performance issue. The programmer has to decide whether it is worth it to allocate the extra storage to call the cis function fewer times.

If you do X .+= colvec .* cis.(rowvec), then it allocates no temporary storage but it calls cis for every output. If you do tmp = cis.(rowvec); X .+= colvec .* tmp then it allocates more storage but it calls cis fewer times (saving a factor of length(colvec)).

@StefanKarpinski
Copy link
Sponsor Member

One really nice thing about the new (c)transpose wrappers is that you can just transpose a range without needing to expand it to a full vector, so the collect in collect(range)' can be removed.

@stevengj
Copy link
Member

stevengj commented Mar 3, 2017

In principle, the optimal thing would be something like

for j=1:length(rowvec)
    tmp = cis(rowvec[j])
    for i = 1:length(colvec)
        X[i,j] += colvec[i] * tmp
    end
end

which allocates no temporary storage and also calls cis a minimal number of times, but it seems like a hard problem for the compiler to automate this from X .+= colvec .* cis.(rowvec).

@StefanKarpinski
Copy link
Sponsor Member

What about the Any-typed variables in the code_warntype output? I.e. T::Any, A_1@_55::Any, A_1@_152::Any. It doesn't seem like the generated code for this should have any such variables.

@StefanKarpinski
Copy link
Sponsor Member

That sort of optimization is what an @pure annotation for methods that means what people think it should mean would be really useful for – i.e. if you call this again with the same arguments, it will always give the same answer, so feel free to call it as few or as many times as you want and cache results. Of course, that's not what @vtjnash says @pure means, so cis isn't considered "pure".

@zsoerenm zsoerenm changed the title Slow loop fusion Slow loop fusion when multiplying a column vector with a row vector Mar 3, 2017
@stevengj
Copy link
Member

stevengj commented Mar 3, 2017

@StefanKarpinski, you're right, a sufficiently clever compiler could hoist cis(rowvec[j]) out of the loop (after inlining the broadcast) if it knows that both cis and getindex are pure. That's very similar to #14324, however.

However, the repeated computations of cis are presumably a separate issue from any type-inference failure here. Can we focus this issue on one or the other?

@StefanKarpinski
Copy link
Sponsor Member

Let's focus on the type inference issues then, since those should definitely not occur. Being clever enough to hoist repeated pure computations is out of scope for now.

@yuyichao
Copy link
Contributor

yuyichao commented Mar 3, 2017

There isn't an inference issue, the red is showing variables that are not used. I wonder if #20853 is already filtering them.

@zsoerenm
Copy link
Contributor Author

zsoerenm commented Mar 3, 2017

In version 0.5. I don't get those Any-typed variables using @code_warntype.
It is also considerably faster:

v0.5:
BenchmarkTools.Trial: 
  memory estimate:  137.33 MiB
  allocs estimate:  20
  --------------
  minimum time:     1.774 s (0.51% GC)
  median time:      1.830 s (2.06% GC)
  mean time:        1.813 s (1.71% GC)
  maximum time:     1.836 s (2.51% GC)
  --------------
  samples:          3
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

v0.6:
BenchmarkTools.Trial: 
  memory estimate:  122.07 MiB
  allocs estimate:  15
  --------------
  minimum time:     3.675 s (0.27% GC)
  median time:      3.695 s (1.21% GC)
  mean time:        3.695 s (1.21% GC)
  maximum time:     3.715 s (2.13% GC)
  --------------
  samples:          2
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

Code for both:

@inbounds function test_perf6()
    rangeᵀ = (1:2000000)'
    steering_vectors = complex.(ones(4,11), ones(4,11))

    sum_signal = zeros(Complex{Float64}, 4, length(rangeᵀ))
    for i = 1:11
      for kc = 1:length(rangeᵀ)
        carrier_signal = cis((2 * pi * 1.023e6 / 4e6) * rangeᵀ[kc] + (40 * pi / 180))
        for ks = 1:4
          sum_signal[ks,kc] += steering_vectors[ks,i] * carrier_signal
        end
      end
    end
    return sum_signal
  end

@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 3, 2017

After inlining, we can probably add a pass to attempt loop-hoisting (LICM). Inference is perfectly capable of doing that without the user being required to apply an incorrect @pure annotation to it, especially since pure isn't really the right criterion here.

@yuyichao
Copy link
Contributor

yuyichao commented Mar 3, 2017

That @inbounds isn't doing anything AFAICT.

@mbauman
Copy link
Sponsor Member

mbauman commented Apr 23, 2018

This is greatly improved on master (and #25377). That said, the LLVM IR looks like we're still computing cis within the innermost loop… have we just gotten way better at it? On the same machine that reports 1.6s on 0.6.2:

julia> @benchmark test_perf1()
BenchmarkTools.Trial:
  memory estimate:  137.33 MiB
  allocs estimate:  8
  --------------
  minimum time:     393.551 ms (1.90% GC)
  median time:      395.808 ms (1.87% GC)
  mean time:        411.531 ms (4.83% GC)
  maximum time:     463.017 ms (16.12% GC)
  --------------
  samples:          13
  evals/sample:     1

@vtjnash
Copy link
Sponsor Member

vtjnash commented Sep 20, 2018

closing in favor of the example in #29285 (this one used a now-deprecated method of cis).

@vtjnash vtjnash closed this as completed Sep 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broadcast Applying a function over a collection performance Must go faster
Projects
None yet
Development

No branches or pull requests

7 participants