-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression with arithmetic operations on sparse matrices #57
Comments
We have had several reports of sparse matrix regressions. Probably worth taking a good look at what is going on and expanding the sparse matrix benchmarks in |
|
Results are similar when interpolating julia> versioninfo()
Julia Version 0.5.1
Commit 6445c82* (2017-03-05 13:25 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin15.6.0)
CPU: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, ivybridge)
julia> A = sprand(10, 10, 0.2);
julia> @benchmark 3.0 * $A
BenchmarkTools.Trial:
samples: 10000
evals/sample: 849
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 736.00 bytes
allocs estimate: 4
minimum time: 146.00 ns (0.00% GC)
median time: 166.00 ns (0.00% GC)
mean time: 215.95 ns (15.66% GC)
maximum time: 2.27 μs (88.80% GC)
julia> @benchmark 3.0 + $A
BenchmarkTools.Trial:
samples: 10000
evals/sample: 282
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 1.75 kb
allocs estimate: 2
minimum time: 290.00 ns (0.00% GC)
median time: 319.00 ns (0.00% GC)
mean time: 390.04 ns (10.10% GC)
maximum time: 4.27 μs (0.00% GC)
julia> @benchmark $A / 3.0
BenchmarkTools.Trial:
samples: 10000
evals/sample: 685
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 736.00 bytes
allocs estimate: 4
minimum time: 186.00 ns (0.00% GC)
median time: 206.00 ns (0.00% GC)
mean time: 260.94 ns (13.07% GC)
maximum time: 2.91 μs (85.06% GC) and julia> versioninfo()
Julia Version 0.6.0-pre.beta.305
Commit f56147d* (2017-04-24 18:38 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin15.6.0)
CPU: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, ivybridge)
julia> A = sprand(10, 10, 0.2);
julia> @benchmark 3.0 * $A
BenchmarkTools.Trial:
memory estimate: 784 bytes
allocs estimate: 7
--------------
minimum time: 238.269 ns (0.00% GC)
median time: 258.608 ns (0.00% GC)
mean time: 329.116 ns (12.51% GC)
maximum time: 4.580 μs (86.26% GC)
--------------
samples: 10000
evals/sample: 438
julia> @benchmark 3.0 + $A
BenchmarkTools.Trial:
memory estimate: 2.02 KiB
allocs estimate: 7
--------------
minimum time: 644.115 ns (0.00% GC)
median time: 707.595 ns (0.00% GC)
mean time: 858.236 ns (8.50% GC)
maximum time: 8.213 μs (80.75% GC)
--------------
samples: 10000
evals/sample: 174
julia> @benchmark $A / 3.0
BenchmarkTools.Trial:
memory estimate: 784 bytes
allocs estimate: 7
--------------
minimum time: 256.019 ns (0.00% GC)
median time: 281.244 ns (0.00% GC)
mean time: 355.654 ns (11.75% GC)
maximum time: 5.543 μs (89.03% GC)
--------------
samples: 10000
evals/sample: 362 Also note that discrepancies persist in the multiplication and division cases for more reasonably sized (though still small) sparse matrices, with on 0.5.1: julia> A = sprand(1000, 1000, 0.01);
julia> @benchmark 3.0 * $A
BenchmarkTools.Trial:
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 167.58 kb
allocs estimate: 6
minimum time: 16.95 μs (0.00% GC)
median time: 25.26 μs (0.00% GC)
mean time: 50.63 μs (29.92% GC)
maximum time: 3.43 ms (77.29% GC)
julia> @benchmark 3.0 + $A
BenchmarkTools.Trial:
samples: 795
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 15.26 mb
allocs estimate: 4
minimum time: 4.10 ms (0.00% GC)
median time: 6.38 ms (28.68% GC)
mean time: 6.28 ms (27.32% GC)
maximum time: 9.35 ms (42.09% GC)
julia> @benchmark $A / 3.0
BenchmarkTools.Trial:
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 167.58 kb
allocs estimate: 6
minimum time: 28.98 μs (0.00% GC)
median time: 36.61 μs (0.00% GC)
mean time: 61.14 μs (22.96% GC)
maximum time: 2.96 ms (75.51% GC) and master julia> A = sprand(1000, 1000, 0.01);
julia> @benchmark 3.0 * $A
BenchmarkTools.Trial:
memory estimate: 167.75 KiB
allocs estimate: 9
--------------
minimum time: 28.489 μs (0.00% GC)
median time: 37.700 μs (0.00% GC)
mean time: 63.382 μs (23.70% GC)
maximum time: 3.182 ms (97.13% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark 3.0 + $A
BenchmarkTools.Trial:
memory estimate: 15.27 MiB
allocs estimate: 9
--------------
minimum time: 3.022 ms (0.00% GC)
median time: 5.242 ms (33.51% GC)
mean time: 5.098 ms (31.28% GC)
maximum time: 76.732 ms (94.91% GC)
--------------
samples: 978
evals/sample: 1
julia> @benchmark $A / 3.0
BenchmarkTools.Trial:
memory estimate: 167.75 KiB
allocs estimate: 9
--------------
minimum time: 40.635 μs (0.00% GC)
median time: 50.794 μs (0.00% GC)
mean time: 78.106 μs (19.19% GC)
maximum time: 3.396 ms (95.61% GC)
--------------
samples: 10000
evals/sample: 1 |
The overarching reason for these performance changes: In 0.5, arithmetic operations between sparse matrices and numbers translate to equivalent dot operations, e.g. Note that the former specializations were extremely simple and not universally correct. For example, the specializations that the multiplication and division examples above hit just applied the corresponding operations to the sparse matrix's nonzero vector, yielding incorrect results in corner cases such as the scalar being Given the semantic changes, I'm not certain the performance comparison is particularly meaningful. (That said, some tweaks to generic sparse |
Couldn't we have something like this for multiplication and division: julia> sp_mul_number(A,n) = SparseMatrixCSC(A.m,A.n,A.colptr,A.rowval,n*A.nzval)
sp_mul_number (generic function with 1 method)
julia> @benchmark 3.26*$A
BenchmarkTools.Trial:
memory estimate: 3.67 KiB
allocs estimate: 7
--------------
minimum time: 551.559 ns (0.00% GC)
median time: 614.414 ns (0.00% GC)
mean time: 685.501 ns (8.68% GC)
maximum time: 3.800 μs (68.65% GC)
--------------
samples: 10000
evals/sample: 186
julia> @benchmark sp_mul_number($A,3.26)
BenchmarkTools.Trial:
memory estimate: 48 bytes
allocs estimate: 1
--------------
minimum time: 51.266 ns (0.00% GC)
median time: 51.561 ns (0.00% GC)
mean time: 53.618 ns (2.20% GC)
maximum time: 776.661 ns (83.50% GC)
--------------
samples: 10000
evals/sample: 992
julia> isequal(3.26*A,sp_mul_number(A,3.26))
true |
That definition is similar in spirit to the above-mentioned specializations existing in 0.5: (.*)(A::SparseMatrixCSC, B::Number) = SparseMatrixCSC(A.m, A.n, copy(A.colptr), copy(A.rowval), A.nzval .* B)
(.*)(A::Number, B::SparseMatrixCSC) = SparseMatrixCSC(B.m, B.n, copy(B.colptr), copy(B.rowval), A .* B.nzval)
(./)(A::SparseMatrixCSC, B::Number) = SparseMatrixCSC(A.m, A.n, copy(A.colptr), copy(A.rowval), A.nzval ./ B) (You need |
Right, sorry I hadn't understood your reply.
What would be the correct way to deal with the number being infinite or NaN
(etc)? What is broadcast now doing?
I'm assuming that simply coding in the behaviour for these special cases
would not work.
|
No worries! :)
Writing specializations of one form or another for each such operation is of course possible. But the net result would be a collection of ad hoc methods closely resembling both each other and sparse For example, improving the mechanism that handles scalars in sparse julia> @benchmark broadcast(x -> 3.0 * x, $A)
BenchmarkTools.Trial:
memory estimate: 896 bytes
allocs estimate: 4
--------------
minimum time: 225.231 ns (0.00% GC)
median time: 239.024 ns (0.00% GC)
mean time: 297.230 ns (11.20% GC)
maximum time: 3.230 μs (84.41% GC)
--------------
samples: 10000
evals/sample: 455
julia> @benchmark broadcast(*, 3.0, $A)
BenchmarkTools.Trial:
memory estimate: 944 bytes
allocs estimate: 7
--------------
minimum time: 279.579 ns (0.00% GC)
median time: 295.338 ns (0.00% GC)
mean time: 374.913 ns (11.91% GC)
maximum time: 5.898 μs (85.46% GC)
--------------
samples: 10000
evals/sample: 318
julia> @benchmark broadcast(x -> x / 3.0, $A)
BenchmarkTools.Trial:
memory estimate: 896 bytes
allocs estimate: 4
--------------
minimum time: 263.210 ns (0.00% GC)
median time: 280.946 ns (0.00% GC)
mean time: 342.693 ns (9.78% GC)
maximum time: 3.941 μs (74.48% GC)
--------------
samples: 10000
evals/sample: 372
julia> @benchmark broadcast(/, $A, 3.0)
BenchmarkTools.Trial:
memory estimate: 944 bytes
allocs estimate: 7
--------------
minimum time: 300.787 ns (0.00% GC)
median time: 320.508 ns (0.00% GC)
mean time: 400.012 ns (10.91% GC)
maximum time: 6.291 μs (86.89% GC)
--------------
samples: 10000
evals/sample: 258 Best! |
@Sacha0 Would it be possible to get back to 0.5 performance if structural zeros didn't follow IEEE in |
@Sacha0 how is scaling the vector of nonzero values different than applying the same operation elementwise? If multiplying by zeros, all the nonzero values become 0 and the resulting sparse matrix is empty. If multiplying by infinity, nonzero values become infinite. The only case I can see a difference would be for NaN, where the whole matrix should become filled with NaNs. Is that correct? |
Not quite: Inf and NaN entries become NaN.
Also when multiplying with Inf, zero becomes NaN, so the whole matrix will be filled with a mixture of Infs and NaNs. |
Without also nixing the other semantic changes? My (approximately meaningless) guess is no: Outperforming [uniformly scaling a vector] while [doing more than uniformly scaling a vector] seems tricky 😉. It might be worth noting that, insofar as I am aware, these performance regressions are tightly confined, specifically to operations that have (over-)aggressive specializations in 0.5 (e.g. Additionally, using dot ops directly (rather than via julia> median(@benchmark 3.0 * $A)
BenchmarkTools.TrialEstimate:
time: 161.932 ns
gctime: 0.000 ns (0.00%)
memory: 640 bytes
allocs: 4 while on 0.6 julia> median(@benchmark 3.0 .* $A)
BenchmarkTools.TrialEstimate:
time: 189.452 ns
gctime: 0.000 ns (0.00%)
memory: 640 bytes
allocs: 4
julia> median(@benchmark 3.0 * $A)
BenchmarkTools.TrialEstimate:
time: 233.724 ns
gctime: 0.000 ns (0.00%)
memory: 688 bytes
allocs: 7 Furthermore, if you make the operation even marginally more complex, 0.6 should win. To illustrate, on 0.5 julia> median(@benchmark 3.0 .* $A .* 3.0)
BenchmarkTools.TrialEstimate:
time: 336.352 ns
gctime: 0.000 ns (0.00%)
memory: 1.25 KiB
allocs: 8 while on 0.6 julia> median(@benchmark 3.0 .* $A .* 3.0)
BenchmarkTools.TrialEstimate:
time: 192.012 ns
gctime: 0.000 ns (0.00%)
memory: 640 bytes
allocs: 4 And there still exist opportunities to whittle away at any such gaps without changing semantics. Best! |
This is because of the literal. With a variable, it doesn't matter. I'm not sure why the two argument |
Yes, exactly :). Sparse If you could shed light on the closure-associated performance issues, I would much appreciate your insight!
Though not in the case of julia> @benchmark $s .* $A .* $s
BenchmarkTools.Trial:
memory estimate: 1.38 KiB
allocs estimate: 8
--------------
minimum time: 313.444 ns (0.00% GC)
median time: 339.560 ns (0.00% GC)
mean time: 429.646 ns (14.17% GC)
maximum time: 5.812 μs (89.75% GC)
--------------
samples: 10000
evals/sample: 252
julia> @benchmark $s * $A * $s
BenchmarkTools.Trial:
memory estimate: 1.38 KiB
allocs estimate: 8
--------------
minimum time: 319.040 ns (0.00% GC)
median time: 342.202 ns (0.00% GC)
mean time: 432.209 ns (14.11% GC)
maximum time: 6.319 μs (89.28% GC)
--------------
samples: 10000
evals/sample: 247 while on 0.6 julia> @benchmark $s .* $A .* $s
BenchmarkTools.Trial:
memory estimate: 672 bytes
allocs estimate: 7
--------------
minimum time: 266.410 ns (0.00% GC)
median time: 284.621 ns (0.00% GC)
mean time: 346.397 ns (10.20% GC)
maximum time: 5.074 μs (88.84% GC)
--------------
samples: 10000
evals/sample: 383
julia> @benchmark $s * $A * $s
BenchmarkTools.Trial:
memory estimate: 1.28 KiB
allocs estimate: 14
--------------
minimum time: 435.910 ns (0.00% GC)
median time: 467.676 ns (0.00% GC)
mean time: 576.297 ns (11.24% GC)
maximum time: 9.801 μs (80.05% GC)
--------------
samples: 10000
evals/sample: 199
Were the closure mechanism above zero-cost, then the former and latter should be comparable (given the former wraps the scalar into a closure and then calls the latter.) Any insight would be much appreciated :). Best! |
I am assuming that this is related, otherwise I can open a new issue, but it seems like structural zeros are not preserved in some cases, for example in function scale1!(A,b)
A .= A.*(1.0./b)
end
function scale2!(A,b)
tmp = 1.0./b
A .= A.*tmp
end julia> A0 = sprandn(100,100000,0.001);
julia> b = randn(100);
julia> A = copy(A0);
julia> @time nnz(scale1!(A,b))
0.433009 seconds (225.66 k allocations: 167.108 MiB, 16.55% gc time)
10000000 #Note full matrix!!!
julia> A = copy(A0);
julia> @benchmark scale1!(A,b)
BenchmarkTools.Trial:
memory estimate: 6.44 KiB
allocs estimate: 11
--------------
minimum time: 55.652 ms (0.00% GC)
median time: 56.000 ms (0.00% GC)
mean time: 56.305 ms (0.00% GC)
maximum time: 61.173 ms (0.00% GC)
--------------
samples: 89
evals/sample: 1
julia> A = copy(A0);
julia> @time nnz(scale2!(A,b))
0.034621 seconds (15.75 k allocations: 729.528 KiB)
10028
julia> A = copy(A0);
julia> @benchmark scale2!(A,b)
BenchmarkTools.Trial:
memory estimate: 8.34 KiB
allocs estimate: 36
--------------
minimum time: 13.837 ms (0.00% GC)
median time: 13.986 ms (0.00% GC)
mean time: 14.674 ms (0.00% GC)
maximum time: 21.433 ms (0.00% GC)
--------------
samples: 341
evals/sample: 1 I was not able to understand why A related problem, maybe even worse (if possible), is that function scale1!(A,b)
A .*= (1.0./b)
end will result in a full NaN matrix! Ran on
Edit: Verified on latest version of julia
with same behavior, although slightly less significant time-wise. |
Did you mean julia> A = sprandn(10, 10, 0.1); nnz(A)
9
julia> b = randn(10);
julia> nnz(A ./ b) # like scale1!
100
julia> nnz(A .* b) # like scale2!
9 Explanation: In essence you've hit the issues I mention in the second section of JuliaLang/julia#18590 (comment) and again in JuliaLang/julia#18590 (comment).
Not certain I follow --- could you clarify? :)
Yes, the issue with e.g. Best! |
Yes, sorry.
Sure, but part of the point was to also show that the expression
Thanks for the explanation, I was able to figure out that this is what's probably being done, but the code is not trivial to read. My understanding is that the whole resulting matrix is |
Sure, what I meant is that |
Might be related to benchmarking at global scope, which generally is to be avoided. |
This understanding is correct :).
The tl;dr is that (1) these methods were primarily designed for combinations involving sparse/structured vectors/matrices and scalars, (2) the mechanisms allowing combinations including dense vectors/matrices were something of an afterthought, and (3) consequently these methods are suboptimal for combinations including dense vectors/matrices. For such combinations, substantial room for improvement exists :).
Benchmarking is tricky :). Note that (1) as used above, Best! |
I see, hopefully I will be able to find some time to contribute more to julia in the close future, I have several issues related to sparse matrices and factorization I will try to take a stab at when i find the time, and your explanations are very useful for me to understand the language better.
Yes, I actually took several steps to create more reliable results, which I didn't show, I really think that the actual performance is closer to the one reported by
I did run
Could you elaborate? I did put the main code in functions, I thought this would remove most of the problems
See (1) Note also that for some reason |
That would be great! :)
Ref. the relevant section of the BenchmarkTools.jl manual. Best! |
Just got hit by the literal vs. variable discrepancy. This issue, along with JuliaLang/julia#21693 makes me seem that for now dotops should really just throw an error instead of silently giving the wrong result... julia> X = sprand(10,0.1)
10-element SparseVector{Float64,Int64} with 2 stored entries:
[4 ] = 0.492868
[8 ] = 0.0962979
julia> a=0
0
julia> X./0
10-element SparseVector{Float64,Int64} with 10 stored entries:
[1 ] = NaN
[2 ] = NaN
[3 ] = NaN
[4 ] = Inf
[5 ] = NaN
[6 ] = NaN
[7 ] = NaN
[8 ] = Inf
[9 ] = NaN
[10] = NaN
julia> X./a
10-element SparseVector{Float64,Int64} with 2 stored entries:
[4 ] = Inf
[8 ] = Inf |
Argh, actually this is an other issue, since the generic sparse broadcast was not done for |
Generic sparse That set of methods looks like it should be removed in the interest of correctness. I will eke out time for a pull request removing those methods this weekend if possible. Thanks for catching those stragglers @jebej! :) |
Hm how do you call the generic sparse broadcast with sparse vectors? edit: I can manually call |
As above (https://github.com/JuliaLang/julia/issues/21515#issuecomment-313869996), those specializations are incorrect and should be removed. Best! |
…arse vectors and scalars (#21515). (#22715)
Possibly related to JuliaLang/julia#21370:
On 0.5.1:
On 0.6
The text was updated successfully, but these errors were encountered: