-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use mul! in Diagonal*Matrix #42321
Use mul! in Diagonal*Matrix #42321
Conversation
Interesting finding! Is there any reason to not implement this with broadcasting though? Here are the numbers I'm seeing julia> using BenchmarkTools: @btime
julia> using LinearAlgebra: Diagonal, mul!, similar, promote_op, transpose
julia> foreach((10, 50, 100, 500, 1_000, 5_000, 10_000)) do N
M = N + rand(-5:5)
A = rand(N, M)
Dr = Diagonal(float.(1:M))
println((;N, M))
println("Current algorithm")
o1 = @btime $A * $Dr
println("Proposed algorithm")
o2 = @btime $A *̂ $Dr
println("Broadcasting")
o3 = @btime $A .* transpose($Dr.diag)
println()
@assert o1 ≈ o2 ≈ o3
end
(N = 10, M = 14)
Current algorithm
212.190 ns (3 allocations: 1.31 KiB)
Proposed algorithm
192.421 ns (3 allocations: 1.31 KiB)
Broadcasting
141.649 ns (1 allocation: 1.22 KiB)
(N = 50, M = 48)
Current algorithm
2.612 μs (4 allocations: 18.92 KiB)
Proposed algorithm
1.439 μs (4 allocations: 18.92 KiB)
Broadcasting
1.346 μs (2 allocations: 18.83 KiB)
(N = 100, M = 99)
Current algorithm
8.623 μs (4 allocations: 77.55 KiB)
Proposed algorithm
4.386 μs (4 allocations: 77.55 KiB)
Broadcasting
4.286 μs (2 allocations: 77.45 KiB)
(N = 500, M = 495)
Current algorithm
212.060 μs (4 allocations: 1.89 MiB)
Proposed algorithm
124.020 μs (4 allocations: 1.89 MiB)
Broadcasting
126.190 μs (2 allocations: 1.89 MiB)
(N = 1000, M = 999)
Current algorithm
1.336 ms (4 allocations: 7.62 MiB)
Proposed algorithm
799.228 μs (4 allocations: 7.62 MiB)
Broadcasting
801.688 μs (2 allocations: 7.62 MiB)
(N = 5000, M = 5001)
Current algorithm
64.819 ms (4 allocations: 190.77 MiB)
Proposed algorithm
43.458 ms (4 allocations: 190.77 MiB)
Broadcasting
43.699 ms (2 allocations: 190.77 MiB)
(N = 10000, M = 9997)
Current algorithm
266.747 ms (4 allocations: 762.71 MiB)
Proposed algorithm
160.720 ms (4 allocations: 762.71 MiB)
Broadcasting
159.481 ms (2 allocations: 762.71 MiB) It seems that just using broadcast is simpler and just as fast or faster. julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: AMD Ryzen Threadripper 2990WX 32-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, znver1)
Environment:
JULIA_NUM_THREADS = 26 |
Normally, matrix multiplication involves a mixture of broadcast and reduction, but involving a Given how much work was put into our broadcast machinery, we might as well use it. |
In fact There should be no performance difference between 3-arg |
Hm. Then why is there such a big performance difference here? |
The difference holds only for small martix. It might come from:
julia> f(x) = x .+= false
f (generic function with 1 method)
julia> @code_llvm f([1,2,3])
; @ REPL[40]:1 within `f`
; Function Attrs: uwtable
define nonnull {}* @japi1_f_6357({}* %0, {}** %1, i32 %2) #0 {
top:
%3 = alloca {}**, align 8
store volatile {}** %1, {}*** %3, align 8
%4 = load {}*, {}** %1, align 8
ret {}* %4
} |
Hm. That eager |
│ %308 = Base.mul_float(%304, %307)::Float64
│ %309 = Base.copysign_float(0.0, %308)::Float64
│ %310 = Base.ifelse(true, %308, %309)::Float64
│ %311 = Base.add_float(1.0, %310)::Float64
│ %312 = Base.ifelse(false, %311, %310)::Float64 BTW, |
The difference for small matrices seems to be entirely due to julia> A = ones(10, 14); DR = Diagonal(ones(size(A, 2)));
julia> @btime permutedims($DR.diag);
98.516 ns (2 allocations: 96 bytes)
julia> @btime transpose($DR.diag);
2.674 ns (0 allocations: 0 bytes)
julia> @btime $A * $DR;
439.354 ns (3 allocations: 1.31 KiB)
julia> @btime $A .* transpose($DR.diag);
344.735 ns (1 allocation: 1.22 KiB) |
julia> @which mul!(zeros(100,100), Diagonal(1:100), Symmetric(randn(100,100)), true, false)
mul!(C::AbstractMatrix, A::AbstractVecOrMat, B::AbstractVecOrMat, alpha::Number, beta::Number) in LinearAlgebra at C:\Users\MYJ\AppData\Local\Programs\Julia-1.7.0-beta4\share\julia\stdlib\v1.7\LinearAlgebra\src\matmul.jl:301
julia> @which mul!(zeros(100,100), Diagonal(1:100), randn(100,100), true, false)
mul!(out::AbstractMatrix, A::Diagonal, in::StridedMatrix{T} where T, alpha::Number, beta::Number) in LinearAlgebra at C:\Users\MYJ\AppData\Local\Programs\Julia-1.7.0-beta4\share\julia\stdlib\v1.7\LinearAlgebra\src\diagonal.jl:370
|
Would it make sense to improve |
Both seems OK to me. |
No, we use |
Non-recursive lazy transpose sounds very nice to have! |
That shouldn’t be a problem if the method is restricted to operating on I guess if someone made some wacky number type that could be |
We could check if the eltype of |
Since @inline function _copy_diag!(data, diag)
for i in 1:length(diag)
data[i,i] = diag[i]
end
data
end
for Tri in (:UpperTriangular, :LowerTriangular)
UTri = Symbol(:Unit, Tri)
@eval (*)(A::$Tri, D::Diagonal) = $Tri(A.data * D)
@eval (*)(A::$UTri, D::Diagonal) = $Tri(_copy_diag!(A.data * D, D.diag))
@eval (*)(D::Diagonal, A::$Tri) = $Tri(D * A.data)
@eval (*)(D::Diagonal, A::$UTri) = $Tri(_copy_diag!(D * A.data, D.diag))
end Some Benchmark (With This PR) julia> A = LowerTriangular(randn(1000,1000));
julia> @btime $A * $Diagonal(ones(1000)); # before
2.215 ms (5 allocations: 7.64 MiB)
julia> Revise.track(LinearAlgebra)
julia> @btime $A * $Diagonal(ones(1000)); # after
1.208 ms (5 allocations: 7.64 MiB) |
BTW, I found that the above pattern also works for for Tri in (:UpperTriangular, :LowerTriangular)
UTri = Symbol(:Unit, Tri)
for fun in (:*, :rmul!)
@eval $fun(A::$Tri, D::Diagonal) = $Tri($fun(A.data, D))
@eval $fun(A::$UTri, D::Diagonal) = $Tri(_copy_diag!($fun(A.data, D), D.diag))
end
for fun in (:*, :lmul!)
@eval $fun(D::Diagonal, A::$Tri) = $Tri($fun(D, A.data))
@eval $fun(D::Diagonal, A::$UTri) = $Tri(_copy_diag!($fun(D, A.data), D.diag))
end
@eval mul!(out::$Tri, D::Diagonal, A::$Tri) = $Tri(mul!(out.data, D, A.Data))
@eval mul!(out::$Tri, D::Diagonal, A::$UTri) = $Tri(_copy_diag!(mul!(out.data, D, A.Data), D.diag))
@eval mul!(out::$Tri, A::$Tri, D::Diagonal) = $Tri(mul!(out.data, A.Data, D))
@eval mul!(out::$Tri, A::$UTri, D::Diagonal) = $Tri(_copy_diag!(mul!(out.data, A.Data, D), D.diag))
end The above code support seperate optimization for julia/stdlib/LinearAlgebra/src/diagonal.jl Line 271 in 7ce858c
Corresponding lmul! has been missing for years, at least since the time this file was created.
@dkarrasch Should this be included in a seperate PR? |
Perhaps this should be specialized to cases where Currently on master julia> A = LowerTriangular(reshape(1:1600, 40, 40)); D = Diagonal(ones(size(A,2)));
julia> @btime $A * $D;
8.931 μs (3 allocations: 12.72 KiB)
julia> @btime $A.data * $D;
2.308 μs (3 allocations: 12.72 KiB)
julia> @btime mul!(similar($A.data, size($A.data)), $A.data, $D);
61.452 μs (7 allocations: 12.95 KiB) With this PR, we're hitting Edit: Oops, this benchmark was on Julia 1.6 and not master. On master I obtain julia> A = LowerTriangular(reshape(1:1600, 40, 40)); D = Diagonal(ones(size(A,2)));
julia> @btime $A * $D;
3.408 μs (3 allocations: 12.72 KiB)
julia> @btime $A.data * $D;
2.331 μs (3 allocations: 12.72 KiB)
julia> @btime mul!(similar($A.data, size($A.data)), $A.data, $D);
3.239 μs (3 allocations: 12.72 KiB)
julia> @btime mul!(similar($A.data, size($A.data)), $A.data, convert(AbstractArray{eltype($A)}, $D));
1.010 μs (4 allocations: 13.11 KiB)
julia> A = LowerTriangular(reshape(1:1600, 40, 40)); D = Diagonal(ones(Int, size(A,2)));
julia> @btime $A * $D;
2.985 μs (3 allocations: 12.72 KiB)
julia> @btime $A.data * $D;
2.396 μs (3 allocations: 12.72 KiB)
julia> @btime mul!(similar($A.data, size($A.data)), $A.data, $D);
918.500 ns (3 allocations: 12.72 KiB)
julia> A = LowerTriangular(reshape(Float64.(1:1600), 40, 40)); D = Diagonal(ones(size(A,2)));
julia> @btime $A * $D;
2.913 μs (3 allocations: 12.72 KiB)
julia> @btime $A.data * $D;
1.998 μs (3 allocations: 12.72 KiB)
julia> @btime mul!(similar($A.data, size($A.data)), $A.data, $D);
791.733 ns (3 allocations: 12.72 KiB) So |
If I have not made a mistake:
So BTW, for your example julia> A = LowerTriangular(reshape(1:1600, 40, 40)); D = Diagonal(ones(size(A,2)));
julia> @btime $A.data * $D;
2.331 μs (3 allocations: 12.72 KiB)
julia> @btime mul!(similar($A.data, size($A.data)), $A.data, $D);
3.239 μs (3 allocations: 12.72 KiB) Their return type are not the same: julia> @btime $A.data * $D;
1.950 μs (3 allocations: 12.72 KiB)
julia> @btime mul!(similar($A.data, size($A.data)), $A.data, $D);
2.422 μs (3 allocations: 12.72 KiB)
julia> @btime mul!(similar($A.data, Float64, size($A.data)), $A.data, $D);
1.250 μs (3 allocations: 12.72 KiB) |
I don't follow why we need this strange |
@dkarrasch @N5N3 could you take a look now? I have incorporated the changes suggested. I have added some extra methods for size check and the actual multiplication to avoid replicating code, hope this is fine. I've fixed the five-term on master julia> A = reshape([[1 2; 3 4], zeros(Int,2,2), zeros(Int, 2, 2), [5 6; 7 8]], 2, 2)
2×2 Matrix{Matrix{Int64}}:
[1 2; 3 4] [0 0; 0 0]
[0 0; 0 0] [5 6; 7 8]
julia> D = Diagonal(A)
2×2 Diagonal{Matrix{Int64}, Vector{Matrix{Int64}}}:
[1 2; 3 4] ⋅
⋅ [5 6; 7 8]
julia> A * D
2×2 Matrix{Matrix{Int64}}:
[7 10; 15 22] [0 0; 0 0]
[0 0; 0 0] [67 78; 91 106]
julia> mul!(similar(A), A, D)
ERROR: MethodError: no method matching +(::Matrix{Int64}, ::Bool)
For element-wise addition, use broadcasting with dot syntax: array .+ scalar This PR julia> mul!(similar(A), A, D)
2×2 Matrix{Matrix{Int64}}:
[7 10; 15 22] [0 0; 0 0]
[0 0; 0 0] [67 78; 91 106] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some recommendations:
- There seems so reason to retain the
l\rmul!
in L245, L255. We can make them call 3-argmul!
(to reuse the input check). - It looks good to have an optimized
mul!
forDiagonal
outputs. (L374 is suboptimal in this case). With it, we can omit L327 and L328. Addrequire_one_based_indexing
in the size check function?- Edited: L339 seems suboptimal for
eltype <: matrix
. Asmul!(A,D,B)
fallbacks toA .= (D.diag .* B) .* true
, which allocates twice unnecessarily in this case.
No, it doesn't. Broadcasting will fuse all operations, including writing into |
|
I agree with this one.
I agree with this one, as well.
This is unnecessary. One-based index checking is only necessary when we index into an array in a loop that has a loop variable
Up to the obsolete multiplication with 1 there is no issue with memory allocation. I don't think we want to introduce more runtime checks here. |
It has nothing to do with that method. It is using broadcasting, which internally writes out the loops and writes into the destination array within that loop. So that operation is in-place, as you can check with BenchmarkTools. |
For Number element I aggree with you. But for Matrix element: julia> a = [randn(3,3) for i in 1:10]; b = [randn(3,3) for i in 1:10];
julia> @btime $a .* $b;
507.216 ns (11 allocations: 1.38 KiB)
julia> @btime $a .* $b .* 1;
895.122 ns (22 allocations: 2.67 KiB)
julia> @btime @. $a * $b * 1;
515.544 ns (11 allocations: 1.38 KiB) I just notice that we won't call |
Are you sure? IIRC, that special multiplication recognizes constant |
julia> g(a, b) = a .* b .* true
g (generic function with 1 method)
julia> @btime g($a,$b);
880.000 ns (21 allocations: 2.62 KiB)
julia> g2(a, b) = a .* b
g2 (generic function with 1 method)
julia> @btime g2($a,$b);
509.896 ns (11 allocations: 1.38 KiB) I think the above example is persuasive enough.
julia> a = randn(10,10);
julia> f(a,b) = a * b * true
f (generic function with 1 method)
julia> @btime f($a,$a);
300.000 ns (1 allocation: 896 bytes)
julia> @btime f($a,$1);
206.667 ns (2 allocations: 1.75 KiB)
julia> @btime f($1,$a);
204.071 ns (2 allocations: 1.75 KiB) So also add something somthing like? *(a::AbstractArray{<:Number},b::Number,c::Number) = b * c * a
*(a::Number,b::AbstractArray{<:Number},c::Number) = a * c * b (I'm not sure whether |
Better not to change the order of multiplication and use broadcast: *(a::AbstractArray{<:Number},b::Number,c::Number) = a .* b .* c
*(a::Number,b::AbstractArray{<:Number},c::Number) = a .* b .* c I'm not sure how sustainable such definitions are, though. These would help with the array of arrays case, but what about the array of arrays of arrays? Feels a bit weird to mess with such a basic operation. |
If we add an Broadcast.broadcasted(::typeof(*ₛ), out, beta) =
iszero(beta::Number) ? false :
isone(beta::Number) ? broadcasted(identity, out) : broadcasted(*, out, beta) we may get rid of the extra allocation: julia> D
2×2 Diagonal{Matrix{Int64}, Vector{Matrix{Int64}}}:
[7 15; 10 22] ⋅
⋅ [67 91; 78 106]
julia> A = collect(D)
2×2 Matrix{Matrix{Int64}}:
[7 15; 10 22] [0 0; 0 0]
[0 0; 0 0] [67 91; 78 106]
julia> B = similar(A);
julia> @btime $B .= $D.diag .* $A;
323.026 ns (4 allocations: 384 bytes)
julia> @btime mul!($B, $D, $A, true, false);
336.775 ns (4 allocations: 384 bytes) |
I've also specialized julia> D = Diagonal([collect(reshape(1:4, 2, 2)), collect(reshape(5:8, 2, 2))])
2×2 Diagonal{Matrix{Int64}, Vector{Matrix{Int64}}}:
[1 3; 2 4] ⋅
⋅ [5 7; 6 8]
julia> D2 = Diagonal([D, D]);
julia> D2 + zero(D2) == D2
true
julia> D2 * one(D2) == D2
true |
I think the behaviour change of BTW, if someone dislike such modification, maybe we can use @noinline ___muldiag!(out,A,B,f::MulAddMul{<:Any,true}) = @. out = f(A * B) # out might be undef
@noinline ___muldiag!(out,A,B,f::MulAddMul) = @. out = f(A * B,out)
@inline __muldiag!(out, D::Diagonal, B, f) = ___muldiag!(out, D.diag, B, f)
@inline __muldiag!(out, A, D::Diagonal, f) = ___muldiag!(out, A, permutedims(D.diag), f)
@inline function _muldiag!(out, A, B, alpha, beta)
_muldiag_size_check(out, A, B)
iszero(alpha) && return _rmul_or_fill!(out, beta)
__muldiag!(out, A, B, MulAddMul(alpha, beta))
end On my desktop this dynamic dispatch tooks about 20ns. |
@dkarrasch I've made most of the changes suggested, could you take a look at this? |
@dkarrasch @jishnub I am observing a pretty substantial slowdown on the LinearAlgebra tests due to this commit. Checking out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick glance.
Broadcast.broadcasted(::typeof(*ₛ), out, beta) = | ||
iszero(beta::Number) ? false : broadcasted(*, out, beta) | ||
iszero(beta::Number) ? false : | ||
isone(beta::Number) ? broadcasted(identity, out) : broadcasted(*, out, beta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change might introduce more instability?
Usage in this PR should be union-splitable, and the branch should be eliminated in 3-args mul!
and l/rmul!
But it might mess else where?
rmul!(A::AbstractMatrix, D::Diagonal) = mul!(A, A, D) | ||
lmul!(D::Diagonal, B::AbstractVecOrMat) = mul!(B, D, B) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rmul!
and lmul!
won't be vectorlized on master as it is a self-inplace broadcast. (4x slower)
Add @inline
is useless here. Without fix like #43185, the only solution is recoding with for-loop.
But this should not be caught by CI as the test size is usually small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look, @N5N3! I vaguely remember that you proposed in a similar situation the fix by inlining the rhs. Why doesn't that help here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, our broadcast unaliases the src
from the dest
to avoid memory contention, which prevents LLVM to prove dest === src
.
Our for-loop based implement doesn't include such unaliasing stage, thus @inline
is enough to enable simd.
This provides a performance boost for matrices of numbers. See the discourse post
On master
After this PR