Use mul! in Diagonal*Matrix #42321

jishnub · 2021-09-20T18:06:33Z

This provides a performance boost for matrices of numbers. See the discourse post

On master

julia> A = ones(3000, 2000); DR = Diagonal(ones(size(A, 2))); DL = Diagonal(ones(size(A,1)));

julia> @btime $A * $DR;
  41.844 ms (4 allocations: 45.78 MiB)

julia> @btime $DL * $A;
  42.521 ms (2 allocations: 45.78 MiB)

After this PR

julia> @btime $A * $DR;
  28.670 ms (4 allocations: 45.78 MiB)

julia> @btime $DL * $A;
  29.377 ms (2 allocations: 45.78 MiB)

MasonProtter · 2021-09-20T23:48:33Z

Interesting finding! Is there any reason to not implement this with broadcasting though? Here are the numbers I'm seeing

julia> using BenchmarkTools: @btime

julia> using LinearAlgebra: Diagonal, mul!, similar, promote_op, transpose

julia> foreach((10, 50, 100, 500, 1_000, 5_000, 10_000)) do N
           M = N + rand(-5:5)
           A = rand(N, M)
           Dr = Diagonal(float.(1:M))

           println((;N, M))
           println("Current algorithm")
           o1 = @btime $A * $Dr
           println("Proposed algorithm")
           o2 = @btime $A *̂ $Dr
           println("Broadcasting")
           o3 = @btime $A .* transpose($Dr.diag)
           println()

           @assert o1 ≈ o2 ≈ o3
       end
(N = 10, M = 14)
Current algorithm
  212.190 ns (3 allocations: 1.31 KiB)
Proposed algorithm
  192.421 ns (3 allocations: 1.31 KiB)
Broadcasting
  141.649 ns (1 allocation: 1.22 KiB)

(N = 50, M = 48)
Current algorithm
  2.612 μs (4 allocations: 18.92 KiB)
Proposed algorithm
  1.439 μs (4 allocations: 18.92 KiB)
Broadcasting
  1.346 μs (2 allocations: 18.83 KiB)

(N = 100, M = 99)
Current algorithm
  8.623 μs (4 allocations: 77.55 KiB)
Proposed algorithm
  4.386 μs (4 allocations: 77.55 KiB)
Broadcasting
  4.286 μs (2 allocations: 77.45 KiB)

(N = 500, M = 495)
Current algorithm
  212.060 μs (4 allocations: 1.89 MiB)
Proposed algorithm
  124.020 μs (4 allocations: 1.89 MiB)
Broadcasting
  126.190 μs (2 allocations: 1.89 MiB)

(N = 1000, M = 999)
Current algorithm
  1.336 ms (4 allocations: 7.62 MiB)
Proposed algorithm
  799.228 μs (4 allocations: 7.62 MiB)
Broadcasting
  801.688 μs (2 allocations: 7.62 MiB)

(N = 5000, M = 5001)
Current algorithm
  64.819 ms (4 allocations: 190.77 MiB)
Proposed algorithm
  43.458 ms (4 allocations: 190.77 MiB)
Broadcasting
  43.699 ms (2 allocations: 190.77 MiB)

(N = 10000, M = 9997)
Current algorithm
  266.747 ms (4 allocations: 762.71 MiB)
Proposed algorithm
  160.720 ms (4 allocations: 762.71 MiB)
Broadcasting
  159.481 ms (2 allocations: 762.71 MiB)

It seems that just using broadcast is simpler and just as fast or faster.

julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: AMD Ryzen Threadripper 2990WX 32-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, znver1)
Environment:
  JULIA_NUM_THREADS = 26

MasonProtter · 2021-09-20T23:53:11Z

Normally, matrix multiplication involves a mixture of broadcast and reduction, but involving a Diagonal explicitly removes the reduction part and you're left with just an elementwise operation.

Given how much work was put into our broadcast machinery, we might as well use it.

N5N3 · 2021-09-21T04:25:05Z

In fact mul!()/rmul!()/lmul!() for Diagonal is implemented via broadcast.

There should be no performance difference between 3-arg mul!() and manual broadcast, as the alpha and beta are propagated as const.

MasonProtter · 2021-09-21T04:36:06Z

Hm. Then why is there such a big performance difference here?

N5N3 · 2021-09-21T04:57:44Z

The difference holds only for small martix. It might come from:

You manual broadcast use lazy transpose, while mul! use premutedims, which is create a new Matrix.
~~mul!(z,x,y::Diagnal) calculate z .= x .* permutedims(y.diag) .+ false. Maybe the .+ was not omited by the optimizer?~~
Should not happend, as

julia> f(x) = x .+= false
f (generic function with 1 method)

julia> @code_llvm f([1,2,3])
;  @ REPL[40]:1 within `f`
; Function Attrs: uwtable
define nonnull {}* @japi1_f_6357({}* %0, {}** %1, i32 %2) #0 {
top:
  %3 = alloca {}**, align 8
  store volatile {}** %1, {}*** %3, align 8
  %4 = load {}*, {}** %1, align 8
  ret {}* %4
}

MasonProtter · 2021-09-21T05:12:44Z

Hm. That eager permutedims Should probably be revisited. Seems like a waste of allocations that was done in the past because we didn’t handle structs containing mutable data well.

N5N3 · 2021-09-21T05:50:45Z

@code_typed shows that their loop body have almost the same IR, except some redundant code in mul!, which should be dropped by LLVM and does no harm to performance:

│     %308 = Base.mul_float(%304, %307)::Float64
│     %309 = Base.copysign_float(0.0, %308)::Float64
│     %310 = Base.ifelse(true, %308, %309)::Float64
│     %311 = Base.add_float(1.0, %310)::Float64
│     %312 = Base.ifelse(false, %311, %310)::Float64

BTW, mul! seems to perform more checks before the loop kernal, maybe this also contributes to the speed gap for small matrix.

jishnub · 2021-09-21T05:55:16Z

The difference for small matrices seems to be entirely due to permutedims being an array reshape that allocates:

julia> A = ones(10, 14); DR = Diagonal(ones(size(A, 2)));

julia> @btime permutedims($DR.diag);
  98.516 ns (2 allocations: 96 bytes)

julia> @btime transpose($DR.diag);
  2.674 ns (0 allocations: 0 bytes)

julia> @btime $A * $DR;
  439.354 ns (3 allocations: 1.31 KiB)

julia> @btime $A .* transpose($DR.diag);
  344.735 ns (1 allocation: 1.22 KiB)

N5N3 · 2021-09-21T07:53:04Z

It seems mul!() is done via broadcast only when the non-Diagonal input has strided layout.
For example:

julia> @which mul!(zeros(100,100), Diagonal(1:100), Symmetric(randn(100,100)), true, false)
mul!(C::AbstractMatrix, A::AbstractVecOrMat, B::AbstractVecOrMat, alpha::Number, beta::Number) in LinearAlgebra at C:\Users\MYJ\AppData\Local\Programs\Julia-1.7.0-beta4\share\julia\stdlib\v1.7\LinearAlgebra\src\matmul.jl:301
julia> @which mul!(zeros(100,100), Diagonal(1:100), randn(100,100), true, false)
mul!(out::AbstractMatrix, A::Diagonal, in::StridedMatrix{T} where T, alpha::Number, beta::Number) in LinearAlgebra at C:\Users\MYJ\AppData\Local\Programs\Julia-1.7.0-beta4\share\julia\stdlib\v1.7\LinearAlgebra\src\diagonal.jl:370

Such restrictions might be added to avoid random memory access?
Anyway, the 5-arg mul! should be optimized before or in this PR.
If most Base matrix type is faster with broadcast rather than l/rmul!() + copyto!(), I think it's fine to extend StrideMatrix to AbstractMatrix.
Edit: on master it has been relaxed with #41188.

jishnub · 2021-09-21T09:21:39Z

Would it make sense to improve mul! in a separate PR? Changing StrideMatrix to AbstractMatrix might need to be tested thoroughly.

N5N3 · 2021-09-21T11:20:05Z

Both seems OK to me.
At present, in test.jl, lmul!, rmul! and mul! appears 11, 10, and 13 times, seperately.
While * appears 170 times.
Even if we assume that half of them are the reference for comparison, * is tested much more thoroughly.
Without this PR, we can say that lmul! and rmul! are well tested via *.
After this PR, things turns to be mul! is tested via *, while lmul! and rmul! not.
So it might be ok to do this change in this PR?

stdlib/LinearAlgebra/src/diagonal.jl

dkarrasch · 2021-09-21T17:32:41Z

Hm. That eager permutedims Should probably be revisited. Seems like a waste of allocations that was done in the past because we didn’t handle structs containing mutable data well.

No, we use permutedims because transpose operates recursively, but all we want is to "reshape" the vector into a row vector for broadcasting. If there is another (lazy) way of telling broadcasting that the vector should be interpreted as 1xn, then we can go with that. Otherwise, products with matrices of matrices will just go wrong.

MasonProtter · 2021-09-21T17:34:07Z

Non-recursive lazy transpose sounds very nice to have!

MasonProtter · 2021-09-21T17:37:00Z

No, we use permutedims because transpose operates recursively, but all we want is to "reshape" the vector into a row vector for broadcasting.

That shouldn’t be a problem if the method is restricted to operating on Diagonal{T, Vector{T}} where T <: Number though right?

I guess if someone made some wacky number type that could be transpose’d that could be a problem?

dkarrasch · 2021-09-21T17:38:35Z

We could check if the eltype of D is a Number, and in that case use transpose in the broadcasting statement... or, as you say, use multiple dispatch.

N5N3 · 2021-09-22T01:30:43Z

Since (*)(A::AbstractMatrix, D::Diagonal) is accelerated,
Maybe we can also optimize (*)(A::AbstractTriangular, D::Diagonal) with that in this PR.
For example:

@inline function _copy_diag!(data, diag)
    for i in 1:length(diag)
        data[i,i] = diag[i]
    end
    data
end
for Tri in (:UpperTriangular, :LowerTriangular)
    UTri = Symbol(:Unit, Tri)
    @eval (*)(A::$Tri, D::Diagonal) = $Tri(A.data * D)
    @eval (*)(A::$UTri, D::Diagonal) = $Tri(_copy_diag!(A.data * D, D.diag))
    @eval (*)(D::Diagonal, A::$Tri) = $Tri(D * A.data)
    @eval (*)(D::Diagonal, A::$UTri) = $Tri(_copy_diag!(D * A.data, D.diag))
end

Some Benchmark (With This PR)

julia> A = LowerTriangular(randn(1000,1000));

julia> @btime $A * $Diagonal(ones(1000)); # before
  2.215 ms (5 allocations: 7.64 MiB)

julia> Revise.track(LinearAlgebra)

julia> @btime $A * $Diagonal(ones(1000)); # after
  1.208 ms (5 allocations: 7.64 MiB)

N5N3 · 2021-09-22T04:51:20Z

BTW, I found that the above pattern also works for (l/r)mul!.
We can put all these definitions together to simplify code:

for Tri in (:UpperTriangular, :LowerTriangular)
    UTri = Symbol(:Unit, Tri)
    for fun in (:*, :rmul!)
        @eval $fun(A::$Tri, D::Diagonal) = $Tri($fun(A.data, D))
        @eval $fun(A::$UTri, D::Diagonal) = $Tri(_copy_diag!($fun(A.data, D), D.diag))
    end
    for fun in (:*, :lmul!)
        @eval $fun(D::Diagonal, A::$Tri) = $Tri($fun(D, A.data))
        @eval $fun(D::Diagonal, A::$UTri) = $Tri(_copy_diag!($fun(D, A.data), D.diag))
    end
    @eval mul!(out::$Tri, D::Diagonal, A::$Tri) = $Tri(mul!(out.data, D, A.Data))
    @eval mul!(out::$Tri, D::Diagonal, A::$UTri) = $Tri(_copy_diag!(mul!(out.data, D, A.Data), D.diag))
    @eval mul!(out::$Tri, A::$Tri, D::Diagonal) = $Tri(mul!(out.data, A.Data, D))
    @eval mul!(out::$Tri, A::$UTri, D::Diagonal) = $Tri(_copy_diag!(mul!(out.data, A.Data, D), D.diag))
end

The above code support seperate optimization for *() and (r/l)mul!.
And it's more easy to cover all cases, as at present only rmul! is optimized for Union{LowerTriangular,UpperTriangular}

julia/stdlib/LinearAlgebra/src/diagonal.jl

Line 271 in 7ce858c

    
           rmul!(A::Union{LowerTriangular,UpperTriangular}, D::Diagonal) = typeof(A)(rmul!(A.data, D))

Corresponding lmul! has been missing for years, at least since the time this file was created.

@dkarrasch Should this be included in a seperate PR?

jishnub · 2021-09-22T06:10:40Z

Perhaps this should be specialized to cases where A.data * D is fast? This shouldn't hit slow AbstractArray multiplication methods. The present methods that rely on a copy get around these methods for the cases where A.data is not a Matrix. Even the method added in this PR might need to be restricted (perhaps to StridedMatrix?).

Currently on master

julia> A = LowerTriangular(reshape(1:1600, 40, 40)); D = Diagonal(ones(size(A,2)));

julia> @btime $A * $D;
  8.931 μs (3 allocations: 12.72 KiB)

julia> @btime $A.data * $D;
  2.308 μs (3 allocations: 12.72 KiB)

julia> @btime mul!(similar($A.data, size($A.data)), $A.data, $D);
  61.452 μs (7 allocations: 12.95 KiB)

With this PR, we're hitting mul! in A.data * D, which will be slow for AbstractArrays in general.

Edit: Oops, this benchmark was on Julia 1.6 and not master. On master I obtain

julia> A = LowerTriangular(reshape(1:1600, 40, 40)); D = Diagonal(ones(size(A,2)));

julia> @btime $A * $D;
  3.408 μs (3 allocations: 12.72 KiB)

julia> @btime $A.data * $D;
  2.331 μs (3 allocations: 12.72 KiB)

julia> @btime mul!(similar($A.data, size($A.data)), $A.data, $D);
  3.239 μs (3 allocations: 12.72 KiB)

julia> @btime mul!(similar($A.data, size($A.data)), $A.data, convert(AbstractArray{eltype($A)}, $D));
  1.010 μs (4 allocations: 13.11 KiB)

julia> A = LowerTriangular(reshape(1:1600, 40, 40)); D = Diagonal(ones(Int, size(A,2)));

julia> @btime $A * $D;
  2.985 μs (3 allocations: 12.72 KiB)

julia> @btime $A.data * $D;
  2.396 μs (3 allocations: 12.72 KiB)

julia> @btime mul!(similar($A.data, size($A.data)), $A.data, $D);
  918.500 ns (3 allocations: 12.72 KiB)

julia> A = LowerTriangular(reshape(Float64.(1:1600), 40, 40)); D = Diagonal(ones(size(A,2)));

julia> @btime $A * $D;
  2.913 μs (3 allocations: 12.72 KiB)

julia> @btime $A.data * $D;
  1.998 μs (3 allocations: 12.72 KiB)

julia> @btime mul!(similar($A.data, size($A.data)), $A.data, $D);
  791.733 ns (3 allocations: 12.72 KiB)

So mul! seems superior if the eltypes match, and converting the eltype of the Diagonal seems to be a good strategy.

N5N3 · 2021-09-22T08:55:35Z

If I have not made a mistake:
The difference between mul! and r/lmul! based *(A::AbstractMatrix,D::Diagnal) is:

r/lmul! --> copyto!(similar(A, Tdest, size(A)), A) .*= D.diag
mul! --> similar(A, Tdest, size(A)) .= A .* D.diag

So r/lmul! pre-promote A if needed, while mul! not. (Both of them promote D during calculation)
Since all elements in A is accessed only once, should pre-promotion offer any performance gain?
(Promotion is vectorizable In most cases.)

BTW, for your example

julia> A = LowerTriangular(reshape(1:1600, 40, 40)); D = Diagonal(ones(size(A,2)));
julia> @btime $A.data * $D;
  2.331 μs (3 allocations: 12.72 KiB)
julia> @btime mul!(similar($A.data, size($A.data)), $A.data, $D);
  3.239 μs (3 allocations: 12.72 KiB)

Their return type are not the same: A.data * D returns Matrix{Float64}, while mul! version returns Matrix{Int64}. So the "regression" comes from extra trunction.
On my PC, mul! is faster for Matrix{Float64}

julia> @btime $A.data * $D;
  1.950 μs (3 allocations: 12.72 KiB)
julia> @btime mul!(similar($A.data, size($A.data)), $A.data, $D);
  2.422 μs (3 allocations: 12.72 KiB)
julia> @btime mul!(similar($A.data, Float64, size($A.data)), $A.data, $D);
  1.250 μs (3 allocations: 12.72 KiB)

dkarrasch · 2021-09-22T13:11:41Z

I don't follow why we need this strange Numbers distinction. I think the main difference between mul! and [r/l]mul!-based implementations is that in the first case we only allocate a similar array and write the result into it, whereas in the latter case we allocate a similar matrix, then fill it with values of A and then pass over it again to multiply each value. So, the only difference is whether we fill the destination array with the (promoted) values of A and then modify inplace or whether we write into an undef destination array. But why the distinction with Numbers? Why not just the size checks, the allocation of the destination array and then mul! into it?

jishnub · 2021-09-27T05:13:07Z

@dkarrasch @N5N3 could you take a look now? I have incorporated the changes suggested. I have added some extra methods for size check and the actual multiplication to avoid replicating code, hope this is fine.

I've fixed the five-term mul! by adding a branch instead of adding extra three-term mul! methods. This fixes the bug on master as well:

on master

julia> A = reshape([[1 2; 3 4], zeros(Int,2,2), zeros(Int, 2, 2), [5 6; 7 8]], 2, 2)
2×2 Matrix{Matrix{Int64}}:
 [1 2; 3 4]  [0 0; 0 0]
 [0 0; 0 0]  [5 6; 7 8]

julia> D = Diagonal(A)
2×2 Diagonal{Matrix{Int64}, Vector{Matrix{Int64}}}:
 [1 2; 3 4]      ⋅     
     ⋅       [5 6; 7 8]

julia> A * D
2×2 Matrix{Matrix{Int64}}:
 [7 10; 15 22]  [0 0; 0 0]
 [0 0; 0 0]     [67 78; 91 106]

julia> mul!(similar(A), A, D)
ERROR: MethodError: no method matching +(::Matrix{Int64}, ::Bool)
For element-wise addition, use broadcasting with dot syntax: array .+ scalar

This PR

julia> mul!(similar(A), A, D)
2×2 Matrix{Matrix{Int64}}:
 [7 10; 15 22]  [0 0; 0 0]
 [0 0; 0 0]     [67 78; 91 106]

N5N3

Just some recommendations:

There seems so reason to retain the l\rmul! in L245, L255. We can make them call 3-arg mul! (to reuse the input check).
It looks good to have an optimized mul! for Diagonal outputs. (L374 is suboptimal in this case). With it, we can omit L327 and L328.
~~Add require_one_based_indexing in the size check function?~~
Edited: L339 seems suboptimal for eltype <: matrix. As mul!(A,D,B) fallbacks to A .= (D.diag .* B) .* true, which allocates twice unnecessarily in this case.

dkarrasch · 2021-09-27T11:03:38Z

4\. As 3-arg `mul!(A,D,B)` calls `A .= D.diag .* B . * 1`, which allocates twice unnecessarily in this case.

No, it doesn't. Broadcasting will fuse all operations, including writing into A.

N5N3 · 2021-09-27T11:11:30Z

~~My fault, didn't notice that we have optimize *(::Matrix,::Matrix,::Number) via mat_mat_scalar~~

dkarrasch · 2021-09-27T11:15:11Z

Just some recommendations:

1. There seems so reason to retain the `l\rmul!` in L245, L255. We can make them call 3-arg `mul!` (to reuse the input check).

I agree with this one.

2. It looks good to have an optimized `mul!` for `Diagonal` outputs. (L374 is suboptimal in this case). With it, we can omit L327 and L328.

I agree with this one, as well.

3. Add `require_one_based_indexing` in the size check function?

This is unnecessary. One-based index checking is only necessary when we index into an array in a loop that has a loop variable i in 1:n. Otherwise, there is no need for such a check.

4. L339 seems suboptimal for `eltype <:  matrix`. As 3-arg `mul!(A,D,B)` calls `A .= D.diag .* B . * 1`, which allocates twice unnecessarily in this case.

Up to the obsolete multiplication with 1 there is no issue with memory allocation. I don't think we want to introduce more runtime checks here.

dkarrasch · 2021-09-27T11:16:29Z

My fault, didn't notice that we have optimize *(::Matrix,::Matrix,::Number) via mat_mat_scalar

It has nothing to do with that method. It is using broadcasting, which internally writes out the loops and writes into the destination array within that loop. So that operation is in-place, as you can check with BenchmarkTools.

N5N3 · 2021-09-27T11:23:52Z

For Number element I aggree with you. But for Matrix element:

julia> a = [randn(3,3) for i in 1:10]; b = [randn(3,3) for i in 1:10];
julia> @btime $a .* $b;
  507.216 ns (11 allocations: 1.38 KiB)
julia> @btime $a .* $b .* 1;
  895.122 ns (22 allocations: 2.67 KiB)
julia> @btime @. $a * $b * 1;
  515.544 ns (11 allocations: 1.38 KiB)

I just notice that we won't call *(::Matrix,::Matrix,::Number) since we use .*s. So the allocation does happen twice.

dkarrasch · 2021-09-27T11:27:45Z

So the allocation does happen twice.

Are you sure? IIRC, that special multiplication recognizes constant trues and avoids performung multiplication by true = 1.

N5N3 · 2021-09-27T11:35:01Z

julia> g(a, b) = a .* b .* true
g (generic function with 1 method)
julia> @btime g($a,$b);
  880.000 ns (21 allocations: 2.62 KiB)
julia> g2(a, b) = a .* b
g2 (generic function with 1 method)
julia> @btime g2($a,$b);
  509.896 ns (11 allocations: 1.38 KiB)

I think the above example is persuasive enough.
At present, replacing a * true with a is done by llvm.
Since matrix * number call broadcast_preserving_zero_d, where materialize definitely call similar , I don't think llvm can prove this * true is droppable.

~~An easy solution is to replace A .* B .*s alpha with .*s(A, B, alpha), and make it call *(A,B,C) if needed.~~
Edit: .*s(A, B, alpha) only works when both A and B has matrix elements. If one of them isa AbstractArray{<:Number}, then *(A,B,true) also falls back to (A*B) * true and allocates twice.

julia> a = randn(10,10);
julia> f(a,b) = a * b * true
f (generic function with 1 method)
julia> @btime f($a,$a);
  300.000 ns (1 allocation: 896 bytes)
julia> @btime f($a,$1);
  206.667 ns (2 allocations: 1.75 KiB)
julia> @btime f($1,$a);
  204.071 ns (2 allocations: 1.75 KiB)

So also add something somthing like?

*(a::AbstractArray{<:Number},b::Number,c::Number) = b * c * a
*(a::Number,b::AbstractArray{<:Number},c::Number) = a * c * b

(I'm not sure whether Number should be restricted toUnion{Real,Complex} here.)

dkarrasch · 2021-09-28T15:56:31Z

So also add something somthing like?

*(a::AbstractArray{<:Number},b::Number,c::Number) = b * c * a
*(a::Number,b::AbstractArray{<:Number},c::Number) = a * c * b

Better not to change the order of multiplication and use broadcast:

*(a::AbstractArray{<:Number},b::Number,c::Number) = a .* b .* c
*(a::Number,b::AbstractArray{<:Number},c::Number) = a .* b .* c

I'm not sure how sustainable such definitions are, though. These would help with the array of arrays case, but what about the array of arrays of arrays? Feels a bit weird to mess with such a basic operation.

jishnub · 2021-10-01T21:49:45Z

If we add an isone check while broadcasting over *ₛ

Broadcast.broadcasted(::typeof(*ₛ), out, beta) =
    iszero(beta::Number) ? false :
    isone(beta::Number) ? broadcasted(identity, out) : broadcasted(*, out, beta)

we may get rid of the extra allocation:

julia> D
2×2 Diagonal{Matrix{Int64}, Vector{Matrix{Int64}}}:
 [7 15; 10 22]      ⋅     
     ⋅          [67 91; 78 106]

julia> A = collect(D)
2×2 Matrix{Matrix{Int64}}:
 [7 15; 10 22]  [0 0; 0 0]
 [0 0; 0 0]     [67 91; 78 106]

julia> B = similar(A);

julia> @btime $B .= $D.diag .* $A;
  323.026 ns (4 allocations: 384 bytes)

julia> @btime mul!($B, $D, $A, true, false);
  336.775 ns (4 allocations: 384 bytes)

jishnub · 2021-10-01T21:52:50Z

I've also specialized zero and one for Diagonals, so now these functions work for nested diagonals:

julia> D = Diagonal([collect(reshape(1:4, 2, 2)), collect(reshape(5:8, 2, 2))])
2×2 Diagonal{Matrix{Int64}, Vector{Matrix{Int64}}}:
 [1 3; 2 4]      ⋅     
     ⋅       [5 7; 6 8]

julia> D2 = Diagonal([D, D]);

julia> D2 + zero(D2) == D2
true

julia> D2 * one(D2) == D2
true

N5N3 · 2021-10-02T06:25:52Z

I think the behaviour change of .*ₛ is acceptable.
In most case .*ₛ is used to dispatch alpha and beta, so the performance influence won't be large?

BTW, if someone dislike such modification, maybe we can use MulAddMul to achieve the same goal:

@noinline ___muldiag!(out,A,B,f::MulAddMul{<:Any,true}) = @. out = f(A * B) # out might be undef
@noinline ___muldiag!(out,A,B,f::MulAddMul) = @. out = f(A * B,out) 
@inline __muldiag!(out, D::Diagonal, B, f) = ___muldiag!(out, D.diag, B, f)
@inline __muldiag!(out, A, D::Diagonal, f) = ___muldiag!(out, A, permutedims(D.diag), f)
@inline function _muldiag!(out, A, B, alpha, beta)
    _muldiag_size_check(out, A, B)
    iszero(alpha) && return _rmul_or_fill!(out, beta)
    __muldiag!(out, A, B, MulAddMul(alpha, beta))
end

On my desktop this dynamic dispatch tooks about 20ns.
So the performance does regress slightly compared with this PR in some cases, such as non-const beta == 0.

jishnub · 2021-10-05T07:23:59Z

@dkarrasch I've made most of the changes suggested, could you take a look at this?

staticfloat · 2022-02-17T22:59:09Z

@dkarrasch @jishnub I am observing a pretty substantial slowdown on the LinearAlgebra tests due to this commit.

Checking out a8db7510b97571cb9ad6cc89fc2fae527c741ea6 and the commit before it, I see (for instance) the LinearAlgebra/addmul tests go from 280s to 340s. This is when running on a Intel(R) Xeon(R) CPU E3-1241 v3 @ 3.50GHz.

N5N3

Just a quick glance.

N5N3 · 2022-02-18T15:53:45Z

stdlib/LinearAlgebra/src/generic.jl

 Broadcast.broadcasted(::typeof(*ₛ), out, beta) =
-    iszero(beta::Number) ? false : broadcasted(*, out, beta)
+    iszero(beta::Number) ? false :
+    isone(beta::Number) ? broadcasted(identity, out) : broadcasted(*, out, beta)


This change might introduce more instability?
Usage in this PR should be union-splitable, and the branch should be eliminated in 3-args mul! and l/rmul!
But it might mess else where?

N5N3 · 2022-02-18T16:00:14Z

stdlib/LinearAlgebra/src/diagonal.jl

+rmul!(A::AbstractMatrix, D::Diagonal) = mul!(A, A, D)
+lmul!(D::Diagonal, B::AbstractVecOrMat) = mul!(B, D, B)


rmul! and lmul! won't be vectorlized on master as it is a self-inplace broadcast. (4x slower)
Add @inline is useless here. Without fix like #43185, the only solution is recoding with for-loop.

But this should not be caught by CI as the test size is usually small.

Thanks for taking a look, @N5N3! I vaguely remember that you proposed in a similar situation the fix by inlining the rhs. Why doesn't that help here?

IIUC, our broadcast unaliases the src from the dest to avoid memory contention, which prevents LLVM to prove dest === src.
Our for-loop based implement doesn't include such unaliasing stage, thus @inline is enough to enable simd.

jishnub added 2 commits September 20, 2021 21:59

use mul! in Diagonal*Matrix of numbers

bc0aa8a

Add comment

211a574

oscardssmith approved these changes Sep 20, 2021

View reviewed changes

runtime checks to avoid ambiguities

e237daa

Force destination to be a Matrix

fcdc986

N5N3 reviewed Sep 21, 2021

View reviewed changes

stdlib/LinearAlgebra/src/diagonal.jl Outdated Show resolved Hide resolved

Require one-based indices

7ce858c

dkarrasch added linear algebra Linear algebra performance Must go faster labels Sep 21, 2021

N5N3 mentioned this pull request Sep 22, 2021

Add missing optimization for *,/ between Diagonal and Triangular #42343

Merged

jishnub added 2 commits September 24, 2021 22:16

simplify size mismatch error check

36a0109

fix mul! broadcasting

0a268c2

N5N3 reviewed Sep 27, 2021

View reviewed changes

mul! with Diagonal output and add zero and one

9761a58

jishnub added 3 commits October 2, 2021 11:54

bugfix in mul for diagonal output

bf17e13

Specialize oneunit for Diagonal

44d19fe

replace conflicting definition for one(::Diagonal)

1f66595

jishnub changed the title ~~Use mul! in Diagonal*Matrix of numbers~~ Use mul! in Diagonal*Matrix Oct 5, 2021

dkarrasch approved these changes Oct 5, 2021

View reviewed changes

dkarrasch merged commit a8db751 into JuliaLang:master Oct 5, 2021

jishnub deleted the muldiag branch October 5, 2021 09:47

N5N3 reviewed Feb 18, 2022

View reviewed changes

LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022

Use mul! in Diagonal*Matrix (JuliaLang#42321)

66302f8

LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022

Use mul! in Diagonal*Matrix (JuliaLang#42321)

85429bd

dkarrasch mentioned this pull request Mar 17, 2022

Fix performance issue with diagonal multiplication #44651

Merged

dkarrasch mentioned this pull request Jul 7, 2024

Fix (l/r)mul! with Diagonal/Bidiagonal #55052

Merged

		rmul!(A::AbstractMatrix, D::Diagonal) = mul!(A, A, D)
		lmul!(D::Diagonal, B::AbstractVecOrMat) = mul!(B, D, B)

Use mul! in Diagonal*Matrix #42321

Use mul! in Diagonal*Matrix #42321

Conversation

jishnub commented Sep 20, 2021

MasonProtter commented Sep 20, 2021 • edited Loading

MasonProtter commented Sep 20, 2021 • edited Loading

N5N3 commented Sep 21, 2021 • edited Loading

MasonProtter commented Sep 21, 2021

N5N3 commented Sep 21, 2021 • edited Loading

MasonProtter commented Sep 21, 2021 • edited Loading

N5N3 commented Sep 21, 2021 • edited Loading

jishnub commented Sep 21, 2021

N5N3 commented Sep 21, 2021 • edited Loading

jishnub commented Sep 21, 2021

N5N3 commented Sep 21, 2021

dkarrasch commented Sep 21, 2021

MasonProtter commented Sep 21, 2021

MasonProtter commented Sep 21, 2021

dkarrasch commented Sep 21, 2021

N5N3 commented Sep 22, 2021 • edited Loading

N5N3 commented Sep 22, 2021 • edited Loading

jishnub commented Sep 22, 2021 • edited Loading

N5N3 commented Sep 22, 2021 • edited Loading

dkarrasch commented Sep 22, 2021

jishnub commented Sep 27, 2021 • edited Loading

N5N3 left a comment • edited Loading

Choose a reason for hiding this comment

dkarrasch commented Sep 27, 2021

N5N3 commented Sep 27, 2021 • edited Loading

dkarrasch commented Sep 27, 2021

dkarrasch commented Sep 27, 2021

N5N3 commented Sep 27, 2021 • edited Loading

dkarrasch commented Sep 27, 2021

N5N3 commented Sep 27, 2021 • edited Loading

dkarrasch commented Sep 28, 2021

jishnub commented Oct 1, 2021 • edited Loading

jishnub commented Oct 1, 2021

N5N3 commented Oct 2, 2021 • edited Loading

jishnub commented Oct 5, 2021

staticfloat commented Feb 17, 2022

N5N3 left a comment

Choose a reason for hiding this comment

N5N3 Feb 18, 2022 • edited Loading

Choose a reason for hiding this comment

N5N3 Feb 18, 2022

Choose a reason for hiding this comment

dkarrasch Mar 11, 2022

Choose a reason for hiding this comment

N5N3 Mar 11, 2022 • edited Loading

Choose a reason for hiding this comment

MasonProtter commented Sep 20, 2021 •

edited

Loading

MasonProtter commented Sep 20, 2021 •

edited

Loading

N5N3 commented Sep 21, 2021 •

edited

Loading

N5N3 commented Sep 21, 2021 •

edited

Loading

MasonProtter commented Sep 21, 2021 •

edited

Loading

N5N3 commented Sep 21, 2021 •

edited

Loading

N5N3 commented Sep 21, 2021 •

edited

Loading

N5N3 commented Sep 22, 2021 •

edited

Loading

N5N3 commented Sep 22, 2021 •

edited

Loading

jishnub commented Sep 22, 2021 •

edited

Loading

N5N3 commented Sep 22, 2021 •

edited

Loading

jishnub commented Sep 27, 2021 •

edited

Loading

N5N3 left a comment •

edited

Loading

N5N3 commented Sep 27, 2021 •

edited

Loading

N5N3 commented Sep 27, 2021 •

edited

Loading

N5N3 commented Sep 27, 2021 •

edited

Loading

jishnub commented Oct 1, 2021 •

edited

Loading

N5N3 commented Oct 2, 2021 •

edited

Loading

N5N3 Feb 18, 2022 •

edited

Loading

N5N3 Mar 11, 2022 •

edited

Loading