Remove buffer pool for tiled GEMM #42309

tkf · 2021-09-19T07:47:47Z

This PR is an RFC for removing the pooled buffer for the generic tiled matrix multiplication. The current implementation has several issues and a quick benchmark does not suggest that allocation and/or GC are the bottleneck.

Issue 1: Reentrancy

Inside of _generic_matmatmul!, we have the check

julia/stdlib/LinearAlgebra/src/matmul.jl

Lines 771 to 776 in eb83c4d

    
           tile_size = 0 
        
           if isbitstype(R) && isbitstype(T) && isbitstype(S) && (tA == 'N' || tB != 'N') 
        
               tile_size = floor(Int, sqrt(tilebufsize / max(sizeof(R), sizeof(S), sizeof(T), 1))) 
        
           end 
        
           @inbounds begin 
        
           if tile_size > 0

which then uses pre-allocated arrays

julia/stdlib/LinearAlgebra/src/matmul.jl

Lines 728 to 732 in eb83c4d

    
           const tilebufsize = 10800  # Approximately 32k/3 
        
           # per-thread arrays of buffers resized by __init__ if needed 
        
           const Abuf = [Vector{UInt8}(undef, tilebufsize)] 
        
           const Bbuf = [Vector{UInt8}(undef, tilebufsize)] 
        
           const Cbuf = [Vector{UInt8}(undef, tilebufsize)]

via

julia/stdlib/LinearAlgebra/src/matmul.jl

Lines 779 to 780 in eb83c4d

    
           Atile = unsafe_wrap(Array, convert(Ptr{T}, pointer(Abuf[Threads.threadid()])), sz) 
        
           Btile = unsafe_wrap(Array, convert(Ptr{S}, pointer(Bbuf[Threads.threadid()])), sz)

I think the major and practical issue here is that isbitstype is not enough for ensuring no-reentrancy of * (so that the buffers are not re-used by multiple functions). I can quite easily define arbitrary large matrix with small isbitstype

struct LargeMatrix <: AbstractMatrix{Flaot64}
    id::UInt64
end

const MTARICES = Dict{UInt64,Matrix{Float64}}()

Base.getindex(m::LargeMatrix, i::Int, j::Int) = MATRICES[m.id][i, j]

This example is rather silly but it's conceivable to have a Julia wrapper of externally managed matrices that is just a Ptr{Cvoid} as a Julia object. If these isbits-but-large matrices are used in nested/blocked matrices, the current implementation can corrupt the result.

Issue 2: `@async`- and `@spawn`-safety

A method of * can yield (e.g., a wrapper type that dumps all operations in a file). We then have the same problem as the reentrancy.

Furthermore, if the current task is a @spawned task, it can be migrated to another worker thread upon a yield.

Issue 3: `nthreads` may not be constant in the future

After a change in the scheduler like #42302, it will be impossible to rely on that nthreads does not change after julia is started. Supporting a pattern like this would also require a proper "static" initialization API (e.g., using the double checked locking pattern).

Issue 4: GC

Also, as noted in #23914, this code also has a problem due to missing GC.@preserve.

Performance impact

It also seems that the allocation is not the bottleneck. For large enough input that the tiling matters, the effect of allocation/GC is not observable. For small matrices where you can see the effect of GC, it's faster to not use the tiled version.

A = randn(n, n)
B = randn(n, n)
C = similar(A)
@btime LinearAlgebra.generic_matmatmul!(C, 'N', 'N', A, B)

shows

	After PR	No tile	Before PR
n = 500	82.1 ms	109.9 ms	81.3 ms
n = 10	1.070 μs	510.361 ns	770.266 ns

"No tile" means applying this diff

diff --git a/stdlib/LinearAlgebra/src/matmul.jl b/stdlib/LinearAlgebra/src/matmul.jl
index 0cbfeaf9ed..b4ff0c0c5a 100644
--- a/stdlib/LinearAlgebra/src/matmul.jl
+++ b/stdlib/LinearAlgebra/src/matmul.jl
@@ -768,6 +768,7 @@ function _generic_matmatmul!(C::AbstractVecOrMat{R}, tA, tB, A::AbstractVecOrMat
     if isbitstype(R) && isbitstype(T) && isbitstype(S) && (tA == 'N' || tB != 'N')
         tile_size = floor(Int, sqrt(tilebufsize / max(sizeof(R), sizeof(S), sizeof(T), 1)))
     end
+    tile_size = 0
     @inbounds begin
     if tile_size > 0
         sz = (tile_size, tile_size)

DilumAluthge · 2021-09-19T08:02:36Z

cc: @chriselrod

KristofferC · 2021-09-19T08:14:36Z

stdlib/LinearAlgebra/src/matmul.jl

@@ -797,8 +792,7 @@ function _generic_matmatmul!(C::AbstractVecOrMat{R}, tA, tB, A::AbstractVecOrMat
                end
            end
        else
-            # FIXME: This code is completely invalid!!!


chriselrod · 2021-09-19T15:31:48Z

I think the major and practical issue here is that isbitstype is not enough for ensuring no-reentrancy of * (so that the buffers are not re-used by multiple functions). I can quite easily define arbitrary large matrix with small isbitstype

I don't follow this example.
The eltype is Float64, so the buffer is unsafe_wrapped into an Array{Float64} that you can copy elements into.
Given the inefficient memory accesses of LargeMatrix, this example looks like a case where we'd really want to use buffers.

FWIW, Octavian.jl avoids the GC issues for Abuf via using a large mutable struct holding an NTuple (think a large StaticArrays.MVector). This will be stack allocated if you're careful, resulting in zero memory allocations.

However, for Bbuf, Octavian is using a large preallocated array as Bbuf is much larger than Abuf and would blow the stack.
Stack size also seems smaller in 32 bit Julia, so it's using arrays instead of stack allocation there, too.

LinearAlgebra.generic_matmatmul! is extremely slow anyway, so I think it's fine to go with a simpler implementation like this PR instead of trying to stack allocate buffers on 64 bit platforms.

tkf · 2021-09-19T21:16:13Z

Note that the main concern (discussed in issue sections) here is correctness and not performance. I discussed performance a bit to point out that the impact is small for previously working cases.

But I think the point on reentrancy can be clarified a bit more. Consider

struct LargeMatrix <: AbstractMatrix{Float64}
    id::UInt64
end

struct LargeMatrix2 <: AbstractMatrix{LargeMatrix}
    id::UInt64
end

The implementation of * on LargeMatrix2 needs to compute * on LargeMatrix. If both methods hit the tiled GEMM, and if the tiled GEMM uses pooled buffers, * on LargeMatrix corrupts the buffers still used by * on LargeMatrix2.

The eltype is Float64

So the point is when LargeMatrix is used as elements. That's why I mentioned block matrix in the OP.

this example looks like a case where we'd really want to use buffers.

After this PR, we'll still use the buffers. The point is to not pool the buffers.

chriselrod · 2021-09-19T21:19:21Z

The implementation of * on LargeMatrix2 needs to compute * on LargeMatrix. If both methods hit the tiled GEMM

Got it, thanks for the explanation.

I got that this needed changes because of possible task movement as well as the threadpools PR, but missed the example of LargeMatrix being isbits yet potentially recursively calling the tiled matmul.

KristofferC

This clearly has to be done at some point and the benchmarks show that this doesn't have a significant performance impact (with perhaps the exception of cases that shouldn't use tiling anyway), So LGTM.

tkf · 2021-10-15T20:25:33Z

This is a bugfix (the bug I illustrated in the above comment can be triggered). So, I suppose this is a target of backporting?

(cherry picked from commit 6893f21)

KristofferC · 2021-11-12T08:48:42Z

This causes quite a bit of breakage on 1.6 so I'll remove this from backports for now. This is of course the packages fault but we try to be very conservative with patch releases.

tkf · 2021-11-12T19:58:47Z

Ah, I didn't think at all that this could trigger breakage on packages. But yes, it makes sense to play on the safe side in backports. Thanks for hunting down backport issues!

(This makes me wonder if it makes sense to delay backport or at least PkgEval until the related patches land on the latest release. It'll give some chance for the packages to fix their bugs. Probably the only disadvantage is that we can't backport things while the patches are hot in the memory of PR authors.)

Remove buffer pool for tiled GEMM

27e4881

tkf added linear algebra Linear algebra multithreading Base.Threads and related functionality labels Sep 19, 2021

tkf mentioned this pull request Sep 19, 2021

Add threadpool support to runtime #42302

Merged

KristofferC reviewed Sep 19, 2021

View reviewed changes

KristofferC approved these changes Sep 20, 2021

View reviewed changes

vchuravy approved these changes Sep 20, 2021

View reviewed changes

chriselrod approved these changes Sep 20, 2021

View reviewed changes

vchuravy merged commit 6893f21 into JuliaLang:master Sep 20, 2021

tkf deleted the unbuf branch September 20, 2021 18:44

KristofferC pushed a commit that referenced this pull request Sep 28, 2021

Remove buffer pool for tiled GEMM (#42309)

7425ef3

sostock mentioned this pull request Oct 8, 2021

mul!(::Matrix{T}, ::Matrix{T}, ::AbstractTriangular{T, <:Matrix{T}}) where T<:BlasFloat does not use BLAS JuliaLang/LinearAlgebra.jl#875

Closed

thchr mentioned this pull request Oct 15, 2021

Refactor Linear Algebra's matmul to use a safer scratch buffer JuliaLang/LinearAlgebra.jl#500

Closed

tkf changed the title ~~RFC: Remove buffer pool for tiled GEMM~~ Remove buffer pool for tiled GEMM Oct 15, 2021

tkf added backport 1.6 Change should be backported to release-1.6 backport 1.7 bugfix This change fixes an existing bug labels Oct 15, 2021

KristofferC pushed a commit that referenced this pull request Oct 18, 2021

Remove buffer pool for tiled GEMM (#42309)

91706ad

(cherry picked from commit 6893f21)

KristofferC mentioned this pull request Oct 18, 2021

release-1.7: Backports for 1.7-rc2 #42255

Merged

84 tasks

maltezfaria mentioned this pull request Oct 18, 2021

Multiplication involving transpose of a Matrix{<:SMatrix} JuliaArrays/StaticArrays.jl#966

Closed

KristofferC mentioned this pull request Oct 19, 2021

release-1.6: Backports for julia-1.6.4 #42147

Merged

95 tasks

vchuravy mentioned this pull request Oct 20, 2021

Use of internal caches broken on 1.7 (and 1.6 soon) JuliaLinearAlgebra/ArrayLayouts.jl#81

Closed

KristofferC removed the backport 1.7 label Oct 22, 2021

KristofferC pushed a commit that referenced this pull request Oct 29, 2021

Remove buffer pool for tiled GEMM (#42309)

ddb02d2

(cherry picked from commit 6893f21)

KristofferC mentioned this pull request Nov 8, 2021

release-1.7: Backports for 1.7.0/1.7.0-rc3 #42765

Merged

66 tasks

KristofferC pushed a commit that referenced this pull request Nov 11, 2021

Remove buffer pool for tiled GEMM (#42309)

32df09b

(cherry picked from commit 6893f21)

KristofferC removed the backport 1.6 Change should be backported to release-1.6 label Nov 12, 2021

LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022

Remove buffer pool for tiled GEMM (JuliaLang#42309)

773a781

LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022

Remove buffer pool for tiled GEMM (JuliaLang#42309)

6022128

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove buffer pool for tiled GEMM #42309

Remove buffer pool for tiled GEMM #42309

tkf commented Sep 19, 2021 •

edited

Loading

DilumAluthge commented Sep 19, 2021

KristofferC Sep 19, 2021

chriselrod commented Sep 19, 2021 •

edited

Loading

tkf commented Sep 19, 2021

chriselrod commented Sep 19, 2021 •

edited

Loading

KristofferC left a comment

tkf commented Oct 15, 2021

KristofferC commented Nov 12, 2021

tkf commented Nov 12, 2021

	tile_size = 0
	if isbitstype(R) && isbitstype(T) && isbitstype(S) && (tA == 'N' \|\| tB != 'N')
	tile_size = floor(Int, sqrt(tilebufsize / max(sizeof(R), sizeof(S), sizeof(T), 1)))
	end
	@inbounds begin
	if tile_size > 0

	const tilebufsize = 10800 # Approximately 32k/3
	# per-thread arrays of buffers resized by __init__ if needed
	const Abuf = [Vector{UInt8}(undef, tilebufsize)]
	const Bbuf = [Vector{UInt8}(undef, tilebufsize)]
	const Cbuf = [Vector{UInt8}(undef, tilebufsize)]

	Atile = unsafe_wrap(Array, convert(Ptr{T}, pointer(Abuf[Threads.threadid()])), sz)
	Btile = unsafe_wrap(Array, convert(Ptr{S}, pointer(Bbuf[Threads.threadid()])), sz)

Remove buffer pool for tiled GEMM #42309

Remove buffer pool for tiled GEMM #42309

Conversation

tkf commented Sep 19, 2021 • edited Loading

Issue 1: Reentrancy

Issue 2: @async- and @spawn-safety

Issue 3: nthreads may not be constant in the future

Issue 4: GC

Performance impact

DilumAluthge commented Sep 19, 2021

KristofferC Sep 19, 2021

Choose a reason for hiding this comment

chriselrod commented Sep 19, 2021 • edited Loading

tkf commented Sep 19, 2021

chriselrod commented Sep 19, 2021 • edited Loading

KristofferC left a comment

Choose a reason for hiding this comment

tkf commented Oct 15, 2021

KristofferC commented Nov 12, 2021

tkf commented Nov 12, 2021

tkf commented Sep 19, 2021 •

edited

Loading

Issue 2: `@async`- and `@spawn`-safety

Issue 3: `nthreads` may not be constant in the future

chriselrod commented Sep 19, 2021 •

edited

Loading

chriselrod commented Sep 19, 2021 •

edited

Loading