-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove buffer pool for tiled GEMM #42309
Conversation
cc: @chriselrod |
@@ -797,8 +792,7 @@ function _generic_matmatmul!(C::AbstractVecOrMat{R}, tA, tB, A::AbstractVecOrMat | |||
end | |||
end | |||
else | |||
# FIXME: This code is completely invalid!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
I don't follow this example. FWIW, Octavian.jl avoids the GC issues for However, for
|
Note that the main concern (discussed in issue sections) here is correctness and not performance. I discussed performance a bit to point out that the impact is small for previously working cases. But I think the point on reentrancy can be clarified a bit more. Consider struct LargeMatrix <: AbstractMatrix{Float64}
id::UInt64
end
struct LargeMatrix2 <: AbstractMatrix{LargeMatrix}
id::UInt64
end The implementation of
So the point is when
After this PR, we'll still use the buffers. The point is to not pool the buffers. |
Got it, thanks for the explanation. I got that this needed changes because of possible task movement as well as the threadpools PR, but missed the example of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This clearly has to be done at some point and the benchmarks show that this doesn't have a significant performance impact (with perhaps the exception of cases that shouldn't use tiling anyway), So LGTM.
This is a bugfix (the bug I illustrated in the above comment can be triggered). So, I suppose this is a target of backporting? |
(cherry picked from commit 6893f21)
(cherry picked from commit 6893f21)
(cherry picked from commit 6893f21)
This causes quite a bit of breakage on 1.6 so I'll remove this from backports for now. This is of course the packages fault but we try to be very conservative with patch releases. |
Ah, I didn't think at all that this could trigger breakage on packages. But yes, it makes sense to play on the safe side in backports. Thanks for hunting down backport issues! (This makes me wonder if it makes sense to delay backport or at least PkgEval until the related patches land on the latest release. It'll give some chance for the packages to fix their bugs. Probably the only disadvantage is that we can't backport things while the patches are hot in the memory of PR authors.) |
This PR is an RFC for removing the pooled buffer for the generic tiled matrix multiplication. The current implementation has several issues and a quick benchmark does not suggest that allocation and/or GC are the bottleneck.
Issue 1: Reentrancy
Inside of
_generic_matmatmul!
, we have the checkjulia/stdlib/LinearAlgebra/src/matmul.jl
Lines 771 to 776 in eb83c4d
which then uses pre-allocated arrays
julia/stdlib/LinearAlgebra/src/matmul.jl
Lines 728 to 732 in eb83c4d
via
julia/stdlib/LinearAlgebra/src/matmul.jl
Lines 779 to 780 in eb83c4d
I think the major and practical issue here is that
isbitstype
is not enough for ensuring no-reentrancy of*
(so that the buffers are not re-used by multiple functions). I can quite easily define arbitrary large matrix with smallisbitstype
This example is rather silly but it's conceivable to have a Julia wrapper of externally managed matrices that is just a
Ptr{Cvoid}
as a Julia object. If theseisbits
-but-large matrices are used in nested/blocked matrices, the current implementation can corrupt the result.Issue 2:
@async
- and@spawn
-safetyA method of
*
canyield
(e.g., a wrapper type that dumps all operations in a file). We then have the same problem as the reentrancy.Furthermore, if the current task is a
@spawn
ed task, it can be migrated to another worker thread upon ayield
.Issue 3:
nthreads
may not be constant in the futureAfter a change in the scheduler like #42302, it will be impossible to rely on that
nthreads
does not change afterjulia
is started. Supporting a pattern like this would also require a proper "static" initialization API (e.g., using the double checked locking pattern).Issue 4: GC
Also, as noted in #23914, this code also has a problem due to missing
GC.@preserve
.Performance impact
It also seems that the allocation is not the bottleneck. For large enough input that the tiling matters, the effect of allocation/GC is not observable. For small matrices where you can see the effect of GC, it's faster to not use the tiled version.
shows
"No tile" means applying this diff