-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stabilize MulAddMul
strategically
#52439
Conversation
847b9c7
to
5f39c10
Compare
Bump, what is the status here? |
This is internal, but breaking. It affects a function that we encouraged data storage packages such as SparseArrays.jl, GPUArrays.jl etc. to overload. If we want to do that, then it requires organized action. I thought that would be best done right after the v1.11 cut, since I didn't have time to follow up all the packages before the branching. |
9ebeeb3
to
21e3f79
Compare
Co-authored-by: Ashley Milsted <ashmilsted@gmail.com>
21e3f79
to
8033ae4
Compare
@nanosoldier |
Your benchmark job has completed - no performance regressions were detected. A full report can be found here. |
Awesome, this seems to speed up multiplication of small matrices significantly. |
@nanosoldier |
Let's see what the pkgeval run says, but I think we should/could slowly start to collectively think about whether we want to do this, or what should be taken care of to finish this. The clear upside is that for packages that have their own 5-arg The downside is that those pathways that run by I'd like to call for some support here, especially with running some benchmarks of your favorite workload and reporting issues. @maleadt On the above PRs, the nightly runs seem to fail (independently from my changes?). Is there a way to test this PR here together with the PRs from GPUArrays.jl and CUDA/oneAPI/Metal respectively? @chriselrod Would you be able to run some of your compilation benchmarks? I'm afraid this will undo some of the speed-ups we achieved in v1.10, but let's see the numbers. If we agree to include this, I believe it would be good to have it early in the cycle, so that chances are high to catch packages that rely on these internal functions. |
The package evaluation job you requested has completed - possible new issues were detected. |
No, there isn't. |
@nanosoldier |
@nanosoldier |
The package evaluation job you requested has completed - possible new issues were detected. |
The package evaluation job you requested has completed - possible new issues were detected. |
The package evaluation job you requested has completed - possible new issues were detected. |
@nanosoldier Most failing tests were due to time-out. From what I've seen there are no failures due to method errors or things like that. One possible source of extensive runtimes is that SparseArrays.jl is run without adaptation, so multiplication runs by the most generic kernel, and not the sparse kernels. |
The package evaluation job you requested has completed - possible new issues were detected. |
A few packages throw some signal in the most generic multiplication kernel, in the simd-loop applying |
The following compile time benchmark from @chriselrod using ForwardDiff, LinearAlgebra
d(x, n) = ForwardDiff.Dual(x, ntuple(_ -> randn(), n))
function dualify(A, n, j)
if n > 0
A = d.(A, n)
if (j > 0)
A = d.(A, j)
end
end
A
end
@time for n = 0:8, j = (n!=0):4
A = dualify.(randn(5,5), n, j);
B = dualify.(randn(5,5), n, j);
C = similar(A);
mul!(C, A, B);
mul!(C, A', B);
mul!(C, A, B');
mul!(C, A', B');
mul!(C, transpose(A), B);
mul!(C, A, transpose(B));
mul!(C, transpose(A), transpose(B));
end doesn't show a dramatic increase, something like 10%. Testing with this: _
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.12.0-DEV.394 (2024-04-22)
_/ |\__'_|_|_|\__'_| | dk/muladdmul/0562602986* (fork: 6 commits, 7 days)
|__/ |
julia> using LinearAlgebra
julia> A = randn(BigFloat, 5, 5);
julia> B = randn(BigFloat, 5, 5);
julia> @time C = A * B;
0.358988 seconds (988.50 k allocations: 49.872 MiB, 31.52% gc time, 99.89% compilation time) # PR
0.188448 seconds (671.25 k allocations: 32.915 MiB, 99.79% compilation time) # v1.11-beta1
julia> @time mul!(C, A, B, true, false);
0.169249 seconds (6.58 k allocations: 306.602 KiB, 99.96% compilation time: 100% of which was recompilation) # PR
0.036884 seconds (1.60 k allocations: 76.930 KiB, 99.78% compilation time) # v1.11-beta1
julia> @time mul!(C, A, B, true, true);
0.000080 seconds (600 allocations: 30.469 KiB) # PR
0.115821 seconds (252.68 k allocations: 12.372 MiB, 13.96% gc time, 99.80% compilation time) # v1.11-beta1
julia> @time mul!(C, A, B, 3.7, false);
0.313045 seconds (804.67 k allocations: 40.479 MiB, 5.31% gc time, 99.97% compilation time) # PR
0.139149 seconds (496.09 k allocations: 24.297 MiB, 99.82% compilation time) # v1.11-beta1
julia> @time mul!(C, A, B, 3.7, 2.8);
0.384808 seconds (804.58 k allocations: 40.463 MiB, 24.57% gc time, 99.97% compilation time) # PR
0.142509 seconds (504.06 k allocations: 24.781 MiB, 9.86% gc time, 99.81% compilation time) # v1.11-beta1 however, surprises me a bit. Why does it need to recompile I'd like to suggest to leave this for a few more days and wait for comments, and absent objections merge this. |
@nanosoldier |
Your benchmark job has completed - no performance regressions were detected. A full report can be found here. |
@nanosoldier |
@dkarrasch thanks for picking this up! Did you consider keeping an inference barrier at the call to generic_matmatmul to avoid compile-time overhead in BLAS or small-matrix cases, at the expense of slightly slowing down the generic_matmatmul case due to always-on runtime dispatch? The last version of my PR made that choice - I assumed generic_matmatmul is typically a slow path anyway. Not sure if that's right in all cases though! |
@nanosoldier |
Could you please point me to that commit? @nanosoldier |
Your benchmark job has completed - no performance regressions were detected. A full report can be found here. |
It's in this commit. It just means not using the macro for the generic_matmatmul call (see comment). I decided not to do it for the vector case (see here). |
I'll try that out. Thanks! |
Ok, I included that suggestion, but kept the macro call at the most generic version (for now). I believe this should not affect the usual small-matrix- or blas-paths. I'll double-check, and then re-evaluate. ADDENDUM: I made the inference barrier complete. That generic method may get called even by BLAS-type matrices and hence gets compiled. So, for generic or mixed eltypes this PR doesn't change anything about the runtime dispatch. |
All green, let's go! |
Co-authored-by: Ashley Milsted <ashmilsted@gmail.com>
Co-authored-by: Ashley Milsted <ashmilsted@gmail.com>
Manual rebase of #47206. Closes #47206.
Co-authored-by: Ashley Milsted ashmilsted@gmail.com