Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stabilize MulAddMul strategically #52439

Merged
merged 11 commits into from
May 8, 2024
Merged

Stabilize MulAddMul strategically #52439

merged 11 commits into from
May 8, 2024

Conversation

dkarrasch
Copy link
Member

Manual rebase of #47206. Closes #47206.

Co-authored-by: Ashley Milsted ashmilsted@gmail.com

@brenhinkeller brenhinkeller added performance Must go faster linear algebra Linear algebra labels Dec 7, 2023
@KristofferC
Copy link
Member

Bump, what is the status here?

@dkarrasch
Copy link
Member Author

This is internal, but breaking. It affects a function that we encouraged data storage packages such as SparseArrays.jl, GPUArrays.jl etc. to overload. If we want to do that, then it requires organized action. I thought that would be best done right after the v1.11 cut, since I didn't have time to follow up all the packages before the branching.

@dkarrasch dkarrasch force-pushed the dk/muladdmul branch 3 times, most recently from 9ebeeb3 to 21e3f79 Compare February 27, 2024 11:01
Co-authored-by: Ashley Milsted <ashmilsted@gmail.com>
@dkarrasch
Copy link
Member Author

@nanosoldier runbenchmarks("linalg", vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - no performance regressions were detected. A full report can be found here.

@dkarrasch
Copy link
Member Author

Awesome, this seems to speed up multiplication of small matrices significantly.

@dkarrasch
Copy link
Member Author

@nanosoldier runtests()

@dkarrasch
Copy link
Member Author

Let's see what the pkgeval run says, but I think we should/could slowly start to collectively think about whether we want to do this, or what should be taken care of to finish this.

The clear upside is that for packages that have their own 5-arg mul! implementations like the packages that have already received PRs (except for GPUArrays.jl), there is no MulAddMul in the pathway anymore, so no issues with dynamic calls to generic_mat[vec/mat]mul! due to type instability of the _add::MulAddMul argument. Those pathways should be somewhat lighter on the compiler, because alpha and beta are just pushed forward.

The downside is that those pathways that run by MulAddMul constructors (including the *diagonal ones and the most generic _generic_matmatmul!) need to compile four different multiplication kernel methods. So, initially, the load on the compiler is increased, for the benefit of avoiding dynamic calls/allocation of MulAddMul objects.

I'd like to call for some support here, especially with running some benchmarks of your favorite workload and reporting issues.

@maleadt On the above PRs, the nightly runs seem to fail (independently from my changes?). Is there a way to test this PR here together with the PRs from GPUArrays.jl and CUDA/oneAPI/Metal respectively?

@chriselrod Would you be able to run some of your compilation benchmarks? I'm afraid this will undo some of the speed-ups we achieved in v1.10, but let's see the numbers.

If we agree to include this, I believe it would be good to have it early in the cycle, so that chances are high to catch packages that rely on these internal functions.

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

@maleadt
Copy link
Member

maleadt commented Mar 1, 2024

On the above PRs, the nightly runs seem to fail (independently from my changes?). Is there a way to test this PR here together with the PRs from GPUArrays.jl and CUDA/oneAPI/Metal respectively?

No, there isn't.

@dkarrasch
Copy link
Member Author

@nanosoldier runtests(["FloatTracker", "EulerAngles", "NumericalAlgorithms", "ConvexHulls2d", "SubSIt", "CompressedSparseBlocks", "MLKernels", "NMF", "VibrationGEPHelpers", "JSOSolvers", "ExtendableSparse", "EcologicalNetworksPlots", "CrystalNets", "OptimalPortfolios", "GLPK", "PlantRayTracer", "ProximalOperators", "Gtk4", "SkyDomes", "BasisMatrices", "TaylorIntegration", "TransitionsInTimeseries", "MixedModels", "ClimaCoreSpectra", "ClimaCorePlots", "LowLevelParticleFilters", "MultiStateSystems", "ManifoldDiffEq", "ConScape", "BLASBenchmarksCPU", "LowRankIntegrators", "SpiDy", "AiidaDFTK", "Petri", "StirredReactor", "BatchReactor", "ONSAS", "ChargeTransport", "SMLMFrameConnection", "MathepiaModels", "NamedTrajectories", "BloqadeGates", "Population"])

@dkarrasch
Copy link
Member Author

@nanosoldier runtests(["FloatTracker", "EulerAngles", "NumericalAlgorithms", "ConvexHulls2d", "SubSIt", "CompressedSparseBlocks", "MLKernels", "NMF", "VibrationGEPHelpers", "JSOSolvers", "ExtendableSparse", "EcologicalNetworksPlots", "CrystalNets", "OptimalPortfolios", "GLPK", "PlantRayTracer", "ProximalOperators", "Gtk4", "SkyDomes", "BasisMatrices", "TaylorIntegration", "TransitionsInTimeseries", "MixedModels", "ClimaCoreSpectra", "ClimaCorePlots", "LowLevelParticleFilters", "MultiStateSystems", "ManifoldDiffEq", "ConScape", "BLASBenchmarksCPU", "LowRankIntegrators", "SpiDy", "AiidaDFTK", "Petri", "StirredReactor", "BatchReactor", "ONSAS", "ChargeTransport", "SMLMFrameConnection", "MathepiaModels", "NamedTrajectories", "BloqadeGates", "Population"])

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

@dkarrasch
Copy link
Member Author

@nanosoldier runtests(["SubSIt", "ReduceWindows", "CompressedSparseBlocks", "EasyCurl", "IterativeSolvers", "LinearMaps", "VibrationGEPHelpers", "PreallocationTools", "ParameterizedQuantumControl", "CrystalNets", "MRICoilSensitivities", "CDDLib", "TaylorIntegration", "BasisMatrices", "ReservoirComputing", "MixedModels", "StructuralEquationModels", "LongwaveModePropagator", "HierarchicalGaussianFiltering", "LowRankIntegrators", "MimiRFFSPs", "Plots", "Knockoffs", "ConceptualClimateModels", "AstrodynamicalModels", "MinimallyDisruptiveCurves", "Biofilm", "SMLMSim"])

Most failing tests were due to time-out. From what I've seen there are no failures due to method errors or things like that. One possible source of extensive runtimes is that SparseArrays.jl is run without adaptation, so multiplication runs by the most generic kernel, and not the sparse kernels.

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

@dkarrasch
Copy link
Member Author

A few packages throw some signal in the most generic multiplication kernel, in the simd-loop applying muladd. It could be that this is related to sparse arrays, which don't get caught by the status quo of SparseArrays.jl. Failures, however, are limited to a handful of packages, so if we decide to go with this, we could first merge the SparseArrays.jl PR, then bump the stdlib, then update this branch, and rerun pkgeval.

@dkarrasch
Copy link
Member Author

dkarrasch commented Apr 30, 2024

The following compile time benchmark from @chriselrod

using ForwardDiff, LinearAlgebra

d(x, n) = ForwardDiff.Dual(x, ntuple(_ -> randn(), n))

function dualify(A, n, j)
  if n > 0
    A = d.(A, n)
    if (j > 0)
      A = d.(A, j)
    end
  end
  A
end

@time for n = 0:8, j = (n!=0):4
  A = dualify.(randn(5,5), n, j);
  B = dualify.(randn(5,5), n, j);
  C = similar(A);
  mul!(C, A, B);
  mul!(C, A', B);
  mul!(C, A, B');
  mul!(C, A', B');
  mul!(C, transpose(A), B);
  mul!(C, A, transpose(B));
  mul!(C, transpose(A), transpose(B));
end

doesn't show a dramatic increase, something like 10%. Testing with this:

               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.12.0-DEV.394 (2024-04-22)
 _/ |\__'_|_|_|\__'_|  |  dk/muladdmul/0562602986* (fork: 6 commits, 7 days)
|__/                   |

julia> using LinearAlgebra

julia> A = randn(BigFloat, 5, 5);

julia> B = randn(BigFloat, 5, 5);

julia> @time C = A * B;
  0.358988 seconds (988.50 k allocations: 49.872 MiB, 31.52% gc time, 99.89% compilation time) # PR
  0.188448 seconds (671.25 k allocations: 32.915 MiB, 99.79% compilation time) # v1.11-beta1

julia> @time mul!(C, A, B, true, false);
  0.169249 seconds (6.58 k allocations: 306.602 KiB, 99.96% compilation time: 100% of which was recompilation) # PR
  0.036884 seconds (1.60 k allocations: 76.930 KiB, 99.78% compilation time) # v1.11-beta1

julia> @time mul!(C, A, B, true, true);
  0.000080 seconds (600 allocations: 30.469 KiB) # PR
  0.115821 seconds (252.68 k allocations: 12.372 MiB, 13.96% gc time, 99.80% compilation time) # v1.11-beta1

julia> @time mul!(C, A, B, 3.7, false);
  0.313045 seconds (804.67 k allocations: 40.479 MiB, 5.31% gc time, 99.97% compilation time) # PR
  0.139149 seconds (496.09 k allocations: 24.297 MiB, 99.82% compilation time) # v1.11-beta1

julia> @time mul!(C, A, B, 3.7, 2.8);
  0.384808 seconds (804.58 k allocations: 40.463 MiB, 24.57% gc time, 99.97% compilation time) # PR
  0.142509 seconds (504.06 k allocations: 24.781 MiB, 9.86% gc time, 99.81% compilation time) # v1.11-beta1

however, surprises me a bit. Why does it need to recompile mul!(C, A, B, true, false)? Other than that, the increase in compile times is less than factor 4, which one would naively expect by how the macro is set up. OTOH, this may be outbalanced by the fact that external packages don't need to compile through MulAddMul anymore, only w.r.t. the types of the alpha and beta arguments. Plus: small matrix multiplication is dramatically sped up!

I'd like to suggest to leave this for a few more days and wait for comments, and absent objections merge this.

@dkarrasch
Copy link
Member Author

@nanosoldier runbenchmarks("linalg", vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - no performance regressions were detected. A full report can be found here.

@dkarrasch
Copy link
Member Author

@nanosoldier runbenchmarks("linalg", vs = ":release-v1.11")

@amilsted
Copy link
Contributor

@dkarrasch thanks for picking this up! Did you consider keeping an inference barrier at the call to generic_matmatmul to avoid compile-time overhead in BLAS or small-matrix cases, at the expense of slightly slowing down the generic_matmatmul case due to always-on runtime dispatch? The last version of my PR made that choice - I assumed generic_matmatmul is typically a slow path anyway. Not sure if that's right in all cases though!

@dkarrasch
Copy link
Member Author

@nanosoldier runbenchmarks("linalg", vs = ":release-v1.10")

@dkarrasch
Copy link
Member Author

The last version of my PR

Could you please point me to that commit?

@nanosoldier runbenchmarks("linalg", vs = ":release-1.11")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - no performance regressions were detected. A full report can be found here.

@amilsted
Copy link
Contributor

amilsted commented Apr 30, 2024

Could you please point me to that commit?

It's in this commit. It just means not using the macro for the generic_matmatmul call (see comment).

I decided not to do it for the vector case (see here).

@dkarrasch
Copy link
Member Author

I'll try that out. Thanks!

@dkarrasch
Copy link
Member Author

dkarrasch commented May 1, 2024

Ok, I included that suggestion, but kept the macro call at the most generic version (for now). I believe this should not affect the usual small-matrix- or blas-paths. I'll double-check, and then re-evaluate.

ADDENDUM: I made the inference barrier complete. That generic method may get called even by BLAS-type matrices and hence gets compiled. So, for generic or mixed eltypes this PR doesn't change anything about the runtime dispatch.

@dkarrasch
Copy link
Member Author

All green, let's go!

@dkarrasch dkarrasch merged commit 29ced9e into master May 8, 2024
7 checks passed
@dkarrasch dkarrasch deleted the dk/muladdmul branch May 8, 2024 07:57
xlxs4 pushed a commit to xlxs4/julia that referenced this pull request May 9, 2024
Co-authored-by: Ashley Milsted <ashmilsted@gmail.com>
lazarusA pushed a commit to lazarusA/julia that referenced this pull request Jul 12, 2024
Co-authored-by: Ashley Milsted <ashmilsted@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linear algebra Linear algebra performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants