-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GC/Parallel marking #44643
GC/Parallel marking #44643
Conversation
I don't have access to the paper unfortunately. I wish more people used researchgate.net. My main concern is that work is hidden from the other threads in a private stack and therefore we are missing out on parallelism. If I pop from my private stack and then generate work, is the new work pushed on the private or public stack? There is some chance that steals happened and the public stack now has room. |
Private stack. Yes, it's possible that you hide some GC work in such cases. The public queue size is chosen to be sufficiently large so that's unlikely to use the private queue though. Edit: it's pushed into the private queue in this implementation, but in fact it can/should be pushed into the public one (if there is space). Should be fixed in the next commit. |
I was digging into long gc times we were seeing (30-40% of duration, gc pauses of 20 seconds, 15 minutes of gc time over a run) and @Sacha0 pointedb out this PR. We will give it a try. One thing I found might be relevant? Visiting objects in a random memory layout the memory accesses were taking 55ns per object. But if you add
not sure if this is something you can make use of. In any case looking forward to trying your parallel marking. |
Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
@nanosoldier |
Your package evaluation job has completed - possible new issues were detected. A full report can be found here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please either remove the formatting changes or squash your commits such that functional changes are in a separate commit. It will make it easier to review this PR.
I've been trying this PR out on some (often GC-heavy) proprietary code. Running 1000 fits to get a better idea of the mean; PR: julia> extrema(ts), 1e3mean(ts)
((0.037808062, 33.898002256), 466.66967408400006) Master: julia> extrema(ts), 1e3mean(ts)
((0.03921645, 0.255453254), 59.963547983) 7.8x regression on this PR; the average time on master is about 60ms, vs 466ms on this PR. This PR seems to cause extreme GC pauses, e.g. one of the fits took >30 seconds on this PR, while the slowest fit on master was little more than a quarter second! EDIT: julia> extrema(ts), 1e3mean(ts)
((0.041272219, 41.353805079), 672.1340294480002) More than 11x slower on average. |
Compile times also regress severely under this PR: julia> using DifferentialEquations
julia> function f(du,u,p,t)
du[1] = p[1] * u[1] - p[2] * u[1]*u[2]
du[2] = -3 * u[2] + u[1]*u[2]
end
f (generic function with 1 method)
julia> function g(du,u,p,t)
du[1] = p[3]*u[1]
du[2] = p[4]*u[2]
end
g (generic function with 1 method)
julia> p = [1.5,1.0,0.1,0.1];
julia> prob = SDEProblem(f,g,[1.0,1.0],(0.0,10.0),p);
julia> function prob_func(prob,i,repeat)
x = 0.3rand(2)
remake(prob,p=[p[1:2];x])
end
prob_func (generic function with 1 method)
julia> ensemble_prob = EnsembleProblem(prob,prob_func=prob_func);
julia> sim = @time solve(ensemble_prob,SRIW1(),trajectories=10)
144.477110 seconds (24.37 M allocations: 1.706 GiB, 0.65% gc time, 100.00% compilation time)
EnsembleSolution Solution of length 10 with uType:
RODESolution{Float64, 2, Vector{Vector{Float64}}, Nothing, Nothing, Vector{Float64}, NoiseProcess{Float64, 2, Float64, Vector{Float64}, Vector{Float64}, Vector{Vector{Float64}}, typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_DIST), typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_BRIDGE), true, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, RSWM{Float64}, Nothing, RandomNumbers.Xorshifts.Xoroshiro128Plus}, SDEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, Vector{Float64}, Nothing, SDEFunction{true, typeof(f), typeof(g), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing}, typeof(g), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Nothing}, SRIW1, StochasticDiffEq.LinearInterpolationData{Vector{Vector{Float64}}, Vector{Float64}}, DiffEqBase.DEStats} Master: julia> using DifferentialEquations
julia> function f(du,u,p,t)
du[1] = p[1] * u[1] - p[2] * u[1]*u[2]
du[2] = -3 * u[2] + u[1]*u[2]
end
f (generic function with 1 method)
julia> function g(du,u,p,t)
du[1] = p[3]*u[1]
du[2] = p[4]*u[2]
end
g (generic function with 1 method)
julia> p = [1.5,1.0,0.1,0.1];
julia> prob = SDEProblem(f,g,[1.0,1.0],(0.0,10.0),p);
julia> function prob_func(prob,i,repeat)
x = 0.3rand(2)
remake(prob,p=[p[1:2];x])
end
prob_func (generic function with 1 method)
julia> ensemble_prob = EnsembleProblem(prob,prob_func=prob_func);
julia> sim = @time solve(ensemble_prob,SRIW1(),trajectories=10)
16.090327 seconds (24.01 M allocations: 1.668 GiB, 3.95% gc time, 100.00% compilation time)
EnsembleSolution Solution of length 10 with uType:
RODESolution{Float64, 2, Vector{Vector{Float64}}, Nothing, Nothing, Vector{Float64}, NoiseProcess{Float64, 2, Float64, Vector{Float64}, Vector{Float64}, Vector{Vector{Float64}}, typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_DIST), typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_BRIDGE), true, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, RSWM{Float64}, Nothing, RandomNumbers.Xorshifts.Xoroshiro128Plus}, SDEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, Vector{Float64}, Nothing, SDEFunction{true, typeof(f), typeof(g), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing}, typeof(g), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Nothing}, SRIW1, StochasticDiffEq.LinearInterpolationData{Vector{Vector{Float64}}, Vector{Float64}}, DiffEqBase.DEStats} |
Running: using DifferentialEquations
function f(du,u,p,t)
du[1] = p[1] * u[1] - p[2] * u[1]*u[2]
du[2] = -3 * u[2] + u[1]*u[2]
end
function g(du,u,p,t)
du[1] = p[3]*u[1]
du[2] = p[4]*u[2]
end
p = [1.5,1.0,0.1,0.1];
prob = SDEProblem(f,g,[1.0,1.0],(0.0,10.0),p);
function prob_func(prob,i,repeat)
x = 0.3rand(2)
remake(prob,p=[p[1:2];x])
end
ensemble_prob = EnsembleProblem(prob,prob_func=prob_func);
include("../../utils.jl")
@gctime solve(ensemble_prob,SRIW1(),trajectories=100_000).u[end].u[end] Most of the time spent is on loading and precompilation. I'll up the trajectories to 5 runs with 36 threads on a 36-thread system yields: # Master:
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
# │ │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
# │ │ ms │ ms │ ms │ ms │ MB │ % │
# ├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
# │ minimum │ 18032 │ 1627 │ 817 │ 0 │ 2752 │ 9 │
# │ median │ 18090 │ 1699 │ 856 │ 0 │ 2833 │ 9 │
# │ maximum │ 19679 │ 1894 │ 877 │ 12 │ 2867 │ 10 │
# └─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘
# ⏎
# PR:
# ┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
# │ │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
# │ │ ms │ ms │ ms │ ms │ MB │ % │
# ├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
# │ minimum │ 25287 │ 1026 │ 739 │ 404 │ 1210 │ 4 │
# │ median │ 29303 │ 1882 │ 906 │ 496 │ 2950 │ 6 │
# │ maximum │ 32776 │ 2028 │ 969 │ 690 │ 3006 │ 7 │
# └─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘ Note the 50% regression in total time, but claimed gc_time is about the same. |
Reducing the number of threads to 18, matching the number of physical cores, and increasing the number of trajectories to 1_000_000 so that the solve actually spends a decent chunk of the total time: # Master:
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
│ │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│ │ ms │ ms │ ms │ ms │ MB │ % │
├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │ 44143 │ 18857 │ 9822 │ 0 │ 34017 │ 43 │
│ median │ 44160 │ 19057 │ 9919 │ 0 │ 34118 │ 43 │
│ maximum │ 44429 │ 19152 │ 9948 │ 0 │ 34202 │ 43 │
└─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘
________________________________________________________
Executed in 362.57 secs fish external
usr time 62.67 mins 325.00 micros 62.67 mins
sys time 3.41 mins 85.00 micros 3.41 mins
# PR:
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
│ │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│ │ ms │ ms │ ms │ ms │ MB │ % │
├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │ 171546 │ 25212 │ 14311 │ 324 │ 34140 │ 4 │
│ median │ 200472 │ 28749 │ 15988 │ 4548 │ 34569 │ 15 │
│ maximum │ 567746 │ 30408 │ 16248 │ 31160 │ 34794 │ 16 │
└─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘
________________________________________________________
Executed in 25.31 mins fish external
usr time 68.37 mins 312.00 micros 68.37 mins
sys time 318.83 mins 84.00 micros 318.83 mins A 3.8 to >12x regression in runtime, but the GC time only increased by about 50%. |
Also, on a 64-thread Epyc system, Julia master: julia> sim = @time solve(ensemble_prob,SRIW1(),trajectories=10)
18.253071 seconds (19.22 M allocations: 1.313 GiB, 5.46% gc time, 99.90% compilation time)
EnsembleSolution Solution of length 10 with uType:
RODESolution{Float64, 2, Vector{Vector{Float64}}, Nothing, Nothing, Vector{Float64}, NoiseProcess{Float64, 2, Float64, Vector{Float64}, Vector{Float64}, Vector{Vector{Float64}}, typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_DIST), typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_BRIDGE), true, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, RSWM{Float64}, Nothing, RandomNumbers.Xorshifts.Xoroshiro128Plus}, SDEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, Vector{Float64}, Nothing, SDEFunction{true, typeof(f), typeof(g), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, typeof(g), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Nothing}, SRIW1, StochasticDiffEq.LinearInterpolationData{Vector{Vector{Float64}}, Vector{Float64}}, DiffEqBase.DEStats} Versus the PR: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1035073 chrisel+ 20 0 19.8g 716008 127416 S 4098 0.1 2700:53 julia It's put 2700 CPU minutes in (that's 45 CPU*hours) without finishing yet. julia> sim = @time solve(ensemble_prob,SRIW1(),trajectories=10)
4530.529931 seconds (19.19 M allocations: 1.310 GiB, 0.12% gc time, 100.00% compilation time)
EnsembleSolution Solution of length 10 with uType:
RODESolution{Float64, 2, Vector{Vector{Float64}}, Nothing, Nothing, Vector{Float64}, NoiseProcess{Float64, 2, Float64, Vector{Float64}, Vector{Float64}, Vector{Vector{Float64}}, typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_DIST), typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_BRIDGE), true, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, RSWM{Float64}, Nothing, RandomNumbers.Xorshifts.Xoroshiro128Plus}, SDEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, Vector{Float64}, Nothing, SDEFunction{true, typeof(f), typeof(g), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, typeof(g), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Nothing}, SRIW1, StochasticDiffEq.LinearInterpolationData{Vector{Vector{Float64}}, Vector{Float64}}, DiffEqBase.DEStats} Comparing single solves with a 1million trajectories post compilation, master: julia> sim = @time solve(ensemble_prob,SRIW1(),trajectories=1_000_000)
124.598627 seconds (691.60 M allocations: 65.804 GiB, 67.09% gc time)
EnsembleSolution Solution of length 1000000 with uType:
RODESolution{Float64, 2, Vector{Vector{Float64}}, Nothing, Nothing, Vector{Float64}, NoiseProcess{Float64, 2, Float64, Vector{Float64}, Vector{Float64}, Vector{Vector{Float64}}, typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_DIST), typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_BRIDGE), true, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, RSWM{Float64}, Nothing, RandomNumbers.Xorshifts.Xoroshiro128Plus}, SDEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, Vector{Float64}, Nothing, SDEFunction{true, typeof(f), typeof(g), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, typeof(g), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Nothing}, SRIW1, StochasticDiffEq.LinearInterpolationData{Vector{Vector{Float64}}, Vector{Float64}}, DiffEqBase.DEStats} PR: julia> sim = @time solve(ensemble_prob,SRIW1(),trajectories=1_000_000)
305.889817 seconds (691.35 M allocations: 65.782 GiB, 25.51% gc time)
EnsembleSolution Solution of length 1000000 with uType:
RODESolution{Float64, 2, Vector{Vector{Float64}}, Nothing, Nothing, Vector{Float64}, NoiseProcess{Float64, 2, Float64, Vector{Float64}, Vector{Float64}, Vector{Vector{Float64}}, typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_DIST), typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_BRIDGE), true, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, RSWM{Float64}, Nothing, RandomNumbers.Xorshifts.Xoroshiro128Plus}, SDEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, Vector{Float64}, Nothing, SDEFunction{true, typeof(f), typeof(g), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, typeof(g), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Nothing}, SRIW1, StochasticDiffEq.LinearInterpolationData{Vector{Vector{Float64}}, Vector{Float64}}, DiffEqBase.DEStats} |
Also, it is import to test with many threads. That is, with more than four. julia_master --project=. run_benchmarks.jl -n5 -t4 --bench=binary_tree/tree_immutable.jl
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
│ │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│ │ ms │ ms │ ms │ ms │ MB │ % │
├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │ 5977 │ 2810 │ 127 │ 0 │ 459 │ 47 │
│ median │ 5988 │ 2833 │ 128 │ 0 │ 461 │ 47 │
│ maximum │ 6122 │ 2963 │ 130 │ 0 │ 462 │ 48 │
└─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘ PR: julia_parallel_gc --project=. run_benchmarks.jl -n5 -t4 --bench=binary_tree/tree_immutable.jl
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
│ │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│ │ ms │ ms │ ms │ ms │ MB │ % │
├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │ 5828 │ 2626 │ 103 │ 0 │ 462 │ 42 │
│ median │ 6018 │ 2716 │ 104 │ 0 │ 464 │ 46 │
│ maximum │ 6247 │ 2872 │ 106 │ 88 │ 472 │ 46 │
└─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘ While with 18 threads: julia_master --project=. run_benchmarks.jl -n5 -t18 --bench=binary_tree/tree_immutable.jl
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
│ │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│ │ ms │ ms │ ms │ ms │ MB │ % │
├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │ 5546 │ 3744 │ 330 │ 0 │ 958 │ 67 │
│ median │ 5601 │ 3793 │ 332 │ 0 │ 965 │ 68 │
│ maximum │ 5626 │ 3819 │ 334 │ 3 │ 973 │ 68 │
└─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘ PR: julia_parallel_gc --project=. run_benchmarks.jl -n5 -t18 --bench=binary_tree/tree_immutable.jl
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
│ │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│ │ ms │ ms │ ms │ ms │ MB │ % │
├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │ 9231 │ 5790 │ 468 │ 0 │ 932 │ 48 │
│ median │ 10908 │ 5809 │ 483 │ 4 │ 944 │ 54 │
│ maximum │ 12263 │ 6281 │ 523 │ 71 │ 951 │ 63 │
└─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘ This PR causes about a 2x regression. |
I believe that there's a bug in the benchmark harness and you are not
really running with multiple threads.
…On Tue, Jun 7, 2022 at 2:41 PM Chris Elrod ***@***.***> wrote:
Also, it is import to test with many threads. That is, with more than four.
With only four threads, the regression is minimal.
Master, running
julia_master --project=. run_benchmarks.jl -n5 -t4 --bench=binary_tree/tree_immutable.jl
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
│ │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│ │ ms │ ms │ ms │ ms │ MB │ % │
├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │ 5977 │ 2810 │ 127 │ 0 │ 459 │ 47 │
│ median │ 5988 │ 2833 │ 128 │ 0 │ 461 │ 47 │
│ maximum │ 6122 │ 2963 │ 130 │ 0 │ 462 │ 48 │
└─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘
PR:
julia_parallel_gc --project=. run_benchmarks.jl -n5 -t4 --bench=binary_tree/tree_immutable.jl
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
│ │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│ │ ms │ ms │ ms │ ms │ MB │ % │
├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │ 5828 │ 2626 │ 103 │ 0 │ 462 │ 42 │
│ median │ 6018 │ 2716 │ 104 │ 0 │ 464 │ 46 │
│ maximum │ 6247 │ 2872 │ 106 │ 88 │ 472 │ 46 │
└─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘
While with 18 threads:
julia_master --project=. run_benchmarks.jl -n5 -t18 --bench=binary_tree/tree_immutable.jl
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
│ │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│ │ ms │ ms │ ms │ ms │ MB │ % │
├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │ 5546 │ 3744 │ 330 │ 0 │ 958 │ 67 │
│ median │ 5601 │ 3793 │ 332 │ 0 │ 965 │ 68 │
│ maximum │ 5626 │ 3819 │ 334 │ 3 │ 973 │ 68 │
└─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘
PR:
julia_parallel_gc --project=. run_benchmarks.jl -n5 -t18 --bench=binary_tree/tree_immutable.jl
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
│ │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│ │ ms │ ms │ ms │ ms │ MB │ % │
├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │ 9231 │ 5790 │ 468 │ 0 │ 932 │ 48 │
│ median │ 10908 │ 5809 │ 483 │ 4 │ 944 │ 54 │
│ maximum │ 12263 │ 6281 │ 523 │ 71 │ 951 │ 63 │
└─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘
—
Reply to this email directly, view it on GitHub
<#44643 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE3CLBE3GIKEBW67OERIUHLVN6JWDANCNFSM5Q4PGVIQ>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
Chris and I have confirmed (by looking at htop) that he was. |
Can we move the work stealing queues to their own file and add some
debugging checks.
Having work items get missed in production would be a bear to track
down, stats would show us if we are missing a fence or something.
Back in the day we were missing a barrier that was needed in TSO memory
models. I'm imagining that in the future there might be similar cases
where knowing that every item that was pushed was popped would be
comforting.
Christine
…On Tue, Jun 7, 2022 at 4:46 PM Oscar Smith ***@***.***> wrote:
Chris and I have confirmed (by looking at htop) that he was.
—
Reply to this email directly, view it on GitHub
<#44643 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE3CLBGWYP2HIWFYDSDN6Q3VN6YIRANCNFSM5Q4PGVIQ>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
I added the start of a deque stats tracker to the stats branch of
https://github.com/chflood/parallelmarking but the numbers aren't adding
up. I'm new to git so if I did this in the wrong place please let me know.
On Wed, Jun 8, 2022 at 8:55 AM Christine Flood <
***@***.***> wrote:
… Can we move the work stealing queues to their own file and add some
debugging checks.
Having work items get missed in production would be a bear to track
down, stats would show us if we are missing a fence or something.
Back in the day we were missing a barrier that was needed in TSO memory
models. I'm imagining that in the future there might be similar cases
where knowing that every item that was pushed was popped would be
comforting.
Christine
On Tue, Jun 7, 2022 at 4:46 PM Oscar Smith ***@***.***>
wrote:
> Chris and I have confirmed (by looking at htop) that he was.
>
> —
> Reply to this email directly, view it on GitHub
> <#44643 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AE3CLBGWYP2HIWFYDSDN6Q3VN6YIRANCNFSM5Q4PGVIQ>
> .
> You are receiving this because your review was requested.Message ID:
> ***@***.***>
>
|
As a side note, you would typically not create a completely new repository for this, just add that branch to your already existing fork at https://github.com/chflood/julia. |
[As noted below, seems to be causing performance regressions on large number of threads, superseded by https://github.com//pull/45639.]
This PR extends #41760 by using the deque from #43366 to implement work-stealing in the GC mark loop.
The design is inspired by Horie et al. (https://dl.acm.org/doi/pdf/10.1145/3299706.3210570).At a high level, two queues (public/private) are maintained by each thread. The public queue has a fixed size and thieves may steal from it. In case of overflow, elements are pushed into the private queue (which in turn, can be expanded with no need of synchronization, since thieves won't access it).For the example below (chosen because it spends ~70% of runtime in the mark loop)
we have
The time spent in the mark loop for the example above is artificially large, so these speedups in GC time won't necessarily be achieved in practice.
TODO:
macOS
(current implementation hangs in theMach
exception handler).jl_n_threads
times and leave the mark loop on failure).src/gc-debug.c
.