GC/Parallel marking #44643

d-netto · 2022-03-16T16:53:17Z

[As noted below, seems to be causing performance regressions on large number of threads, superseded by https://github.com//pull/45639.]

This PR extends #41760 by using the deque from #43366 to implement work-stealing in the GC mark loop.

~~The design is inspired by Horie et al. (https://dl.acm.org/doi/pdf/10.1145/3299706.3210570).~~

At a high level, two queues (public/private) are maintained by each thread. The public queue has a fixed size and thieves may steal from it. In case of overflow, elements are pushed into the private queue (which in turn, can be expanded with no need of synchronization, since thieves won't access it).

For the example below (chosen because it spends ~70% of runtime in the mark loop)

using BenchmarkTools

gctimes = []

for i in 1:25
    stat = @timed begin
        GC.gc(); GC.gc(); GC.gc()
    end
    push!(gctimes, stat.gctime)
end

@show Threads.nthreads()
@show mean(gctimes)
print("~~~~~~~~")

we have

for nt in {1..4}; do ../julia/julia -t$nt gc_scrub.jl; done
Threads.nthreads() = 1
mean(gctimes) = 0.35867786796000006
~~~~~~~~
Threads.nthreads() = 2
mean(gctimes) = 0.20352844991999997
~~~~~~~~
Threads.nthreads() = 3
mean(gctimes) = 0.16981833383999997
~~~~~~~~
Threads.nthreads() = 4
mean(gctimes) = 0.14483000495999998
~~~~~~~~

The time spent in the mark loop for the example above is artificially large, so these speedups in GC time won't necessarily be achieved in practice.

TODO:

Fix thread recruitment in macOS (current implementation hangs in the Mach exception handler).
Improve heuristics to decide when threads should give up work-stealing or wake up and become thieves (as of now, threads try to steal jl_n_threads times and leave the mark loop on failure).
Fix GC debugging infrastructure in src/gc-debug.c.

chflood · 2022-03-16T17:57:35Z

I don't have access to the paper unfortunately. I wish more people used researchgate.net.

My main concern is that work is hidden from the other threads in a private stack and therefore we are missing out on parallelism. If I pop from my private stack and then generate work, is the new work pushed on the private or public stack? There is some chance that steals happened and the public stack now has room.

d-netto · 2022-03-16T20:23:36Z

I don't have access to the paper unfortunately. I wish more people used researchgate.net.

My main concern is that work is hidden from the other threads in a private stack and therefore we are missing out on parallelism. If I pop from my private stack and then generate work, is the new work pushed on the private or public stack? There is some chance that steals happened and the public stack now has room.

Private stack. Yes, it's possible that you hide some GC work in such cases. The public queue size is chosen to be sufficiently large so that's unlikely to use the private queue though.

Edit: it's pushed into the private queue in this implementation, but in fact it can/should be pushed into the public one (if there is space). Should be fixed in the next commit.

src/safepoint.c

src/julia_threads.h

src/safepoint.c

src/signals-mach.c

tveldhui · 2022-04-12T14:41:47Z

I was digging into long gc times we were seeing (30-40% of duration, gc pauses of 20 seconds, 15 minutes of gc time over a run) and @Sacha0 pointedb out this PR. We will give it a try.

One thing I found might be relevant? Visiting objects in a random memory layout the memory accesses were taking 55ns per object. But if you add __intrinsic_prefetch(o) to prefetch the i+4th object while marking the ith object the time dropped to 16ns per node. I was benchmarking loops like

for (int i=0; i < n2; ++i)
    {
       __builtin_prefetch(visit_order[i+4]);
       TreeNode* node = visit_order[i];
       check += node->mark;
       node->mark = 1;
    }

not sure if this is something you can make use of. In any case looking forward to trying your parallel marking.

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>

vchuravy · 2022-05-31T13:09:44Z

@nanosoldier runtests(ALL, vs = ":master")

nanosoldier · 2022-06-01T04:50:12Z

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

kpamnany

Please either remove the formatting changes or squash your commits such that functional changes are in a separate commit. It will make it easier to review this PR.

chriselrod · 2022-06-06T19:02:05Z

I've been trying this PR out on some (often GC-heavy) proprietary code.
If needed, I could try and make a reproducible open source example.

Running 1000 fits to get a better idea of the mean; PR:

julia> extrema(ts), 1e3mean(ts)
((0.037808062, 33.898002256), 466.66967408400006)

Master:

julia> extrema(ts), 1e3mean(ts)
((0.03921645, 0.255453254), 59.963547983)

7.8x regression on this PR; the average time on master is about 60ms, vs 466ms on this PR.

This PR seems to cause extreme GC pauses, e.g. one of the fits took >30 seconds on this PR, while the slowest fit on master was little more than a quarter second!

EDIT:
Repeating the PR for another 1000 iterations:

julia> extrema(ts), 1e3mean(ts)
((0.041272219, 41.353805079), 672.1340294480002)

More than 11x slower on average.

chriselrod · 2022-06-06T21:02:08Z

Compile times also regress severely under this PR:

julia> using DifferentialEquations

julia> function f(du,u,p,t)
         du[1] = p[1] * u[1] - p[2] * u[1]*u[2]
         du[2] = -3 * u[2] + u[1]*u[2]
       end
f (generic function with 1 method)

julia> function g(du,u,p,t)
         du[1] = p[3]*u[1]
         du[2] = p[4]*u[2]
       end
g (generic function with 1 method)

julia> p = [1.5,1.0,0.1,0.1];

julia> prob = SDEProblem(f,g,[1.0,1.0],(0.0,10.0),p);

julia> function prob_func(prob,i,repeat)
         x = 0.3rand(2)
         remake(prob,p=[p[1:2];x])
       end
prob_func (generic function with 1 method)

julia> ensemble_prob = EnsembleProblem(prob,prob_func=prob_func);

julia> sim = @time solve(ensemble_prob,SRIW1(),trajectories=10)
144.477110 seconds (24.37 M allocations: 1.706 GiB, 0.65% gc time, 100.00% compilation time)
EnsembleSolution Solution of length 10 with uType:
RODESolution{Float64, 2, Vector{Vector{Float64}}, Nothing, Nothing, Vector{Float64}, NoiseProcess{Float64, 2, Float64, Vector{Float64}, Vector{Float64}, Vector{Vector{Float64}}, typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_DIST), typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_BRIDGE), true, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, RSWM{Float64}, Nothing, RandomNumbers.Xorshifts.Xoroshiro128Plus}, SDEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, Vector{Float64}, Nothing, SDEFunction{true, typeof(f), typeof(g), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing}, typeof(g), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Nothing}, SRIW1, StochasticDiffEq.LinearInterpolationData{Vector{Vector{Float64}}, Vector{Float64}}, DiffEqBase.DEStats}

Master:

julia> using DifferentialEquations

julia> function f(du,u,p,t)
         du[1] = p[1] * u[1] - p[2] * u[1]*u[2]
         du[2] = -3 * u[2] + u[1]*u[2]
       end
f (generic function with 1 method)

julia> function g(du,u,p,t)
         du[1] = p[3]*u[1]
         du[2] = p[4]*u[2]
       end
g (generic function with 1 method)

julia> p = [1.5,1.0,0.1,0.1];

julia> prob = SDEProblem(f,g,[1.0,1.0],(0.0,10.0),p);

julia> function prob_func(prob,i,repeat)
         x = 0.3rand(2)
         remake(prob,p=[p[1:2];x])
       end
prob_func (generic function with 1 method)

julia> ensemble_prob = EnsembleProblem(prob,prob_func=prob_func);

julia> sim = @time solve(ensemble_prob,SRIW1(),trajectories=10)
 16.090327 seconds (24.01 M allocations: 1.668 GiB, 3.95% gc time, 100.00% compilation time)
EnsembleSolution Solution of length 10 with uType:
RODESolution{Float64, 2, Vector{Vector{Float64}}, Nothing, Nothing, Vector{Float64}, NoiseProcess{Float64, 2, Float64, Vector{Float64}, Vector{Float64}, Vector{Vector{Float64}}, typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_DIST), typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_BRIDGE), true, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, RSWM{Float64}, Nothing, RandomNumbers.Xorshifts.Xoroshiro128Plus}, SDEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, Vector{Float64}, Nothing, SDEFunction{true, typeof(f), typeof(g), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing}, typeof(g), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Nothing}, SRIW1, StochasticDiffEq.LinearInterpolationData{Vector{Vector{Float64}}, Vector{Float64}}, DiffEqBase.DEStats}

chriselrod · 2022-06-06T21:47:49Z

Running:

using DifferentialEquations

function f(du,u,p,t)
  du[1] = p[1] * u[1] - p[2] * u[1]*u[2]
  du[2] = -3 * u[2] + u[1]*u[2]
end

function g(du,u,p,t)
  du[1] = p[3]*u[1]
  du[2] = p[4]*u[2]
end

p = [1.5,1.0,0.1,0.1];
prob = SDEProblem(f,g,[1.0,1.0],(0.0,10.0),p);

function prob_func(prob,i,repeat)
  x = 0.3rand(2)
  remake(prob,p=[p[1:2];x])
end

ensemble_prob = EnsembleProblem(prob,prob_func=prob_func);

include("../../utils.jl")

@gctime solve(ensemble_prob,SRIW1(),trajectories=100_000).u[end].u[end]

Most of the time spent is on loading and precompilation. I'll up the trajectories to 1_000_000.

5 runs with 36 threads on a 36-thread system yields:

#  Master:
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
# │         │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
# │         │         ms │      ms │           ms │                ms │       MB │          % │
# ├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
# │ minimum │      18032 │    1627 │          817 │                 0 │     2752 │          9 │
# │  median │      18090 │    1699 │          856 │                 0 │     2833 │          9 │
# │ maximum │      19679 │    1894 │          877 │                12 │     2867 │         10 │
# └─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘
# ⏎                                                                                                                              
# PR:
# ┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
# │         │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
# │         │         ms │      ms │           ms │                ms │       MB │          % │
# ├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
# │ minimum │      25287 │    1026 │          739 │               404 │     1210 │          4 │
# │  median │      29303 │    1882 │          906 │               496 │     2950 │          6 │
# │ maximum │      32776 │    2028 │          969 │               690 │     3006 │          7 │
# └─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘

Note the 50% regression in total time, but claimed gc_time is about the same.

chriselrod · 2022-06-06T22:32:07Z

Reducing the number of threads to 18, matching the number of physical cores, and increasing the number of trajectories to 1_000_000 so that the solve actually spends a decent chunk of the total time:

# Master:
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │           ms │                ms │       MB │          % │
├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │      44143 │   18857 │         9822 │                 0 │    34017 │         43 │
│  median │      44160 │   19057 │         9919 │                 0 │    34118 │         43 │
│ maximum │      44429 │   19152 │         9948 │                 0 │    34202 │         43 │
└─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘

________________________________________________________
Executed in  362.57 secs    fish           external
   usr time   62.67 mins  325.00 micros   62.67 mins
   sys time    3.41 mins   85.00 micros    3.41 mins

# PR:
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │           ms │                ms │       MB │          % │
├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │     171546 │   25212 │        14311 │               324 │    34140 │          4 │
│  median │     200472 │   28749 │        15988 │              4548 │    34569 │         15 │
│ maximum │     567746 │   30408 │        16248 │             31160 │    34794 │         16 │
└─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘

________________________________________________________
Executed in   25.31 mins    fish           external
   usr time   68.37 mins  312.00 micros   68.37 mins
   sys time  318.83 mins   84.00 micros  318.83 mins

A 3.8 to >12x regression in runtime, but the GC time only increased by about 50%.
Just looking at the GC time vastly understates the degree to which this PR regresses performance.

chriselrod · 2022-06-06T22:35:37Z

Also, on a 64-thread Epyc system, Julia master:

julia> sim = @time solve(ensemble_prob,SRIW1(),trajectories=10)
 18.253071 seconds (19.22 M allocations: 1.313 GiB, 5.46% gc time, 99.90% compilation time)
EnsembleSolution Solution of length 10 with uType:
RODESolution{Float64, 2, Vector{Vector{Float64}}, Nothing, Nothing, Vector{Float64}, NoiseProcess{Float64, 2, Float64, Vector{Float64}, Vector{Float64}, Vector{Vector{Float64}}, typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_DIST), typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_BRIDGE), true, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, RSWM{Float64}, Nothing, RandomNumbers.Xorshifts.Xoroshiro128Plus}, SDEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, Vector{Float64}, Nothing, SDEFunction{true, typeof(f), typeof(g), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, typeof(g), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Nothing}, SRIW1, StochasticDiffEq.LinearInterpolationData{Vector{Vector{Float64}}, Vector{Float64}}, DiffEqBase.DEStats}

Versus the PR:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1035073 chrisel+  20   0   19.8g 716008 127416 S  4098   0.1   2700:53 julia

It's put 2700 CPU minutes in (that's 45 CPU*hours) without finishing yet.
EDIT:
It has finished!

julia> sim = @time solve(ensemble_prob,SRIW1(),trajectories=10)
4530.529931 seconds (19.19 M allocations: 1.310 GiB, 0.12% gc time, 100.00% compilation time)
EnsembleSolution Solution of length 10 with uType:
RODESolution{Float64, 2, Vector{Vector{Float64}}, Nothing, Nothing, Vector{Float64}, NoiseProcess{Float64, 2, Float64, Vector{Float64}, Vector{Float64}, Vector{Vector{Float64}}, typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_DIST), typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_BRIDGE), true, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, RSWM{Float64}, Nothing, RandomNumbers.Xorshifts.Xoroshiro128Plus}, SDEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, Vector{Float64}, Nothing, SDEFunction{true, typeof(f), typeof(g), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, typeof(g), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Nothing}, SRIW1, StochasticDiffEq.LinearInterpolationData{Vector{Vector{Float64}}, Vector{Float64}}, DiffEqBase.DEStats}

Comparing single solves with a 1million trajectories post compilation, master:

julia> sim = @time solve(ensemble_prob,SRIW1(),trajectories=1_000_000)
124.598627 seconds (691.60 M allocations: 65.804 GiB, 67.09% gc time)
EnsembleSolution Solution of length 1000000 with uType:
RODESolution{Float64, 2, Vector{Vector{Float64}}, Nothing, Nothing, Vector{Float64}, NoiseProcess{Float64, 2, Float64, Vector{Float64}, Vector{Float64}, Vector{Vector{Float64}}, typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_DIST), typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_BRIDGE), true, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, RSWM{Float64}, Nothing, RandomNumbers.Xorshifts.Xoroshiro128Plus}, SDEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, Vector{Float64}, Nothing, SDEFunction{true, typeof(f), typeof(g), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, typeof(g), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Nothing}, SRIW1, StochasticDiffEq.LinearInterpolationData{Vector{Vector{Float64}}, Vector{Float64}}, DiffEqBase.DEStats}

PR:

julia> sim = @time solve(ensemble_prob,SRIW1(),trajectories=1_000_000)
305.889817 seconds (691.35 M allocations: 65.782 GiB, 25.51% gc time)
EnsembleSolution Solution of length 1000000 with uType:
RODESolution{Float64, 2, Vector{Vector{Float64}}, Nothing, Nothing, Vector{Float64}, NoiseProcess{Float64, 2, Float64, Vector{Float64}, Vector{Float64}, Vector{Vector{Float64}}, typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_DIST), typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_BRIDGE), true, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, RSWM{Float64}, Nothing, RandomNumbers.Xorshifts.Xoroshiro128Plus}, SDEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, Vector{Float64}, Nothing, SDEFunction{true, typeof(f), typeof(g), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, typeof(g), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Nothing}, SRIW1, StochasticDiffEq.LinearInterpolationData{Vector{Vector{Float64}}, Vector{Float64}}, DiffEqBase.DEStats}

chriselrod · 2022-06-07T18:41:24Z

Also, it is import to test with many threads. That is, with more than four.
With only four threads, the regression is minimal.
Master, running

julia_master --project=. run_benchmarks.jl -n5 -t4 --bench=binary_tree/tree_immutable.jl
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │           ms │                ms │       MB │          % │
├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       5977 │    2810 │          127 │                 0 │      459 │         47 │
│  median │       5988 │    2833 │          128 │                 0 │      461 │         47 │
│ maximum │       6122 │    2963 │          130 │                 0 │      462 │         48 │
└─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘

PR:

julia_parallel_gc --project=. run_benchmarks.jl -n5 -t4 --bench=binary_tree/tree_immutable.jl
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │           ms │                ms │       MB │          % │
├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       5828 │    2626 │          103 │                 0 │      462 │         42 │
│  median │       6018 │    2716 │          104 │                 0 │      464 │         46 │
│ maximum │       6247 │    2872 │          106 │                88 │      472 │         46 │
└─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘

While with 18 threads:

julia_master --project=. run_benchmarks.jl -n5 -t18 --bench=binary_tree/tree_immutable.jl
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │           ms │                ms │       MB │          % │
├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       5546 │    3744 │          330 │                 0 │      958 │         67 │
│  median │       5601 │    3793 │          332 │                 0 │      965 │         68 │
│ maximum │       5626 │    3819 │          334 │                 3 │      973 │         68 │
└─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘

PR:

julia_parallel_gc --project=. run_benchmarks.jl -n5 -t18 --bench=binary_tree/tree_immutable.jl
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │           ms │                ms │       MB │          % │
├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       9231 │    5790 │          468 │                 0 │      932 │         48 │
│  median │      10908 │    5809 │          483 │                 4 │      944 │         54 │
│ maximum │      12263 │    6281 │          523 │                71 │      951 │         63 │
└─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘

This PR causes about a 2x regression.

chflood · 2022-06-07T20:21:47Z

I believe that there's a bug in the benchmark harness and you are not really running with multiple threads.

…

On Tue, Jun 7, 2022 at 2:41 PM Chris Elrod ***@***.***> wrote: Also, it is import to test with many threads. That is, with more than four. With only four threads, the regression is minimal. Master, running julia_master --project=. run_benchmarks.jl -n5 -t4 --bench=binary_tree/tree_immutable.jl ┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐ │ │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │ │ │ ms │ ms │ ms │ ms │ MB │ % │ ├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤ │ minimum │ 5977 │ 2810 │ 127 │ 0 │ 459 │ 47 │ │ median │ 5988 │ 2833 │ 128 │ 0 │ 461 │ 47 │ │ maximum │ 6122 │ 2963 │ 130 │ 0 │ 462 │ 48 │ └─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘ PR: julia_parallel_gc --project=. run_benchmarks.jl -n5 -t4 --bench=binary_tree/tree_immutable.jl ┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐ │ │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │ │ │ ms │ ms │ ms │ ms │ MB │ % │ ├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤ │ minimum │ 5828 │ 2626 │ 103 │ 0 │ 462 │ 42 │ │ median │ 6018 │ 2716 │ 104 │ 0 │ 464 │ 46 │ │ maximum │ 6247 │ 2872 │ 106 │ 88 │ 472 │ 46 │ └─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘ While with 18 threads: julia_master --project=. run_benchmarks.jl -n5 -t18 --bench=binary_tree/tree_immutable.jl ┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐ │ │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │ │ │ ms │ ms │ ms │ ms │ MB │ % │ ├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤ │ minimum │ 5546 │ 3744 │ 330 │ 0 │ 958 │ 67 │ │ median │ 5601 │ 3793 │ 332 │ 0 │ 965 │ 68 │ │ maximum │ 5626 │ 3819 │ 334 │ 3 │ 973 │ 68 │ └─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘ PR: julia_parallel_gc --project=. run_benchmarks.jl -n5 -t18 --bench=binary_tree/tree_immutable.jl ┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐ │ │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │ │ │ ms │ ms │ ms │ ms │ MB │ % │ ├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤ │ minimum │ 9231 │ 5790 │ 468 │ 0 │ 932 │ 48 │ │ median │ 10908 │ 5809 │ 483 │ 4 │ 944 │ 54 │ │ maximum │ 12263 │ 6281 │ 523 │ 71 │ 951 │ 63 │ └─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘ — Reply to this email directly, view it on GitHub <#44643 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE3CLBE3GIKEBW67OERIUHLVN6JWDANCNFSM5Q4PGVIQ> . You are receiving this because your review was requested.Message ID: ***@***.***>

oscardssmith · 2022-06-07T20:45:47Z

Chris and I have confirmed (by looking at htop) that he was.

chflood · 2022-06-08T12:56:01Z

Can we move the work stealing queues to their own file and add some debugging checks. Having work items get missed in production would be a bear to track down, stats would show us if we are missing a fence or something. Back in the day we were missing a barrier that was needed in TSO memory models. I'm imagining that in the future there might be similar cases where knowing that every item that was pushed was popped would be comforting. Christine

…

On Tue, Jun 7, 2022 at 4:46 PM Oscar Smith ***@***.***> wrote: Chris and I have confirmed (by looking at htop) that he was. — Reply to this email directly, view it on GitHub <#44643 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE3CLBGWYP2HIWFYDSDN6Q3VN6YIRANCNFSM5Q4PGVIQ> . You are receiving this because your review was requested.Message ID: ***@***.***>

chflood · 2022-06-08T13:04:57Z

I added the start of a deque stats tracker to the stats branch of https://github.com/chflood/parallelmarking but the numbers aren't adding up. I'm new to git so if I did this in the wrong place please let me know. On Wed, Jun 8, 2022 at 8:55 AM Christine Flood < ***@***.***> wrote:

…

Can we move the work stealing queues to their own file and add some debugging checks. Having work items get missed in production would be a bear to track down, stats would show us if we are missing a fence or something. Back in the day we were missing a barrier that was needed in TSO memory models. I'm imagining that in the future there might be similar cases where knowing that every item that was pushed was popped would be comforting. Christine On Tue, Jun 7, 2022 at 4:46 PM Oscar Smith ***@***.***> wrote: > Chris and I have confirmed (by looking at htop) that he was. > > — > Reply to this email directly, view it on GitHub > <#44643 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AE3CLBGWYP2HIWFYDSDN6Q3VN6YIRANCNFSM5Q4PGVIQ> > . > You are receiving this because your review was requested.Message ID: > ***@***.***> >

KristofferC · 2022-06-08T13:12:16Z

I added the start of a deque stats tracker to the stats branch of https://github.com/chflood/parallelmarking

As a side note, you would typically not create a completely new repository for this, just add that branch to your already existing fork at https://github.com/chflood/julia.

parallel gc

4417f76

vchuravy requested review from chflood and yuyichao March 16, 2022 17:39

jpsamaroo added performance Must go faster GC Garbage collector labels Mar 16, 2022

jpsamaroo mentioned this pull request Mar 16, 2022

Add parallel marking to GC #41760

Closed

mach handler fix

6905de0

oscardssmith reviewed Apr 10, 2022

View reviewed changes

src/safepoint.c Outdated Show resolved Hide resolved

Diogo Netto added 2 commits April 9, 2022 21:11

typo

48c6fda

no need for locking

733fc50

vchuravy reviewed Apr 10, 2022

View reviewed changes

src/julia_threads.h Outdated Show resolved Hide resolved

src/safepoint.c Outdated Show resolved Hide resolved

src/signals-mach.c Outdated Show resolved Hide resolved

src/signals-mach.c Show resolved Hide resolved

vchuravy requested a review from vtjnash April 10, 2022 12:32

vchuravy changed the title ~~[WIP] GC/Parallel marking~~ GC/Parallel marking Apr 10, 2022

Diogo Netto and others added 2 commits April 10, 2022 17:10

cleanup

dccb37e

minor change in thread recruitment

d55923d

Diogo Netto and others added 12 commits April 13, 2022 13:10

overflow fix

1bd9f41

Merge branch 'master' into dcn/gc

530f3fd

Merge branch 'dcn/gc' of https://github.com/d-netto/julia into dcn/gc

1b51d5b

always trigger ws

ed6783c

no circ buffer for now

c172b9d

panic'ing on public queue overflow. TODO: resize it

33afea6

fix in overflow condition

49b5c84

unused func

fa222ed

started to implement spin master

4ee9ea9

spin master fixes

78d8359

spin master working; TODO: stress test

a0116d6

Merge branch 'master' into dcn/gc

f48bffb

Diogo Netto and others added 8 commits May 25, 2022 13:05

started to implement queue resize

aecc2ac

removed old bpftrace probes

e60dd24

actually freeing ws buffers

8864ee4

Apply suggestions from code review

e048228

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>

formatting + small change in atomics ordering

5b528db

Merge branch 'dcn/gc' of https://github.com/d-netto/julia into dcn/gc

0ae4534

off by one

24d6c94

larger init_size

02213b3

vchuravy requested a review from kpamnany May 31, 2022 13:16

kpamnany reviewed Jun 1, 2022

View reviewed changes

d-netto force-pushed the dcn/gc branch from c3b0f0c to 02213b3 Compare June 1, 2022 20:12

Diogo Netto and others added 2 commits June 1, 2022 17:15

small fix in termination protocol

82dbe34

Merge branch 'master' into dcn/gc

e6a9e3c

d-netto mentioned this pull request Jun 7, 2022

GC mark-loop rewrite #45608

Closed

3 tasks

d-netto mentioned this pull request Jun 10, 2022

[rfc] parallel marking #45639

Closed

3 tasks

d-netto closed this Jun 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GC/Parallel marking #44643

GC/Parallel marking #44643

d-netto commented Mar 16, 2022 •

edited

Loading

chflood commented Mar 16, 2022

d-netto commented Mar 16, 2022 •

edited

Loading

tveldhui commented Apr 12, 2022

vchuravy commented May 31, 2022

nanosoldier commented Jun 1, 2022

kpamnany left a comment

chriselrod commented Jun 6, 2022 •

edited

Loading

chriselrod commented Jun 6, 2022

chriselrod commented Jun 6, 2022 •

edited

Loading

chriselrod commented Jun 6, 2022

chriselrod commented Jun 6, 2022 •

edited

Loading

chriselrod commented Jun 7, 2022 •

edited

Loading

chflood commented Jun 7, 2022 via email

oscardssmith commented Jun 7, 2022

chflood commented Jun 8, 2022 via email

chflood commented Jun 8, 2022 via email

KristofferC commented Jun 8, 2022

GC/Parallel marking #44643

GC/Parallel marking #44643

Conversation

d-netto commented Mar 16, 2022 • edited Loading

chflood commented Mar 16, 2022

d-netto commented Mar 16, 2022 • edited Loading

tveldhui commented Apr 12, 2022

vchuravy commented May 31, 2022

nanosoldier commented Jun 1, 2022

kpamnany left a comment

Choose a reason for hiding this comment

chriselrod commented Jun 6, 2022 • edited Loading

chriselrod commented Jun 6, 2022

chriselrod commented Jun 6, 2022 • edited Loading

chriselrod commented Jun 6, 2022

chriselrod commented Jun 6, 2022 • edited Loading

chriselrod commented Jun 7, 2022 • edited Loading

chflood commented Jun 7, 2022 via email

oscardssmith commented Jun 7, 2022

chflood commented Jun 8, 2022 via email

chflood commented Jun 8, 2022 via email

KristofferC commented Jun 8, 2022

d-netto commented Mar 16, 2022 •

edited

Loading

d-netto commented Mar 16, 2022 •

edited

Loading

chriselrod commented Jun 6, 2022 •

edited

Loading

chriselrod commented Jun 6, 2022 •

edited

Loading

chriselrod commented Jun 6, 2022 •

edited

Loading

chriselrod commented Jun 7, 2022 •

edited

Loading