Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC/Parallel marking #44643

Closed
wants to merge 28 commits into from
Closed

GC/Parallel marking #44643

wants to merge 28 commits into from

Conversation

d-netto
Copy link
Member

@d-netto d-netto commented Mar 16, 2022

[As noted below, seems to be causing performance regressions on large number of threads, superseded by https://github.com//pull/45639.]

This PR extends #41760 by using the deque from #43366 to implement work-stealing in the GC mark loop.

The design is inspired by Horie et al. (https://dl.acm.org/doi/pdf/10.1145/3299706.3210570).

At a high level, two queues (public/private) are maintained by each thread. The public queue has a fixed size and thieves may steal from it. In case of overflow, elements are pushed into the private queue (which in turn, can be expanded with no need of synchronization, since thieves won't access it).

For the example below (chosen because it spends ~70% of runtime in the mark loop)

using BenchmarkTools

gctimes = []

for i in 1:25
    stat = @timed begin
        GC.gc(); GC.gc(); GC.gc()
    end
    push!(gctimes, stat.gctime)
end

@show Threads.nthreads()
@show mean(gctimes)
print("~~~~~~~~")

we have

for nt in {1..4}; do ../julia/julia -t$nt gc_scrub.jl; done
Threads.nthreads() = 1
mean(gctimes) = 0.35867786796000006
~~~~~~~~
Threads.nthreads() = 2
mean(gctimes) = 0.20352844991999997
~~~~~~~~
Threads.nthreads() = 3
mean(gctimes) = 0.16981833383999997
~~~~~~~~
Threads.nthreads() = 4
mean(gctimes) = 0.14483000495999998
~~~~~~~~

The time spent in the mark loop for the example above is artificially large, so these speedups in GC time won't necessarily be achieved in practice.

TODO:

  • Fix thread recruitment in macOS (current implementation hangs in the Mach exception handler).
  • Improve heuristics to decide when threads should give up work-stealing or wake up and become thieves (as of now, threads try to steal jl_n_threads times and leave the mark loop on failure).
  • Fix GC debugging infrastructure in src/gc-debug.c.

@vchuravy vchuravy requested review from chflood and yuyichao March 16, 2022 17:39
@chflood
Copy link
Member

chflood commented Mar 16, 2022

I don't have access to the paper unfortunately. I wish more people used researchgate.net.

My main concern is that work is hidden from the other threads in a private stack and therefore we are missing out on parallelism. If I pop from my private stack and then generate work, is the new work pushed on the private or public stack? There is some chance that steals happened and the public stack now has room.

@jpsamaroo jpsamaroo added performance Must go faster GC Garbage collector labels Mar 16, 2022
@d-netto
Copy link
Member Author

d-netto commented Mar 16, 2022

I don't have access to the paper unfortunately. I wish more people used researchgate.net.

My main concern is that work is hidden from the other threads in a private stack and therefore we are missing out on parallelism. If I pop from my private stack and then generate work, is the new work pushed on the private or public stack? There is some chance that steals happened and the public stack now has room.

Private stack. Yes, it's possible that you hide some GC work in such cases. The public queue size is chosen to be sufficiently large so that's unlikely to use the private queue though.

Edit: it's pushed into the private queue in this implementation, but in fact it can/should be pushed into the public one (if there is space). Should be fixed in the next commit.

src/safepoint.c Outdated Show resolved Hide resolved
src/julia_threads.h Outdated Show resolved Hide resolved
src/safepoint.c Outdated Show resolved Hide resolved
src/signals-mach.c Outdated Show resolved Hide resolved
src/signals-mach.c Show resolved Hide resolved
@vchuravy vchuravy requested a review from vtjnash April 10, 2022 12:32
@vchuravy vchuravy changed the title [WIP] GC/Parallel marking GC/Parallel marking Apr 10, 2022
@tveldhui
Copy link

I was digging into long gc times we were seeing (30-40% of duration, gc pauses of 20 seconds, 15 minutes of gc time over a run) and @Sacha0 pointedb out this PR. We will give it a try.

One thing I found might be relevant? Visiting objects in a random memory layout the memory accesses were taking 55ns per object. But if you add __intrinsic_prefetch(o) to prefetch the i+4th object while marking the ith object the time dropped to 16ns per node. I was benchmarking loops like

for (int i=0; i < n2; ++i)
    {
       __builtin_prefetch(visit_order[i+4]);
       TreeNode* node = visit_order[i];
       check += node->mark;
       node->mark = 1;
    }

not sure if this is something you can make use of. In any case looking forward to trying your parallel marking.

@vchuravy
Copy link
Member

@nanosoldier runtests(ALL, vs = ":master")

@vchuravy vchuravy requested a review from kpamnany May 31, 2022 13:16
@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

Copy link
Contributor

@kpamnany kpamnany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please either remove the formatting changes or squash your commits such that functional changes are in a separate commit. It will make it easier to review this PR.

@chriselrod
Copy link
Contributor

chriselrod commented Jun 6, 2022

I've been trying this PR out on some (often GC-heavy) proprietary code.
If needed, I could try and make a reproducible open source example.

Running 1000 fits to get a better idea of the mean; PR:

julia> extrema(ts), 1e3mean(ts)
((0.037808062, 33.898002256), 466.66967408400006)

Master:

julia> extrema(ts), 1e3mean(ts)
((0.03921645, 0.255453254), 59.963547983)

7.8x regression on this PR; the average time on master is about 60ms, vs 466ms on this PR.

This PR seems to cause extreme GC pauses, e.g. one of the fits took >30 seconds on this PR, while the slowest fit on master was little more than a quarter second!

EDIT:
Repeating the PR for another 1000 iterations:

julia> extrema(ts), 1e3mean(ts)
((0.041272219, 41.353805079), 672.1340294480002)

More than 11x slower on average.

@chriselrod
Copy link
Contributor

Compile times also regress severely under this PR:

julia> using DifferentialEquations

julia> function f(du,u,p,t)
         du[1] = p[1] * u[1] - p[2] * u[1]*u[2]
         du[2] = -3 * u[2] + u[1]*u[2]
       end
f (generic function with 1 method)

julia> function g(du,u,p,t)
         du[1] = p[3]*u[1]
         du[2] = p[4]*u[2]
       end
g (generic function with 1 method)

julia> p = [1.5,1.0,0.1,0.1];

julia> prob = SDEProblem(f,g,[1.0,1.0],(0.0,10.0),p);

julia> function prob_func(prob,i,repeat)
         x = 0.3rand(2)
         remake(prob,p=[p[1:2];x])
       end
prob_func (generic function with 1 method)

julia> ensemble_prob = EnsembleProblem(prob,prob_func=prob_func);

julia> sim = @time solve(ensemble_prob,SRIW1(),trajectories=10)
144.477110 seconds (24.37 M allocations: 1.706 GiB, 0.65% gc time, 100.00% compilation time)
EnsembleSolution Solution of length 10 with uType:
RODESolution{Float64, 2, Vector{Vector{Float64}}, Nothing, Nothing, Vector{Float64}, NoiseProcess{Float64, 2, Float64, Vector{Float64}, Vector{Float64}, Vector{Vector{Float64}}, typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_DIST), typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_BRIDGE), true, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, RSWM{Float64}, Nothing, RandomNumbers.Xorshifts.Xoroshiro128Plus}, SDEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, Vector{Float64}, Nothing, SDEFunction{true, typeof(f), typeof(g), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing}, typeof(g), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Nothing}, SRIW1, StochasticDiffEq.LinearInterpolationData{Vector{Vector{Float64}}, Vector{Float64}}, DiffEqBase.DEStats}

Master:

julia> using DifferentialEquations

julia> function f(du,u,p,t)
         du[1] = p[1] * u[1] - p[2] * u[1]*u[2]
         du[2] = -3 * u[2] + u[1]*u[2]
       end
f (generic function with 1 method)

julia> function g(du,u,p,t)
         du[1] = p[3]*u[1]
         du[2] = p[4]*u[2]
       end
g (generic function with 1 method)

julia> p = [1.5,1.0,0.1,0.1];

julia> prob = SDEProblem(f,g,[1.0,1.0],(0.0,10.0),p);

julia> function prob_func(prob,i,repeat)
         x = 0.3rand(2)
         remake(prob,p=[p[1:2];x])
       end
prob_func (generic function with 1 method)

julia> ensemble_prob = EnsembleProblem(prob,prob_func=prob_func);

julia> sim = @time solve(ensemble_prob,SRIW1(),trajectories=10)
 16.090327 seconds (24.01 M allocations: 1.668 GiB, 3.95% gc time, 100.00% compilation time)
EnsembleSolution Solution of length 10 with uType:
RODESolution{Float64, 2, Vector{Vector{Float64}}, Nothing, Nothing, Vector{Float64}, NoiseProcess{Float64, 2, Float64, Vector{Float64}, Vector{Float64}, Vector{Vector{Float64}}, typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_DIST), typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_BRIDGE), true, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, RSWM{Float64}, Nothing, RandomNumbers.Xorshifts.Xoroshiro128Plus}, SDEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, Vector{Float64}, Nothing, SDEFunction{true, typeof(f), typeof(g), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing}, typeof(g), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Nothing}, SRIW1, StochasticDiffEq.LinearInterpolationData{Vector{Vector{Float64}}, Vector{Float64}}, DiffEqBase.DEStats}

@chriselrod
Copy link
Contributor

chriselrod commented Jun 6, 2022

Running:

using DifferentialEquations

function f(du,u,p,t)
  du[1] = p[1] * u[1] - p[2] * u[1]*u[2]
  du[2] = -3 * u[2] + u[1]*u[2]
end

function g(du,u,p,t)
  du[1] = p[3]*u[1]
  du[2] = p[4]*u[2]
end

p = [1.5,1.0,0.1,0.1];
prob = SDEProblem(f,g,[1.0,1.0],(0.0,10.0),p);

function prob_func(prob,i,repeat)
  x = 0.3rand(2)
  remake(prob,p=[p[1:2];x])
end

ensemble_prob = EnsembleProblem(prob,prob_func=prob_func);

include("../../utils.jl")

@gctime solve(ensemble_prob,SRIW1(),trajectories=100_000).u[end].u[end]

Most of the time spent is on loading and precompilation. I'll up the trajectories to 1_000_000.

5 runs with 36 threads on a 36-thread system yields:

#  Master:
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
# │         │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
# │         │         ms │      ms │           ms │                ms │       MB │          % │
# ├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
# │ minimum │      18032 │    1627 │          817 │                 0 │     2752 │          9 │
# │  median │      18090 │    1699 │          856 │                 0 │     2833 │          9 │
# │ maximum │      19679 │    1894 │          877 │                12 │     2867 │         10 │
# └─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘
#
# PR:
# ┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
# │         │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
# │         │         ms │      ms │           ms │                ms │       MB │          % │
# ├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
# │ minimum │      25287 │    1026 │          739 │               404 │     1210 │          4 │
# │  median │      29303 │    1882 │          906 │               496 │     2950 │          6 │
# │ maximum │      32776 │    2028 │          969 │               690 │     3006 │          7 │
# └─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘

Note the 50% regression in total time, but claimed gc_time is about the same.

@chriselrod
Copy link
Contributor

Reducing the number of threads to 18, matching the number of physical cores, and increasing the number of trajectories to 1_000_000 so that the solve actually spends a decent chunk of the total time:

# Master:
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │           ms │                ms │       MB │          % │
├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │      4414318857982203401743 │
│  median │      4416019057991903411843 │
│ maximum │      4442919152994803420243 │
└─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘

________________________________________________________
Executed in  362.57 secs    fish           external
   usr time   62.67 mins  325.00 micros   62.67 mins
   sys time    3.41 mins   85.00 micros    3.41 mins

# PR:
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │           ms │                ms │       MB │          % │
├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │     1715462521214311324341404 │
│  median │     200472287491598845483456915 │
│ maximum │     5677463040816248311603479416 │
└─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘

________________________________________________________
Executed in   25.31 mins    fish           external
   usr time   68.37 mins  312.00 micros   68.37 mins
   sys time  318.83 mins   84.00 micros  318.83 mins

A 3.8 to >12x regression in runtime, but the GC time only increased by about 50%.
Just looking at the GC time vastly understates the degree to which this PR regresses performance.

@chriselrod
Copy link
Contributor

chriselrod commented Jun 6, 2022

Also, on a 64-thread Epyc system, Julia master:

julia> sim = @time solve(ensemble_prob,SRIW1(),trajectories=10)
 18.253071 seconds (19.22 M allocations: 1.313 GiB, 5.46% gc time, 99.90% compilation time)
EnsembleSolution Solution of length 10 with uType:
RODESolution{Float64, 2, Vector{Vector{Float64}}, Nothing, Nothing, Vector{Float64}, NoiseProcess{Float64, 2, Float64, Vector{Float64}, Vector{Float64}, Vector{Vector{Float64}}, typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_DIST), typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_BRIDGE), true, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, RSWM{Float64}, Nothing, RandomNumbers.Xorshifts.Xoroshiro128Plus}, SDEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, Vector{Float64}, Nothing, SDEFunction{true, typeof(f), typeof(g), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, typeof(g), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Nothing}, SRIW1, StochasticDiffEq.LinearInterpolationData{Vector{Vector{Float64}}, Vector{Float64}}, DiffEqBase.DEStats}

Versus the PR:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1035073 chrisel+  20   0   19.8g 716008 127416 S  4098   0.1   2700:53 julia

It's put 2700 CPU minutes in (that's 45 CPU*hours) without finishing yet.
EDIT:
It has finished!

julia> sim = @time solve(ensemble_prob,SRIW1(),trajectories=10)
4530.529931 seconds (19.19 M allocations: 1.310 GiB, 0.12% gc time, 100.00% compilation time)
EnsembleSolution Solution of length 10 with uType:
RODESolution{Float64, 2, Vector{Vector{Float64}}, Nothing, Nothing, Vector{Float64}, NoiseProcess{Float64, 2, Float64, Vector{Float64}, Vector{Float64}, Vector{Vector{Float64}}, typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_DIST), typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_BRIDGE), true, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, RSWM{Float64}, Nothing, RandomNumbers.Xorshifts.Xoroshiro128Plus}, SDEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, Vector{Float64}, Nothing, SDEFunction{true, typeof(f), typeof(g), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, typeof(g), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Nothing}, SRIW1, StochasticDiffEq.LinearInterpolationData{Vector{Vector{Float64}}, Vector{Float64}}, DiffEqBase.DEStats}

Comparing single solves with a 1million trajectories post compilation, master:

julia> sim = @time solve(ensemble_prob,SRIW1(),trajectories=1_000_000)
124.598627 seconds (691.60 M allocations: 65.804 GiB, 67.09% gc time)
EnsembleSolution Solution of length 1000000 with uType:
RODESolution{Float64, 2, Vector{Vector{Float64}}, Nothing, Nothing, Vector{Float64}, NoiseProcess{Float64, 2, Float64, Vector{Float64}, Vector{Float64}, Vector{Vector{Float64}}, typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_DIST), typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_BRIDGE), true, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, RSWM{Float64}, Nothing, RandomNumbers.Xorshifts.Xoroshiro128Plus}, SDEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, Vector{Float64}, Nothing, SDEFunction{true, typeof(f), typeof(g), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, typeof(g), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Nothing}, SRIW1, StochasticDiffEq.LinearInterpolationData{Vector{Vector{Float64}}, Vector{Float64}}, DiffEqBase.DEStats}

PR:

julia> sim = @time solve(ensemble_prob,SRIW1(),trajectories=1_000_000)
305.889817 seconds (691.35 M allocations: 65.782 GiB, 25.51% gc time)
EnsembleSolution Solution of length 1000000 with uType:
RODESolution{Float64, 2, Vector{Vector{Float64}}, Nothing, Nothing, Vector{Float64}, NoiseProcess{Float64, 2, Float64, Vector{Float64}, Vector{Float64}, Vector{Vector{Float64}}, typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_DIST), typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_BRIDGE), true, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, ResettableStacks.ResettableStack{Tuple{Float64, Vector{Float64}, Vector{Float64}}, true}, RSWM{Float64}, Nothing, RandomNumbers.Xorshifts.Xoroshiro128Plus}, SDEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, Vector{Float64}, Nothing, SDEFunction{true, typeof(f), typeof(g), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, typeof(g), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Nothing}, SRIW1, StochasticDiffEq.LinearInterpolationData{Vector{Vector{Float64}}, Vector{Float64}}, DiffEqBase.DEStats}

@chriselrod
Copy link
Contributor

chriselrod commented Jun 7, 2022

Also, it is import to test with many threads. That is, with more than four.
With only four threads, the regression is minimal.
Master, running

julia_master --project=. run_benchmarks.jl -n5 -t4 --bench=binary_tree/tree_immutable.jl
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │           ms │                ms │       MB │          % │
├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       59772810127045947 │
│  median │       59882833128046147 │
│ maximum │       61222963130046248 │
└─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘

PR:

julia_parallel_gc --project=. run_benchmarks.jl -n5 -t4 --bench=binary_tree/tree_immutable.jl
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │           ms │                ms │       MB │          % │
├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       58282626103046242 │
│  median │       60182716104046446 │
│ maximum │       624728721068847246 │
└─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘

While with 18 threads:

julia_master --project=. run_benchmarks.jl -n5 -t18 --bench=binary_tree/tree_immutable.jl
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │           ms │                ms │       MB │          % │
├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       55463744330095867 │
│  median │       56013793332096568 │
│ maximum │       56263819334397368 │
└─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘

PR:

julia_parallel_gc --project=. run_benchmarks.jl -n5 -t18 --bench=binary_tree/tree_immutable.jl
┌─────────┬────────────┬─────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │           ms │                ms │       MB │          % │
├─────────┼────────────┼─────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       92315790468093248 │
│  median │      109085809483494454 │
│ maximum │      1226362815237195163 │
└─────────┴────────────┴─────────┴──────────────┴───────────────────┴──────────┴────────────┘

This PR causes about a 2x regression.

@chflood
Copy link
Member

chflood commented Jun 7, 2022 via email

@oscardssmith
Copy link
Member

Chris and I have confirmed (by looking at htop) that he was.

@d-netto d-netto mentioned this pull request Jun 7, 2022
3 tasks
@chflood
Copy link
Member

chflood commented Jun 8, 2022 via email

@chflood
Copy link
Member

chflood commented Jun 8, 2022 via email

@KristofferC
Copy link
Member

I added the start of a deque stats tracker to the stats branch of https://github.com/chflood/parallelmarking

As a side note, you would typically not create a completely new repository for this, just add that branch to your already existing fork at https://github.com/chflood/julia.

@d-netto d-netto mentioned this pull request Jun 10, 2022
3 tasks
@d-netto d-netto closed this Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GC Garbage collector performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants