Remove the coordinator (controller) thread #1062

wks · 2024-01-05T07:24:38Z

DRAFT: Need polishing. Need performance evaluation. Need to update bindings.

This PR removes the coordinator (a.k.a. controller) thread. GC worker threads now coordinate themselves.

Major changes include:

GCController is removed. All synchronization mechanisms for the communication between the controller and workers are removed, too.
The mechanism for triggering GC is redesigned. Mutators now communicate with workers directly, and a variable WorkerMonitorSync::should_schedule_gc is added as the communication channel. To trigger a GC, a mutator shall acquire the mutex WorkerMonitor::sync, set sync.should_schedule_gc to true, and notify one worker. The last parked worker always checks this variable before blocking.
The last parked worker (instead of the coordinator) is now responsible for adding the ScheduleCollection work packet when a GC is requested.
The last parked worker (instead of the coordinator) is now responsible for opening work packets when open buckets are drained.
The work packet EndOfGC is removed. The code is now executed by the last parked worker when GC finishes.
GlobalState::gc_status is now Atomic, and it is read by the last parked worker to decide whether GC is in progress. Previously, it was not used in Rust MMTk, and was only used for assertion in the MMTk in JikesRVM
Added GlobalState::gc_start_time to record the start time of a GC. It was a local variable in GCController.
Added method WorkBucket::add_no_notify to add work packets while holding the mutex WorkerMonitor::sync.
Updated comments and deleted outdated comments no longer relevant after this change.

Fixes: #1053
Closes: #1061

Found a case where workers may sleep forever. Need to be fixed later.

Mutators no longer trigger GC by adding ScheduleCollection directly.

wks · 2024-01-08T18:41:19Z

I ran lusearch from DaCapo Chopin on bobcat.moma, 2.5x min heap size w.r.t. OpenJDK's G1, 40 invocations, 5 iterations each, comparing master (build1) and this PR (build2).

The follow plot is normalised to build1:

plotty link: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|bobcat-2024-01-08-Mon-075256&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&mmtk_gc&build;build1|41&Histogram%20(with%20CI)^build^mmtk_gc&

The following plot is not normalised:

plotty link: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|bobcat-2024-01-08-Mon-075256&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|41&Histogram%20(with%20CI)^build^mmtk_gc&

From the plots, it looks like this PR reduces the STW time if the absolute (not normalised) number of GC is large. There is 1% to 2% STW time reduction for GenCopy, GenImmix, SemiSpace and StickyImmix. For Immix, the STW time is increased by 0.3%, but the error bar is much larger than the difference. From individual data, it looks like some outlier invocations triggered more GC than others. The following plot shows the number of GCs executed in each invocation when running Immix, sorted and grouped by build. It shows the distribution of the number of GCs is the same, but "build2" (yellow) contains some outliers on the right end of the curve.

We see this PR makes GC slightly faster, and mainly impacts the plans that trigger GC often (mainly generational plans). I think the main reason is that this PR reduces one context switch every time all workers stopped (because there is no need to switch to the coordinator). If a plan does more GC, but fewer work in each GC (mostly nursery GC), the synchronisation overhead will be significant. But it is hard to explain why SemiSpace becomes faster this way because it is not generational. Maybe SemiSpace simply triggered GC too many times given this heap size.

wks · 2024-01-10T18:42:56Z

The same setting (lusearch, 2.5x min heap size, 40 invocations) ran on skunk.moma. This Coffee Lake machine does not have small ("efficiency") cores, so it is expected to have smaller variance.

Normalised to build1:

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|skunk-2024-01-10-Wed-093645&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^mmtk_gc&build;build1|41&Histogram%20(with%20CI)^build^mmtk_gc&

STW times are reduced for all plans, and Mutator (other) times are increased for all plans.

I realised that there is one problem with the measurement. The definition of "STW time" changed in this PR. On master, the STW time starts when the coordinator thread receives a message from the mutator (i.e. when the "requested flag" is set and the coordinator thread is notified by the CondVar), but in this PR, the STW time starts at the beginning of the ScheduleCollection work packet. For this reason, a greater proportion of time is calculated as part of the mutator time. This explains why the greater the speed-up of STW time is, the greater the slow-down of mutator time is. Update: No. The GC time printed by the log at the end of GC changed for that reason, but the STW time recorded by Stats didn't change. The recorded STW time always starts at mmtk.set_gc_status(GcStatus::GcPrepare); which always happens in ScheduleCollection for both builds.

The total time reflects the actual speed-up. There is a 0.3% slow down for Immix, likely due to the slightly increased number of GC due to random effects.

The same data without normalisation:

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|skunk-2024-01-10-Wed-093645&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|41&Histogram%20(with%20CI)^build^mmtk_gc&

At this heap size, GenCopy and GenImmix spent more time in STW than in mutator (other). ~~This explains why their normalised time.other values apparently increased so much, because we are dividing the same absolute difference with a smaller denominator.~~ Update: Since the way of recording STW time didn't change, we need a different explanation.

wks · 2024-01-10T19:00:21Z

Here are results from running more benchmarks. 3.0x min heap size, 20 invocations. I split the tests onto two machines (with identical hardware and identical binaries)

From the geometric means of the total time, this PR speeds up all the plans tested, but only by 0.1% to 0.3%.

wks · 2024-01-12T03:18:00Z

I repeated the lusearch run on three machines:

Re-ran on skunk.moma:

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|skunk-2024-01-11-Thu-100400&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^mmtk_gc&build;build1|41&Histogram%20(with%20CI)^build^mmtk_gc&

fisher.moma:

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|fisher-2024-01-11-Thu-100421&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^mmtk_gc&build;build1|41&Histogram%20(with%20CI)^build^mmtk_gc&

mole.moma:

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|mole-2024-01-11-Thu-100437&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^mmtk_gc&build;build1|41&Histogram%20(with%20CI)^build^mmtk_gc&

On skunk.moma and fisher.moma, the mutator time (time.other) increased a lot for almost all plans. But on mole.moma, it reduced. This shows that the measured increase in mutator time is not random because it can be repeated. I still can't find a good explanation, yet.

The GC time (time.stw) decreased for all plans on all machine, except Immix on fisher.

The total time (time) decreased for all plans on all machines, but not by much.

wks · 2024-01-14T15:32:29Z

I added a build3 which is #1067. It further refactored over build2 so that it is easy to express the idea of "letting all workers do one thing" (i.e. work towards one "goal") which can be either "GC" or "stopping for fork" at this time. I did the refactoring because I think the current code in this PR compilcates things a bit (although it already greatly simplified the synchronisation by removing coordinator) and made it hard to add forking support.

Result on Skunk. All the same setting as before except adding build3. i.e.:

build1: master
build2: this PR
build3: Remove coordinator and support forking #1067

build2 and build3 have the same performance characterestics. time.other increased, time.stw decreased, time decreased. And the geomean shows build3 performs slightly better than build2, probably due to simplified synchronisation and one removed mutex. But it is still only 0.5% improvement over master in total time.

wks · 2024-01-31T08:40:24Z

I am closing this PR in favor for #1067 (Remove coordinator and support forking). The reason is, after we remove the coordinator thread, we need a way for workers to coordinate themselves. But the concrete way how they coordinate themselves is dictated by the requirement of supporting forking. That is, we must implement a coordinating mechanism that (1) does not need a dedicated coordinator thread, and (2) is easy to bring down all GC workers when forking. The code in this PR accomplished (1), but is obviously not suitable for (2), and #1067 has already done non-trivial refactoring over this PR. Therefore, it is a waste of effort to merge this PR before applying #1067.

wks added 10 commits December 28, 2023 22:00

WIP: Add ScheduleCollection when requesting GC

3b61fad

WIP: Rework GC triggering

18f7bb6

Working

ffe8f33

WIP: Remove Controller

b64e2a7

Fix deadlock between GCRequester and GC workers.

4fa8fac

Remove comments and add back trace points

8f8c3e0

WIP: Updating comments and minor code change.

845eea6

Found a case where workers may sleep forever. Need to be fixed later.

Fix a case where workers wait forever

2d52f00

Mutators no longer trigger GC by adding ScheduleCollection directly.

Added comments and removed redundant code.

fa2aa50

Merge branch 'master' into fix/no-coordinator

867eac0

wks mentioned this pull request Jan 8, 2024

Remove the coordinator (controller) thread mmtk/mmtk-openjdk#264

Closed

caizixian requested review from wenyuzhao and caizixian and removed request for caizixian January 9, 2024 09:56

wks added 2 commits January 11, 2024 15:05

Merge branch 'master' into fix/no-coordinator

2230c3f

Move the gc_start trace point.

5bda9ee

wks added 2 commits January 12, 2024 12:01

Merge branch 'master' into fix/no-coordinator

70d7b95

Remove the EndOfGC work packet.

72e6eb9

wks mentioned this pull request Jan 12, 2024

Remove coordinator and support forking #1067

Merged

wks mentioned this pull request Jan 19, 2024

Stats: number of GC stuck at 2047 #1074

Open

wks closed this Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the coordinator (controller) thread #1062

Remove the coordinator (controller) thread #1062

wks commented Jan 5, 2024

wks commented Jan 8, 2024

wks commented Jan 10, 2024 •

edited

Loading

wks commented Jan 10, 2024

wks commented Jan 12, 2024

wks commented Jan 14, 2024

wks commented Jan 31, 2024

Remove the coordinator (controller) thread #1062

Remove the coordinator (controller) thread #1062

Conversation

wks commented Jan 5, 2024

wks commented Jan 8, 2024

wks commented Jan 10, 2024 • edited Loading

wks commented Jan 10, 2024

wks commented Jan 12, 2024

wks commented Jan 14, 2024

wks commented Jan 31, 2024

wks commented Jan 10, 2024 •

edited

Loading