Add SweepChunk to native MarkSweepSpace #1158

wks · 2024-06-26T10:30:22Z

This PR re-introduce the SweepChunk work packet to the native MarkSweepSpace when using eager sweeping. The main intention of this PS is fixing a pathological case where there is only one mutator and the single ReleaseMutator work packet cannot be parallelized.

The algorithm is unchanged for lazy sweeping.

When doing eager sweeping,

We first release all unmarked blocks in MarkSweepSpace::abandoned and each mutator.
And then we sweep blocks in parallel using SweepChunk work packets.
We then go through all "consumed" blocks and move them into "available" lists if they have any free cells.

In step (1), we release blocks for each mutator in ReleaseMutator. Releasing blocks is very fast, so parallelism is not a bottleneck. During step (2), all blocks are either unallocated or marked, so we don't need to move any block out of linked lists, avoiding the data race we observed in #1103. Step (3) is done by one thread, but it is fast enough not to be a performance bottleneck.

We also introduced a work packet ReleaseMarkSweepSpace which does what MarkSweepSpace::release did, but is executed concurrently with ReleaseMutator work packets. In this way, we don't do too much work in the Release work packet, which is a problem we discussed in #1147.

We removed the constant ABANDON_BLOCKS_IN_RESET because it is obviously necessary to do so. Otherwise one mutator will keep too many blocks locally, preventing other mutators to get available blocks to use.

We added an USDT tracepoint in SweepChunk in both MarkSweepSpace and ImmixSpace so that we can see the number of allocated blocks a SweepChunk work packet visited in the timeline generated by eBPF tracing tools.

We also changed immixspace::SweepChunk so that it now asserts that the chunk is allocated rather than skipping unallocated chunks. ChunkMap::generate_tasks already filters out unallocated chunks.

Fixes: #1146

wks · 2024-06-27T02:23:11Z

DaCapo

I ran lusearch from DaCapo Chopin. There are four builds: master and this PR, with lazy and eager sweeping for each of them. Three heap factors (3.023x, 3.678x, 4.392x w.r.t. G1 in OpenJDK), 10 invocations each. (plotty link)

The STW time of lazy sweeping is not obviously changed.

The mutator time is not obviously changed.

The mutator time of master-eager is significantly lower at the lowest heap factor. That is a result of too frequent GCs (as seen in the number of GC). Some mutator time is counted into STW time after one mutator triggers GC but before all mutators come to a stop, as seen in the timeline below. This behavior is only reproducible if I run multiple iterations, and the timeline below is collected with the -n5 cmdline option of DaCapo. It is normal with -n1. It may be a result of some existing bugs that caused MarkSweep not to release some memory. When running pr-eager, -n10 still gives a reasonable result.

As a result, the difference in STW time is mainly a result of increased number of GC and the memory leak bug. The STW time per GC is also unreasonably high for master-eager.

eBPF

The following timeline is from master-eager, lusearch. Although there are many mutators and ReleaseMutator can be parallelized, the Release work packet is busy releasing blocks in the abandoned list, blocking the ReleaseMutator work packets.

This is pr-eager running lusearch. Using SweepChunk is not faster than ReleaseMutator. It is not that load-balanced if the number of chunks is small and some chunks contain unallocated blocks. And the sum of execution time of all SweepChunk packets is greater than that of all ReleaseMutator work packets.

This is master running Liquid using mmtk-ruby (must be eager). Because there is only one mutator, the ReleaseMutator cannot be parallelized, and it spent most of the time doing sweeping.

This is this PR running Liquid. Sweeping is still slow because of naive_brute_force_sweep, it is much more load-balanced.

Conclusion

There may be some bug in the master branch that caused MarkSweepSpace to not making some memory available for the mutators, causing the heap to become tighter as we increase the number of iterations. This PR solves that problem.
Although SweepChunk is not necessarily faster than doing sweeping in ReleaseMutator (depending on work load), it avoids the pathological case of single-threaded applications.

wks · 2024-06-27T02:29:40Z

From the CI tests of the mmtk-ruby repo, this PR does not accelerate the test with MarkSweep. See the execution time of the "Build and test (release, MarkSweep)" of the following two PRs:

Last PR that upgrades mmtk-ruby to the latest mmtk-core master: Bump ruby, deps and fix clippy warning for Rust 1.79 mmtk-ruby#78
This PR: Experimental version of mmtk-core mmtk-ruby#79

So the execution speed difference is more likely due to other factors, including the mutator speed (lack of allocation fast path) and GC speed (the time needed to run a whole GC), and the impact of the MemBalancer.

qinsoon · 2024-06-27T04:01:15Z

src/policy/marksweepspace/native_ms/global.rs

        }
    }

-    fn sweep<VM: VMBinding>(&mut self, space: &MarkSweepSpace<VM>) {
+    fn recategorize_blocks(&mut self) {


'recycle' is more clear for what it does.

Suggested change

fn recategorize_blocks(&mut self) {

fn recycle_blocks(&mut self) {

I made change as you suggested.

qinsoon · 2024-06-27T04:22:47Z

src/policy/marksweepspace/native_ms/global.rs

-            abandoned.sweep_later(self);
+            let num_mutators = VM::VMActivePlan::number_of_mutators();
+            // all ReleaseMutator work packets plus the ReleaseMarkSweepSpace packet
+            self.pending_release_packets


I suggest just using this mechanism for both eager and lazy sweeping. We always maintain the pending packets count. Just for eager sweeping, when all the packets are done, you generate sweep tasks. This could make things cleaner -- performance wise, the impact should be minimal. Otherwise, we see the conditionally compiled code for eager sweeping scattered everywhere to achieve this one thing.

I moved what's in MarkSweepSpace::end_of_gc into recycle_blocks. That is now used by both lazy and eager sweeping. When using lazy sweeping, we directly call recycle_blocks after all ReleaseMutator packets and the ReleaseMarkSweepSpace packet have finished.

qinsoon · 2024-06-27T04:23:43Z

src/policy/marksweepspace/native_ms/global.rs

+
+    fn generate_sweep_tasks(&self) -> Vec<Box<dyn GCWork<VM>>> {
+        let space = unsafe { &*(self as *const Self) };
+        let epilogue = Arc::new(RecategorizeBlocks {


I have a PR here that extracts the epilogue mechanism outside of the space code: #877. Do you want me to tidy it up and get it merged so you can base this PR on it?

I tried tidying it up locally, but I feel that #877 may not be general enough. One example is the pending_release_packets field I put into MarkSweepSpace. Due to our current mechanism of creating the ReleaseMutator work packets, we can't wrap ReleaseMutator into a WorkPacketWithEpilogue. Therefore, I just use the space instance as the shared data. Another thing I worry about is WorkPacketEpilogue wraps a Box<dyn Fn> inside it (it's better to use FnOnce instead). We may want to mutate the epilogue in some ways after it is created, according to the work packets that will count down the counter.

I also think the "epilogue" may have some capability of collecting data in addition to counting. For example, each unit of work could "abandon" some blocks to a pool which also plays the role of "epilogue". Maybe the most general mechanism is a simple wrapper of AtomicUsize that provides a few intuitively named methods, and implement something like a countdown latch.

BTW, I have a utility class AfterAll in the Ruby binding. It is more special purposed, but it's not that hard to implement or use.

qinsoon · 2024-06-27T04:51:07Z

src/policy/marksweepspace/native_ms/global.rs

@@ -78,6 +79,9 @@ pub struct MarkSweepSpace<VM: VMBinding> {
    /// these block lists in the space. These lists are only filled in the release phase,
    /// and will be moved to the abandoned lists above at the end of a GC.
    abandoned_in_gc: Mutex<AbandonedBlockLists>,
+    /// Count the number of pending release packets during the `Release` stage.


The comments seem too general. There might be other work packets that the plan is not aware of. The plan only cares about packets that mutate block lists, not all the packets in the release stage.

Yes. That makes sense. I'll explicitly mention ReleaseMarkSweepSpace and ReleaseMutator in the comment.

qinsoon

LGTM

The purpose of this PR includes: 1. Providing a systematic way to add additional attributes to events on the timeline generated from eBPF traces. 2. Adding additional USDT probes in mmtk-core and display more useful information on the timeline. 3. Allowing users, especially VM binding developers, to extend the tracing scripts to visualize additional USDT probes in the binding. In previous PRs #1154 and #1158, we use "meta" events to add attributes to work packets, including the number of slots processed by ProcessEdgesWork and the number of allocated blocks visited by SweepChunk. This PR refactors the scripts to standardize the use of "meta" events. Specifically - The Rust program executes USDT trace points to expose additional details (such as the number of slots processed) during the execution of a work packet - `capture.bt` emits "meta" events for such USDTs, and - `visualize.py` adds attributes to the current unfinished GC event and/or work packet event of the current thread. In this way, the information about a GC or a work packet can be gradually added by multiple USDT trace points, providing more flexibility than having to provide all information at the beginning or the end of the work packet. Using this mechanism, we added a few more USDT trace points in mmtk-core, including: - Generation plans trigger a USDT trace point when they have decided if the current GC is a full-heap GC. The information is displayed on the big "GC" bar (in "Thread 0") for each GC on the timeline. - Immix-based plans trigger a USDT trace point when they have decided if the current GC is a defrag GC. Displayed on the "GC" bar, too. - Root-scanning packets now have "roots" attributes showing how many root slots or root nodes are reported. - ScanObjects work packets (as well as ProcessEdgesWork packets that scan objects immediately) now display the number of objects scanned. We also introduce extension mechanisms to `capture.py` and `visualize.py`. Both of them now accept the `-x` or `--extra` command line argument. Specifically, - `capture.py` uses `-x` to specify an additional script to be appended after the `capture.bt` script. - `visualize.py` uses `-x` to specify a Python script to handle unknown event names. The extension mechanism allows VM binding developers to insert custom USDT trace points in the VM binding, and extend the timeline tools to visualize those custom events.

wks added 9 commits June 26, 2024 12:01

WIP: MS sweep chunk

71643eb

assertions and probe

7045cfb

Minor fixes

5268e28

Remove ABANDON_BLOCKS_IN_RESET

7638c46

Remove unnecessary assertion

7e8b171

Minor adjustments

e9b787f

eBPF tracing tools and docs

6e88f30

SweepChunk now asserts the chunk is allocated.

513cee8

Release lock promptly

f1c0592

wks added the PR-extended-testing Run extended tests for the pull request label Jun 26, 2024

wks requested a review from qinsoon June 27, 2024 02:32

qinsoon reviewed Jun 27, 2024

View reviewed changes

wks added 2 commits June 27, 2024 16:05

Update comment to make it more concrete.

2762f69

Use release packet counter for lazy sweeping too

28febf0

qinsoon approved these changes Jun 27, 2024

View reviewed changes

wks added this pull request to the merge queue Jun 28, 2024

Merged via the queue into mmtk:master with commit 06adfb0 Jun 28, 2024
25 checks passed

wks deleted the feature/ms-sweep-chunk branch June 28, 2024 02:22

wks mentioned this pull request Jul 3, 2024

Extensible eBPF timeline attributes #1162

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SweepChunk to native MarkSweepSpace #1158

Add SweepChunk to native MarkSweepSpace #1158

wks commented Jun 26, 2024

wks commented Jun 27, 2024 •

edited

Loading

wks commented Jun 27, 2024

qinsoon Jun 27, 2024

wks Jun 27, 2024

qinsoon Jun 27, 2024

wks Jun 27, 2024

qinsoon Jun 27, 2024

wks Jun 27, 2024

qinsoon Jun 27, 2024

wks Jun 27, 2024

qinsoon left a comment

	fn recategorize_blocks(&mut self) {
	fn recycle_blocks(&mut self) {

Add SweepChunk to native MarkSweepSpace #1158

Add SweepChunk to native MarkSweepSpace #1158

Conversation

wks commented Jun 26, 2024

wks commented Jun 27, 2024 • edited Loading

DaCapo

eBPF

Conclusion

wks commented Jun 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qinsoon left a comment

Choose a reason for hiding this comment

wks commented Jun 27, 2024 •

edited

Loading