Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SweepChunk to native MarkSweepSpace #1158

Merged
merged 11 commits into from
Jun 28, 2024
Merged

Conversation

wks
Copy link
Collaborator

@wks wks commented Jun 26, 2024

This PR re-introduce the SweepChunk work packet to the native MarkSweepSpace when using eager sweeping. The main intention of this PS is fixing a pathological case where there is only one mutator and the single ReleaseMutator work packet cannot be parallelized.

The algorithm is unchanged for lazy sweeping.

When doing eager sweeping,

  1. We first release all unmarked blocks in MarkSweepSpace::abandoned and each mutator.
  2. And then we sweep blocks in parallel using SweepChunk work packets.
  3. We then go through all "consumed" blocks and move them into "available" lists if they have any free cells.

In step (1), we release blocks for each mutator in ReleaseMutator. Releasing blocks is very fast, so parallelism is not a bottleneck. During step (2), all blocks are either unallocated or marked, so we don't need to move any block out of linked lists, avoiding the data race we observed in #1103. Step (3) is done by one thread, but it is fast enough not to be a performance bottleneck.

We also introduced a work packet ReleaseMarkSweepSpace which does what MarkSweepSpace::release did, but is executed concurrently with ReleaseMutator work packets. In this way, we don't do too much work in the Release work packet, which is a problem we discussed in #1147.

We removed the constant ABANDON_BLOCKS_IN_RESET because it is obviously necessary to do so. Otherwise one mutator will keep too many blocks locally, preventing other mutators to get available blocks to use.

We added an USDT tracepoint in SweepChunk in both MarkSweepSpace and ImmixSpace so that we can see the number of allocated blocks a SweepChunk work packet visited in the timeline generated by eBPF tracing tools.

We also changed immixspace::SweepChunk so that it now asserts that the chunk is allocated rather than skipping unallocated chunks. ChunkMap::generate_tasks already filters out unallocated chunks.

Fixes: #1146

@wks wks added the PR-extended-testing Run extended tests for the pull request label Jun 26, 2024
@wks
Copy link
Collaborator Author

wks commented Jun 27, 2024

DaCapo

I ran lusearch from DaCapo Chopin. There are four builds: master and this PR, with lazy and eager sweeping for each of them. Three heap factors (3.023x, 3.678x, 4.392x w.r.t. G1 in OpenJDK), 10 invocations each. (plotty link)

image

The STW time of lazy sweeping is not obviously changed.

The mutator time is not obviously changed.

The mutator time of master-eager is significantly lower at the lowest heap factor. That is a result of too frequent GCs (as seen in the number of GC). Some mutator time is counted into STW time after one mutator triggers GC but before all mutators come to a stop, as seen in the timeline below. This behavior is only reproducible if I run multiple iterations, and the timeline below is collected with the -n5 cmdline option of DaCapo. It is normal with -n1. It may be a result of some existing bugs that caused MarkSweep not to release some memory. When running pr-eager, -n10 still gives a reasonable result.

image

As a result, the difference in STW time is mainly a result of increased number of GC and the memory leak bug. The STW time per GC is also unreasonably high for master-eager.

eBPF

The following timeline is from master-eager, lusearch. Although there are many mutators and ReleaseMutator can be parallelized, the Release work packet is busy releasing blocks in the abandoned list, blocking the ReleaseMutator work packets.

image

This is pr-eager running lusearch. Using SweepChunk is not faster than ReleaseMutator. It is not that load-balanced if the number of chunks is small and some chunks contain unallocated blocks. And the sum of execution time of all SweepChunk packets is greater than that of all ReleaseMutator work packets.

image

This is master running Liquid using mmtk-ruby (must be eager). Because there is only one mutator, the ReleaseMutator cannot be parallelized, and it spent most of the time doing sweeping.

image

This is this PR running Liquid. Sweeping is still slow because of naive_brute_force_sweep, it is much more load-balanced.

image

Conclusion

  1. There may be some bug in the master branch that caused MarkSweepSpace to not making some memory available for the mutators, causing the heap to become tighter as we increase the number of iterations. This PR solves that problem.
  2. Although SweepChunk is not necessarily faster than doing sweeping in ReleaseMutator (depending on work load), it avoids the pathological case of single-threaded applications.

@wks
Copy link
Collaborator Author

wks commented Jun 27, 2024

From the CI tests of the mmtk-ruby repo, this PR does not accelerate the test with MarkSweep. See the execution time of the "Build and test (release, MarkSweep)" of the following two PRs:

So the execution speed difference is more likely due to other factors, including the mutator speed (lack of allocation fast path) and GC speed (the time needed to run a whole GC), and the impact of the MemBalancer.

@wks wks requested a review from qinsoon June 27, 2024 02:32
}
}

fn sweep<VM: VMBinding>(&mut self, space: &MarkSweepSpace<VM>) {
fn recategorize_blocks(&mut self) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'recycle' is more clear for what it does.

Suggested change
fn recategorize_blocks(&mut self) {
fn recycle_blocks(&mut self) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made change as you suggested.

abandoned.sweep_later(self);
let num_mutators = VM::VMActivePlan::number_of_mutators();
// all ReleaseMutator work packets plus the ReleaseMarkSweepSpace packet
self.pending_release_packets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest just using this mechanism for both eager and lazy sweeping. We always maintain the pending packets count. Just for eager sweeping, when all the packets are done, you generate sweep tasks. This could make things cleaner -- performance wise, the impact should be minimal. Otherwise, we see the conditionally compiled code for eager sweeping scattered everywhere to achieve this one thing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved what's in MarkSweepSpace::end_of_gc into recycle_blocks. That is now used by both lazy and eager sweeping. When using lazy sweeping, we directly call recycle_blocks after all ReleaseMutator packets and the ReleaseMarkSweepSpace packet have finished.


fn generate_sweep_tasks(&self) -> Vec<Box<dyn GCWork<VM>>> {
let space = unsafe { &*(self as *const Self) };
let epilogue = Arc::new(RecategorizeBlocks {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a PR here that extracts the epilogue mechanism outside of the space code: #877. Do you want me to tidy it up and get it merged so you can base this PR on it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried tidying it up locally, but I feel that #877 may not be general enough. One example is the pending_release_packets field I put into MarkSweepSpace. Due to our current mechanism of creating the ReleaseMutator work packets, we can't wrap ReleaseMutator into a WorkPacketWithEpilogue. Therefore, I just use the space instance as the shared data. Another thing I worry about is WorkPacketEpilogue wraps a Box<dyn Fn> inside it (it's better to use FnOnce instead). We may want to mutate the epilogue in some ways after it is created, according to the work packets that will count down the counter.

I also think the "epilogue" may have some capability of collecting data in addition to counting. For example, each unit of work could "abandon" some blocks to a pool which also plays the role of "epilogue". Maybe the most general mechanism is a simple wrapper of AtomicUsize that provides a few intuitively named methods, and implement something like a countdown latch.

BTW, I have a utility class AfterAll in the Ruby binding. It is more special purposed, but it's not that hard to implement or use.

@@ -78,6 +79,9 @@ pub struct MarkSweepSpace<VM: VMBinding> {
/// these block lists in the space. These lists are only filled in the release phase,
/// and will be moved to the abandoned lists above at the end of a GC.
abandoned_in_gc: Mutex<AbandonedBlockLists>,
/// Count the number of pending release packets during the `Release` stage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments seem too general. There might be other work packets that the plan is not aware of. The plan only cares about packets that mutate block lists, not all the packets in the release stage.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That makes sense. I'll explicitly mention ReleaseMarkSweepSpace and ReleaseMutator in the comment.

Copy link
Member

@qinsoon qinsoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wks wks added this pull request to the merge queue Jun 28, 2024
Merged via the queue into mmtk:master with commit 06adfb0 Jun 28, 2024
25 checks passed
@wks wks deleted the feature/ms-sweep-chunk branch June 28, 2024 02:22
github-merge-queue bot pushed a commit that referenced this pull request Jul 17, 2024
The purpose of this PR includes:

1. Providing a systematic way to add additional attributes to events on
the timeline generated from eBPF traces.
2. Adding additional USDT probes in mmtk-core and display more useful
information on the timeline.
3. Allowing users, especially VM binding developers, to extend the
tracing scripts to visualize additional USDT probes in the binding.

In previous PRs #1154 and
#1158, we use "meta" events to add
attributes to work packets, including the number of slots processed by
ProcessEdgesWork and the number of allocated blocks visited by
SweepChunk. This PR refactors the scripts to standardize the use of
"meta" events. Specifically

- The Rust program executes USDT trace points to expose additional
details (such as the number of slots processed) during the execution of
a work packet
-   `capture.bt` emits "meta" events for such USDTs, and
- `visualize.py` adds attributes to the current unfinished GC event
and/or work packet event of the current thread.

In this way, the information about a GC or a work packet can be
gradually added by multiple USDT trace points, providing more
flexibility than having to provide all information at the beginning or
the end of the work packet.

Using this mechanism, we added a few more USDT trace points in
mmtk-core, including:

- Generation plans trigger a USDT trace point when they have decided if
the current GC is a full-heap GC. The information is displayed on the
big "GC" bar (in "Thread 0") for each GC on the timeline.
- Immix-based plans trigger a USDT trace point when they have decided if
the current GC is a defrag GC. Displayed on the "GC" bar, too.
- Root-scanning packets now have "roots" attributes showing how many
root slots or root nodes are reported.
- ScanObjects work packets (as well as ProcessEdgesWork packets that
scan objects immediately) now display the number of objects scanned.

We also introduce extension mechanisms to `capture.py` and
`visualize.py`. Both of them now accept the `-x` or `--extra` command
line argument. Specifically,

- `capture.py` uses `-x` to specify an additional script to be appended
after the `capture.bt` script.
- `visualize.py` uses `-x` to specify a Python script to handle unknown
event names.

The extension mechanism allows VM binding developers to insert custom
USDT trace points in the VM binding, and extend the timeline tools to
visualize those custom events.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR-extended-testing Run extended tests for the pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Native MarkSweep: bad load balancing with single-threaded workloads
2 participants