-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SweepChunk to native MarkSweepSpace #1158
Conversation
DaCapoI ran lusearch from DaCapo Chopin. There are four builds: master and this PR, with lazy and eager sweeping for each of them. Three heap factors (3.023x, 3.678x, 4.392x w.r.t. G1 in OpenJDK), 10 invocations each. (plotty link) The STW time of lazy sweeping is not obviously changed. The mutator time is not obviously changed. The mutator time of master-eager is significantly lower at the lowest heap factor. That is a result of too frequent GCs (as seen in the number of GC). Some mutator time is counted into STW time after one mutator triggers GC but before all mutators come to a stop, as seen in the timeline below. This behavior is only reproducible if I run multiple iterations, and the timeline below is collected with the As a result, the difference in STW time is mainly a result of increased number of GC and the memory leak bug. The STW time per GC is also unreasonably high for master-eager. eBPFThe following timeline is from master-eager, lusearch. Although there are many mutators and This is pr-eager running lusearch. Using This is master running Liquid using mmtk-ruby (must be eager). Because there is only one mutator, the This is this PR running Liquid. Sweeping is still slow because of Conclusion
|
From the CI tests of the mmtk-ruby repo, this PR does not accelerate the test with MarkSweep. See the execution time of the "Build and test (release, MarkSweep)" of the following two PRs:
So the execution speed difference is more likely due to other factors, including the mutator speed (lack of allocation fast path) and GC speed (the time needed to run a whole GC), and the impact of the MemBalancer. |
} | ||
} | ||
|
||
fn sweep<VM: VMBinding>(&mut self, space: &MarkSweepSpace<VM>) { | ||
fn recategorize_blocks(&mut self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'recycle' is more clear for what it does.
fn recategorize_blocks(&mut self) { | |
fn recycle_blocks(&mut self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made change as you suggested.
abandoned.sweep_later(self); | ||
let num_mutators = VM::VMActivePlan::number_of_mutators(); | ||
// all ReleaseMutator work packets plus the ReleaseMarkSweepSpace packet | ||
self.pending_release_packets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest just using this mechanism for both eager and lazy sweeping. We always maintain the pending packets count. Just for eager sweeping, when all the packets are done, you generate sweep tasks. This could make things cleaner -- performance wise, the impact should be minimal. Otherwise, we see the conditionally compiled code for eager sweeping scattered everywhere to achieve this one thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved what's in MarkSweepSpace::end_of_gc
into recycle_blocks
. That is now used by both lazy and eager sweeping. When using lazy sweeping, we directly call recycle_blocks
after all ReleaseMutator
packets and the ReleaseMarkSweepSpace
packet have finished.
|
||
fn generate_sweep_tasks(&self) -> Vec<Box<dyn GCWork<VM>>> { | ||
let space = unsafe { &*(self as *const Self) }; | ||
let epilogue = Arc::new(RecategorizeBlocks { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a PR here that extracts the epilogue mechanism outside of the space code: #877. Do you want me to tidy it up and get it merged so you can base this PR on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried tidying it up locally, but I feel that #877 may not be general enough. One example is the pending_release_packets
field I put into MarkSweepSpace
. Due to our current mechanism of creating the ReleaseMutator
work packets, we can't wrap ReleaseMutator
into a WorkPacketWithEpilogue
. Therefore, I just use the space instance as the shared data. Another thing I worry about is WorkPacketEpilogue
wraps a Box<dyn Fn>
inside it (it's better to use FnOnce
instead). We may want to mutate the epilogue in some ways after it is created, according to the work packets that will count down the counter.
I also think the "epilogue" may have some capability of collecting data in addition to counting. For example, each unit of work could "abandon" some blocks to a pool which also plays the role of "epilogue". Maybe the most general mechanism is a simple wrapper of AtomicUsize
that provides a few intuitively named methods, and implement something like a countdown latch.
BTW, I have a utility class AfterAll
in the Ruby binding. It is more special purposed, but it's not that hard to implement or use.
@@ -78,6 +79,9 @@ pub struct MarkSweepSpace<VM: VMBinding> { | |||
/// these block lists in the space. These lists are only filled in the release phase, | |||
/// and will be moved to the abandoned lists above at the end of a GC. | |||
abandoned_in_gc: Mutex<AbandonedBlockLists>, | |||
/// Count the number of pending release packets during the `Release` stage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments seem too general. There might be other work packets that the plan is not aware of. The plan only cares about packets that mutate block lists, not all the packets in the release stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That makes sense. I'll explicitly mention ReleaseMarkSweepSpace
and ReleaseMutator
in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The purpose of this PR includes: 1. Providing a systematic way to add additional attributes to events on the timeline generated from eBPF traces. 2. Adding additional USDT probes in mmtk-core and display more useful information on the timeline. 3. Allowing users, especially VM binding developers, to extend the tracing scripts to visualize additional USDT probes in the binding. In previous PRs #1154 and #1158, we use "meta" events to add attributes to work packets, including the number of slots processed by ProcessEdgesWork and the number of allocated blocks visited by SweepChunk. This PR refactors the scripts to standardize the use of "meta" events. Specifically - The Rust program executes USDT trace points to expose additional details (such as the number of slots processed) during the execution of a work packet - `capture.bt` emits "meta" events for such USDTs, and - `visualize.py` adds attributes to the current unfinished GC event and/or work packet event of the current thread. In this way, the information about a GC or a work packet can be gradually added by multiple USDT trace points, providing more flexibility than having to provide all information at the beginning or the end of the work packet. Using this mechanism, we added a few more USDT trace points in mmtk-core, including: - Generation plans trigger a USDT trace point when they have decided if the current GC is a full-heap GC. The information is displayed on the big "GC" bar (in "Thread 0") for each GC on the timeline. - Immix-based plans trigger a USDT trace point when they have decided if the current GC is a defrag GC. Displayed on the "GC" bar, too. - Root-scanning packets now have "roots" attributes showing how many root slots or root nodes are reported. - ScanObjects work packets (as well as ProcessEdgesWork packets that scan objects immediately) now display the number of objects scanned. We also introduce extension mechanisms to `capture.py` and `visualize.py`. Both of them now accept the `-x` or `--extra` command line argument. Specifically, - `capture.py` uses `-x` to specify an additional script to be appended after the `capture.bt` script. - `visualize.py` uses `-x` to specify a Python script to handle unknown event names. The extension mechanism allows VM binding developers to insert custom USDT trace points in the VM binding, and extend the timeline tools to visualize those custom events.
This PR re-introduce the
SweepChunk
work packet to the nativeMarkSweepSpace
when using eager sweeping. The main intention of this PS is fixing a pathological case where there is only one mutator and the singleReleaseMutator
work packet cannot be parallelized.The algorithm is unchanged for lazy sweeping.
When doing eager sweeping,
MarkSweepSpace::abandoned
and each mutator.SweepChunk
work packets.In step (1), we release blocks for each mutator in
ReleaseMutator
. Releasing blocks is very fast, so parallelism is not a bottleneck. During step (2), all blocks are either unallocated or marked, so we don't need to move any block out of linked lists, avoiding the data race we observed in #1103. Step (3) is done by one thread, but it is fast enough not to be a performance bottleneck.We also introduced a work packet
ReleaseMarkSweepSpace
which does whatMarkSweepSpace::release
did, but is executed concurrently withReleaseMutator
work packets. In this way, we don't do too much work in theRelease
work packet, which is a problem we discussed in #1147.We removed the constant
ABANDON_BLOCKS_IN_RESET
because it is obviously necessary to do so. Otherwise one mutator will keep too many blocks locally, preventing other mutators to get available blocks to use.We added an USDT tracepoint in
SweepChunk
in bothMarkSweepSpace
andImmixSpace
so that we can see the number of allocated blocks aSweepChunk
work packet visited in the timeline generated by eBPF tracing tools.We also changed
immixspace::SweepChunk
so that it now asserts that the chunk is allocated rather than skipping unallocated chunks.ChunkMap::generate_tasks
already filters out unallocated chunks.Fixes: #1146