Clear side forwarding bits properly #1138

wks · 2024-05-22T06:58:38Z

This PR fixes bugs related to the clearing of side forwarding bits.

The side forwarding bits of ImmixSpace are now cleared after moving GCs instead of before every major GC. This eliminates redundant clearing operations during non-moving full-heap GCs (i.e. when not doing defrag), and also ensures the subsequent nursery GCs do not see stale forwarding bits when using StickyImmix in which case young objects are also allocated in the ImmixSpace.

Also fixed the code for clearing side forwarding bits in discontiguous CopySpace.

When using StickyImmix, we should clear side forwarding bits in the ImmixSpace in each GC, too. We move it from the start of major GC to the end of each GC so that we don't add additional preparation work if it is not nursery GC.

qinsoon

Can you run benchmarks for Immix and SemiSpace? We should expect no perf difference for SemiSpace, and speedup/no diff for Immix, right?

src/policy/copyspace.rs

qinsoon · 2024-05-23T06:35:34Z

src/policy/immix/immixspace.rs

+                // In the beginning of the next GC, no side forwarding bits shall be set.
+                if let MetadataSpec::OnSide(side) = *VM::VMObjectModel::LOCAL_FORWARDING_BITS_SPEC {
+                    if is_moving_gc {
+                        let objects_may_move = !is_defrag_gc || block.is_defrag_source();


Probably add some comments here to make it more clear: if this is a defrag GC, we only need to clear forwarding bits for defrag source blocks; if this is a moving GC but not a defrag GC, we have no other information, and we have to clear forwarding bits for all the blocks.

src/policy/immix/immixspace.rs

wks · 2024-05-23T08:13:44Z

Can you run benchmarks for Immix and SemiSpace? We should expect no perf difference for SemiSpace, and speedup/no diff for Immix, right?

This PR only changes the code that involves on-the-side forwarding bits. If the forwarding bits are in the header, there should be no performance difference with the default configuration of OpenJDK (which uses in-header forwarding bits).

If the forwarding bits are on the side, this PR fixes correctness issues because both CopySpace and ImmixSpace (in StickyImmix) would crash.

I'll run benchmarks using the default settings (in-header forwarding bits) to verify that there is no performance difference.

qinsoon · 2024-05-23T10:32:28Z

I'll run benchmarks using the default settings (in-header forwarding bits) to verify that there is no performance difference.

If we can use side forwarding bits for OpenJDK, it is worth measuring the performance.

wks · 2024-05-23T11:53:17Z

I'll run benchmarks using the default settings (in-header forwarding bits) to verify that there is no performance difference.

If we can use side forwarding bits for OpenJDK, it is worth measuring the performance.

I'll add a feature to OpenJDK to force the forwarding bits to be on the side. And it shall be added to minimal-tests-openjdk. Otherwise the CI has no way to even check if this PR is correct.

wks · 2024-05-24T07:35:12Z

binding-refs
OPENJDK_BINDING_REPO=wks/mmtk-openjdk
OPENJDK_BINDING_REF=fix/side-fwd-bits

wks · 2024-05-24T14:01:39Z

There is a deadlock happening in the minimal binding test. It is running lusearch in dacapo 2006. The bug is reproducible locally, and it does not require the forwarding bits to be on the side (i.e. not related to this PR or the OpenJDK PR). I have reported it in mmtk/mmtk-openjdk#278. Since the deadlock is so easy to trigger, I'll dig into that bug before continuing with this PR.

wks · 2024-05-31T02:36:36Z

The benchmark result. Lusearch from DaCapo Chopin, 3x min heap size, 40 invocations. 3 builds:

build1: master
build2: this PR and the mmtk-openjdk PR
build3: this PR and the mmtk-openjdk PR, using side forwarding bits

Note: There is no master branch with side forwarding bits because that'll crash.

Plots (normalised to build1): plotty link

Plots (not normalised): plotty link

Note that the number of GC is stuck at 2047 for GenImmix. The actual number of GCs should be much higher.

build1 vs build2

The difference between build1 and build2 is the effect of this PR without enabling side forwarding bits.

Plots (normalised to build1): plotty link

Mostly unchanged. For some plans, such as GenCopy and SemiSpace, STW time becomes higher, but other time becomes lower. It's likely some changes in the code influenced the way it measures STW time, counting some other time into STW time. The total time is different with non-overlapping error bars, but is within 1%.

build2 vs build3

The difference between build2 and build3 indicates the cost of enabling side forwarding bits.

Plots (normalised to build2): plotty link

Obviously, the number of GC is increased. That's because the side forwarding bits occupies part of the heap space. As a result, the effective heap size for objects becomes smaller. The STW time increased due to increased number of GCs and the cost of clearing side forwarding bits. The plans involving CopySpace (SemiSpace, GenCopy and GenImmix) are impacted to a greater degree. The time.other is mostly unchanged, with GenImmix as an exception. But the total time is consistently increased.

qinsoon · 2024-05-31T03:59:36Z

The slowdown of using side forwarding bits is significant (build2 vs build3). It would be good to measure the overhead of side forwarding bits on master -- if it cannot run with Immix or with discontiguous spaces, you can try find a configuration that runs (e.g. no compress pointer, semispace).

Also please include all the benchmarks in the evaluation. It is okay to focus on one benchmark to address some issues, but we need to know the results on more benchmarks in order to know if the PR has performance regression.

wks · 2024-05-31T04:29:11Z

The slowdown of using side forwarding bits is significant (build2 vs build3).

Yes. Side forwarding bits are not free. It is significant for CopySpace because the entire CopySpace is copied away, and the clearing involves the entire space. But it is much lower for ImmixSpace. For Immix and StickyImmix, the total time overhead is within 1%, and the STW time overhead is about 2%-4%.

It would be good to measure the overhead of side forwarding bits on master -- if it cannot run with Immix or with discontiguous spaces, you can try find a configuration that runs (e.g. no compress pointer, semispace).

I am doing a 4-way comparison, comparing master-header, master-side, PR-header and PR-side, using SemiSpace and GenCopy, without compressed oops. Let's wait for the results.

wks · 2024-05-31T09:48:40Z

I added build4: master with side forwarding bits enabled. To avoid confusion, I'll use "master-header", "PR-header", "master-side", "PR-side" as the labels of the builds. I re-ran lusearch, but only using SemiSpace and GenCopy which only involve CopySpace.

Plot (un-normalised): plotty

Plot (normalised to "master-header"): plotty

The increase in the number of GC and the times due to moving the forwarding bits on the side (i.e. the ratio of side / header for each number) is similar for both the master branch and this PR. ~~(p.s. I wonder why the number of GC didn't overflow this time.)~~ The -XX:-UseCompressedOops option affects the number of GC triggered. When using compressed oops, it triggers more GC.

For SemiSpace, there is a 1.7% increase in STW time and 0.5% increase in total time, and it is consistent with the result of the last 3-way comparison.

wks · 2024-05-31T13:37:26Z

I had a look at the timeline using eBPF. THe following plots are from running lusearch, 64M heap size (the same as on lynx.moma), on my laptop.

This PR, header, no compressed oops, SemiSpace

This PR, side, no compressed oops, SemiSpace

There is one obvious problem with the current implementation. The clearing of the side forwarding bits is not parallelised. But for lusearch alone, that's not the bottleneck. Any load-imbalance in the Prepare and Closure stage would have greater impact in STW time.

This PR, side, no compressed oops, GenCopy nursery collection

The impact should be greater for GenCopy because the impact of the Release work packet not parallelized is greater on nursery collection. However, the benchmark results from lynx.moma showed that the impact of side forwarding bits on STW time is greater for SemiSpace. I think that means the un-parallelised Release work packet is only a secondary factor. The primary impact factor is the decreased effective heap size (due to side metadata taking space) as well as cache locality problems with side metadata.

wks · 2024-06-01T14:46:08Z

I ran other benchmarks on on lynx.moma and bobcat.moma (identical hardware and identical binary). 3x min heap size w.r.t. G1 in OpenJDK. No compressed OOPs (even when running StickyImmix).

Note that all four builds use the mmtk-openjdk binding from mmtk/mmtk-openjdk#277, but only master-header and PR-header have the forwarding_bits_on_side Cargo feature enabled in mmtk-openjdk.

SemiSpace. (plotty) Note that biojava and fop failed to run at 3x min heap size with SemiSpace

StickyImmix. (plotty

Comparing master-header and pr-header, most differences are within the error bars. For SemiSpace, the stw.time for pr-header is noticeably higher (but still small) than master-header. That could be a result of executing for (start, size) in self.pr.iterate_allocated_regions() { unconditionally in Release, even though it does a no-op within the for look if VO bit is not enabled, either. I'll verify if that's the cause, but I am not sure if it is worth optimising. For StickyImmix, the time.stw for pr-header is sometimes higher and sometimes lower, but within the error bars. I think they are just noise because the clearing of side forwarding bits for ImmixSpace is always guarded by an if statement with a constant condition, and can be eliminated by the compiler. The generated code should be identical if the forwarding bits are in the header.

Comparing XXX-header and XXX-side, we see the overhead of forcing forwarding bits to be on the side. The overhead is significantly higher for SemiSpace (only CopySpace) than StickyImmix (only ImmixSpace). One reason may be that ImmixSpace distributes the clearing operation to SweepChunk, and that can be paralellised. Another reason may be that if StickyImmix executes full-heap GC often (which is the reality for tighter heap), it will not do copying GC unless it is a defrag collection and if the block is a defrag source. That should greatly reduced the amount of clearing to do. However, from the plot of "majorGC / GC" for StickyImmix, I don't see any connection between "majorGC / GC" and the increase in time.stw.

MajorGC per GC: (plotty)

There are two interesting benchmarks: "avrora" and "jython". Their side-metadata overhead is so big in SemiSpace, but "avrora" sped up when using side forwarding bits, the "jython" only had a very tiny overhead when using side forwarding bits. The speed-up could be noise (as their error bars are large).

qinsoon

LGTM. The performance looks reasonable to me. If you would like to investigate further, that's fine. The PR in its current state is good to merge.

wks · 2024-06-02T09:51:47Z

Running jython locally, with SemiSpace, 76M heap size (3x min heap size w.r.t. G1 in OpenJDK), and no compressed oops:

master-header (number of GC = 1697, time.stw = 6489.81 ms, time = 19773ms)

pr-side (number of GC > 2047 (counter overflowed), time.stw = 11954.42 ms, time = 24658ms)

It looks like the 76M heap size is still too small for jython when using SemiSpace. Individual GCs don't look very different, and the Release work packet is trivial compared to the Closure stage. But when using side metadata, it is simply doing more GCs. Let's zoom in to the middle of the execution.

master-header

During some periods of execution (maybe when the program demands more memory), the program is spending almost 100% of the time doing GC, and the mutators hardly have any time to execute. From time to time, the muators can get an extended period of execution time without triggering GC (probably when the program keeps fewer objects alive). If the heap gets even a little bit smaller (or some portion of the heap is occupied by the side metadata), the thrashing will become significantly worse, amplifying the overhead of side forwarding bits.

Given that SemiSpace can only make use of 50% of the heap size, I doubled the heap size to 152M, and the problem became much milder.

master-header (GC = 141, time.stw = 590.08 ms, time = 13926 ms)

pr-side (GC = 149, time.stw = 735.66 ms, time = 14202 ms)

The STW time still increased significantly when using side metadata, but the total time is less affected. At this heap size, it still occasionally have bursts of frequent GCs. See the following plot:

master-header

Each individual GC still takes the similar time (about 3.5ms) as 76M, but the density is much lower. I think it is the nature of the jython benchmark to have short bursts of allocations, but the amount of memory it keeps alive is not very high (that's why each GC can still finish in 3.5ms regardless of heap size). Ideally, this kind of workload could benefit from MemBalancer.

FYI, the following timelines are recorded from StickyImmix with 76M heap size:

master-header (GC = 178, time.stw = 458.02, time = 14110)

pr-side (GC = 186, time.stw = 423.02, time = 13823) (Note: I recorded it on my laptop and only ran once. The speed-up may just be noise, but the shape of the timeline is important.)

zoom-in of master-header

The allocation is still bursty, and there are some long GCs (which take up to 9.8ms). But overall it only spent a small amount of time doing GC.

wks · 2024-06-02T10:44:20Z

avrora is simpler. GC happens at regular intervals. The shape of the timeline of individual GCs doesn't change, but every GC becomes longer, and it runs more GCs. It simply takes longer to do the Closure. I think it is a result of cache locality.

master-header SemiSpace (GC = 137, time.stw = 283 ms, time = 6078 ms)

pr-side Semispace (GC = 164, time.stw = 387, time = 6105ms)

The 100th GC of master-header

The 100th GC of PR-side

wks · 2024-06-02T14:02:49Z

I re-ran lusearch with SemiSpace, but this time with three builds

build1: master
build2: this PR
build3: this PR plus a conditional if statement so that the Release work packet won't iterate through the chunks if not using VO bits or side forwarding bits. (wks@b6d24bf)

All builds are using header forwarding bits. I added Build3 to see if the unconditional iteration of chunks is the cause of the tiny overhead of STW time.

But experiments show that build3 has even greater overhead than build2. (plotty)

I think the effect of the if statement is trivial, and maybe some inlining decisions made by the compiler can have more significant effects than that.

Added a feature to force using on-the-side forwarding bits. This is for testing mmtk-core. The CI script for minimal test is also extended to run several plans involving CopySpace and ImmixSpace on OpenJDK with side forwarding bits. Related PR: mmtk/mmtk-core#1138 The feature introduced in this PR can be used to check the correctness of that PR. --------- Co-authored-by: mmtkgc-bot <mmtkgc.bot@gmail.com>

wks added 5 commits May 22, 2024 13:08

Clear side forwarding bits during Release

184767c

When using StickyImmix, we should clear side forwarding bits in the ImmixSpace in each GC, too. We move it from the start of major GC to the end of each GC so that we don't add additional preparation work if it is not nursery GC.

Only clear if the current GC moves object.

20b8ae1

Fix copyspace metadata clearing.

9e3922d

In defrag GC, only clear fwd bits for defrag sources

2adf511

Merge branch 'master' into fix/side-fwd-bits

8ce9799

wks marked this pull request as ready for review May 23, 2024 05:40

wks requested a review from qinsoon May 23, 2024 05:40

qinsoon reviewed May 23, 2024

View reviewed changes

Add comment

160fe37

Clear forwarding bits before sweeping block

899856d

wks added 2 commits May 24, 2024 15:59

No clearing side fwd bits in post_copy

8ea511c

Use proper logging for defrag

805b3b8

wks mentioned this pull request May 28, 2024

Add option to put forwarding bits on the side mmtk/mmtk-openjdk#277

Merged

qinsoon approved these changes Jun 1, 2024

View reviewed changes

wks added this pull request to the merge queue Jun 3, 2024

Merged via the queue into mmtk:master with commit a9b619a Jun 3, 2024
24 checks passed

wks deleted the fix/side-fwd-bits branch June 3, 2024 02:58

wks mentioned this pull request Jun 3, 2024

Prolific types #1143

Open

wks mentioned this pull request Jul 19, 2024

Stale forwarding bits in nursery GC in StickyImmix #1118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clear side forwarding bits properly #1138

Clear side forwarding bits properly #1138

wks commented May 22, 2024 •

edited

Loading

qinsoon left a comment

qinsoon May 23, 2024

wks commented May 23, 2024

qinsoon commented May 23, 2024

wks commented May 23, 2024

wks commented May 24, 2024

wks commented May 24, 2024

wks commented May 31, 2024

qinsoon commented May 31, 2024

wks commented May 31, 2024

wks commented May 31, 2024 •

edited

Loading

wks commented May 31, 2024

wks commented Jun 1, 2024

qinsoon left a comment

wks commented Jun 2, 2024

wks commented Jun 2, 2024

wks commented Jun 2, 2024

Clear side forwarding bits properly #1138

Clear side forwarding bits properly #1138

Conversation

wks commented May 22, 2024 • edited Loading

qinsoon left a comment

Choose a reason for hiding this comment

qinsoon May 23, 2024

Choose a reason for hiding this comment

wks commented May 23, 2024

qinsoon commented May 23, 2024

wks commented May 23, 2024

wks commented May 24, 2024

wks commented May 24, 2024

wks commented May 31, 2024

build1 vs build2

build2 vs build3

qinsoon commented May 31, 2024

wks commented May 31, 2024

wks commented May 31, 2024 • edited Loading

wks commented May 31, 2024

wks commented Jun 1, 2024

qinsoon left a comment

Choose a reason for hiding this comment

wks commented Jun 2, 2024

wks commented Jun 2, 2024

wks commented Jun 2, 2024

wks commented May 22, 2024 •

edited

Loading

wks commented May 31, 2024 •

edited

Loading