Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear side forwarding bits properly #1138

Merged
merged 9 commits into from
Jun 3, 2024
Merged

Clear side forwarding bits properly #1138

merged 9 commits into from
Jun 3, 2024

Conversation

wks
Copy link
Collaborator

@wks wks commented May 22, 2024

This PR fixes bugs related to the clearing of side forwarding bits.

The side forwarding bits of ImmixSpace are now cleared after moving GCs instead of before every major GC. This eliminates redundant clearing operations during non-moving full-heap GCs (i.e. when not doing defrag), and also ensures the subsequent nursery GCs do not see stale forwarding bits when using StickyImmix in which case young objects are also allocated in the ImmixSpace.

Also fixed the code for clearing side forwarding bits in discontiguous CopySpace.

wks added 5 commits May 22, 2024 13:08
When using StickyImmix, we should clear side forwarding bits in the
ImmixSpace in each GC, too.  We move it from the start of major GC to
the end of each GC so that we don't add additional preparation work if
it is not nursery GC.
@wks wks marked this pull request as ready for review May 23, 2024 05:40
@wks wks requested a review from qinsoon May 23, 2024 05:40
Copy link
Member

@qinsoon qinsoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run benchmarks for Immix and SemiSpace? We should expect no perf difference for SemiSpace, and speedup/no diff for Immix, right?

src/policy/copyspace.rs Show resolved Hide resolved
// In the beginning of the next GC, no side forwarding bits shall be set.
if let MetadataSpec::OnSide(side) = *VM::VMObjectModel::LOCAL_FORWARDING_BITS_SPEC {
if is_moving_gc {
let objects_may_move = !is_defrag_gc || block.is_defrag_source();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably add some comments here to make it more clear: if this is a defrag GC, we only need to clear forwarding bits for defrag source blocks; if this is a moving GC but not a defrag GC, we have no other information, and we have to clear forwarding bits for all the blocks.

src/policy/immix/immixspace.rs Show resolved Hide resolved
@wks
Copy link
Collaborator Author

wks commented May 23, 2024

Can you run benchmarks for Immix and SemiSpace? We should expect no perf difference for SemiSpace, and speedup/no diff for Immix, right?

This PR only changes the code that involves on-the-side forwarding bits. If the forwarding bits are in the header, there should be no performance difference with the default configuration of OpenJDK (which uses in-header forwarding bits).

If the forwarding bits are on the side, this PR fixes correctness issues because both CopySpace and ImmixSpace (in StickyImmix) would crash.

I'll run benchmarks using the default settings (in-header forwarding bits) to verify that there is no performance difference.

@qinsoon
Copy link
Member

qinsoon commented May 23, 2024

I'll run benchmarks using the default settings (in-header forwarding bits) to verify that there is no performance difference.

If we can use side forwarding bits for OpenJDK, it is worth measuring the performance.

@wks
Copy link
Collaborator Author

wks commented May 23, 2024

I'll run benchmarks using the default settings (in-header forwarding bits) to verify that there is no performance difference.

If we can use side forwarding bits for OpenJDK, it is worth measuring the performance.

I'll add a feature to OpenJDK to force the forwarding bits to be on the side. And it shall be added to minimal-tests-openjdk. Otherwise the CI has no way to even check if this PR is correct.

@wks
Copy link
Collaborator Author

wks commented May 24, 2024

binding-refs
OPENJDK_BINDING_REPO=wks/mmtk-openjdk
OPENJDK_BINDING_REF=fix/side-fwd-bits

@wks
Copy link
Collaborator Author

wks commented May 24, 2024

There is a deadlock happening in the minimal binding test. It is running lusearch in dacapo 2006. The bug is reproducible locally, and it does not require the forwarding bits to be on the side (i.e. not related to this PR or the OpenJDK PR). I have reported it in mmtk/mmtk-openjdk#278. Since the deadlock is so easy to trigger, I'll dig into that bug before continuing with this PR.

@wks
Copy link
Collaborator Author

wks commented May 31, 2024

The benchmark result. Lusearch from DaCapo Chopin, 3x min heap size, 40 invocations. 3 builds:

Note: There is no master branch with side forwarding bits because that'll crash.

Plots (normalised to build1): plotty link

image

Plots (not normalised): plotty link

image

Note that the number of GC is stuck at 2047 for GenImmix. The actual number of GCs should be much higher.

build1 vs build2

The difference between build1 and build2 is the effect of this PR without enabling side forwarding bits.

Plots (normalised to build1): plotty link

image

Mostly unchanged. For some plans, such as GenCopy and SemiSpace, STW time becomes higher, but other time becomes lower. It's likely some changes in the code influenced the way it measures STW time, counting some other time into STW time. The total time is different with non-overlapping error bars, but is within 1%.

build2 vs build3

The difference between build2 and build3 indicates the cost of enabling side forwarding bits.

Plots (normalised to build2): plotty link

Obviously, the number of GC is increased. That's because the side forwarding bits occupies part of the heap space. As a result, the effective heap size for objects becomes smaller. The STW time increased due to increased number of GCs and the cost of clearing side forwarding bits. The plans involving CopySpace (SemiSpace, GenCopy and GenImmix) are impacted to a greater degree. The time.other is mostly unchanged, with GenImmix as an exception. But the total time is consistently increased.

image

@qinsoon
Copy link
Member

qinsoon commented May 31, 2024

The slowdown of using side forwarding bits is significant (build2 vs build3). It would be good to measure the overhead of side forwarding bits on master -- if it cannot run with Immix or with discontiguous spaces, you can try find a configuration that runs (e.g. no compress pointer, semispace).

Also please include all the benchmarks in the evaluation. It is okay to focus on one benchmark to address some issues, but we need to know the results on more benchmarks in order to know if the PR has performance regression.

@wks
Copy link
Collaborator Author

wks commented May 31, 2024

The slowdown of using side forwarding bits is significant (build2 vs build3).

Yes. Side forwarding bits are not free. It is significant for CopySpace because the entire CopySpace is copied away, and the clearing involves the entire space. But it is much lower for ImmixSpace. For Immix and StickyImmix, the total time overhead is within 1%, and the STW time overhead is about 2%-4%.

It would be good to measure the overhead of side forwarding bits on master -- if it cannot run with Immix or with discontiguous spaces, you can try find a configuration that runs (e.g. no compress pointer, semispace).

I am doing a 4-way comparison, comparing master-header, master-side, PR-header and PR-side, using SemiSpace and GenCopy, without compressed oops. Let's wait for the results.

@wks
Copy link
Collaborator Author

wks commented May 31, 2024

I added build4: master with side forwarding bits enabled. To avoid confusion, I'll use "master-header", "PR-header", "master-side", "PR-side" as the labels of the builds. I re-ran lusearch, but only using SemiSpace and GenCopy which only involve CopySpace.

Plot (un-normalised): plotty

image

Plot (normalised to "master-header"): plotty

image

The increase in the number of GC and the times due to moving the forwarding bits on the side (i.e. the ratio of side / header for each number) is similar for both the master branch and this PR. (p.s. I wonder why the number of GC didn't overflow this time.) The -XX:-UseCompressedOops option affects the number of GC triggered. When using compressed oops, it triggers more GC.

For SemiSpace, there is a 1.7% increase in STW time and 0.5% increase in total time, and it is consistent with the result of the last 3-way comparison.

@wks
Copy link
Collaborator Author

wks commented May 31, 2024

I had a look at the timeline using eBPF. THe following plots are from running lusearch, 64M heap size (the same as on lynx.moma), on my laptop.

This PR, header, no compressed oops, SemiSpace
image

This PR, side, no compressed oops, SemiSpace
image

There is one obvious problem with the current implementation. The clearing of the side forwarding bits is not parallelised. But for lusearch alone, that's not the bottleneck. Any load-imbalance in the Prepare and Closure stage would have greater impact in STW time.

This PR, side, no compressed oops, GenCopy nursery collection
image

The impact should be greater for GenCopy because the impact of the Release work packet not parallelized is greater on nursery collection. However, the benchmark results from lynx.moma showed that the impact of side forwarding bits on STW time is greater for SemiSpace. I think that means the un-parallelised Release work packet is only a secondary factor. The primary impact factor is the decreased effective heap size (due to side metadata taking space) as well as cache locality problems with side metadata.

@wks
Copy link
Collaborator Author

wks commented Jun 1, 2024

I ran other benchmarks on on lynx.moma and bobcat.moma (identical hardware and identical binary). 3x min heap size w.r.t. G1 in OpenJDK. No compressed OOPs (even when running StickyImmix).

Note that all four builds use the mmtk-openjdk binding from mmtk/mmtk-openjdk#277, but only master-header and PR-header have the forwarding_bits_on_side Cargo feature enabled in mmtk-openjdk.

SemiSpace. (plotty) Note that biojava and fop failed to run at 3x min heap size with SemiSpace

image

StickyImmix. (plotty

image

Comparing master-header and pr-header, most differences are within the error bars. For SemiSpace, the stw.time for pr-header is noticeably higher (but still small) than master-header. That could be a result of executing for (start, size) in self.pr.iterate_allocated_regions() { unconditionally in Release, even though it does a no-op within the for look if VO bit is not enabled, either. I'll verify if that's the cause, but I am not sure if it is worth optimising. For StickyImmix, the time.stw for pr-header is sometimes higher and sometimes lower, but within the error bars. I think they are just noise because the clearing of side forwarding bits for ImmixSpace is always guarded by an if statement with a constant condition, and can be eliminated by the compiler. The generated code should be identical if the forwarding bits are in the header.

Comparing XXX-header and XXX-side, we see the overhead of forcing forwarding bits to be on the side. The overhead is significantly higher for SemiSpace (only CopySpace) than StickyImmix (only ImmixSpace). One reason may be that ImmixSpace distributes the clearing operation to SweepChunk, and that can be paralellised. Another reason may be that if StickyImmix executes full-heap GC often (which is the reality for tighter heap), it will not do copying GC unless it is a defrag collection and if the block is a defrag source. That should greatly reduced the amount of clearing to do. However, from the plot of "majorGC / GC" for StickyImmix, I don't see any connection between "majorGC / GC" and the increase in time.stw.

MajorGC per GC: (plotty)

image

There are two interesting benchmarks: "avrora" and "jython". Their side-metadata overhead is so big in SemiSpace, but "avrora" sped up when using side forwarding bits, the "jython" only had a very tiny overhead when using side forwarding bits. The speed-up could be noise (as their error bars are large).

Copy link
Member

@qinsoon qinsoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The performance looks reasonable to me. If you would like to investigate further, that's fine. The PR in its current state is good to merge.

@wks
Copy link
Collaborator Author

wks commented Jun 2, 2024

Running jython locally, with SemiSpace, 76M heap size (3x min heap size w.r.t. G1 in OpenJDK), and no compressed oops:

master-header (number of GC = 1697, time.stw = 6489.81 ms, time = 19773ms)
image

pr-side (number of GC > 2047 (counter overflowed), time.stw = 11954.42 ms, time = 24658ms)
image

It looks like the 76M heap size is still too small for jython when using SemiSpace. Individual GCs don't look very different, and the Release work packet is trivial compared to the Closure stage. But when using side metadata, it is simply doing more GCs. Let's zoom in to the middle of the execution.

master-header
image

During some periods of execution (maybe when the program demands more memory), the program is spending almost 100% of the time doing GC, and the mutators hardly have any time to execute. From time to time, the muators can get an extended period of execution time without triggering GC (probably when the program keeps fewer objects alive). If the heap gets even a little bit smaller (or some portion of the heap is occupied by the side metadata), the thrashing will become significantly worse, amplifying the overhead of side forwarding bits.

Given that SemiSpace can only make use of 50% of the heap size, I doubled the heap size to 152M, and the problem became much milder.

master-header (GC = 141, time.stw = 590.08 ms, time = 13926 ms)
image

pr-side (GC = 149, time.stw = 735.66 ms, time = 14202 ms)
image

The STW time still increased significantly when using side metadata, but the total time is less affected. At this heap size, it still occasionally have bursts of frequent GCs. See the following plot:

master-header
image

Each individual GC still takes the similar time (about 3.5ms) as 76M, but the density is much lower. I think it is the nature of the jython benchmark to have short bursts of allocations, but the amount of memory it keeps alive is not very high (that's why each GC can still finish in 3.5ms regardless of heap size). Ideally, this kind of workload could benefit from MemBalancer.

FYI, the following timelines are recorded from StickyImmix with 76M heap size:

master-header (GC = 178, time.stw = 458.02, time = 14110)
image

pr-side (GC = 186, time.stw = 423.02, time = 13823) (Note: I recorded it on my laptop and only ran once. The speed-up may just be noise, but the shape of the timeline is important.)
image

zoom-in of master-header
image

The allocation is still bursty, and there are some long GCs (which take up to 9.8ms). But overall it only spent a small amount of time doing GC.

@wks
Copy link
Collaborator Author

wks commented Jun 2, 2024

avrora is simpler. GC happens at regular intervals. The shape of the timeline of individual GCs doesn't change, but every GC becomes longer, and it runs more GCs. It simply takes longer to do the Closure. I think it is a result of cache locality.

master-header SemiSpace (GC = 137, time.stw = 283 ms, time = 6078 ms)
image

pr-side Semispace (GC = 164, time.stw = 387, time = 6105ms)
image

The 100th GC of master-header
image

The 100th GC of PR-side
image

@wks
Copy link
Collaborator Author

wks commented Jun 2, 2024

I re-ran lusearch with SemiSpace, but this time with three builds

  • build1: master
  • build2: this PR
  • build3: this PR plus a conditional if statement so that the Release work packet won't iterate through the chunks if not using VO bits or side forwarding bits. (wks@b6d24bf)

All builds are using header forwarding bits. I added Build3 to see if the unconditional iteration of chunks is the cause of the tiny overhead of STW time.

But experiments show that build3 has even greater overhead than build2. (plotty)

image

I think the effect of the if statement is trivial, and maybe some inlining decisions made by the compiler can have more significant effects than that.

@wks wks added this pull request to the merge queue Jun 3, 2024
Merged via the queue into mmtk:master with commit a9b619a Jun 3, 2024
24 checks passed
@wks wks deleted the fix/side-fwd-bits branch June 3, 2024 02:58
mmtkgc-bot added a commit to mmtk/mmtk-openjdk that referenced this pull request Jun 3, 2024
Added a feature to force using on-the-side forwarding bits. This is for
testing mmtk-core. The CI script for minimal test is also extended to
run several plans involving CopySpace and ImmixSpace on OpenJDK with
side forwarding bits.

Related PR: mmtk/mmtk-core#1138 The feature
introduced in this PR can be used to check the correctness of that PR.

---------

Co-authored-by: mmtkgc-bot <mmtkgc.bot@gmail.com>
@wks wks mentioned this pull request Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants