Remove descriptor_map #952

wks · 2023-09-12T06:40:22Z

DRAFT: This PR has two problems:

The current implementation of address_in_space in this PR has noticeable performance overhead. From benchmark results, it looks like the check of having SFT entry and the 128-bit load is a bottleneck.
The semantics of casting the 128-bit *const dyn SFT to *const () is unclear, and may be unreliable.

So I decide to postpone this PR until two other things are done:

Refactoring the SFT_MAP implementation to eliminate 128-bit atomic read. See Should we avoid using fat pointers for SFT? #945
Merging the compressed pointer support in the OpenJDK binding so that we will have a more realistic use case to evaluate. See Compressed Oops Support mmtk-openjdk#235

Description

This PR removes the descriptor_map from both Map32 and Map64.

Currently, the descriptor_map is only used by Space::address_in_space when using Map32, and not used (except in some assertions about newly acquired pages) when using Map64. With descriptor_map removed, Space::address_in_space will use the SFT to find the space of a given address.

Performance: The Space::in_space function (which calls Space::address_in_space) is used by the derive macro of the trait PlanTraceObject. Therefore, this PR will affect the performance of tracing for plans that use PlanTraceObject when using Map32. This needs to be tested.

Related PRs:

Let Map32 use proper synchronization. #951: If descriptor_map were moved into the Mutex, it would be inefficient. This PR removes descriptor_map, instead, so that Let Map32 use proper synchronization. #951 can move other fields into the Mutex.

qinsoon

The refactoring looks good. We need some performance data before we can confidently merge the PR.

src/policy/space.rs

wks · 2023-09-12T10:20:58Z

I tested on bobcat.moma, comparing master and this PR, running lusearch, with the Immix plan, at 2.5x min heap size, 20 iterations. The result shows less than 1% impact on total time, STW time, and mutator time. This looks promising. I'll retry this test with other plans running lusearch.

Plotty: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|bobcat-2023-09-12-Tue-093227&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark&build;build1|41&Histogram%20(with%20CI)^build^benchmark&

qinsoon · 2023-09-13T00:51:29Z

I tested on bobcat.moma, comparing master and this PR, running lusearch, with the Immix plan, at 2.5x min heap size, 20 iterations. The result shows less than 1% impact on total time, STW time, and mutator time. This looks promising. I'll retry this test with other plans running lusearch.

Plotty: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|bobcat-2023-09-12-Tue-093227&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark&build;build1|41&Histogram%20(with%20CI)^build^benchmark&

Is this measured with a 64 bit build? If that's the case, the new code in address_in_space is not used.

wks · 2023-09-13T02:04:29Z

Is this measured with a 64 bit build? If that's the case, the new code in address_in_space is not used.

I built and ran it on x86-64. But I used the following patch to force it to use Map32.

diff --git a/src/util/heap/layout/vm_layout.rs b/src/util/heap/layout/vm_layout.rs
index ddf4472a5..66a7ee083 100644
--- a/src/util/heap/layout/vm_layout.rs
+++ b/src/util/heap/layout/vm_layout.rs
@@ -178,14 +178,14 @@ impl std::default::Default for VMLayout {
 
     #[cfg(target_pointer_width = "64")]
     fn default() -> Self {
-        Self::new_64bit()
+        Self::new_32bit()
     }
 }
 
 #[cfg(target_pointer_width = "32")]
 static mut VM_LAYOUT: VMLayout = VMLayout::new_32bit();
 #[cfg(target_pointer_width = "64")]
-static mut VM_LAYOUT: VMLayout = VMLayout::new_64bit();
+static mut VM_LAYOUT: VMLayout = VMLayout::new_32bit();
 
 static VM_LAYOUT_FETCHED: AtomicBool = AtomicBool::new(false);

The log shows it is using the new code.

wks · 2023-09-13T02:17:34Z

I ran again on bobcat.moma, 2.5x min heap size, 40 iterations, with several plans (MarkSweep and MarkCompact couldn't run with that heap size) and used the patch above to force it using Map32.

The result shows noticeable slowdown for all plans. Immix has the smallest slowdown. Others have about 1% slowdown in STW time.

Plotty: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|bobcat-2023-09-12-Tue-120535&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^mmtk_gc&build;build1|40&Histogram%20(with%20CI)^build^mmtk_gc&

qinsoon · 2023-09-13T02:24:15Z

Your build should be using Map32 and SFTSparseChunkMap. Getting SFT from SFTSparseChunkMap should be as simple as indexing into a vec, and it should be no different than the previous descriptor_map code. Is it possible to use
SFTMap::get_unchecked() instead? It could be the extra check that caused the slowdown.

wks · 2023-09-13T02:42:16Z

Your build should be using Map32 and SFTSparseChunkMap. Getting SFT from SFTSparseChunkMap should be as simple as indexing into a vec, and it should be no different than the previous descriptor_map code. Is it possible to use SFTMap::get_unchecked() instead? It could be the extra check that caused the slowdown.

We can't use SFTMap::get_unchecked() because the object may not be in any space in MMTk. That's what ActivePlan::vm_trace_object handles. But I'll try running get_unchecked on OpenJDK to see if it is causing the slowdown.

I guess the 128-bit atomic load may also be a reason for the slowdown. I'll check that, too.

qinsoon · 2023-09-13T02:55:58Z

The

Your build should be using Map32 and SFTSparseChunkMap. Getting SFT from SFTSparseChunkMap should be as simple as indexing into a vec, and it should be no different than the previous descriptor_map code. Is it possible to use SFTMap::get_unchecked() instead? It could be the extra check that caused the slowdown.

We can't use SFTMap::get_unchecked() because the object may not be in any space in MMTk. That's what ActivePlan::vm_trace_object handles. But I'll try running get_unchecked on OpenJDK to see if it is causing the slowdown.

I guess the 128-bit atomic load may also be a reason for the slowdown. I'll check that, too.

The address does not have to be in an MMTk space. It just needs an entry in SFT (which could be mapped to SFT_EMPTY_SPACE). For SFTSpaceMap and SFTSparseChunkMap, we should be fine, as we map all the address space we may use into SFT map. But for SFTDenseChunkMap where we use side metadata, we cannot guarantee the side metadata is available for the entire address space.

wks · 2023-09-13T09:42:59Z

This time I added build3 and build4. Build3 uses SFT_MAP::get_unchecked. Build4 does not check either, but also only load the lowest 64 bits of the SFTSparseChunkMap using unsafe non-atomic 64-bit load. In the current Rust version on x86_64, the lower 64 bits of a &dyn is the object pointer.

From the plot, build3 is noticeably faster than build2 in STW time. The STW time of build4 is about the same as build3 for StickyImmix, but slightly higher than build3 for Immix. It may indicate that the "check" is the bottleneck. But it is hard to explain why build4 is slower than build3 since it does strictly less memory loading. It may be noise.

But since the number is still a bit noisy, I'll do some extra experiments with micro-benchmarks.

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|bobcat-2023-09-13-Wed-061313&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^mmtk_gc&build;build1|41&Histogram%20(with%20CI)^build^mmtk_gc&

wks · 2023-09-14T10:10:26Z

lusearch

This is the same setting but with added tests for other plans, and increased the number of invocations to 40.

Plotty: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|bobcat-2023-09-13-Wed-144053&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^mmtk_gc&build;build1|41&Histogram%20(with%20CI)^build^mmtk_gc&

From this plot, we can see

Build2 (this PR) is consistently slower than build1 w.r.t. STW time.
Build3 (this PR but use get_unchecked) also has some overhead, but not as great as build2. For Immix, the mean STW time is the same as build1's
Build4 exhibits speed-up in STW time over build1 in GenCopy, GenImmix and SemiSpace, but slowdown in Immix and StickyImmix.

Microbenchmark

I also tested with a microbenchmark. It is similar to GCBench, but

it only has one long-lived tree, and
it only triggers GC and do nothing else, and
it prints out the time of each GC (not total STW time) in the end.

class Node {
    int n;
    Node left;
    Node right;

    Node(int n, Node left, Node right) {
        this.n = n;
        this.left = left;
        this.right = right;
    }

    static Node makeTree(int depth) {
        if (depth == 0) {
            return null;
        } else {
            return new Node(depth, makeTree(depth - 1), makeTree(depth - 1));
        }
    }
}

public class TraceTest {
    public static void main(String[] args) {
        int depth = Integer.parseInt(args[0]);
        int iterations = Integer.parseInt(args[1]);
        int warmups = Integer.parseInt(args[2]);
        long[] gctimes = new long[iterations];

        Node tree = Node.makeTree(depth);

        for (int i = 0; i < warmups; i++) {
            System.gc();
        }

        for (int i = 0; i < iterations; i++) {
            long time1 = System.nanoTime();
            System.gc();
            long time2 = System.nanoTime();

            gctimes[i] = time2 - time1;
        }

        for (long gctime: gctimes) {
            System.out.println(gctime);
        }
    }
}

I ran it on bobcat.moma with the following script:

for plan in SemiSpace GenCopy GenImmix StickyImmix Immix; do
        for j in {1..5}; do
                for i in {1..4}; do
                        echo $plan build$i iter$j
                        MMTK_THREADS=1 MMTK_PLAN=${plan} ~/compare/build${i}/openjdk/build/linux-x86_64-normal-server-release/images/jdk/bin/java -XX:+UseThirdPartyHeap -server -XX:ParallelGCThreads=1 -XX:MetaspaceSize=100M -Xm{s,x}500M TraceTest 22 100 10 > out/result-${plan}-build${i}-iter${j}.txt
                done
        done
done

The number of GC workers is set to 1. For each plan-build pair, it runs 5 times. Each time it creates a 22-level tree, trigger GC 10 times for warm-up, and trigger GC 100 more times, recording the time of each GC.

The results are plotted in the following violin plot + scattered point plot. Each cell corresponds to a plan-build pair, and it contains 5 bars, each correspond to one of the 5 iteration. The horizontal dash "-" in the middle of each "violin" is the median.

With outliers (zscore >= 3) removed, the result is:

The GC times exhibit an interesting bi-modal distribution in SemiSpace and StickyImmix.

For the two non-generational plans, namely SemiSpace and Immix, the median of build2 is significant greater than build1. build3 is slightly faster than build2 but still noticeably slower than build1. Build4 is similar to build1 in both SemiSpace and Immix, but build4 is slightly faster than build1 in SemiSpace, but slightly slower than build1 in Immix.

For the two GenXxxxx plans, namely GenCopy and GenImmix, the plot doesn't show significant differences in GC time. The result varies in each run, and the noise is more significant than the medians.

StickyImmix is a bit interesting. The bi-modal distribution disappeared in build3 (like this PR but using get_unchecked). Since the script runs each build in turn (build1, build2, build3, build4 then build1 and buid2 again...), the difference should be intrinsic to build3.

This result is hard to interpret, but it looks like the cost of the 128-bit load is significant, and the check in get_checked also has a minor contribution to the overhead.

wenyuzhao · 2023-09-15T01:03:33Z

src/policy/space.rs

+            // TODO: For `DenseChunkMap`, it is possible to reduce one level of dereferencing by
+            // letting each space remember its index in `DenseChunkMap::index_map`.
+            let self_ptr = self as *const _ as *const ();
+            let other_ptr = SFT_MAP.get_checked(start) as *const _ as *const ();


Better to use as *const Self instead of *const _. We had a bug #750 where if the return type of get_checked is changed, the *const _ cast may still compile but can sliently fail the pointer comparison or dereference later.

Same to L229 as well.

Yes. We can do that on line 229. But on L230, the returned SFT instance may not have the same type as Self.

But I think a deeper problem is that SFT_MAP.get_checked(start) as *const _ is a *const dyn SFT, which is 16 bytes long. But *const () is 8 bytes long. Embarrassingly, neither the Rust reference nor the Rustonomicon specify the semantics of casting pointer to pointer. This means casting pointers this way is really no better than loading only half of the 128-bit &dyn from the SFT, as what build4 did.

Embarrassingly, neither the Rust reference nor the Rustonomicon specify the semantics of casting pointer to pointer.

Not sure I follow. Doesn't the first rule here describe when it's allowed? Or do you mean the semantics of then deferencing the pointer, which is most likely undefined

Not sure I follow. Doesn't the first rule here describe when it's allowed? Or do you mean the semantics of then deferencing the pointer, which is most likely undefined

The link you provided points to an ancient version of the Rust Book for v1.25. The newest version of the Rust Book does not contain that section.

I did mean the casting itself. It is not an no-op bits-preserving cast like transmute, as it changes the number of bits.

Ah right. That was the first link on Google for me when I searched for "raw pointer casting Rust".

k-sareen · 2023-09-15T03:39:29Z

Could you run a smaller experiment with a different machine? bobcat is asymmetrical so scheduling may affect the results. (Or just run on the performance cores on bobcat)

wks · 2023-09-15T05:26:49Z

To further investigate the bi-modal distribution in my microbenchmark, I plotted the GC time of each GC in the first SemiSpace run with build2.

The GC time is jumping back and forth between two values. I suspect the difference is the cost of a failed in_space test. When running SemiSpace and PlanTraceObject, the macro-generated trace_object checks if an object is in copyspace1 before checking if it is in copyspace2. So in odd GCs, copyspace1 is the from space, while in even GCs, copyspace2 is the from space. In my microbenchmark, the data structure is a large binary tree, so all from-space objects are visited by exactly one edge (i.e. it will not see edges pointing to objects that are already forwarded).

When copyspace1 is the from-space, most objects will be in copyspace1, and the first copyspace1.in_space(object) check will usuallly succeed.
When copyspace2 is the from-space, most objects will be in copyspace2. But the generated trace_object function will still call copyspace1.in_space(object) first, which will return false. Then it will call copyspace2.in_space(object) which will return true.

So it will call in_space once in odd GCs, but twice in even GCs. That's probably the reason behind the bimodal distribution.

The curve for StickyImmix is different The following is the first run of build 2.

This may indicate that the bimodal distribution is caused by something else that is periodic.

And the first run of build 3:

This indicates the GC time still oscillates, but with lower "AC" amplitude and higher "DC" component. This is hard to explain because omitting a check should only make things faster.

wks · 2023-09-15T11:11:01Z

I ran the microbenchmark again on fisher.moma.

The overall result is consistent with bobcat.moma. For non-generational plans, build2 is obviously slower; build3 is slightly less slower; build4 is almost as good as build1. SemiSpace still exhibit the bi-modal distribution, but StickyImmix no longer does.

Semispace build2 run2:

StickyImmix build2 run2:

wks · 2023-09-26T06:25:00Z

From our previous discussion, although the descriptor_map is not used in Map64 now, it is still useful, and one possible use case is identifying which space an object is in, and enqueue edges into specialised work packets for the space it points into. So we shouldn't simply remove descriptor_map. I am closing this PR without merging.

Remove descriptor_map

74a0e09

wks requested review from qinsoon and wenyuzhao September 12, 2023 06:40

qinsoon reviewed Sep 12, 2023

View reviewed changes

src/policy/space.rs Show resolved Hide resolved

wenyuzhao reviewed Sep 15, 2023

View reviewed changes

wks marked this pull request as draft September 15, 2023 02:50

wks closed this Sep 26, 2023

wks mentioned this pull request Oct 16, 2023

Two-step spaces instantiation #984

Open

wks mentioned this pull request Dec 11, 2023

NULL and movement check in process_edge #1032

Merged

wks mentioned this pull request Jan 8, 2024

Merge SFT with Space #109

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove descriptor_map #952

Remove descriptor_map #952

wks commented Sep 12, 2023 •

edited

Loading

qinsoon left a comment

wks commented Sep 12, 2023

qinsoon commented Sep 13, 2023

wks commented Sep 13, 2023

wks commented Sep 13, 2023

qinsoon commented Sep 13, 2023

wks commented Sep 13, 2023

qinsoon commented Sep 13, 2023

wks commented Sep 13, 2023 •

edited

Loading

wks commented Sep 14, 2023

wenyuzhao Sep 15, 2023 •

edited

Loading

wks Sep 15, 2023

k-sareen Sep 15, 2023

wks Sep 15, 2023

k-sareen Sep 15, 2023

k-sareen commented Sep 15, 2023 •

edited

Loading

wks commented Sep 15, 2023 •

edited

Loading

wks commented Sep 15, 2023

wks commented Sep 26, 2023

Remove descriptor_map #952

Remove descriptor_map #952

Conversation

wks commented Sep 12, 2023 • edited Loading

Description

qinsoon left a comment

Choose a reason for hiding this comment

wks commented Sep 12, 2023

qinsoon commented Sep 13, 2023

wks commented Sep 13, 2023

wks commented Sep 13, 2023

qinsoon commented Sep 13, 2023

wks commented Sep 13, 2023

qinsoon commented Sep 13, 2023

wks commented Sep 13, 2023 • edited Loading

wks commented Sep 14, 2023

lusearch

Microbenchmark

wenyuzhao Sep 15, 2023 • edited Loading

Choose a reason for hiding this comment

wks Sep 15, 2023

Choose a reason for hiding this comment

k-sareen Sep 15, 2023

Choose a reason for hiding this comment

wks Sep 15, 2023

Choose a reason for hiding this comment

k-sareen Sep 15, 2023

Choose a reason for hiding this comment

k-sareen commented Sep 15, 2023 • edited Loading

wks commented Sep 15, 2023 • edited Loading

wks commented Sep 15, 2023

wks commented Sep 26, 2023

wks commented Sep 12, 2023 •

edited

Loading

wks commented Sep 13, 2023 •

edited

Loading

wenyuzhao Sep 15, 2023 •

edited

Loading

k-sareen commented Sep 15, 2023 •

edited

Loading

wks commented Sep 15, 2023 •

edited

Loading