-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove descriptor_map #952
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refactoring looks good. We need some performance data before we can confidently merge the PR.
I tested on |
Is this measured with a 64 bit build? If that's the case, the new code in |
I built and ran it on x86-64. But I used the following patch to force it to use diff --git a/src/util/heap/layout/vm_layout.rs b/src/util/heap/layout/vm_layout.rs
index ddf4472a5..66a7ee083 100644
--- a/src/util/heap/layout/vm_layout.rs
+++ b/src/util/heap/layout/vm_layout.rs
@@ -178,14 +178,14 @@ impl std::default::Default for VMLayout {
#[cfg(target_pointer_width = "64")]
fn default() -> Self {
- Self::new_64bit()
+ Self::new_32bit()
}
}
#[cfg(target_pointer_width = "32")]
static mut VM_LAYOUT: VMLayout = VMLayout::new_32bit();
#[cfg(target_pointer_width = "64")]
-static mut VM_LAYOUT: VMLayout = VMLayout::new_64bit();
+static mut VM_LAYOUT: VMLayout = VMLayout::new_32bit();
static VM_LAYOUT_FETCHED: AtomicBool = AtomicBool::new(false); The log shows it is using the new code. |
I ran again on bobcat.moma, 2.5x min heap size, 40 iterations, with several plans (MarkSweep and MarkCompact couldn't run with that heap size) and used the patch above to force it using Map32. The result shows noticeable slowdown for all plans. Immix has the smallest slowdown. Others have about 1% slowdown in STW time. |
Your build should be using |
We can't use I guess the 128-bit atomic load may also be a reason for the slowdown. I'll check that, too. |
The
The address does not have to be in an MMTk space. It just needs an entry in SFT (which could be mapped to |
This time I added build3 and build4. Build3 uses From the plot, build3 is noticeably faster than build2 in STW time. The STW time of build4 is about the same as build3 for StickyImmix, but slightly higher than build3 for Immix. It may indicate that the "check" is the bottleneck. But it is hard to explain why build4 is slower than build3 since it does strictly less memory loading. It may be noise. But since the number is still a bit noisy, I'll do some extra experiments with micro-benchmarks. |
lusearchThis is the same setting but with added tests for other plans, and increased the number of invocations to 40. From this plot, we can see
MicrobenchmarkI also tested with a microbenchmark. It is similar to GCBench, but
class Node {
int n;
Node left;
Node right;
Node(int n, Node left, Node right) {
this.n = n;
this.left = left;
this.right = right;
}
static Node makeTree(int depth) {
if (depth == 0) {
return null;
} else {
return new Node(depth, makeTree(depth - 1), makeTree(depth - 1));
}
}
}
public class TraceTest {
public static void main(String[] args) {
int depth = Integer.parseInt(args[0]);
int iterations = Integer.parseInt(args[1]);
int warmups = Integer.parseInt(args[2]);
long[] gctimes = new long[iterations];
Node tree = Node.makeTree(depth);
for (int i = 0; i < warmups; i++) {
System.gc();
}
for (int i = 0; i < iterations; i++) {
long time1 = System.nanoTime();
System.gc();
long time2 = System.nanoTime();
gctimes[i] = time2 - time1;
}
for (long gctime: gctimes) {
System.out.println(gctime);
}
}
} I ran it on bobcat.moma with the following script: for plan in SemiSpace GenCopy GenImmix StickyImmix Immix; do
for j in {1..5}; do
for i in {1..4}; do
echo $plan build$i iter$j
MMTK_THREADS=1 MMTK_PLAN=${plan} ~/compare/build${i}/openjdk/build/linux-x86_64-normal-server-release/images/jdk/bin/java -XX:+UseThirdPartyHeap -server -XX:ParallelGCThreads=1 -XX:MetaspaceSize=100M -Xm{s,x}500M TraceTest 22 100 10 > out/result-${plan}-build${i}-iter${j}.txt
done
done
done The number of GC workers is set to 1. For each plan-build pair, it runs 5 times. Each time it creates a 22-level tree, trigger GC 10 times for warm-up, and trigger GC 100 more times, recording the time of each GC. The results are plotted in the following violin plot + scattered point plot. Each cell corresponds to a plan-build pair, and it contains 5 bars, each correspond to one of the 5 iteration. The horizontal dash "-" in the middle of each "violin" is the median. With outliers (zscore >= 3) removed, the result is: The GC times exhibit an interesting bi-modal distribution in SemiSpace and StickyImmix. For the two non-generational plans, namely SemiSpace and Immix, the median of build2 is significant greater than build1. build3 is slightly faster than build2 but still noticeably slower than build1. Build4 is similar to build1 in both SemiSpace and Immix, but build4 is slightly faster than build1 in SemiSpace, but slightly slower than build1 in Immix. For the two GenXxxxx plans, namely GenCopy and GenImmix, the plot doesn't show significant differences in GC time. The result varies in each run, and the noise is more significant than the medians. StickyImmix is a bit interesting. The bi-modal distribution disappeared in build3 (like this PR but using This result is hard to interpret, but it looks like the cost of the 128-bit load is significant, and the check in |
// TODO: For `DenseChunkMap`, it is possible to reduce one level of dereferencing by | ||
// letting each space remember its index in `DenseChunkMap::index_map`. | ||
let self_ptr = self as *const _ as *const (); | ||
let other_ptr = SFT_MAP.get_checked(start) as *const _ as *const (); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use as *const Self
instead of *const _
. We had a bug #750 where if the return type of get_checked
is changed, the *const _
cast may still compile but can sliently fail the pointer comparison or dereference later.
Same to L229 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We can do that on line 229. But on L230, the returned SFT instance may not have the same type as Self.
But I think a deeper problem is that SFT_MAP.get_checked(start) as *const _
is a *const dyn SFT
, which is 16 bytes long. But *const ()
is 8 bytes long. Embarrassingly, neither the Rust reference nor the Rustonomicon specify the semantics of casting pointer to pointer. This means casting pointers this way is really no better than loading only half of the 128-bit &dyn
from the SFT, as what build4 did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Embarrassingly, neither the Rust reference nor the Rustonomicon specify the semantics of casting pointer to pointer.
Not sure I follow. Doesn't the first rule here describe when it's allowed? Or do you mean the semantics of then deferencing the pointer, which is most likely undefined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow. Doesn't the first rule here describe when it's allowed? Or do you mean the semantics of then deferencing the pointer, which is most likely undefined
The link you provided points to an ancient version of the Rust Book for v1.25. The newest version of the Rust Book does not contain that section.
I did mean the casting itself. It is not an no-op bits-preserving cast like transmute
, as it changes the number of bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right. That was the first link on Google for me when I searched for "raw pointer casting Rust".
Could you run a smaller experiment with a different machine? bobcat is asymmetrical so scheduling may affect the results. (Or just run on the performance cores on bobcat) |
To further investigate the bi-modal distribution in my microbenchmark, I plotted the GC time of each GC in the first SemiSpace run with build2. The GC time is jumping back and forth between two values. I suspect the difference is the cost of a failed
So it will call The curve for StickyImmix is different The following is the first run of build 2. This may indicate that the bimodal distribution is caused by something else that is periodic. And the first run of build 3: This indicates the GC time still oscillates, but with lower "AC" amplitude and higher "DC" component. This is hard to explain because omitting a check should only make things faster. |
I ran the microbenchmark again on The overall result is consistent with Semispace build2 run2: StickyImmix build2 run2: |
From our previous discussion, although the |
DRAFT: This PR has two problems:
address_in_space
in this PR has noticeable performance overhead. From benchmark results, it looks like the check of having SFT entry and the 128-bit load is a bottleneck.*const dyn SFT
to*const ()
is unclear, and may be unreliable.So I decide to postpone this PR until two other things are done:
SFT_MAP
implementation to eliminate 128-bit atomic read. See Should we avoid using fat pointers for SFT? #945Description
This PR removes the
descriptor_map
from bothMap32
andMap64
.Currently, the
descriptor_map
is only used bySpace::address_in_space
when usingMap32
, and not used (except in some assertions about newly acquired pages) when usingMap64
. Withdescriptor_map
removed,Space::address_in_space
will use the SFT to find the space of a given address.Performance: The
Space::in_space
function (which callsSpace::address_in_space
) is used by the derive macro of thetrait PlanTraceObject
. Therefore, this PR will affect the performance of tracing for plans that usePlanTraceObject
when usingMap32
. This needs to be tested.Related PRs:
descriptor_map
were moved into theMutex
, it would be inefficient. This PR removesdescriptor_map
, instead, so that Let Map32 use proper synchronization. #951 can move other fields into theMutex
.