-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix VO bits for Immix #849
Conversation
Implemented the ReconstructByTracing strategy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR only makes changes to vo bit in the immix space, while vo bit is global metadata and is applied to all the policies. Some changes such as removing set_vo_bit
from forward_object
will break other policies. Some changes such as introducing vo bit update strategy for immix will make the update strategy inconsistent among different policies, which make the update strategy less useful (we don't expect a binding to check which space an object lives in before calling is_mmtk_object
).
Getting all these correct would need some non-trivial efforts. The following are the minimal that is needed for this PR:
- We need to have a complete and coherent design for the update strategy: the update strategy needs to be global and works for all the policies. We need to be clear of what each policy does, in terms of complying with the strategy. We should support different strategies for different policies if it is simple, or leave some
unimplemented
if they are non trivial. - We use assertions to check vo bits (any object that is traced/reachable should have its vo bit set), and in the OpenJDK binding, we test a few plans with this assertion. With this PR (and update strategy), we should have some assertions like this (e.g. the assertion you put in sanity GC looks good but in the OpenJDK tests we do not run sanity GC), and OpenJDK tests should pass with those assertions.
I might have missed some of your discussions on this topic. Let me know if what I said contradicts with what you have discussed earlier.
/// * `start`: The starting address of a memory region. | ||
/// * `size`: The size of the memory region. | ||
/// * `other`: The other metadata to copy from. | ||
pub fn bcopy_metadata_contiguous(&self, start: Address, size: usize, other: &SideMetadataSpec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need some test cases for this. Also it needs to update the side metadata sanity table (once you have some tests and run it with extreme_assertions
, you should find some assertion failures).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I'll add some unit tests for bcopy_metadata_contiguous
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just added a unit test for bcopy_metadata_contiguous
.
This is a good point. I'll make sure VO bits remain useable for existing non-immix plans.
The reason why the "VO bits updating strategies" are only applicable to Immix is because currently Immix (and its variants) is the only GC algorithm that may leaves dead objects in the memory without cleaning them up. On the contrary,
Immix is special because we only sweep completely free lines. If a line contains both live and dead object, and it is not part of defrag source, we don't sweep it. During tracing, Immix traces through live objects in such lines, but unlike MarkSweep, Immix does not know where dead objects are. To work around this, we either clear all VO bits before tracing, and reconstruct VO bits when we mark or forward objects because we do know where live objects are; or we copy the on-the-side mark bits over to the VO bits because the on-the-side mark bits are always cleared before GC. Using either one of those two strategies, dead objects in such lines will have VO bits of
Can you point me to the "assertions to check vo bits" you mentioned? The
|
This is not true. As you said, mark compact leaves dead objects in the memory. Marksweep with lazy sweeping also leaves dead objects in the memory. These all affect VO bits. But anyway, the important thing is not about update strategy. It is about what semantics we allow, and the semantics should apply to all the policies. For example, in this PR, if we set update strategy to
The assertion in |
The key part is "not cleaning them up". I actually meant metadata, not the content of the memory. MarkCompact leaves dead objects in the memory just like SemiSpace does. If an object is in the rear portion of the MarkCompactSpace, we leave it behind. But after compaction, we know the region of MarkCompactSpace that contain live objects and we can clear the metadata (including VO bits) for the rear portion that contain dead objects. It's just like SemiSpace releasing the from-space. We may leave the memory content intact, but we clear its metadata and mark the memory region as available.
If lazy sweeping is the default, it may explain why I observed some dead objects having have VO bits in the EndOfGC phase. Currently, I am testing the VO bits by recording dead objects in one GC (by recording the finalised objects in Ruby because Ruby never resurrects any objects), and injecting them in a subsequent GC. I added an assertion that "in the EndOfGC phase of a GC, the finalised objects recorded in this GC should not have VO bits set", and that failed.
For MarkSweep, MarkCompact, SemiSpace and GenCopy, the VO bits are always available during tracing. Their most efficient updating strategies happen to preserve VO bits during tracing. Anyway, if the VM bindign sets
In ImmixSpace, the assertion also exists in |
Add a helper function in rust_util, instead.
I updated the semantics of VO bits so that it is set since the object is allocated until the object is considered dead by the GC. The reason is that if an object is dead, it will not be reached again by the mutator or by the GC, and it will contain stale data, including dangling references (not forwarded because the GC never visits that object again) and other data that no longer make sense. If VO bits are used for conservative stack scanning, and VO bits are set on some dead objects, those dead objects may be brought back to life. If GC traces from those objects through dangling edges, it will crash. So we must clear the VO bits once the GC can no longer reach that object. I think the reason why I still observe dead objects having VO bits set during |
Update comment Co-authored-by: Yi Lin <qinsoon@gmail.com>
If I merge #830 and turn on the "eager_sweeping" feature, I no longer observe dead objects with VO bits remain set at EndOfGC. For this PR, I'll let the "vo_bit" feature depend on the "eager_sweeping" feature. |
Otherwise native mark-sweep will not clear VO bits for all dead objects during GC.
I added support for StickyImmix for both strategies. The |
It is the proportion of execution time doing GC. Have a look at the un-normalised data. GenImmix, for some reasons, is spending about 90% of the time doing GC. Therefore, the STW time dominates the total time. From Rifat's paper, the STW time is greatly impacted by conservative GC, but the total time is not... as long as the STW time doesn't dominate. BTW, here is a preview of the results for other benchmarks.
This time the heap size is between 2x to 3x. ( One possibility that ClearAndReconstruct is slower may be that it is clearing more blocks than necessary. Using the ClearAndReconstruct strategy, it clears the VO bits for whole chunks, regardless whether the blocks (and lines) inside are allocated or not. (Link: https://github.com/wks/mmtk-core/blob/336908526459dc9d1b66e779ada48eaf81d9f61c/src/policy/immix/immixspace.rs#L1088) On the other hand, when using the CopyFromMarkBits strategy, it only copies VO bits for marked lines. May be the bottleneck is simply the memory bandwidth. I'll look into it tomorrow. |
You definitely need more invocations to reduce the affect of possible noise. I usually use at least 10 invocations when I try to draw any conclusion (20 or 40 invocations is also very common), and use 5 invocations when I am in the progress of debugging performance. You can't tell much from 1 invocation as the data you are looking at could be just a very noisy run. |
We thought VO bits is cheap based on Yiluo's report and the PR #390. However, as @wks found out that the VO bits were not cleared promptly, and dead objects may still have their VO bits set. This PR fixes that. So our understanding about the cost of VO bits was based on an inaccurate implementation. We do not know the actual cost of VO bits before this PR. The cost of So I would think 4% overhead might be reasonable, but it certainly makes VO bits less appealing. If you believe there is any performance mistake in the PR, please leave a comment. Otherwise, the remaining question is how we can optimise it, and how much effort we put into optimizing it. |
@wks Yes I agree GenImmix's total time is being dominated by its STW time which is not tuned. But then even in the STW time we have a ~14% overhead for the |
From Rifat's paper Figure 4(a), the overhead of adding conservatism for 2x min heap is 2.7%. So I guess it's not a big stretch that the PR above is adding 4% overhead for StickyImmix and 1% overhead for Immix. But Rifat's paper implemented |
TL;DR: The culprit that made ClearAndReconstruct slow is setting VO bits during tracing. It may be a result of cache misses. I'll not fix it in this PR. We should optimise metadata access in the future, especially use
If I force the CopyFromMarkBits strategy to set VO bits when marking objects, it will have the same cost as ClearAndReconstruct. Replacing (p.s. I tested with a variant of GCBench which allocates a long-live tree and triggers GC 100 times. The execution time is almost 100% GC time.) When an object is forwarded, both strategies have to set the VO bit for the to-space object. There is no performance difference. I think the reason why it takes so much time to set VO bits while tracing is that the on-the-side bitmap may result in cache misses if it is accessed randomly. Similar to VO bits, the operation to atomically set the mark bit (i.e. I think it is just the nature of the on-the-side metadata that makes the ClearAndReconstruct strategy slower than CopyFromMarkBits strategy. ClearAndReconstruct writes to two bitmaps simultaneously during tracing, while CopyFromMarkBits writes to one bitmap (occasionally writes VO bits when forwarded, but Immix doesn't move many objects), and then copy to another during the Release stage, and the copying is sequential which means it is cache-friendly. I'll not attempt to solve this problem in this PR because it is just the nature of metadata access. (BTW, if I also modify I think there may be multiple reasons why Rifat's result showed lower overhead.
public static void setNewBit(ObjectReference object) {
Address address = VM.objectModel.refToAddress(object);
Address base = getMetaDataBase(address).plus(Chunk.NEW_DATA_OFFSET);
Word index = address.toWord().rshl(Chunk.OBJECT_LIVE_SHIFT).and(Chunk.NEW_DATA_BIT_MASK);
base.atomicSetBit(index);
} It uses the On the contrary, in mmtk-core, the pub fn fetch_or_atomic<T: MetadataValue>(
&self,
data_addr: Address,
val: T,
order: Ordering,
) -> T {
self.side_metadata_access::<T, _, _, _>(
data_addr,
Some(val),
|| {
let meta_addr = address_to_meta_address(self, data_addr);
if self.log_num_of_bits < 3 {
let lshift = meta_byte_lshift(self, data_addr);
let mask = meta_byte_mask(self) << lshift;
// We do not need to use fetch_ops_on_bits(), we can just set irrelavent bits to 0, and do fetch_or
let rhs = (val.to_u8().unwrap() << lshift) & mask;
let old_raw_byte =
unsafe { <u8 as MetadataValue>::fetch_or(meta_addr, rhs, order) };
let old_val = (old_raw_byte & mask) >> lshift;
FromPrimitive::from_u8(old_val).unwrap()
} else {
unsafe { T::fetch_or(meta_addr, val, order) }
}
},
|_old_val| {
#[cfg(feature = "extreme_assertions")]
sanity::verify_update::<T>(self, data_addr, _old_val, _old_val.bitor(val))
},
)
} Those |
Plotty links for Immix and StickyImmix This time I ran Immix and StickyImmix for multiple benchmarks. Heap size is between 2x and 3x (running runbms ... 8 3). STW time, Immix, normalised to build2 Total time, Immix, normalised to build2 STW time, StickyImmix, normalised to build2 Total time, StickyImmix, normalised to build2 For Immix, the total time overhead is OK for all benchmarks. For StickyImmix, h2o and xalan are outliers. h2o has almost 70% increase in total time, but its increase in STW time is not that significant. The mutator time is greatly increased. This is a bit disturbing. |
Preliminary experiments show that the abnormal mutator time of "h2o" and "xalan" is not specific to StickyImmix. It is reproducible with GenCopy, too, but not SemiSpace. With GenCopy, enabling "vo_bit" and run "h2o" almost doubles the mutator time, too, just like StickyImmix. I guess it has something to do with the interaction between generational GC and the VO bits metadata. I also tried multiple heap sizes and that problem is the same for all heap sizes. I am running more invocations to get more data points. |
Looking at your results, it seems there is a slowdown for the mutator time in all the benchmarks for both VO bit strategies. For most benchmarks, it is just like 5% or less, but it seems visible. |
The slowdown may come from the invocations of
Running h2o at different heap sizes using SemiSpace and GenCopy: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2023-06-21-Wed-170232&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|40&Histogram%20(with%20CI)^build^hfac& It looks like the mutator time is a constant while both the number of GC and the STW time decreases as the heap size goes up. |
Yeah. Right. You can run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just some minor issues.
|
||
/// Select a strategy for the VM. It is a `const` function so it always returns the same strategy | ||
/// for a given VM. | ||
const fn strategy<VM: VMBinding>() -> VOBitUpdateStrategy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use CopyFromMarkBits
as default if the mark bit is on the side, as it is faster, and complies with the semantics in both cases. We only need to use ClearAndReconstruct
if the mark bit is in the header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I let it simply return CopyFromMarkBits
for now, and mentioned in the comments when we may need to choose other strategies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use ClearAndReconstruct
when mark bit is in the header. Our dummy vm tests crashed, because it uses header mark bit, and validation failed for CopyFromMarkBits
.
So the gist is that our metadata access costs too much? Hm. |
Our current implementation of metadata access is sub-optimal, but there is much room for improvement. See: #840 |
Need to implement both strategies:
Other tasks:
This PR fixes the valid object (VO) bits metadata for Immix. Partially fixes #848
The expected result of this PR is, for Immix and GenImmix, after each GC, the VO bits will be set for an address if and only if there is a live object at that address. For this reason, we can let Sanity GC verify that all live objects have their VO bits set.
Known issues:
This PR does not fix StickyImmix because the only user of conservative stack scanning (Ruby) still cannot use StickyImmix, therefore cannot test it.I added StickyImmix support and is tested on OpenJDK with Sanity GC. However, we still need to test it with a VM that uses conservative GC (such as Ruby) in the future.