Fix VO bits for Immix #849

wks · 2023-06-07T11:10:30Z

Need to implement both strategies:

Clear after stack scanning and reconstruct during tracing (default)
Copy mark bits to VO bits (alternative, making VO bits available during tracing)

Other tasks:

Need to ensure the change don't break existing spaces, including SemiSpace, GenCopy and MarkCompact.
Support StickyImmix
Add unit tests for copying metadata

This PR fixes the valid object (VO) bits metadata for Immix. Partially fixes #848

The expected result of this PR is, for Immix and GenImmix, after each GC, the VO bits will be set for an address if and only if there is a live object at that address. For this reason, we can let Sanity GC verify that all live objects have their VO bits set.

Known issues:

~~This PR does not fix StickyImmix because the only user of conservative stack scanning (Ruby) still cannot use StickyImmix, therefore cannot test it.~~ I added StickyImmix support and is tested on OpenJDK with Sanity GC. However, we still need to test it with a VM that uses conservative GC (such as Ruby) in the future.
When using MarkSweep, some addresses of dead objects still have VO bits set. We need to investigate this, but it is beyond the scope of this PR.
- update: It is caused by a bug in native MarkSweep, and it can be fixed in Sweep abandoned blocks in eager sweeping #830

Implemented the ReconstructByTracing strategy

qinsoon

This PR only makes changes to vo bit in the immix space, while vo bit is global metadata and is applied to all the policies. Some changes such as removing set_vo_bit from forward_object will break other policies. Some changes such as introducing vo bit update strategy for immix will make the update strategy inconsistent among different policies, which make the update strategy less useful (we don't expect a binding to check which space an object lives in before calling is_mmtk_object).

Getting all these correct would need some non-trivial efforts. The following are the minimal that is needed for this PR:

We need to have a complete and coherent design for the update strategy: the update strategy needs to be global and works for all the policies. We need to be clear of what each policy does, in terms of complying with the strategy. We should support different strategies for different policies if it is simple, or leave some unimplemented if they are non trivial.
We use assertions to check vo bits (any object that is traced/reachable should have its vo bit set), and in the OpenJDK binding, we test a few plans with this assertion. With this PR (and update strategy), we should have some assertions like this (e.g. the assertion you put in sanity GC looks good but in the OpenJDK tests we do not run sanity GC), and OpenJDK tests should pass with those assertions.

I might have missed some of your discussions on this topic. Let me know if what I said contradicts with what you have discussed earlier.

src/scheduler/scheduler.rs

src/policy/immix/vo_bit_helper.rs

src/vm/object_model.rs

src/policy/immix/vo_bit_helper.rs

qinsoon · 2023-06-15T02:39:11Z

src/util/metadata/side_metadata/global.rs

+    /// * `start`: The starting address of a memory region.
+    /// * `size`: The size of the memory region.
+    /// * `other`: The other metadata to copy from.
+    pub fn bcopy_metadata_contiguous(&self, start: Address, size: usize, other: &SideMetadataSpec) {


We need some test cases for this. Also it needs to update the side metadata sanity table (once you have some tests and run it with extreme_assertions, you should find some assertion failures).

Yes. I'll add some unit tests for bcopy_metadata_contiguous.

I have just added a unit test for bcopy_metadata_contiguous.

src/util/object_forwarding.rs

wks · 2023-06-15T06:32:23Z

This PR only makes changes to vo bit in the immix space, while vo bit is global metadata and is applied to all the policies. Some changes such as removing set_vo_bit from forward_object will break other policies.

This is a good point. I'll make sure VO bits remain useable for existing non-immix plans.

Some changes such as introducing vo bit update strategy for immix will make the update strategy inconsistent among different policies, which make the update strategy less useful (we don't expect a binding to check which space an object lives in before calling is_mmtk_object).

Getting all these correct would need some non-trivial efforts. The following are the minimal that is needed for this PR:
* We need to have a complete and coherent design for the update strategy: the update strategy needs to be global and works for all the policies. We need to be clear of what each policy does, in terms of complying with the strategy. We should support different strategies for different policies if it is simple, or leave some `unimplemented` if they are non trivial.

The reason why the "VO bits updating strategies" are only applicable to Immix is because currently Immix (and its variants) is the only GC algorithm that may leaves dead objects in the memory without cleaning them up. On the contrary,

MarkSweep explicitly sweeps cells that contain dead objects. At that time, we clear the VO bits.
SemiSpace and GenCopy are fully evacuating GC. All dead objects are in the from space or the nursery, and the entire CopySpaces for the from-space and the nursery are released after GC. Meanwhile, VO bits for to-space objects are gradually set during evacuation. This makes the "updating strategies" unnecessary. It is unnecessary to clear VO bits before tracing because it can be cleared after tracing. It is unnecessary to copy from mark bits because the to-space is a fresh new space without any objects, so the VO bits are fully 0 for the to-space before GC.
MarkCompact is a compacting GC. It effectively reconstructs all the objects in the MarkCompactSpace, moving all objects to one end. All dead objects are either overwritten (if in the front portion of the space) or abandoned (if in the rear portion). As a result, the VO bits for MarkCompactSpace should be cleared and reconstructed during compaction (or during CalculateForward, or the second transitive closure, whenever we know the destination of all objects).

Immix is special because we only sweep completely free lines. If a line contains both live and dead object, and it is not part of defrag source, we don't sweep it. During tracing, Immix traces through live objects in such lines, but unlike MarkSweep, Immix does not know where dead objects are. To work around this, we either clear all VO bits before tracing, and reconstruct VO bits when we mark or forward objects because we do know where live objects are; or we copy the on-the-side mark bits over to the VO bits because the on-the-side mark bits are always cleared before GC. Using either one of those two strategies, dead objects in such lines will have VO bits of 0, while live objects will have VO bits of 1.

* We use assertions to check vo bits (any object that is traced/reachable should have its vo bit set), and in the OpenJDK binding, we test a few plans with this assertion. With this PR (and update strategy), we should have some assertions like this (e.g. the assertion you put in sanity GC looks good but in the OpenJDK tests we do not run sanity GC), and OpenJDK tests should pass with those assertions.

Can you point me to the "assertions to check vo bits" you mentioned? The vo_bit_helper::on_trace_object method asserts the VO bit is set if it is available, but that assertion happens just before the object is traced, not after the reachability of objects are established.

I might have missed some of your discussions on this topic. Let me know if what I said contradicts with what you have discussed earlier.

qinsoon · 2023-06-15T07:20:18Z

The reason why the "VO bits updating strategies" are only applicable to Immix is because currently Immix (and its variants) is the only GC algorithm that may leaves dead objects in the memory without cleaning them up.

This is not true. As you said, mark compact leaves dead objects in the memory. Marksweep with lazy sweeping also leaves dead objects in the memory. These all affect VO bits.

But anyway, the important thing is not about update strategy. It is about what semantics we allow, and the semantics should apply to all the policies. For example, in this PR, if we set update strategy to CopyMarkBits, that means we can use VO bits during tracing for immix objects. But for objects in other spaces, are VO bits available? This is unclear.

We use assertions to check vo bits (any object that is traced/reachable should have its vo bit set), and in the OpenJDK binding, we test a few plans with this assertion. With this PR (and update strategy), we should have some assertions like this (e.g. the assertion you put in sanity GC looks good but in the OpenJDK tests we do not run sanity GC), and OpenJDK tests should pass with those assertions.

Can you point me to the "assertions to check vo bits" you mentioned? The vo_bit_helper::on_trace_object method asserts the VO bit is set if it is available, but that assertion happens just before the object is traced, not after the reachability of objects are established.

The assertion in trace_object in all policies is what I meant.

wks · 2023-06-15T08:06:00Z

This is not true. As you said, mark compact leaves dead objects in the memory...

The key part is "not cleaning them up". I actually meant metadata, not the content of the memory.

MarkCompact leaves dead objects in the memory just like SemiSpace does. If an object is in the rear portion of the MarkCompactSpace, we leave it behind. But after compaction, we know the region of MarkCompactSpace that contain live objects and we can clear the metadata (including VO bits) for the rear portion that contain dead objects. It's just like SemiSpace releasing the from-space. We may leave the memory content intact, but we clear its metadata and mark the memory region as available.

Marksweep with lazy sweeping also leaves dead objects in the memory. These all affect VO bits.

If lazy sweeping is the default, it may explain why I observed some dead objects having have VO bits in the EndOfGC phase.

Currently, I am testing the VO bits by recording dead objects in one GC (by recording the finalised objects in Ruby because Ruby never resurrects any objects), and injecting them in a subsequent GC. I added an assertion that "in the EndOfGC phase of a GC, the finalised objects recorded in this GC should not have VO bits set", and that failed.

But anyway, the important thing is not about update strategy. It is about what semantics we allow, and the semantics should apply to all the policies. For example, in this PR, if we set update strategy to CopyMarkBits, that means we can use VO bits during tracing for immix objects. But for objects in other spaces, are VO bits available? This is unclear.

For MarkSweep, MarkCompact, SemiSpace and GenCopy, the VO bits are always available during tracing. Their most efficient updating strategies happen to preserve VO bits during tracing.

Anyway, if the VM bindign sets const MAKE_VO_BITS_AVAILABLE_IN_TRACING: bool to true, the semantics is trivially guaranteed for those plans, too.

The assertion in trace_object in all policies is what I meant.

In ImmixSpace, the assertion also exists in trace_object_without_moving and trace_object_with_opportunistic_copy. When using the ClearAndReconstruct strategy, the VO bits are not available during tracing, so those assertions are not performed. If the strategy is CopyFromMarkBits, the assertions will be done.

Add a helper function in rust_util, instead.

wks · 2023-06-15T14:37:44Z

I updated the semantics of VO bits so that it is set since the object is allocated until the object is considered dead by the GC. The reason is that if an object is dead, it will not be reached again by the mutator or by the GC, and it will contain stale data, including dangling references (not forwarded because the GC never visits that object again) and other data that no longer make sense. If VO bits are used for conservative stack scanning, and VO bits are set on some dead objects, those dead objects may be brought back to life. If GC traces from those objects through dangling edges, it will crash. So we must clear the VO bits once the GC can no longer reach that object.

I think the reason why I still observe dead objects having VO bits set during EndOfGC when using MarkSweep is due to lazy sweeping. But even if I enabled the "eager_sweeping" feature, some dead objects still have VO bits set. I'll look deeper into the behaviour of MarkSweepSpace.

src/vm/object_model.rs

Update comment Co-authored-by: Yi Lin <qinsoon@gmail.com>

wks · 2023-06-16T08:52:53Z

If I merge #830 and turn on the "eager_sweeping" feature, I no longer observe dead objects with VO bits remain set at EndOfGC.

For this PR, I'll let the "vo_bit" feature depend on the "eager_sweeping" feature.

Otherwise native mark-sweep will not clear VO bits for all dead objects during GC.

wks · 2023-06-16T12:51:23Z

I added support for StickyImmix for both strategies. The CopyFromMarkBits strategy trivially works for StickyImmix. The ClearAndReconstruct strategy needs to be implemented for StickyImmix by clearing only unmarked lines (or blocks if BLOCK_ONLY == true). I am exploiting the fact that StickyImmix never allocate nursery objects in lines occupied by mature objects.

wks · 2023-06-20T16:39:01Z

Looking at CopyFromMarkBits, it has a 1-4% slowdown in time for Immix and StickyImmix and a 17% slowdown for GenImmix. Why is the slowdown for GenImmix so high? Yes GenImmix isn't tuned, but the costs are relative to build2 which is not tuned either. It's obviously not going to be free (given we're doing more work), so the 1-4% slowdown in time for {Sticky,}Immix makes sense, but have you figured why GenImmix is slower? Also is 4% expected? It seems pretty high imo, but that may be the choice of benchmark coloring the results.

It is the proportion of execution time doing GC. Have a look at the un-normalised data. GenImmix, for some reasons, is spending about 90% of the time doing GC. Therefore, the STW time dominates the total time. From Rifat's paper, the STW time is greatly impacted by conservative GC, but the total time is not... as long as the STW time doesn't dominate.

BTW, here is a preview of the results for other benchmarks.

This time the heap size is between 2x to 3x. (running runbms ... ... 8 3. The last two numbers of running runbms is a bit confusing for me, to be honest.) But I only ran one iteration for each benchmark (running 20 iterations will take about 20 hours, and I just spent quite some time working out the min heapsize on mole and vole). For lusearch and Immix, it is 3% increase in total time. But the impact is higher for h2, h2o and spring using Immix. Those benchmarks also have greater number of GCs when VO bits are enabled. I guess the VO bits metadata effectively reduced the effective heap size, causing the program to do more GC. The more GC it does, the more time it spends in STW time.

One possibility that ClearAndReconstruct is slower may be that it is clearing more blocks than necessary. Using the ClearAndReconstruct strategy, it clears the VO bits for whole chunks, regardless whether the blocks (and lines) inside are allocated or not. (Link: https://github.com/wks/mmtk-core/blob/336908526459dc9d1b66e779ada48eaf81d9f61c/src/policy/immix/immixspace.rs#L1088) On the other hand, when using the CopyFromMarkBits strategy, it only copies VO bits for marked lines. May be the bottleneck is simply the memory bandwidth. I'll look into it tomorrow.

qinsoon · 2023-06-21T01:04:37Z

But I only ran one iteration for each benchmark (running 20 iterations will take about 20 hours, and I just spent quite some time working out the min heapsize on mole and vole).

You definitely need more invocations to reduce the affect of possible noise. I usually use at least 10 invocations when I try to draw any conclusion (20 or 40 invocations is also very common), and use 5 invocations when I am in the progress of debugging performance. You can't tell much from 1 invocation as the data you are looking at could be just a very noisy run.

qinsoon · 2023-06-21T01:31:47Z

Also is 4% expected? It seems pretty high imo, but that may be the choice of benchmark coloring the results.

We thought VO bits is cheap based on Yiluo's report and the PR #390. However, as @wks found out that the VO bits were not cleared promptly, and dead objects may still have their VO bits set. This PR fixes that. So our understanding about the cost of VO bits was based on an inaccurate implementation. We do not know the actual cost of VO bits before this PR.

The cost of ClearAndReconstruct should have be identical to what Rifat reported in the conservative RC immix paper. However, what he reported is the performance difference between conservative RC immix and exact RC immix. And the overhead from conservatism is not trivial if you look at the GC time when the heap size is small. However, the overhead of conservatism could come from multiple sources -- maintaining the VO bit is just one of them. So still, we do not know the exact cost of VO bits.

So I would think 4% overhead might be reasonable, but it certainly makes VO bits less appealing. If you believe there is any performance mistake in the PR, please leave a comment. Otherwise, the remaining question is how we can optimise it, and how much effort we put into optimizing it.

k-sareen · 2023-06-21T01:50:46Z

@wks Yes I agree GenImmix's total time is being dominated by its STW time which is not tuned. But then even in the STW time we have a ~14% overhead for the CopyFromMarkBits strategy relative to build 2. I don't quite understand this either. Is the idea that since it's not tuned + has >>> more GCs than the others, the actual small change in STW time by adding VO bits is magnified? I still feel something is off with GenImmix even discounting that it's not tuned.

k-sareen · 2023-06-21T01:54:52Z

Also is 4% expected? It seems pretty high imo, but that may be the choice of benchmark coloring the results.

We thought VO bits is cheap based on Yiluo's report and the PR #390. However, as @wks found out that the VO bits were not cleared promptly, and dead objects may still have their VO bits set. This PR fixes that. So our understanding about the cost of VO bits was based on an inaccurate implementation. We do not know the actual cost of VO bits before this PR.

From Rifat's paper Figure 4(a), the overhead of adding conservatism for 2x min heap is 2.7%. So I guess it's not a big stretch that the PR above is adding 4% overhead for StickyImmix and 1% overhead for Immix. But Rifat's paper implemented ClearAndReconstruct instead of CopyFromMarkBits. GenImmix is a different story. We really need to run an experiment to crank out the min nursery size. I feel like powers of 2 <= 16 should be fine. So only need to do 2, 4, 8, 16 MB. 16 MB might be too large given some benchmarks have min heap sizes < 10 MB (like avrora with a min heap of 7 MB), but still worth testing.

wks · 2023-06-21T15:10:29Z

TL;DR: The culprit that made ClearAndReconstruct slow is setting VO bits during tracing. It may be a result of cache misses. I'll not fix it in this PR. We should optimise metadata access in the future, especially use fetch_or instead of fetch_update or even cmpxchg.

perf shows that the main difference between ClearAndReconstruct and CopyFromMarkBits is on_object_marked. It is called when Immix marks an object (but does not forward). For ClearAndReconstruct, it also sets the VO bit of the marked object (to reconstruct the VO bits); for CopyFromMarkBits, it's a no-op (because the VO bit was not cleared).

perf shows that the sole operation SideMetadataSpec::store_atomic (inlined) that sets the VO bit in on_object_marked takes 15% of the STW time, among which the underlying fetch_update (inlined) takes up 9.9% of the STW time, among which the underlying compare_exchange_weak (inlined) takes up 7.3% of the STW time.

If I force the CopyFromMarkBits strategy to set VO bits when marking objects, it will have the same cost as ClearAndReconstruct.

Replacing fetch_update with fetch_or_atomic reduces the time of on_object_marked from 15% to 6.73%, but it is still visible in perf. On the contrary, the work packet ClearVOBitsAfterPrepare is completely invisible on perf.

(p.s. I tested with a variant of GCBench which allocates a long-live tree and triggers GC 100 times. The execution time is almost 100% GC time.)

When an object is forwarded, both strategies have to set the VO bit for the to-space object. There is no performance difference.

I think the reason why it takes so much time to set VO bits while tracing is that the on-the-side bitmap may result in cache misses if it is accessed randomly. Similar to VO bits, the operation to atomically set the mark bit (i.e. attempt_mark which is inlined) also takes up 14.6% of STW time, among which the underlying compare_exchange_metadata takes up 12.9% of the STW time.

I think it is just the nature of the on-the-side metadata that makes the ClearAndReconstruct strategy slower than CopyFromMarkBits strategy. ClearAndReconstruct writes to two bitmaps simultaneously during tracing, while CopyFromMarkBits writes to one bitmap (occasionally writes VO bits when forwarded, but Immix doesn't move many objects), and then copy to another during the Release stage, and the copying is sequential which means it is cache-friendly.

I'll not attempt to solve this problem in this PR because it is just the nature of metadata access.

(BTW, if I also modify attempt_mark to make it use fetch_or_metadata, it will reduce the time of attempt_mark to 13.3% of STW time, with the underlying fetch_or taking 7.7% of the STW time. If I assume the mark bits are on the side and call LOCAL_MARK_BIT_SPEC.extract_side_spec().fetch_or_atomic directly, it can further reduce the time of attempt_mark to 10% of STW time, with the underlying fetch_or taking 9.12% of the STW time. This means there is still room for improving the metadata access by using fetch_and and fetch_or, and some overhead within the MetadataSpec::fetch_or_metadata method which matches the metadata to test whether it is on the side or in the header.)

I think there may be multiple reasons why Rifat's result showed lower overhead.

Architecture difference. The "3.4 GHz, 22 nm Core i7-4770 Haswell with 4 cores and 2-way SMT" Rifat used may be less susceptible to cache misses than {m,v}ole.moma which are both "Ryzen 9 7950X Zen 4 (Family 25 Model 97)".
Hand-crafted bitmap. Rifat made the bitmap specifically for his need. As shown above, the metadata accessing functions may contain overhead. Here is the code from Rifat's branch:

  public static void setNewBit(ObjectReference object) {
    Address address = VM.objectModel.refToAddress(object);
    Address base = getMetaDataBase(address).plus(Chunk.NEW_DATA_OFFSET);
    Word index = address.toWord().rshl(Chunk.OBJECT_LIVE_SHIFT).and(Chunk.NEW_DATA_BIT_MASK);
    base.atomicSetBit(index);
  }

It uses the atomicSetBit operations to access the bitmap. It is a VM Magic which the compiler compiles to a LOCK BTS instruction on ia32. And it only contains a few adding and bit shifting. Therefore, it has minimum overhead.

On the contrary, in mmtk-core, the set_vo_bit function is implemented with VO_BIT_SIDE_METADATA_SPEC.store_atomic which is implemented with fetch_update which is implemented with cmpxchg underneath. Even if we change it to fetch_or_atomic, the metadata access is still relatively dynamic.

    pub fn fetch_or_atomic<T: MetadataValue>(
        &self,
        data_addr: Address,
        val: T,
        order: Ordering,
    ) -> T {
        self.side_metadata_access::<T, _, _, _>(
            data_addr,
            Some(val),
            || {
                let meta_addr = address_to_meta_address(self, data_addr);
                if self.log_num_of_bits < 3 {
                    let lshift = meta_byte_lshift(self, data_addr);
                    let mask = meta_byte_mask(self) << lshift;
                    // We do not need to use fetch_ops_on_bits(), we can just set irrelavent bits to 0, and do fetch_or
                    let rhs = (val.to_u8().unwrap() << lshift) & mask;
                    let old_raw_byte =
                        unsafe { <u8 as MetadataValue>::fetch_or(meta_addr, rhs, order) };
                    let old_val = (old_raw_byte & mask) >> lshift;
                    FromPrimitive::from_u8(old_val).unwrap()
                } else {
                    unsafe { T::fetch_or(meta_addr, val, order) }
                }
            },
            |_old_val| {
                #[cfg(feature = "extreme_assertions")]
                sanity::verify_update::<T>(self, data_addr, _old_val, _old_val.bitor(val))
            },
        )
    }

Those ifs and elses make the code generic w.r.t. the number of metadata bits per object etc. It also works if val is not all zeroes or all ones, which is nice to have for generic metadata access, but comes at a run-time cost.

wks · 2023-06-21T15:40:13Z

Plotty links for Immix and StickyImmix

This time I ran Immix and StickyImmix for multiple benchmarks. Heap size is between 2x and 3x (running runbms ... 8 3).

STW time, Immix, normalised to build2

Total time, Immix, normalised to build2

STW time, StickyImmix, normalised to build2

Total time, StickyImmix, normalised to build2

For Immix, the total time overhead is OK for all benchmarks.

For StickyImmix, h2o and xalan are outliers. h2o has almost 70% increase in total time, but its increase in STW time is not that significant. The mutator time is greatly increased. This is a bit disturbing.

Mutator time, StickyImmix, normalised to build2

wks · 2023-06-21T17:07:48Z

Preliminary experiments show that the abnormal mutator time of "h2o" and "xalan" is not specific to StickyImmix. It is reproducible with GenCopy, too, but not SemiSpace. With GenCopy, enabling "vo_bit" and run "h2o" almost doubles the mutator time, too, just like StickyImmix. I guess it has something to do with the interaction between generational GC and the VO bits metadata.

I also tried multiple heap sizes and that problem is the same for all heap sizes.

I am running more invocations to get more data points.

qinsoon · 2023-06-22T00:57:40Z

Preliminary experiments show that the abnormal mutator time of "h2o" and "xalan" is not specific to StickyImmix. It is reproducible with GenCopy, too, but not SemiSpace. With GenCopy, enabling "vo_bit" and run "h2o" almost doubles the mutator time, too, just like StickyImmix. I guess it has something to do with the interaction between generational GC and the VO bits metadata.

I also tried multiple heap sizes and that problem is the same for all heap sizes.

I am running more invocations to get more data points.

Looking at your results, it seems there is a slowdown for the mutator time in all the benchmarks for both VO bit strategies. For most benchmarks, it is just like 5% or less, but it seems visible. h2o and xalan is just extremely worse. Is this correlated with the number of GCs in the benchmark?

wks · 2023-06-22T01:22:09Z

Looking at your results, it seems there is a slowdown for the mutator time in all the benchmarks for both VO bit strategies.

The slowdown may come from the invocations of set_vo_bit in post_alloc. That happens at every allocation.

For most benchmarks, it is just like 5% or less, but it seems visible. h2o and xalan is just extremely worse. Is this correlated with the number of GCs in the benchmark?

Running h2o at different heap sizes using SemiSpace and GenCopy: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2023-06-21-Wed-170232&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|40&Histogram%20(with%20CI)^build^hfac&

It looks like the mutator time is a constant while both the number of GC and the STW time decreases as the heap size goes up.

qinsoon · 2023-06-22T01:55:34Z

Looking at your results, it seems there is a slowdown for the mutator time in all the benchmarks for both VO bit strategies.

The slowdown may come from the invocations of set_vo_bit in post_alloc. That happens at every allocation.

Yeah. Right.

You can run master with VO bits enabled, and see if the mutator time is slower for those two benchmarks. If they are also slower, then the problem is not introduced in this PR.

wks · 2023-06-22T08:04:43Z

You can run master with VO bits enabled, and see if the mutator time is slower for those two benchmarks. If they are also slower, then the problem is not introduced in this PR.

Plotty link: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|mole-2023-06-22-Thu-041357&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^mmtk_gc&build;build1|70&build^SS-GenCopy-h2o|40&Histogram%20(with%20CI)^build^benchmark&

This experiment runs h2o and xalan with SemiSpace and GenCopy GC using both the master branch and this PR.

STW and Mutator time, normalised to "master no-VO"

We can see that the h2o and xalan benchmarks had 1.7x the mutator time when using GenCopy, but increase in mutator time is moderate when using SemiSpace. This phenomenon manifests on both the master branch and this PR. This means it is an existing problem.

qinsoon

Looks good. Just some minor issues.

src/policy/immix/block.rs

src/util/metadata/side_metadata/sanity.rs

src/util/metadata/vo_bit/helper.rs

qinsoon · 2023-06-22T23:29:28Z

src/util/metadata/vo_bit/helper.rs

+
+/// Select a strategy for the VM.  It is a `const` function so it always returns the same strategy
+/// for a given VM.
+const fn strategy<VM: VMBinding>() -> VOBitUpdateStrategy {


We can use CopyFromMarkBits as default if the mark bit is on the side, as it is faster, and complies with the semantics in both cases. We only need to use ClearAndReconstruct if the mark bit is in the header.

Done. I let it simply return CopyFromMarkBits for now, and mentioned in the comments when we may need to choose other strategies.

We should use ClearAndReconstruct when mark bit is in the header. Our dummy vm tests crashed, because it uses header mark bit, and validation failed for CopyFromMarkBits.

src/vm/object_model.rs

src/plan/sticky/immix/global.rs

k-sareen · 2023-06-23T04:04:40Z

So the gist is that our metadata access costs too much? Hm.

wks · 2023-06-23T05:19:54Z

So the gist is that our metadata access costs too much? Hm.

Our current implementation of metadata access is sub-optimal, but there is much room for improvement. See: #840

wks added 10 commits June 7, 2023 19:07

WIP: Fix VO bits for Immix

465f445

Implemented the ReconstructByTracing strategy

WIP: Implement CopyFromMarkBits strategy

409b336

Merge branch 'master' into immix-clear-vo-bits

a1a6ed8

WIP: Minor fixes

831b8d7

Extract VO-bit-related code into a module

7dec9a2

Minor changes

a1a0a5e

Rename and minor fixes

75b32cc

Clippy and fmt

bade4dd

cargo doc

0983db5

Let sanity GC check VO bit if enabled

baedfe0

wks marked this pull request as ready for review June 14, 2023 12:46

wks requested a review from qinsoon June 14, 2023 12:46

qinsoon requested changes Jun 15, 2023

View reviewed changes

wks added 5 commits June 15, 2023 16:29

Do not use std::array::from_fn before bumping MSRV

722ee07

Do not clutter main code.

723f862

Add a helper function in rust_util, instead.

Make the update strategy not Immix-specific

c91bbaa

Formatting and comment

9ee0e0a

Comments and minor fixes

3348422

qinsoon reviewed Jun 16, 2023

View reviewed changes

src/vm/object_model.rs Outdated Show resolved Hide resolved

wks and others added 3 commits June 16, 2023 09:29

Update src/vm/object_model.rs

20b8de8

Update comment Co-authored-by: Yi Lin <qinsoon@gmail.com>

Suppress clippy warning for 1.61.0

b690e65

Merge branch 'master' into immix-clear-vo-bits

7bbaf5d

wks added 2 commits June 16, 2023 17:06

vo_bit implies eager_sweeping

e4b4ab6

Otherwise native mark-sweep will not clear VO bits for all dead objects during GC.

Support StickyImmix

a4385df

Merge branch 'master' into immix-clear-vo-bits

976f420

wks mentioned this pull request Jun 22, 2023

Use fetch_xxx for atomic access if possible #840

Open

wks mentioned this pull request Jun 22, 2023

Abnormal mutator time for "h2o" and "xalan" with generational plans #858

Open

wks added 2 commits June 22, 2023 17:29

Add unit test for copying side metadata.

de5ab0e

Metadata sanity support for bcopy metadata.

dbd3e00

qinsoon reviewed Jun 22, 2023

View reviewed changes

wks and others added 2 commits June 23, 2023 09:57

Minor fixes

dff91eb

Update helper.rs

cfa7ed6

qinsoon added the PR-testing Run binding tests for the pull request (deprecated: use PR-extended-testing instead) label Jun 23, 2023

qinsoon approved these changes Jun 23, 2023

View reviewed changes

Use ClearAndReconstruct for in-header mark bits

dd5c3b5

Merge branch 'master' into immix-clear-vo-bits

bd2903d

wks merged commit 0fc03e9 into mmtk:master Jun 24, 2023

This was referenced Jun 24, 2023

Fix valid object (VO) bits for ImmixSpace #848

Closed

Properly clear VO bits for ImmixSpace #844

Closed

qinsoon mentioned this pull request Jul 31, 2023

PR #849 broke JikesRVM RBaseBaseSemiSpace #874

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix VO bits for Immix #849

Fix VO bits for Immix #849

wks commented Jun 7, 2023 •

edited

Loading

qinsoon left a comment

qinsoon Jun 15, 2023

wks Jun 15, 2023

wks Jun 22, 2023

wks commented Jun 15, 2023

qinsoon commented Jun 15, 2023

wks commented Jun 15, 2023 •

edited

Loading

wks commented Jun 15, 2023

wks commented Jun 16, 2023

wks commented Jun 16, 2023

wks commented Jun 20, 2023

qinsoon commented Jun 21, 2023

qinsoon commented Jun 21, 2023

k-sareen commented Jun 21, 2023

k-sareen commented Jun 21, 2023

wks commented Jun 21, 2023 •

edited

Loading

wks commented Jun 21, 2023

wks commented Jun 21, 2023

qinsoon commented Jun 22, 2023

wks commented Jun 22, 2023 •

edited

Loading

qinsoon commented Jun 22, 2023

wks commented Jun 22, 2023

qinsoon left a comment

qinsoon Jun 22, 2023

wks Jun 23, 2023

qinsoon Jun 23, 2023

k-sareen commented Jun 23, 2023

wks commented Jun 23, 2023

Fix VO bits for Immix #849

Fix VO bits for Immix #849

Conversation

wks commented Jun 7, 2023 • edited Loading

qinsoon left a comment

Choose a reason for hiding this comment

qinsoon Jun 15, 2023

Choose a reason for hiding this comment

wks Jun 15, 2023

Choose a reason for hiding this comment

wks Jun 22, 2023

Choose a reason for hiding this comment

wks commented Jun 15, 2023

qinsoon commented Jun 15, 2023

wks commented Jun 15, 2023 • edited Loading

wks commented Jun 15, 2023

wks commented Jun 16, 2023

wks commented Jun 16, 2023

wks commented Jun 20, 2023

qinsoon commented Jun 21, 2023

qinsoon commented Jun 21, 2023

k-sareen commented Jun 21, 2023

k-sareen commented Jun 21, 2023

wks commented Jun 21, 2023 • edited Loading

wks commented Jun 21, 2023

wks commented Jun 21, 2023

qinsoon commented Jun 22, 2023

wks commented Jun 22, 2023 • edited Loading

qinsoon commented Jun 22, 2023

wks commented Jun 22, 2023

qinsoon left a comment

Choose a reason for hiding this comment

qinsoon Jun 22, 2023

Choose a reason for hiding this comment

wks Jun 23, 2023

Choose a reason for hiding this comment

qinsoon Jun 23, 2023

Choose a reason for hiding this comment

k-sareen commented Jun 23, 2023

wks commented Jun 23, 2023

wks commented Jun 7, 2023 •

edited

Loading

wks commented Jun 15, 2023 •

edited

Loading

wks commented Jun 21, 2023 •

edited

Loading

wks commented Jun 22, 2023 •

edited

Loading