Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix VO bits for Immix #849

Merged
merged 28 commits into from
Jun 24, 2023
Merged

Fix VO bits for Immix #849

merged 28 commits into from
Jun 24, 2023

Conversation

wks
Copy link
Collaborator

@wks wks commented Jun 7, 2023

Need to implement both strategies:

  • Clear after stack scanning and reconstruct during tracing (default)
  • Copy mark bits to VO bits (alternative, making VO bits available during tracing)

Other tasks:

  • Need to ensure the change don't break existing spaces, including SemiSpace, GenCopy and MarkCompact.
  • Support StickyImmix
  • Add unit tests for copying metadata

This PR fixes the valid object (VO) bits metadata for Immix. Partially fixes #848

The expected result of this PR is, for Immix and GenImmix, after each GC, the VO bits will be set for an address if and only if there is a live object at that address. For this reason, we can let Sanity GC verify that all live objects have their VO bits set.

Known issues:

  1. This PR does not fix StickyImmix because the only user of conservative stack scanning (Ruby) still cannot use StickyImmix, therefore cannot test it. I added StickyImmix support and is tested on OpenJDK with Sanity GC. However, we still need to test it with a VM that uses conservative GC (such as Ruby) in the future.
  2. When using MarkSweep, some addresses of dead objects still have VO bits set. We need to investigate this, but it is beyond the scope of this PR.

@wks wks marked this pull request as ready for review June 14, 2023 12:46
@wks wks requested a review from qinsoon June 14, 2023 12:46
Copy link
Member

@qinsoon qinsoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR only makes changes to vo bit in the immix space, while vo bit is global metadata and is applied to all the policies. Some changes such as removing set_vo_bit from forward_object will break other policies. Some changes such as introducing vo bit update strategy for immix will make the update strategy inconsistent among different policies, which make the update strategy less useful (we don't expect a binding to check which space an object lives in before calling is_mmtk_object).

Getting all these correct would need some non-trivial efforts. The following are the minimal that is needed for this PR:

  • We need to have a complete and coherent design for the update strategy: the update strategy needs to be global and works for all the policies. We need to be clear of what each policy does, in terms of complying with the strategy. We should support different strategies for different policies if it is simple, or leave some unimplemented if they are non trivial.
  • We use assertions to check vo bits (any object that is traced/reachable should have its vo bit set), and in the OpenJDK binding, we test a few plans with this assertion. With this PR (and update strategy), we should have some assertions like this (e.g. the assertion you put in sanity GC looks good but in the OpenJDK tests we do not run sanity GC), and OpenJDK tests should pass with those assertions.

I might have missed some of your discussions on this topic. Let me know if what I said contradicts with what you have discussed earlier.

src/scheduler/scheduler.rs Outdated Show resolved Hide resolved
src/policy/immix/vo_bit_helper.rs Outdated Show resolved Hide resolved
src/policy/immix/vo_bit_helper.rs Outdated Show resolved Hide resolved
src/vm/object_model.rs Outdated Show resolved Hide resolved
src/policy/immix/vo_bit_helper.rs Outdated Show resolved Hide resolved
/// * `start`: The starting address of a memory region.
/// * `size`: The size of the memory region.
/// * `other`: The other metadata to copy from.
pub fn bcopy_metadata_contiguous(&self, start: Address, size: usize, other: &SideMetadataSpec) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some test cases for this. Also it needs to update the side metadata sanity table (once you have some tests and run it with extreme_assertions, you should find some assertion failures).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'll add some unit tests for bcopy_metadata_contiguous.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just added a unit test for bcopy_metadata_contiguous.

src/util/object_forwarding.rs Show resolved Hide resolved
@wks
Copy link
Collaborator Author

wks commented Jun 15, 2023

This PR only makes changes to vo bit in the immix space, while vo bit is global metadata and is applied to all the policies. Some changes such as removing set_vo_bit from forward_object will break other policies.

This is a good point. I'll make sure VO bits remain useable for existing non-immix plans.

Some changes such as introducing vo bit update strategy for immix will make the update strategy inconsistent among different policies, which make the update strategy less useful (we don't expect a binding to check which space an object lives in before calling is_mmtk_object).

Getting all these correct would need some non-trivial efforts. The following are the minimal that is needed for this PR:

* We need to have a complete and coherent design for the update strategy: the update strategy needs to be global and works for all the policies. We need to be clear of what each policy does, in terms of complying with the strategy. We should support different strategies for different policies if it is simple, or leave some `unimplemented` if they are non trivial.

The reason why the "VO bits updating strategies" are only applicable to Immix is because currently Immix (and its variants) is the only GC algorithm that may leaves dead objects in the memory without cleaning them up. On the contrary,

  • MarkSweep explicitly sweeps cells that contain dead objects. At that time, we clear the VO bits.
  • SemiSpace and GenCopy are fully evacuating GC. All dead objects are in the from space or the nursery, and the entire CopySpaces for the from-space and the nursery are released after GC. Meanwhile, VO bits for to-space objects are gradually set during evacuation. This makes the "updating strategies" unnecessary. It is unnecessary to clear VO bits before tracing because it can be cleared after tracing. It is unnecessary to copy from mark bits because the to-space is a fresh new space without any objects, so the VO bits are fully 0 for the to-space before GC.
  • MarkCompact is a compacting GC. It effectively reconstructs all the objects in the MarkCompactSpace, moving all objects to one end. All dead objects are either overwritten (if in the front portion of the space) or abandoned (if in the rear portion). As a result, the VO bits for MarkCompactSpace should be cleared and reconstructed during compaction (or during CalculateForward, or the second transitive closure, whenever we know the destination of all objects).

Immix is special because we only sweep completely free lines. If a line contains both live and dead object, and it is not part of defrag source, we don't sweep it. During tracing, Immix traces through live objects in such lines, but unlike MarkSweep, Immix does not know where dead objects are. To work around this, we either clear all VO bits before tracing, and reconstruct VO bits when we mark or forward objects because we do know where live objects are; or we copy the on-the-side mark bits over to the VO bits because the on-the-side mark bits are always cleared before GC. Using either one of those two strategies, dead objects in such lines will have VO bits of 0, while live objects will have VO bits of 1.

* We use assertions to check vo bits (any object that is traced/reachable should have its vo bit set), and in the OpenJDK binding, we test a few plans with this assertion. With this PR (and update strategy), we should have some assertions like this (e.g. the assertion you put in sanity GC looks good but in the OpenJDK tests we do not run sanity GC), and OpenJDK tests should pass with those assertions.

Can you point me to the "assertions to check vo bits" you mentioned? The vo_bit_helper::on_trace_object method asserts the VO bit is set if it is available, but that assertion happens just before the object is traced, not after the reachability of objects are established.

I might have missed some of your discussions on this topic. Let me know if what I said contradicts with what you have discussed earlier.

@qinsoon
Copy link
Member

qinsoon commented Jun 15, 2023

The reason why the "VO bits updating strategies" are only applicable to Immix is because currently Immix (and its variants) is the only GC algorithm that may leaves dead objects in the memory without cleaning them up.

This is not true. As you said, mark compact leaves dead objects in the memory. Marksweep with lazy sweeping also leaves dead objects in the memory. These all affect VO bits.

But anyway, the important thing is not about update strategy. It is about what semantics we allow, and the semantics should apply to all the policies. For example, in this PR, if we set update strategy to CopyMarkBits, that means we can use VO bits during tracing for immix objects. But for objects in other spaces, are VO bits available? This is unclear.

  • We use assertions to check vo bits (any object that is traced/reachable should have its vo bit set), and in the OpenJDK binding, we test a few plans with this assertion. With this PR (and update strategy), we should have some assertions like this (e.g. the assertion you put in sanity GC looks good but in the OpenJDK tests we do not run sanity GC), and OpenJDK tests should pass with those assertions.

Can you point me to the "assertions to check vo bits" you mentioned? The vo_bit_helper::on_trace_object method asserts the VO bit is set if it is available, but that assertion happens just before the object is traced, not after the reachability of objects are established.

The assertion in trace_object in all policies is what I meant.

@wks
Copy link
Collaborator Author

wks commented Jun 15, 2023

This is not true. As you said, mark compact leaves dead objects in the memory...

The key part is "not cleaning them up". I actually meant metadata, not the content of the memory.

MarkCompact leaves dead objects in the memory just like SemiSpace does. If an object is in the rear portion of the MarkCompactSpace, we leave it behind. But after compaction, we know the region of MarkCompactSpace that contain live objects and we can clear the metadata (including VO bits) for the rear portion that contain dead objects. It's just like SemiSpace releasing the from-space. We may leave the memory content intact, but we clear its metadata and mark the memory region as available.

Marksweep with lazy sweeping also leaves dead objects in the memory. These all affect VO bits.

If lazy sweeping is the default, it may explain why I observed some dead objects having have VO bits in the EndOfGC phase.

Currently, I am testing the VO bits by recording dead objects in one GC (by recording the finalised objects in Ruby because Ruby never resurrects any objects), and injecting them in a subsequent GC. I added an assertion that "in the EndOfGC phase of a GC, the finalised objects recorded in this GC should not have VO bits set", and that failed.

But anyway, the important thing is not about update strategy. It is about what semantics we allow, and the semantics should apply to all the policies. For example, in this PR, if we set update strategy to CopyMarkBits, that means we can use VO bits during tracing for immix objects. But for objects in other spaces, are VO bits available? This is unclear.

For MarkSweep, MarkCompact, SemiSpace and GenCopy, the VO bits are always available during tracing. Their most efficient updating strategies happen to preserve VO bits during tracing.

Anyway, if the VM bindign sets const MAKE_VO_BITS_AVAILABLE_IN_TRACING: bool to true, the semantics is trivially guaranteed for those plans, too.

The assertion in trace_object in all policies is what I meant.

In ImmixSpace, the assertion also exists in trace_object_without_moving and trace_object_with_opportunistic_copy. When using the ClearAndReconstruct strategy, the VO bits are not available during tracing, so those assertions are not performed. If the strategy is CopyFromMarkBits, the assertions will be done.

@wks
Copy link
Collaborator Author

wks commented Jun 15, 2023

I updated the semantics of VO bits so that it is set since the object is allocated until the object is considered dead by the GC. The reason is that if an object is dead, it will not be reached again by the mutator or by the GC, and it will contain stale data, including dangling references (not forwarded because the GC never visits that object again) and other data that no longer make sense. If VO bits are used for conservative stack scanning, and VO bits are set on some dead objects, those dead objects may be brought back to life. If GC traces from those objects through dangling edges, it will crash. So we must clear the VO bits once the GC can no longer reach that object.

I think the reason why I still observe dead objects having VO bits set during EndOfGC when using MarkSweep is due to lazy sweeping. But even if I enabled the "eager_sweeping" feature, some dead objects still have VO bits set. I'll look deeper into the behaviour of MarkSweepSpace.

src/vm/object_model.rs Outdated Show resolved Hide resolved
@wks
Copy link
Collaborator Author

wks commented Jun 16, 2023

If I merge #830 and turn on the "eager_sweeping" feature, I no longer observe dead objects with VO bits remain set at EndOfGC.

For this PR, I'll let the "vo_bit" feature depend on the "eager_sweeping" feature.

wks added 2 commits June 16, 2023 17:06
Otherwise native mark-sweep will not clear VO bits for all dead objects
during GC.
@wks
Copy link
Collaborator Author

wks commented Jun 16, 2023

I added support for StickyImmix for both strategies. The CopyFromMarkBits strategy trivially works for StickyImmix. The ClearAndReconstruct strategy needs to be implemented for StickyImmix by clearing only unmarked lines (or blocks if BLOCK_ONLY == true). I am exploiting the fact that StickyImmix never allocate nursery objects in lines occupied by mature objects.

@wks
Copy link
Collaborator Author

wks commented Jun 20, 2023

Looking at CopyFromMarkBits, it has a 1-4% slowdown in time for Immix and StickyImmix and a 17% slowdown for GenImmix. Why is the slowdown for GenImmix so high? Yes GenImmix isn't tuned, but the costs are relative to build2 which is not tuned either. It's obviously not going to be free (given we're doing more work), so the 1-4% slowdown in time for {Sticky,}Immix makes sense, but have you figured why GenImmix is slower? Also is 4% expected? It seems pretty high imo, but that may be the choice of benchmark coloring the results.

It is the proportion of execution time doing GC. Have a look at the un-normalised data. GenImmix, for some reasons, is spending about 90% of the time doing GC. Therefore, the STW time dominates the total time. From Rifat's paper, the STW time is greatly impacted by conservative GC, but the total time is not... as long as the STW time doesn't dominate.

BTW, here is a preview of the results for other benchmarks.

This time the heap size is between 2x to 3x. (running runbms ... ... 8 3. The last two numbers of running runbms is a bit confusing for me, to be honest.) But I only ran one iteration for each benchmark (running 20 iterations will take about 20 hours, and I just spent quite some time working out the min heapsize on mole and vole). For lusearch and Immix, it is 3% increase in total time. But the impact is higher for h2, h2o and spring using Immix. Those benchmarks also have greater number of GCs when VO bits are enabled. I guess the VO bits metadata effectively reduced the effective heap size, causing the program to do more GC. The more GC it does, the more time it spends in STW time.

One possibility that ClearAndReconstruct is slower may be that it is clearing more blocks than necessary. Using the ClearAndReconstruct strategy, it clears the VO bits for whole chunks, regardless whether the blocks (and lines) inside are allocated or not. (Link: https://github.com/wks/mmtk-core/blob/336908526459dc9d1b66e779ada48eaf81d9f61c/src/policy/immix/immixspace.rs#L1088) On the other hand, when using the CopyFromMarkBits strategy, it only copies VO bits for marked lines. May be the bottleneck is simply the memory bandwidth. I'll look into it tomorrow.

@qinsoon
Copy link
Member

qinsoon commented Jun 21, 2023

But I only ran one iteration for each benchmark (running 20 iterations will take about 20 hours, and I just spent quite some time working out the min heapsize on mole and vole).

You definitely need more invocations to reduce the affect of possible noise. I usually use at least 10 invocations when I try to draw any conclusion (20 or 40 invocations is also very common), and use 5 invocations when I am in the progress of debugging performance. You can't tell much from 1 invocation as the data you are looking at could be just a very noisy run.

@qinsoon
Copy link
Member

qinsoon commented Jun 21, 2023

Also is 4% expected? It seems pretty high imo, but that may be the choice of benchmark coloring the results.

We thought VO bits is cheap based on Yiluo's report and the PR #390. However, as @wks found out that the VO bits were not cleared promptly, and dead objects may still have their VO bits set. This PR fixes that. So our understanding about the cost of VO bits was based on an inaccurate implementation. We do not know the actual cost of VO bits before this PR.

The cost of ClearAndReconstruct should have be identical to what Rifat reported in the conservative RC immix paper. However, what he reported is the performance difference between conservative RC immix and exact RC immix. And the overhead from conservatism is not trivial if you look at the GC time when the heap size is small. However, the overhead of conservatism could come from multiple sources -- maintaining the VO bit is just one of them. So still, we do not know the exact cost of VO bits.

So I would think 4% overhead might be reasonable, but it certainly makes VO bits less appealing. If you believe there is any performance mistake in the PR, please leave a comment. Otherwise, the remaining question is how we can optimise it, and how much effort we put into optimizing it.

@k-sareen
Copy link
Collaborator

@wks Yes I agree GenImmix's total time is being dominated by its STW time which is not tuned. But then even in the STW time we have a ~14% overhead for the CopyFromMarkBits strategy relative to build 2. I don't quite understand this either. Is the idea that since it's not tuned + has >>> more GCs than the others, the actual small change in STW time by adding VO bits is magnified? I still feel something is off with GenImmix even discounting that it's not tuned.

@k-sareen
Copy link
Collaborator

Also is 4% expected? It seems pretty high imo, but that may be the choice of benchmark coloring the results.

We thought VO bits is cheap based on Yiluo's report and the PR #390. However, as @wks found out that the VO bits were not cleared promptly, and dead objects may still have their VO bits set. This PR fixes that. So our understanding about the cost of VO bits was based on an inaccurate implementation. We do not know the actual cost of VO bits before this PR.

From Rifat's paper Figure 4(a), the overhead of adding conservatism for 2x min heap is 2.7%. So I guess it's not a big stretch that the PR above is adding 4% overhead for StickyImmix and 1% overhead for Immix. But Rifat's paper implemented ClearAndReconstruct instead of CopyFromMarkBits. GenImmix is a different story. We really need to run an experiment to crank out the min nursery size. I feel like powers of 2 <= 16 should be fine. So only need to do 2, 4, 8, 16 MB. 16 MB might be too large given some benchmarks have min heap sizes < 10 MB (like avrora with a min heap of 7 MB), but still worth testing.

@wks
Copy link
Collaborator Author

wks commented Jun 21, 2023

TL;DR: The culprit that made ClearAndReconstruct slow is setting VO bits during tracing. It may be a result of cache misses. I'll not fix it in this PR. We should optimise metadata access in the future, especially use fetch_or instead of fetch_update or even cmpxchg.

perf shows that the main difference between ClearAndReconstruct and CopyFromMarkBits is on_object_marked. It is called when Immix marks an object (but does not forward). For ClearAndReconstruct, it also sets the VO bit of the marked object (to reconstruct the VO bits); for CopyFromMarkBits, it's a no-op (because the VO bit was not cleared).

perf shows that the sole operation SideMetadataSpec::store_atomic (inlined) that sets the VO bit in on_object_marked takes 15% of the STW time, among which the underlying fetch_update (inlined) takes up 9.9% of the STW time, among which the underlying compare_exchange_weak (inlined) takes up 7.3% of the STW time.

If I force the CopyFromMarkBits strategy to set VO bits when marking objects, it will have the same cost as ClearAndReconstruct.

Replacing fetch_update with fetch_or_atomic reduces the time of on_object_marked from 15% to 6.73%, but it is still visible in perf. On the contrary, the work packet ClearVOBitsAfterPrepare is completely invisible on perf.

(p.s. I tested with a variant of GCBench which allocates a long-live tree and triggers GC 100 times. The execution time is almost 100% GC time.)

When an object is forwarded, both strategies have to set the VO bit for the to-space object. There is no performance difference.

I think the reason why it takes so much time to set VO bits while tracing is that the on-the-side bitmap may result in cache misses if it is accessed randomly. Similar to VO bits, the operation to atomically set the mark bit (i.e. attempt_mark which is inlined) also takes up 14.6% of STW time, among which the underlying compare_exchange_metadata takes up 12.9% of the STW time.

I think it is just the nature of the on-the-side metadata that makes the ClearAndReconstruct strategy slower than CopyFromMarkBits strategy. ClearAndReconstruct writes to two bitmaps simultaneously during tracing, while CopyFromMarkBits writes to one bitmap (occasionally writes VO bits when forwarded, but Immix doesn't move many objects), and then copy to another during the Release stage, and the copying is sequential which means it is cache-friendly.

I'll not attempt to solve this problem in this PR because it is just the nature of metadata access.

(BTW, if I also modify attempt_mark to make it use fetch_or_metadata, it will reduce the time of attempt_mark to 13.3% of STW time, with the underlying fetch_or taking 7.7% of the STW time. If I assume the mark bits are on the side and call LOCAL_MARK_BIT_SPEC.extract_side_spec().fetch_or_atomic directly, it can further reduce the time of attempt_mark to 10% of STW time, with the underlying fetch_or taking 9.12% of the STW time. This means there is still room for improving the metadata access by using fetch_and and fetch_or, and some overhead within the MetadataSpec::fetch_or_metadata method which matches the metadata to test whether it is on the side or in the header.)

I think there may be multiple reasons why Rifat's result showed lower overhead.

  1. Architecture difference. The "3.4 GHz, 22 nm Core i7-4770 Haswell with 4 cores and 2-way SMT" Rifat used may be less susceptible to cache misses than {m,v}ole.moma which are both "Ryzen 9 7950X Zen 4 (Family 25 Model 97)".
  2. Hand-crafted bitmap. Rifat made the bitmap specifically for his need. As shown above, the metadata accessing functions may contain overhead. Here is the code from Rifat's branch:
  public static void setNewBit(ObjectReference object) {
    Address address = VM.objectModel.refToAddress(object);
    Address base = getMetaDataBase(address).plus(Chunk.NEW_DATA_OFFSET);
    Word index = address.toWord().rshl(Chunk.OBJECT_LIVE_SHIFT).and(Chunk.NEW_DATA_BIT_MASK);
    base.atomicSetBit(index);
  }

It uses the atomicSetBit operations to access the bitmap. It is a VM Magic which the compiler compiles to a LOCK BTS instruction on ia32. And it only contains a few adding and bit shifting. Therefore, it has minimum overhead.

On the contrary, in mmtk-core, the set_vo_bit function is implemented with VO_BIT_SIDE_METADATA_SPEC.store_atomic which is implemented with fetch_update which is implemented with cmpxchg underneath. Even if we change it to fetch_or_atomic, the metadata access is still relatively dynamic.

    pub fn fetch_or_atomic<T: MetadataValue>(
        &self,
        data_addr: Address,
        val: T,
        order: Ordering,
    ) -> T {
        self.side_metadata_access::<T, _, _, _>(
            data_addr,
            Some(val),
            || {
                let meta_addr = address_to_meta_address(self, data_addr);
                if self.log_num_of_bits < 3 {
                    let lshift = meta_byte_lshift(self, data_addr);
                    let mask = meta_byte_mask(self) << lshift;
                    // We do not need to use fetch_ops_on_bits(), we can just set irrelavent bits to 0, and do fetch_or
                    let rhs = (val.to_u8().unwrap() << lshift) & mask;
                    let old_raw_byte =
                        unsafe { <u8 as MetadataValue>::fetch_or(meta_addr, rhs, order) };
                    let old_val = (old_raw_byte & mask) >> lshift;
                    FromPrimitive::from_u8(old_val).unwrap()
                } else {
                    unsafe { T::fetch_or(meta_addr, val, order) }
                }
            },
            |_old_val| {
                #[cfg(feature = "extreme_assertions")]
                sanity::verify_update::<T>(self, data_addr, _old_val, _old_val.bitor(val))
            },
        )
    }

Those ifs and elses make the code generic w.r.t. the number of metadata bits per object etc. It also works if val is not all zeroes or all ones, which is nice to have for generic metadata access, but comes at a run-time cost.

@wks
Copy link
Collaborator Author

wks commented Jun 21, 2023

Plotty links for Immix and StickyImmix

This time I ran Immix and StickyImmix for multiple benchmarks. Heap size is between 2x and 3x (running runbms ... 8 3).

STW time, Immix, normalised to build2
image

Total time, Immix, normalised to build2
image

STW time, StickyImmix, normalised to build2
image

Total time, StickyImmix, normalised to build2
image

For Immix, the total time overhead is OK for all benchmarks.

For StickyImmix, h2o and xalan are outliers. h2o has almost 70% increase in total time, but its increase in STW time is not that significant. The mutator time is greatly increased. This is a bit disturbing.

Mutator time, StickyImmix, normalised to build2
image

@wks
Copy link
Collaborator Author

wks commented Jun 21, 2023

Preliminary experiments show that the abnormal mutator time of "h2o" and "xalan" is not specific to StickyImmix. It is reproducible with GenCopy, too, but not SemiSpace. With GenCopy, enabling "vo_bit" and run "h2o" almost doubles the mutator time, too, just like StickyImmix. I guess it has something to do with the interaction between generational GC and the VO bits metadata.

I also tried multiple heap sizes and that problem is the same for all heap sizes.

I am running more invocations to get more data points.

@qinsoon
Copy link
Member

qinsoon commented Jun 22, 2023

Preliminary experiments show that the abnormal mutator time of "h2o" and "xalan" is not specific to StickyImmix. It is reproducible with GenCopy, too, but not SemiSpace. With GenCopy, enabling "vo_bit" and run "h2o" almost doubles the mutator time, too, just like StickyImmix. I guess it has something to do with the interaction between generational GC and the VO bits metadata.

I also tried multiple heap sizes and that problem is the same for all heap sizes.

I am running more invocations to get more data points.

Looking at your results, it seems there is a slowdown for the mutator time in all the benchmarks for both VO bit strategies. For most benchmarks, it is just like 5% or less, but it seems visible. h2o and xalan is just extremely worse. Is this correlated with the number of GCs in the benchmark?

@wks
Copy link
Collaborator Author

wks commented Jun 22, 2023

Looking at your results, it seems there is a slowdown for the mutator time in all the benchmarks for both VO bit strategies.

The slowdown may come from the invocations of set_vo_bit in post_alloc. That happens at every allocation.

For most benchmarks, it is just like 5% or less, but it seems visible. h2o and xalan is just extremely worse. Is this correlated with the number of GCs in the benchmark?

Running h2o at different heap sizes using SemiSpace and GenCopy: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2023-06-21-Wed-170232&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|40&Histogram%20(with%20CI)^build^hfac&

image
image

It looks like the mutator time is a constant while both the number of GC and the STW time decreases as the heap size goes up.

@qinsoon
Copy link
Member

qinsoon commented Jun 22, 2023

Looking at your results, it seems there is a slowdown for the mutator time in all the benchmarks for both VO bit strategies.

The slowdown may come from the invocations of set_vo_bit in post_alloc. That happens at every allocation.

Yeah. Right.

You can run master with VO bits enabled, and see if the mutator time is slower for those two benchmarks. If they are also slower, then the problem is not introduced in this PR.

@wks
Copy link
Collaborator Author

wks commented Jun 22, 2023

You can run master with VO bits enabled, and see if the mutator time is slower for those two benchmarks. If they are also slower, then the problem is not introduced in this PR.

Plotty link: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|mole-2023-06-22-Thu-041357&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^mmtk_gc&build;build1|70&build^SS-GenCopy-h2o|40&Histogram%20(with%20CI)^build^benchmark&

This experiment runs h2o and xalan with SemiSpace and GenCopy GC using both the master branch and this PR.

STW and Mutator time, normalised to "master no-VO"
image

We can see that the h2o and xalan benchmarks had 1.7x the mutator time when using GenCopy, but increase in mutator time is moderate when using SemiSpace. This phenomenon manifests on both the master branch and this PR. This means it is an existing problem.

Copy link
Member

@qinsoon qinsoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just some minor issues.

src/policy/immix/block.rs Outdated Show resolved Hide resolved
src/util/metadata/side_metadata/sanity.rs Outdated Show resolved Hide resolved
src/util/metadata/vo_bit/helper.rs Outdated Show resolved Hide resolved

/// Select a strategy for the VM. It is a `const` function so it always returns the same strategy
/// for a given VM.
const fn strategy<VM: VMBinding>() -> VOBitUpdateStrategy {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use CopyFromMarkBits as default if the mark bit is on the side, as it is faster, and complies with the semantics in both cases. We only need to use ClearAndReconstruct if the mark bit is in the header.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I let it simply return CopyFromMarkBits for now, and mentioned in the comments when we may need to choose other strategies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use ClearAndReconstruct when mark bit is in the header. Our dummy vm tests crashed, because it uses header mark bit, and validation failed for CopyFromMarkBits.

src/vm/object_model.rs Outdated Show resolved Hide resolved
src/plan/sticky/immix/global.rs Outdated Show resolved Hide resolved
@qinsoon qinsoon added the PR-testing Run binding tests for the pull request (deprecated: use PR-extended-testing instead) label Jun 23, 2023
@k-sareen
Copy link
Collaborator

So the gist is that our metadata access costs too much? Hm.

@wks
Copy link
Collaborator Author

wks commented Jun 23, 2023

So the gist is that our metadata access costs too much? Hm.

Our current implementation of metadata access is sub-optimal, but there is much room for improvement. See: #840

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR-testing Run binding tests for the pull request (deprecated: use PR-extended-testing instead)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix valid object (VO) bits for ImmixSpace
3 participants