Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mono][interp] Reduce false pinning from interp stack #100400

Merged
merged 3 commits into from
Apr 5, 2024

Conversation

BrzVlad
Copy link
Member

@BrzVlad BrzVlad commented Mar 28, 2024

Interpreter opcodes operate on the interp stack, an area of memory separately allocated. Each interp var will have an allocated stack offset in the current interpreter stack frame. When we allocate the storage for an interp var we can take into account the var type. If the type can represent a potential ref to an object or an interior ref then we mark the pointer slot as potentially containing refs, for the method that is being compiled.

During GC, we used to conservatively scan the entire interp stack space used by each thread. After this change, in the first stage, we do a stack walkwhere we detect slots in each interp frame where no refs can reside. We mark these slots in a bit array. Afterwards we conservatively scan the interp stack of the thread, while ignoring slots that were previously marked as not containing any refs.

System.Runtime.Tests suite was used for testing the effectiveness of the change, by computing the cumulative number of pinned objects throughout all GCs (about 1100).
minijit - avg 702000 pinned objects
old-interp - avg 641000 pinned objects
precise-interp - avg 578000 pinned objects

This resulted in 10% reduction in the number of pinned objects during collection. This change is meant to reduce memory usage of apps by making objects die earlier. We could further improve by being more precise. For example, for call sites we could reuse liveness information to precisely know which slots actually contain refs. This is a bit more complex to implement and it is unclear yet how impactful it would be.

Comment on lines +8569 to +8576

int slot_index = 0;
for (gpointer *p = (gpointer*)context->stack_start; p < (gpointer*)context->stack_pointer; p++) {
if (context->no_ref_slots && (context->no_ref_slots [slot_index / 8] & (1 << (slot_index % 8))))
;// This slot is marked as no ref, we don't scan it
else
func (p, gc_data);
slot_index++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just summarizing so I understand how this all works.

The way this works is that sgen calls interp_mark_stack twice: once during the conservative phase and once during the precise phase (if the precise phase is enabled). This code runs during the conservative phase. The no_ref_slots bits are conservative: we're either sure there's definitely no managed pointer in the slot, or we're not sure what's in the slot. This is because sometimes we push managed pointers with MONO_TYPE_I.

So if we're sure the slot isn't a pointer, we don't scan it at all. otherwise if we're not sure - we scan it, and possibly create false pinning.

So it's not important that this PR precisely tracks every single managed pointer opcode. But if we see a slot that can't possibly contain a managed pointer, we may mark it and potentially avoid some false pinning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @cshung FYI

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add for clarity that the ambiguity of MONO_TYPE_I is not really that relevant to the pinning story. Let's say we have two vars, one int32 and one of type object that are allocated at the same offset, since the liveness doesn't interesect. Because we mark these ref bits for the whole scope of the method, the offset will be marked as potentially containing a ref. At the moment of suspend, in theory, we could tell exactly whether the int32 var or the object var is alive at the moment of execution, or maybe none of them. This would probably reduce most of the remaining false pinning.

@lambdageek lambdageek requested a review from cshung April 2, 2024 23:21
@@ -412,6 +412,8 @@ get_context (void)
if (context == NULL) {
context = g_new0 (ThreadContext, 1);
context->stack_start = (guchar*)mono_valloc_aligned (INTERP_STACK_SIZE, MINT_STACK_ALIGNMENT, MONO_MMAP_READ | MONO_MMAP_WRITE, MONO_MEM_ACCOUNT_INTERP_STACK);
// A bit for every pointer sized slot in the stack. FIXME don't allocate whole bit array
context->no_ref_slots = (guchar*)mono_valloc (NULL, INTERP_STACK_SIZE / (8 * sizeof (gpointer)), MONO_MMAP_READ | MONO_MMAP_WRITE, MONO_MEM_ACCOUNT_INTERP_STACK);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is INTERP_STACK_SIZE the max stack size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is 1MB statically allocated and never changed / increased.

gsize global_slot_index = current - (gpointer*)context->stack_start;
gsize table_index = global_slot_index / 8;
int bit_index = global_slot_index % 8;
context->no_ref_slots [table_index] |= 1 << bit_index;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use mono_bitset_set_fast here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but the use case was simple enough that I went with the manual implementation.

src/mono/mono/mini/interp/transform.c Outdated Show resolved Hide resolved
guint32 old_size = td->ref_slots ? (guint32)td->ref_slots->size : 0;
guint32 new_size = old_size ? old_size * 2 : 32;

gpointer mem = mono_mempool_alloc0 (td->mempool, mono_bitset_alloc_size (new_size, 0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you decide whether to use mono_mem_manager_alloc0 or mono_mem_manager_alloc0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

td->mempool is a mempool used by interpreter compiler for any temporary data used during compilation. The mempool is destroyed once method is finished compiling. mono_mem_manager_alloc0 is used for memory used during execution, so it remains alive for the entire duration of the application.

BrzVlad added 3 commits April 4, 2024 11:46
Interpreter opcodes operate on the interp stack, an area of memory separately allocated. Each interp var will have an allocated stack offset in the current interpreter stack frame. When we allocate the storage for an interp var we can take into account the var type. If the type can represent a potential ref to an object or an interior ref then we mark the pointer slot as potentially containing refs, for the method that is being compiled.

During GC, we used to conservatively scan the entire interp stack space used by each thread. After this change, in the first stage, we do a stack walkwhere we detect slots in each interp frame where no refs can reside. We mark these slots in a bit array. Afterwards we conservatively scan the interp stack of the thread, while ignoring slots that were previously marked as not containing any refs.

System.Runtime.Tests suite was used for testing the effectiveness of the change, by computing the cumulative number of pinned objects throughout all GCs (about 1100).
minijit		- avg 702000 pinned objects
old-interp	- avg 641000 pinned objects
precise-interp	- avg 578000 pinned objects

This resulted in 10% reduction in the number of pinned objects during collection. This change is meant to reduce memory usage of apps by making objects die earlier. We could further improve by being more precise. For example, for call sites we could reuse liveness information to precisely know which slots actually contain refs. This is a bit more complex to implement and it is unclear yet how impactful it would be.
A lot of times, when we were pushing a byref type on the stack during compilation, we would first get the mint_type which would be MINT_TYPE_I4/I8. From the mint_type we would then obtain the STACK_TYPE_I4/I8, losing information because it should have been STACK_TYPE_MP. Because of this, the underlying interp var would end up being created as MONO_TYPE_I4/I8 instead of MONO_TYPE_I. Add another method for pushing directly a MonoType, with less confusing indirections. Code around here could further be refactored. This is only relevant for GC stack scanning, since we would want to scan only slots containing MONO_TYPE_I.
matouskozak pushed a commit to matouskozak/runtime that referenced this pull request Apr 30, 2024
* [mono][interp] Reduce false pinning from interp stack

Interpreter opcodes operate on the interp stack, an area of memory separately allocated. Each interp var will have an allocated stack offset in the current interpreter stack frame. When we allocate the storage for an interp var we can take into account the var type. If the type can represent a potential ref to an object or an interior ref then we mark the pointer slot as potentially containing refs, for the method that is being compiled.

During GC, we used to conservatively scan the entire interp stack space used by each thread. After this change, in the first stage, we do a stack walkwhere we detect slots in each interp frame where no refs can reside. We mark these slots in a bit array. Afterwards we conservatively scan the interp stack of the thread, while ignoring slots that were previously marked as not containing any refs.

System.Runtime.Tests suite was used for testing the effectiveness of the change, by computing the cumulative number of pinned objects throughout all GCs (about 1100).
minijit		- avg 702000 pinned objects
old-interp	- avg 641000 pinned objects
precise-interp	- avg 578000 pinned objects

This resulted in 10% reduction in the number of pinned objects during collection. This change is meant to reduce memory usage of apps by making objects die earlier. We could further improve by being more precise. For example, for call sites we could reuse liveness information to precisely know which slots actually contain refs. This is a bit more complex to implement and it is unclear yet how impactful it would be.

* [mono][interp] Add option to disable precise scanning of stack

* [mono][interp] Fix pushing of byrefs on execution stack

A lot of times, when we were pushing a byref type on the stack during compilation, we would first get the mint_type which would be MINT_TYPE_I4/I8. From the mint_type we would then obtain the STACK_TYPE_I4/I8, losing information because it should have been STACK_TYPE_MP. Because of this, the underlying interp var would end up being created as MONO_TYPE_I4/I8 instead of MONO_TYPE_I. Add another method for pushing directly a MonoType, with less confusing indirections. Code around here could further be refactored. This is only relevant for GC stack scanning, since we would want to scan only slots containing MONO_TYPE_I.
@github-actions github-actions bot locked and limited conversation to collaborators May 6, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants