Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new synchronization mechanism for DATAS #90726

Merged
merged 3 commits into from
Aug 31, 2023
Merged

new synchronization mechanism for DATAS #90726

merged 3 commits into from
Aug 31, 2023

Conversation

Maoni0
Copy link
Member

@Maoni0 Maoni0 commented Aug 17, 2023

The current mechanism has a fundamental flaw which is the idling threads can start running at unpredictable times when they are woken up. This causes all sorts of problems. For example, when a thread gets here in gc_thread_function -

if (n_heaps <= heap_number)

if it's true it's supposed to wait. But its execution could be delayed so after it reads n_heaps it can stop for a while since no thread is waiting on this thread anyway... till some time later when a heap count changes happens again and it requires this thread to participating. And now this thread does the comparison and discovers that it needs to wait so it goes idle and all other threads will just be waiting for this thread to join.

Another example is it's not safe to change the heap count for a join from a larger one to a smaller one. It's fine to change from a smaller one to a larger one because all the threads participating will have to run in order for a join to finish. But if no one is waiting on a thread, it could just wake up from the event being set by the last thread joining and not run for a while. Then go back to the respin loop at a point where the color was changed and changed again! So now it thinks it can proceed with a join it does not belong to. And of course that wouldn't work.

The way threads are going idle/waking up is hard to keep track of - not only does it involve the gc_start_event and gc_idle_thread_event, it also uses WaitForGCEvent which is used by SuspendEE/RestartEE which in turn means whenever we want to call these we'd need to care about how that would affect this.

The new mechanism only uses gc_start_event and gc_idle_thread_event, but I changed gc_idle_thread_event to a per heap event. We can easily track which threads are going idling easily - whenever a thread is about to wait on the idle event, we increase the current idle_thread_count. And when we increase the heap count we only set the gc_idle_thread_event for the new heaps that are about to participate so we can deduct that many from idle_thread_count. There's a much simpler code path between "we know we don't need these threads anymore" to "these threads are at a known point" because the next time gc_start_event is set (ie, a GC is requested) we make sure to get these threads to a good known point, ie, we wait till all of them have completed increasing idle_thread_count.

Also fixed a couple of other problems that I hit while testing the new mechanism -

  1. We are setting freeable_uoh_segment and freeable_soh_segment in decommission_heap to DECOMMISSIONED_REGION_P. And this causes us to simply lose the value for them. We should make sure we do push these to the free regions before we start changing the heap count.

  2. We should also call background_delay_delete_uoh_segments before we start changing the heap count so we can get rid of the regions marked with heap_segment_flags_uoh_delete. If we allow these to be rearranged in equalize_promoted_bytes it means the order can change the invariant of the first region never being deleted no longer holds true and we can AV in this method.

I added an new method delay_free_segments to perform both tasks.

  1. The accounting of generation_free_list_space is slightly off for LOH which causes us to hit assert (gen_size >= dd_fragmentation (dd)); in change_heap_count because we were not counting the loh_pad size.

I also disabled assert (free_list_space_decrease <= dd_fragmentation (dd)); for gen2 since I'm seeing this fired while I'm doing stress runs. I have yet to investigate this since I didn't want to add yet more changes to this PR.

@ghost
Copy link

ghost commented Aug 17, 2023

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

the current one has a fundamental flaw which is the idling threads can start running at unpredictable times when they are woken up. this causes all sorts of problems. also the way threads are going idle/waking up is hard to keep track of - not only does it involve the gc_start_event and gc_idle_thread_event, it also uses WaitForGCEvent.

the new mechanism is much easier to keep track of and we can track exactly how many threads are going idle and when they wake up so we can run till we know they are at a good place.

I've been running tests for a while now and there's still a problem which is an AV in background_delay_delete_uoh_segments - I don't believe this is related to my changes. but I'll make a fix so I can run stress for longer.

I'll add more detail to the description later.

Author: Maoni0
Assignees: -
Labels:

area-GC-coreclr

Milestone: -

@mrsharm mrsharm changed the title [WIP] new synchronization mechnism for DATAS [WIP] new synchronization mechanism for DATAS Aug 30, 2023
Copy link
Member

@mrsharm mrsharm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfwise tested via ASP.NET Benchmarks and there were no hangs / failures and regressions that stood out.

@Maoni0 Maoni0 changed the title [WIP] new synchronization mechanism for DATAS new synchronization mechanism for DATAS Aug 31, 2023
@@ -49461,13 +49649,14 @@ void gc_heap::do_pre_gc()
#ifdef TRACE_GC
size_t total_allocated_since_last_gc = get_total_allocated_since_last_gc();
#ifdef BACKGROUND_GC
dprintf (1, (ThreadStressLog::gcDetailedStartMsg(),
//dprintf (1, (ThreadStressLog::gcDetailedStartMsg(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code? Can we remove this line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for noticing! unfortunately it did not show there were new messages when I looked at the issue (only after I merged already so I will address this in another PR).

@@ -823,6 +823,11 @@ class t_join
join_struct.r_join_lock = n_th;
}

int get_num_threads()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we inline this function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is taken care of by the compiler.

@Maoni0 Maoni0 merged commit e4e95de into dotnet:main Aug 31, 2023
Maoni0 added a commit to Maoni0/runtime that referenced this pull request Sep 20, 2023
The current mechanism has a fundamental flaw which is the idling threads can start running at unpredictable times when they are woken up. This causes all sorts of problems. For example, when a thread gets here in gc_thread_function -

`if (n_heaps <= heap_number)`

if it's true it's supposed to wait. But its execution could be delayed so after it reads n_heaps it can stop for a while since no thread is waiting on this thread anyway... till some time later when a heap count changes happens again and it requires this thread to participating. And now this thread does the comparison and discovers that it needs to wait so it goes idle and all other threads will just be waiting for this thread to join.

Another example is it's not safe to change the heap count for a join from a larger one to a smaller one. It's fine to change from a smaller one to a larger one because all the threads participating will have to run in order for a join to finish. But if no one is waiting on a thread, it could just wake up from the event being set by the last thread joining and not run for a while. Then go back to the respin loop at a point where the color was changed and changed again! So now it thinks it can proceed with a join it does not belong to. And of course that wouldn't work.

The way threads are going idle/waking up is hard to keep track of - not only does it involve the gc_start_event and gc_idle_thread_event, it also uses WaitForGCEvent which is used by SuspendEE/RestartEE which in turn means whenever we want to call these we'd need to care about how that would affect this.

The new mechanism only uses gc_start_event and gc_idle_thread_event, but I changed gc_idle_thread_event to a per heap event. We can easily track which threads are going idling easily - whenever a thread is about to wait on the idle event, we increase the current idle_thread_count. And when we increase the heap count we only set the gc_idle_thread_event for the new heaps that are about to participate so we can deduct that many from idle_thread_count. There's a much simpler code path between "we know we don't need these threads anymore" to "these threads are at a known point" because the next time gc_start_event is set (ie, a GC is requested) we make sure to get these threads to a good known point, ie, we wait till all of them have completed increasing idle_thread_count.

Also fixed a couple of other problems that I hit while testing the new mechanism -

We are setting freeable_uoh_segment and freeable_soh_segment in decommission_heap to DECOMMISSIONED_REGION_P. And this causes us to simply lose the value for them. We should make sure we do push these to the free regions before we start changing the heap count.

We should also call background_delay_delete_uoh_segments before we start changing the heap count so we can get rid of the regions marked with heap_segment_flags_uoh_delete. If we allow these to be rearranged in equalize_promoted_bytes it means the order can change the invariant of the first region never being deleted no longer holds true and we can AV in this method.

I added an new method delay_free_segments to perform both tasks.

The accounting of generation_free_list_space is slightly off for LOH which causes us to hit assert (gen_size >= dd_fragmentation (dd)); in change_heap_count because we were not counting the loh_pad size.
I also disabled assert (free_list_space_decrease <= dd_fragmentation (dd)); for gen2 since I'm seeing this fired while I'm doing stress runs. I have yet to investigate this since I didn't want to add yet more changes to this PR.
carlossanlop pushed a commit that referenced this pull request Sep 20, 2023
* new synchronization mechanism for DATAS (#90726)

The current mechanism has a fundamental flaw which is the idling threads can start running at unpredictable times when they are woken up. This causes all sorts of problems. For example, when a thread gets here in gc_thread_function -

`if (n_heaps <= heap_number)`

if it's true it's supposed to wait. But its execution could be delayed so after it reads n_heaps it can stop for a while since no thread is waiting on this thread anyway... till some time later when a heap count changes happens again and it requires this thread to participating. And now this thread does the comparison and discovers that it needs to wait so it goes idle and all other threads will just be waiting for this thread to join.

Another example is it's not safe to change the heap count for a join from a larger one to a smaller one. It's fine to change from a smaller one to a larger one because all the threads participating will have to run in order for a join to finish. But if no one is waiting on a thread, it could just wake up from the event being set by the last thread joining and not run for a while. Then go back to the respin loop at a point where the color was changed and changed again! So now it thinks it can proceed with a join it does not belong to. And of course that wouldn't work.

The way threads are going idle/waking up is hard to keep track of - not only does it involve the gc_start_event and gc_idle_thread_event, it also uses WaitForGCEvent which is used by SuspendEE/RestartEE which in turn means whenever we want to call these we'd need to care about how that would affect this.

The new mechanism only uses gc_start_event and gc_idle_thread_event, but I changed gc_idle_thread_event to a per heap event. We can easily track which threads are going idling easily - whenever a thread is about to wait on the idle event, we increase the current idle_thread_count. And when we increase the heap count we only set the gc_idle_thread_event for the new heaps that are about to participate so we can deduct that many from idle_thread_count. There's a much simpler code path between "we know we don't need these threads anymore" to "these threads are at a known point" because the next time gc_start_event is set (ie, a GC is requested) we make sure to get these threads to a good known point, ie, we wait till all of them have completed increasing idle_thread_count.

Also fixed a couple of other problems that I hit while testing the new mechanism -

We are setting freeable_uoh_segment and freeable_soh_segment in decommission_heap to DECOMMISSIONED_REGION_P. And this causes us to simply lose the value for them. We should make sure we do push these to the free regions before we start changing the heap count.

We should also call background_delay_delete_uoh_segments before we start changing the heap count so we can get rid of the regions marked with heap_segment_flags_uoh_delete. If we allow these to be rearranged in equalize_promoted_bytes it means the order can change the invariant of the first region never being deleted no longer holds true and we can AV in this method.

I added an new method delay_free_segments to perform both tasks.

The accounting of generation_free_list_space is slightly off for LOH which causes us to hit assert (gen_size >= dd_fragmentation (dd)); in change_heap_count because we were not counting the loh_pad size.
I also disabled assert (free_list_space_decrease <= dd_fragmentation (dd)); for gen2 since I'm seeing this fired while I'm doing stress runs. I have yet to investigate this since I didn't want to add yet more changes to this PR.

* fixed problems with how sampling is done and how we suspend to change heap count in DATAS (#91712)

+ Moved the sample recording into when we are suspended. The way we were calculating the throughput cost was in check_heap_count (which is called right after we restart EE on heap0), we record the msl_wait_time (and reset it to 0 for soh/uoh). This is not synchronized with the allocating threads (which are already running at this point). So what can happen is the allocating threads are already accumulated more wait time which is attributed to this GC but it's not within the period we are counting for this GC (and we lose this part for the next GC). For BGC this is incorrect. If an ephemeral GC did happen before the BGC starts, we'd be adding a sample for that GC which is basically correct for that eph GC. But if an eph GC did not happen, we are just adding a random sample which is calculating the tcp as (msl wait + whatever GC that was finished before this BGC) so obviously incorrect.

+ Added gen2 sampling - this was adapted from Peter's gen2 sampling changes. This serves as a backstop in case the existing sampling doesn't ever pick gen2 GC costs. I made the following fixes -

1) changed the way we calculated the median

2) moved where this is calculated to again avoid timing issues

3) made the gen2 samples actually count instead of losing that info if we happen to sample when a gen2 didn't just occur.

+ Changed when check_heap_count is called - the previous place is right after a suspension which does not help with spacing the suspension time out (it was "suspend for GC" then "immediately suspend to change heap count"). And it caused a problem with BGC which was it always tried to change heap count when it couldn't because BGC was in progress. I changed this to be on a timeout to intentionally space the suspensions out. Now most of the time, heap count changes happen due to this time out. If we are really in a situation where GCs are happening too quickly and we return from waiting on the ee_suspend_event due to a GC started, we change the heap count right before we do a GC. So this also helps with the BGC problem.

* gen0_bricks_cleared flag needs to be propagated when we change heap count (#90457)

when we change the heap count, in heap X we get a region from heap Y and the gen0_bricks_cleared flag from Y says false but heap X says true. So when we check the bricks on heap X, we assume it’s true but it’s not.

the fix is to detect if any heap has this flag as false and if so make all heaps’ flag false (tracking which region is moved from which other heap is something we need additional recording for and it’s not really worth doing just for this)

* a logging change

---------

Co-authored-by: Maoni0 <maoni@microsoft.com>
@ghost ghost locked as resolved and limited conversation to collaborators Sep 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants