Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some refactoring of thread suspension. #34666

Merged
18 commits merged into from
Oct 27, 2020
Merged

Some refactoring of thread suspension. #34666

18 commits merged into from
Oct 27, 2020

Conversation

VSadov
Copy link
Member

@VSadov VSadov commented Apr 8, 2020

Original goal was to understand better suspension code and add comments to tricky points.
There were few places that could benefit from some refactoring. In fact there were already comments in few places recommending changes.

The major part here is merging 2 nearly identical suspend/hijack/redirect loops into one loop and changing it a bit to avoid unnecessary work. In particular, when suspension of a thread is already in progress we were too eager to retry without giving the thread enough chances to park itself.

There are some minor cleanups as well - like removing code related to fibers.
And added a few comments in places where I felt like the purpose or invariants of the code were not obvious.

The end effects on the suspension latencies is

  • some improvement in scenarios where user threads aggressively poll for GC suspension and
  • substantial improvements when threads are not polling (up to 4X faster).

The real-world cases will be some mix of the above, so it will vary, but should improve.

@ghost
Copy link

ghost commented Apr 8, 2020

Tagging @ViktorHofer as an area owner

@davidwrighton
Copy link
Member

@VSadov are you planning to do anything with this?

@VSadov
Copy link
Member Author

VSadov commented Oct 8, 2020

I did this partially for my own education - why suspension is so complex for what it is doing. There was indeed some room to make things simpler and I made some changes. I think it could be simplified a bit more.
There was some improvement in GC pauses as well, but not huge, so I started having doubts if we would take this kind of change. Especially close to shipping.

Let me dust it off and measure again if there are gains.

@VSadov VSadov force-pushed the suspend branch 2 times, most recently from 2b537d5 to ce94005 Compare October 12, 2020 14:55
@VSadov
Copy link
Member Author

VSadov commented Oct 12, 2020

The test program:

using System;
using System.Diagnostics;
using System.Runtime.CompilerServices;
using System.Threading;

class Program
{
    static object[] arr1 = new Exception[10000];

    enum E1
    {
        a 
    }

    static object box = E1.a;

    static void DoSomeWork()
    {
        var a1 = arr1;

        ArgumentNullException val = new ArgumentNullException();
        for (; ; )
        {
            Assign(a1, 1, val);
            Assign(a1, 2, val);
            Assign(a1, 3, val);
            Assign(a1, 4, val);
            Assign(a1, 5, val);
            Assign(a1, 6, val);
            Assign(a1, 7, val);
        }
    }

    static bool doPoll;

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void Assign(object[] a, int i, ArgumentNullException val)
    {
        if (val != null)
        {
            a[i] = val;
        }

        if (doPoll)
        {
            // enum->int  unbox will perform a GC-polling FCALL   (easy case for suspension)
            int unboxed = (int)box;
        }

        a[i] = null;
    }

    static void Main()
    {
        for (int i = 0; i < Environment.ProcessorCount-1; i++)
        {
            new Thread(() =>
            {
                DoSomeWork();
            }).Start();
        }

        Thread.Sleep(10);

        for (; ; )
        {
            long longest = 0;
            int iterations = 0;
            doPoll ^= true;

            // do 5sec of suspension/resume
            var sw = Stopwatch.StartNew();
            while (sw.ElapsedMilliseconds < 5000)
            {
                long oneIter = sw.ElapsedMilliseconds;

                // Suspend/Resume EE
                GC.GetTotalAllocatedBytes(precise:true);

                oneIter = sw.ElapsedMilliseconds - oneIter;

                longest = Math.Max(longest, oneIter);
                iterations++;

                // Let other threads actually resume
                Thread.Yield();
            }
            sw.Stop();

            {
                Console.WriteLine($"frequent GC-polling        : {doPoll}");
                Console.WriteLine($"#suspensions in 5 sec.     : {iterations}");
                Console.WriteLine($"Average suspension took    : {(double)sw.ElapsedMilliseconds / iterations}");
                Console.WriteLine($"longest suspend  was (msec): {longest}");
                Console.WriteLine();
            }
        }
    }
}

@VSadov VSadov changed the title [WIP] thread suspension Some refactoring of thread suspension. Oct 12, 2020
@VSadov
Copy link
Member Author

VSadov commented Oct 12, 2020

Regular desktop (i7-8700, 12 logical cores, Windows 10)

Numbers have some variations, especially "longest suspend", but these are typical numbers:

=== baseline (5.0)

frequent GC-polling : False
#suspensions in 5 sec. : 543
Average suspension took : 9.250460405156538
longest suspend was (msec): 62

frequent GC-polling : True
#suspensions in 5 sec. : 21179
Average suspension took : 0.23608291231880638
longest suspend was (msec): 22

frequent GC-polling : False
#suspensions in 5 sec. : 597
Average suspension took : 8.393634840871021
longest suspend was (msec): 63

frequent GC-polling : True
#suspensions in 5 sec. : 25457
Average suspension took : 0.19640963192834976
longest suspend was (msec): 22

=== after the change

frequent GC-polling : False
#suspensions in 5 sec. : 2008
Average suspension took : 2.4955179282868527
longest suspend was (msec): 62

frequent GC-polling : True
#suspensions in 5 sec. : 32085
Average suspension took : 0.15596072931276297
longest suspend was (msec): 17

frequent GC-polling : False
#suspensions in 5 sec. : 1915
Average suspension took : 2.613054830287206
longest suspend was (msec): 47

frequent GC-polling : True
#suspensions in 5 sec. : 48662
Average suspension took : 0.10274957872672721
longest suspend was (msec): 15

@VSadov
Copy link
Member Author

VSadov commented Oct 12, 2020

Azure Standard_F16s_v2 VM (16 virtual cores, Ubuntu Server 16.04 LTS)

=== baseline (5.0)

frequent GC-polling : False
#suspensions in 5 sec. : 6575
Average suspension took : 0.7604562737642585
longest suspend was (msec): 7

frequent GC-polling : True
#suspensions in 5 sec. : 33631
Average suspension took : 0.148672355862151
longest suspend was (msec): 7

=== after the change

frequent GC-polling : False
#suspensions in 5 sec. : 22560
Average suspension took : 0.22163120567375885
longest suspend was (msec): 7

frequent GC-polling : True
#suspensions in 5 sec. : 34491
Average suspension took : 0.14496535328056595
longest suspend was (msec): 9

@VSadov
Copy link
Member Author

VSadov commented Oct 12, 2020

Server (AMD Rome, 256 logical cores, 64 per NUMA node, Windows Server)
The test uses default config thus runs on a single NUMA node (64 cores).

=== baseline (5.0)

frequent GC-polling : False
#suspensions in 5 sec. : 314
Average suspension took : 15.945859872611464
longest suspend was (msec): 63

frequent GC-polling : True
#suspensions in 5 sec. : 2173
Average suspension took : 2.300966405890474
longest suspend was (msec): 3

=== after the change

frequent GC-polling : False
#suspensions in 5 sec. : 1647
Average suspension took : 3.036429872495446
longest suspend was (msec): 37

frequent GC-polling : True
#suspensions in 5 sec. : 3838
Average suspension took : 1.3027618551328817
longest suspend was (msec): 2

@VSadov VSadov marked this pull request as ready for review October 12, 2020 18:45
@davidwrighton
Copy link
Member

  1. The performance numbers here are pretty good. Particularly compelling are the performance improvements for frequent gc polling scenarios. I agree we would not have taken this as .NET 5 is winding down, but as a .NET 6 bit of work, I think this meets the bar.
  2. This looks pretty hard to review for correctness. I look at this, and there is a LOT of removal of volatile loads/stores, and I think we should be pretty careful here. If you want me to review this for correctness, please schedule a meeting or several so we can go through the change line by line.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the change, but it needs very careful scrutiny. I fully agree with David's comment.

@@ -4016,9 +4016,6 @@ BasicBlock* Compiler::fgCreateGCPoll(GCPollType pollType, BasicBlock* block)
value = gtNewIndOfIconHandleNode(TYP_INT, (size_t)addrTrap, GTF_ICON_GLOBAL_PTR, false);
}

// Treat the reading of g_TrapReturningThreads as volatile.
Copy link
Member

@jkotas jkotas Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be marked as non-CSEable at least? #Resolved

Copy link
Member Author

@VSadov VSadov Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-CSEable would be good to have. I thought this is injected by JIT after optimizations.

In c++ we use compiler-only fences to make sure the read gets emitted in program order (the only gurantee that we need on the read side).
Some equivalent would be sufficient in JIT code as well.
#Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added GTF_DONT_CSE


In reply to: 503509382 [](ancestors = 503509382)

@@ -6107,14 +5869,10 @@ void ThreadSuspend::SuspendEE(SUSPEND_REASON reason)
// I wonder how much confusion this has caused....)
// It seems like much of the above is redundant. We should investigate reducing the number
// of mechanisms we use to indicate that a suspension is in progress.
//
// +1
Copy link
Member

@jkotas jkotas Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unintentional change? #Resolved

Copy link
Member Author

@VSadov VSadov Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was agreeing with the comment, but decided not to deal with that in this change and forgot to remove.
#Resolved

{
if (thread->HasThreadState(Thread::TS_GCSuspendPending))
for (Thread *thread = ThreadStore::GetThreadList(NULL); thread != NULL; thread = ThreadStore::GetThreadList(thread))
Copy link
Member

@jkotas jkotas Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the while loop pattern in all places that iterate over threadstore today. I do not see value in changing to for loop in some of the places. #Resolved

Copy link
Member Author

@VSadov VSadov Oct 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted back this and 2 other places to be:

    Thread* thread = NULL;
    while ((thread = ThreadStore::GetThreadList(thread)) != NULL)
    {
 
    }

In reply to: 503476859 [](ancestors = 503476859)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for syntax could be a bit more robust for loops like this, since it scopes the iteration variable (thread here) to the loop body. It makes it a bit easier to track who changes it and who uses it. It is a mild preference though.


In reply to: 505819489 [](ancestors = 505819489,503476859)

Copy link
Member Author

@VSadov VSadov Oct 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made it a while loop #Resolved

//
// also unblock any thread waiting around for this thread to suspend. This prevents us from completely
// starving other suspension clients, such as the debugger, which we otherwise would do because of
// the priority we just established.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are non-obvious interactions between thread suspend and debugger. It am not sure whether this is safe to delete.

Either way, it may be a good idea to run debugger tests on this.

Copy link
Member Author

@VSadov VSadov Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a single place where we wait on this event and it has a very short timeout. It timeouts often and waiter should be able to handle timeouts. Signaling the event, on the other hand, is an indication that there was some progress (a thread has suspended) and could be more confusing, since here there was no progress made.

Signaling probably made more sense in the past when suspension ping value was much longer.

I completely agree with the "debugger test" part. Should run some stress tests too.

@VSadov
Copy link
Member Author

VSadov commented Oct 12, 2020

Yes, I will schedule an in-person meeting. Some volatile removes (like locals that are not shared with anybody), may be easy to review.
Some other variables may need explaining/comments and could use careful examining. Reviewing those is definitely welcome!! #Resolved

// return the current thread here, because we NULL it out after obtaining the thread store lock. </REVISIT>
// If we have just updated hijacks/redirects, then do a pass while only observing.
// Repeat observing only as long as we see progress. Most threads react to hijack/redirect very fast and
// typically we can avoid waiting on an event. (except on uniporocessor where we do not spin)
Copy link
Member

@danmoseley danmoseley Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit typo #Resolved

Copy link
Member Author

@VSadov VSadov Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read it like 10 times - where? #Resolved

Copy link
Member Author

@VSadov VSadov Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, unipOrocessor :-) #Resolved

@@ -404,26 +389,18 @@ Thread::SuspendThreadResult Thread::SuspendThread(BOOL fOneTryOnly, DWORD *pdwSu
}
#endif // _DEBUG

if (doSwitchToThread)
Copy link
Member Author

@VSadov VSadov Oct 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code is simpler here primarily because fibers stuff was removed. For example without fibers doSwitchToThread would be always TRUE so it is no longer needed, and so on...

@@ -2054,6 +2004,7 @@ void ThreadSuspend::LockThreadStore(ThreadSuspend::SUSPEND_REASON reason)
// runtime threads for a GC that depends on the fact that we
// do an EnablePreemptiveGC and a DisablePreemptiveGC around
// taking this lock.
// besides, if the suspension is in progress, the lock is taken, we better be in premptive mode while blocked on this lock.
Copy link
Member Author

@VSadov VSadov Oct 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

premptive [](start = 89, length = 9)

premptive typo #Resolved

@VSadov
Copy link
Member Author

VSadov commented Oct 25, 2020

Regarding trying this change with debugger tests - That was done and revealed a few regressions in stress scenarios, which were related to incomplete removal of SWITCHOUT_HANDLE_VALUE.

After that was fixed, debugger tests do not indicate any regressions.

@jkotas jkotas added the tenet-performance Performance related issue label Oct 26, 2020
// Must let the profiler know that this thread is aborting its attempt at suspending
{
BEGIN_PIN_PROFILER(CORProfilerTrackSuspends());
g_profControlBlock.pProfInterface->RuntimeSuspendAborted();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this new location of the suspend abort, it doesn't appear that it protects against a deadlock on the threadstore lock. Could you explain the reasoning here?

Copy link
Member Author

@VSadov VSadov Oct 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think suspension-yielding behavior was done to guarantee that GC suspension will preempt other suspension, because it did not do that.
The thread that requests GC suspension is always in preemptive mode and would be ignored by the thread doing suspension. GC suspender can signal about its intentions by setting a global flag, but that could easily be too late for the current suspending thread to notice.

I believe the purpose is a mitigation for situations where we have a lot of non-GC suspensions. I am not sure how common/real such scenarios are, but it could potentially starve GC so we have this suspension-yielding behavior.
The other part of this mechanism is an event (s_hAbortEvt) that non-GC suspenders could be forced to wait on while GC suspension is trying to get TSL. That is also racey - when we reset the event we may already have non-GC suspenders waiting on the TSL and not affected by the event.

Basically - this mechanism clearly does not guarantee that any non-GC suspension will immediately yield to GC suspension. At best it can guarantee that if we have 100 non-GC suspenders, then an incoming GC suspension will gain higher priority and get ahead most of them.
For such purpose checking once before starting the suspension loop is just as good as checking at each iteration.

I.E. we may allow one non-GC suspension that is already half-way or almost done, but will deprioritize others and this should still serve the starvation-prevention purpose.

Doing the check at this point has a number of benefits though:

  • suspending EE is a relatively heavy operation. It makes sense to not abort suspension that is nearly done. (and we can't abort one that is completely done anyways).
  • it reduces the number of states we have to deal with when restarting EE after an aborted suspension - we would not have, for example threads that are already redirected oh hijacked and asynchronously moving towards suspension as we decide to resume EE, or suspend again later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removed comment from the previous implementation discusses a deadlock, where the GC thread attempts to get the threadstore lock, but a separate suspended thread already holds the lock. Effectively, I think I'm asking, can that deadlock occur?

Copy link
Member Author

@VSadov VSadov Oct 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this code had a complicated history and some comments no longer match what code does.
I assume the question is about:

            // When the debugger is synchronizing, trying to perform a GC could deadlock. The GC has the
            // threadstore lock and synchronization cannot complete until the debugger can get the
            // threadstore lock. However the GC can not complete until it sends the BeforeGarbageCollection
            // event, and the event can not be sent until the debugger is synchronized. In order to break
            // this deadlock cycle the GC must give up the threadstore lock, allow the debugger to synchronize,
            // then try again.

I do not think the scenario described in the comment exists.

GC does not send BeforeGarbageCollection event. At least I could not find where it does. Perhaps it did in the past.

Regardless, debugger should not be synchronizing when GC holds the lock.
I have added an assert, just in case, but I think it is not possible for a GC suspension to be happening while debugger is IsSynchronizing.

Right now switching into IsSynchronizing mode happens when debugger already owns the TSL. here:

m_trappingRuntimeThreads = TRUE;

It is done via an ordinary write but it is done while holding TSL so that is ok. All places that handle this flag appear to hold the TSL as well, except when it is reset - not sure if we hold the TSL at that point, but contracts do not require that.

There is only one case when GC could yield the suspension - when debugger has threads in inconsistent state - like stopped on a breakpoint. Then the thread may appear stopped on a safe point, but actually it is not.
This is the only case where everyone, including GC must retry suspending - since suspension did not really succeed.
We still handle this scenario.
The state is indicated by g_pDebugInterface->ThreadsAtUnsafePlaces(), which is an interlock-updated counter. We read the value after EE is suspended and thus we read a stable value.

Copy link
Member Author

@VSadov VSadov Oct 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless, debugger should not be synchronizing when GC holds the lock.

If above is actually not true, we should check for that before doing suspension loop - same place where we check for GC vs. non-GC suspension collisions. At that point we hold TSL, so the value will not change, but we have not done much work yet.
I do not think it is a case that we need to handle though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine a context switch delaying the GC thread by XX time right before setting m_pThreadAttemptingSuspendForGC.
Since GC thread is preemptive, the other thread could start, continue, finish the suspension, or even finish that what it was going to do and resume EE, if XX is long enough.

That is the case before or after the change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that the thread that sets m_pThreadAttemptingSuspendForGC is in cooperative mode not preemptive mode. When a GC is triggering a suspend, it is in cooperative mode (as its in the process of manipulating GC data structures), and the LockThreadStore routine does not transition to preemptive mode when attempting to suspend for a GC. (See the toggleGC boolean in ThreadSuspend::LockThreadStore.) Thus there is a potential for deadlock.

The old logic had this abort routine, which would kick in at any point in time during suspend, and if it ever triggered, it would stop attempting to suspend. Effectively, this make the whole threadstore a deadlock detecting lock, with a policy favoring the GC thread suspension logic. There was a race here, but it is protected by the abort process being inside the loop, and being able to abort suspension whenever the thread attempting suspend status became true. Effectively, the check could race, but only for one extra iteration of the suspension loop.

Now, the previous logic would simply abort the suspend immediately, and leave the various threads that it had managed in a state where they were suspended for some other purpose, and then ThreadSuspend::SuspendEE will call RestartEE and then use a goto to exit the threadlock, and then hopefully the the GC thread would come along and finish the suspend from its end. That logic all seems a bit sketchy to me, but it looks like it would have worked in the old approach. With your new code that moved the abort logic out of the loop, I don't see how this works.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think LockThreadStore does not transition to preemptive for GC threads because they are already supposed to be preemptive. Otherwise we could block on TSL while in coop mode, which would deadlock.

Let me check. It may be worth adding asserts/contracts to be explicit about this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I have added a contract to ThreadStore::Enter() for exactly this purpose already, I just forgot.

Probably worth adding a comment to LockThreadStore on why we do not bother about GC threads.
Or better not specialcase them in the toggleGC check at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this offline, and I'm convinced that the gc thread is preemptive when suspending.

// event, and the event can not be sent until the debugger is synchronized. In order to break
// this deadlock cycle the GC must give up the threadstore lock, allow the debugger to synchronize,
// then try again.
(g_pDebugInterface->ThreadsAtUnsafePlaces() || g_pDebugInterface->IsSynchronizing()))
Copy link
Member Author

@VSadov VSadov Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since GC does call BeforeGarbageCollection I am less sure that IsSynchronizing while suspending is an impossible state. I think I will just bring the IsSynchronizing check back.
It may be worth examining closely if it is needed. However we would do this check rarely (once per suspension) and only if debugger is attached, so even if it is an impossible state, the check won't be a big deal.

Need to fix an ASSERT to account for the possibility of NULL.
@VSadov
Copy link
Member Author

VSadov commented Oct 27, 2020

Thanks!!!

@ghost
Copy link

ghost commented Oct 27, 2020

Hello @VSadov!

Because this pull request has the auto-merge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

@ghost ghost merged commit 73a7993 into dotnet:master Oct 27, 2020
@VSadov VSadov deleted the suspend branch October 27, 2020 22:58
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-VM-coreclr tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants