Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Porting SuspendAllThreads from the NativeAOT to CoreCLR. #101782

Merged
merged 26 commits into from
May 10, 2024

Conversation

VSadov
Copy link
Member

@VSadov VSadov commented May 1, 2024

A step towards making EE suspension support similar between NativeAOT and CoreCLR and possibly eventually share.

The main goal of this change is to port SuspendAllThreads and ResumeAllThreads from the NativeAOT. More steps will follow on this path, but this looks like a good state to commit the changes.

The CoreCLR implementation of SuspendAllThreads is now roughly equivalent to the one in NativeAOT modulo different helpers to iterate threads or to figure if they are in coop mode.

This also introduces Thread::Hijack entry point through which SuspendAllThreads nudges threads into preempt mode. The implementation of Thread::Hijack is very different between runtimes right now. Unifying the design of Thread::Hijack will be the goal of further changes.

Introducing SuspendAllThreads pulled a thread of other required or good-to-have changes:

  • g_pGCSuspendEvent is now gone together with PING_JIT_TIMEOUT
    The timeout for the event was 1 millisecond, which is too long. What's worse is that it could still take up to 16 milliseconds, depending on OS, to timeout if threads need to be re-hijacked. We are looking for sub-millisecond timings for suspension. Spending 16 milliseconds per hijack iteration could introduce really bad outliers.

  • TS_GCSuspendPending thread state is removed. It was a redundant way to specify whether a coop thread needs to be stopped for GC. Having more than one way only brings confusion and concerns about ordering (what needs to be set/unset/checked, in what order).

  • We now trap threads on preempt->coop transition only.
    In addition to suspension trapping, CoreCLR has other reasons to trap threads (ThreadAbort, Debugger...). The checks for those conditions were done either on transition to coop or on transition from coop - for no good reason. A program will see the same number of forward and reverse coop transition, so trapping at either edge would work, except that preempt->coop must have a trap for GC purposes and coop->preemt does not.
    In short - this removes RareEnablePreemptiveGC, because having RareDisablePreemptiveGC is enough. Exit to preemptive mode is now an unconditional set of m_fPreemptiveGCDisabled.

  • trap flag for EE suspension is now a single bit that is set/unset atomically and also is the "source of truth" on whether coop threads need to be suspended.
    We still have the trap counter for scenarios like ThreadAbort as there could be multiple at a time (NativeAOT may need something similar if ThreadAbort is implemented). There could be only one suspension at a time though, thus it is possible and useful to have a single dedicated bit for that.

  • avoid posting new suspension signals when previous one is still in progress.
    This was done on Windows, but not on Unix. Redundant signals can be harmful on Unix too.

@VSadov
Copy link
Member Author

VSadov commented May 1, 2024

CC @tommcdon - I may need to run this through debugger tests. I'll contact you separately for that.


// exponential spinwait with an approximate time limit for waiting in microsecond range.
// when iteration == -1, only usecLimit is used
void SpinWait(int iteration, int usecLimit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this existing logic or a new mechanism? Changing the spinwait duration has shown to impact startup when DATAS is enabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not new, it is copied from NativeAOT. It is not related to spinwaits used by GC in any way. Here we just need to keep the thread that performs the suspension busy for a few microseconds. System timers do not offer pauses with such granularity, not in a portable way, anyways.

Copy link
Member Author

@VSadov VSadov May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is going on here:

  • We set a flag telling threads to suspend themselves. (post a frame suitable for a stackwalk start and block on an event).
  • Most threads do some kind of allocations or calls into runtime, so they will notice the flag and suspend.
  • Some threads may not be doing that (could be computing something in a loop), so they need to be hijacked.

Hijack may either catch a thread in interruptible code (then we are done with the thread) or hijack the return address, so that when a thread returns from the current call, it will suspend itself.

Hijack typically leads to thread suspension, but not always (a thread may go deeper into call tree), so we might need to redo the hijack a few times. We should eventually "corner" the thread as hijacked return will be moving only lower in the call tree with every try.
Here is a conundrum though - after hijacking we must let the thread to run for a while, so it has a chance to observe what we did and suspend itself. We can't be too aggressive with this. If we keep interrupting the thread to check what is happening, nothing will happen. So after a hijack cycle we back off for a few microseconds and then check if we need to hijack again.

We will also increase the time we give to the thread with every iteration (to make sure we are not starving it with our interruptions), but up to a limit.

Timings here are derived from the general pause expectations - 1/60 second (16 msec) pause could be perceptible in 60fps animation, 1/15 second is certainly noticeable in interactive apps. But what we do here is just the suspension part, we need to leave most of that time for the GC to run. While we can't guarantee the upper bound, we strive for suspension to happen in sub-millisecond time.
Thus the pauses between retries are measured in microseconds.

@VSadov
Copy link
Member Author

VSadov commented May 2, 2024

For the debugger interaction this should work the same as before, unless I messed up something while moving code around.

@tommcdon
Copy link
Member

tommcdon commented May 2, 2024

Adding @kouvel to review IsInForbidSuspendForDebuggerRegion refactoring changes

@@ -328,14 +328,6 @@ bool Thread::IsGCSpecial()
return IsStateSet(TSF_IsGcSpecialThread);
}

bool Thread::CatchAtSafePoint()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was dead code already. Noone called this.

@@ -394,29 +393,6 @@ endif
retn 8
_CallJitEHFinallyHelper@8 ENDP

;-----------------------------------------------------------------------
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We stopped doing thread trapping on transitions from coop in JIT helpers a while ago.
It looks like x86 stubs were left behind and were still doing it.

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks OK but given the complexity of thread suspension I wouldn't feel particularly confident that I would catch issues :)

Its possible that you are going to find subtle dependencies the debugger had on suspending during the coop->preempt transition during your testing but I am not aware of any explicit dependency.

if (ThreadStore::HoldingThreadStore(this))
// A thread that performs GC may switch modes inside GC and come here.
// We will not try suspending a thread that is responsible for the suspension.
if (this == ThreadSuspend::GetSuspensionThread())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some case where a thread is set as the suspension thread and it doesn't hold the TSL? I am wondering if we need this check when we've got the TSL check below.

Copy link
Member Author

@VSadov VSadov May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some case where a thread is set as the suspension thread and it doesn't hold the TSL? I am wondering if we need this check when we've got the TSL check below.

The difference between these two checks is that this one is quite understandable - a thread that performs suspension certainly does not want itself being blocked. Places that perform mode switches (like inside GC) could specialcase the suspending thread, but it is easier to just handle the scenario here.

The one below seems a bit more dangerous. That is a random thread holding TSL and trying to get into coop mode. If such scenario happens by accident and then the thread allocates, causes GC... etc, not sure what would happen.
I initially had an assert instead of a check, but the assert was hit in tests, so I changed it to a condition. I am not happy about that. Maybe there is no way around this and some scenarios must do it (very carefully), but it seems fragile.

You are right though - the next check that tests for TSL ownership subsumes the check for thread driving the suspension. I guess, I can remove the this == ThreadSuspend::GetSuspensionThread() check for now.

I will try to follow up separately from this PR and see if we can avoid holding TSL in coop mode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine if you want to keep the check or convert it to an assert inside the TSL check. Either way comments might be nice to preserve the context you just described here in the code. Thanks!

@VSadov VSadov force-pushed the suspAllThrCore branch from a38b54a to 48e1c78 Compare May 5, 2024 01:42
@kouvel
Copy link
Member

kouvel commented May 6, 2024

Adding @kouvel to review IsInForbidSuspendForDebuggerRegion refactoring changes

The refactoring looks fine to me.

Copy link
Member

@kouvel kouvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and looks mostly similar to before, though I may have missed some subtleties.

@VSadov
Copy link
Member Author

VSadov commented May 10, 2024

Thanks!!

@VSadov VSadov merged commit 00a8973 into dotnet:main May 10, 2024
88 of 90 checks passed
@VSadov VSadov deleted the suspAllThrCore branch May 10, 2024 04:12
Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this pull request May 30, 2024
…1782)

* port suspension algo from NativeAOT

* PING_JIT_TIMEOUT gone

* CatchAtSafePoint is always opportunistic

* current

* removed RareEnablePreemptiveGC

* cleanup RareDisablePreemptiveGC

* fix for x86

* factored out Thread::Hijack

* fix build for non-x64 windows

* assert noone holds TSL into coop mode

* activation safety check is always for the current thread

* undo comment

* PulseGCMode should not check for CatchAtSafePointOpportunistic

* not disabling preempt while holding TSL

* tweak

* dead assert

* tweak RareDisablePreemptiveGC

* RareDisablePreemptiveGC avoid GetSuspensionThread()

* updated Thread::Hijack

* fix typo

* allow coop mode while holding TS lock

* Refactored into SuspendAllThreads/ResumeAllThreads

* SetThreadTrapForSuspension

* deleted TS_GCSuspendPending

* tweaks

* PR feedback
@github-actions github-actions bot locked and limited conversation to collaborators Jun 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants