-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Porting SuspendAllThreads
from the NativeAOT to CoreCLR.
#101782
Conversation
CC @tommcdon - I may need to run this through debugger tests. I'll contact you separately for that. |
|
||
// exponential spinwait with an approximate time limit for waiting in microsecond range. | ||
// when iteration == -1, only usecLimit is used | ||
void SpinWait(int iteration, int usecLimit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this existing logic or a new mechanism? Changing the spinwait duration has shown to impact startup when DATAS is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not new, it is copied from NativeAOT. It is not related to spinwaits used by GC in any way. Here we just need to keep the thread that performs the suspension busy for a few microseconds. System timers do not offer pauses with such granularity, not in a portable way, anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is going on here:
- We set a flag telling threads to suspend themselves. (post a frame suitable for a stackwalk start and block on an event).
- Most threads do some kind of allocations or calls into runtime, so they will notice the flag and suspend.
- Some threads may not be doing that (could be computing something in a loop), so they need to be hijacked.
Hijack may either catch a thread in interruptible code (then we are done with the thread) or hijack the return address, so that when a thread returns from the current call, it will suspend itself.
Hijack typically leads to thread suspension, but not always (a thread may go deeper into call tree), so we might need to redo the hijack a few times. We should eventually "corner" the thread as hijacked return will be moving only lower in the call tree with every try.
Here is a conundrum though - after hijacking we must let the thread to run for a while, so it has a chance to observe what we did and suspend itself. We can't be too aggressive with this. If we keep interrupting the thread to check what is happening, nothing will happen. So after a hijack cycle we back off for a few microseconds and then check if we need to hijack again.
We will also increase the time we give to the thread with every iteration (to make sure we are not starving it with our interruptions), but up to a limit.
Timings here are derived from the general pause expectations - 1/60 second (16 msec) pause could be perceptible in 60fps animation, 1/15 second is certainly noticeable in interactive apps. But what we do here is just the suspension part, we need to leave most of that time for the GC to run. While we can't guarantee the upper bound, we strive for suspension to happen in sub-millisecond time.
Thus the pauses between retries are measured in microseconds.
For the debugger interaction this should work the same as before, unless I messed up something while moving code around. |
Adding @kouvel to review IsInForbidSuspendForDebuggerRegion refactoring changes |
@@ -328,14 +328,6 @@ bool Thread::IsGCSpecial() | |||
return IsStateSet(TSF_IsGcSpecialThread); | |||
} | |||
|
|||
bool Thread::CatchAtSafePoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was dead code already. Noone called this.
@@ -394,29 +393,6 @@ endif | |||
retn 8 | |||
_CallJitEHFinallyHelper@8 ENDP | |||
|
|||
;----------------------------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We stopped doing thread trapping on transitions from coop in JIT helpers a while ago.
It looks like x86 stubs were left behind and were still doing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks OK but given the complexity of thread suspension I wouldn't feel particularly confident that I would catch issues :)
Its possible that you are going to find subtle dependencies the debugger had on suspending during the coop->preempt transition during your testing but I am not aware of any explicit dependency.
src/coreclr/vm/threadsuspend.cpp
Outdated
if (ThreadStore::HoldingThreadStore(this)) | ||
// A thread that performs GC may switch modes inside GC and come here. | ||
// We will not try suspending a thread that is responsible for the suspension. | ||
if (this == ThreadSuspend::GetSuspensionThread()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some case where a thread is set as the suspension thread and it doesn't hold the TSL? I am wondering if we need this check when we've got the TSL check below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some case where a thread is set as the suspension thread and it doesn't hold the TSL? I am wondering if we need this check when we've got the TSL check below.
The difference between these two checks is that this one is quite understandable - a thread that performs suspension certainly does not want itself being blocked. Places that perform mode switches (like inside GC) could specialcase the suspending thread, but it is easier to just handle the scenario here.
The one below seems a bit more dangerous. That is a random thread holding TSL and trying to get into coop mode. If such scenario happens by accident and then the thread allocates, causes GC... etc, not sure what would happen.
I initially had an assert instead of a check, but the assert was hit in tests, so I changed it to a condition. I am not happy about that. Maybe there is no way around this and some scenarios must do it (very carefully), but it seems fragile.
You are right though - the next check that tests for TSL ownership subsumes the check for thread driving the suspension. I guess, I can remove the this == ThreadSuspend::GetSuspensionThread()
check for now.
I will try to follow up separately from this PR and see if we can avoid holding TSL in coop mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine if you want to keep the check or convert it to an assert inside the TSL check. Either way comments might be nice to preserve the context you just described here in the code. Thanks!
The refactoring looks fine to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, and looks mostly similar to before, though I may have missed some subtleties.
Thanks!! |
…1782) * port suspension algo from NativeAOT * PING_JIT_TIMEOUT gone * CatchAtSafePoint is always opportunistic * current * removed RareEnablePreemptiveGC * cleanup RareDisablePreemptiveGC * fix for x86 * factored out Thread::Hijack * fix build for non-x64 windows * assert noone holds TSL into coop mode * activation safety check is always for the current thread * undo comment * PulseGCMode should not check for CatchAtSafePointOpportunistic * not disabling preempt while holding TSL * tweak * dead assert * tweak RareDisablePreemptiveGC * RareDisablePreemptiveGC avoid GetSuspensionThread() * updated Thread::Hijack * fix typo * allow coop mode while holding TS lock * Refactored into SuspendAllThreads/ResumeAllThreads * SetThreadTrapForSuspension * deleted TS_GCSuspendPending * tweaks * PR feedback
A step towards making EE suspension support similar between NativeAOT and CoreCLR and possibly eventually share.
The main goal of this change is to port SuspendAllThreads and
ResumeAllThreads
from the NativeAOT. More steps will follow on this path, but this looks like a good state to commit the changes.The CoreCLR implementation of
SuspendAllThreads
is now roughly equivalent to the one in NativeAOT modulo different helpers to iterate threads or to figure if they are in coop mode.This also introduces
Thread::Hijack
entry point through whichSuspendAllThreads
nudges threads into preempt mode. The implementation ofThread::Hijack
is very different between runtimes right now. Unifying the design ofThread::Hijack
will be the goal of further changes.Introducing
SuspendAllThreads
pulled a thread of other required or good-to-have changes:g_pGCSuspendEvent
is now gone together withPING_JIT_TIMEOUT
The timeout for the event was 1 millisecond, which is too long. What's worse is that it could still take up to 16 milliseconds, depending on OS, to timeout if threads need to be re-hijacked. We are looking for sub-millisecond timings for suspension. Spending 16 milliseconds per hijack iteration could introduce really bad outliers.
TS_GCSuspendPending
thread state is removed. It was a redundant way to specify whether a coop thread needs to be stopped for GC. Having more than one way only brings confusion and concerns about ordering (what needs to be set/unset/checked, in what order).We now trap threads on preempt->coop transition only.
In addition to suspension trapping, CoreCLR has other reasons to trap threads (ThreadAbort, Debugger...). The checks for those conditions were done either on transition to coop or on transition from coop - for no good reason. A program will see the same number of forward and reverse coop transition, so trapping at either edge would work, except that preempt->coop must have a trap for GC purposes and coop->preemt does not.
In short - this removes
RareEnablePreemptiveGC
, because havingRareDisablePreemptiveGC
is enough. Exit to preemptive mode is now an unconditional set ofm_fPreemptiveGCDisabled
.trap flag for EE suspension is now a single bit that is set/unset atomically and also is the "source of truth" on whether coop threads need to be suspended.
We still have the trap counter for scenarios like ThreadAbort as there could be multiple at a time (NativeAOT may need something similar if ThreadAbort is implemented). There could be only one suspension at a time though, thus it is possible and useful to have a single dedicated bit for that.
avoid posting new suspension signals when previous one is still in progress.
This was done on Windows, but not on Unix. Redundant signals can be harmful on Unix too.