-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Apply tiering's call counting delay more broadly #18610
Conversation
All cores, before:
After:
|
This is with single-proc affinity. No improvements can be seen in these tests, the regressions are due to it taking longer in some cases to reach steady-state. Single core, before:
After:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Policy-wise seems like steps in the right direction (though I suspect its not the end of the road)
Implementation-wise the multi-threaded complexity is getting fairly intense and I'm worried bugs are lurking. I made a few suggestions how you might be able to shed some complexity.
Thanks!
break; | ||
} | ||
|
||
DecrementWorkerThreadCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a race condition hiding here. Its convoluted and I'm not sure you'd ever get the timing to work like this in practice, but the fact that I found one makes we worried that we've got a few too many moving parts. There might be other easier to hit ones lurking. Consider:
Thread A - call AsyncPromoteToTier1, queue threadpool worker, increment worker thread count, insert method into optimization queue
Thread A - call AsyncPromoteTier1 again, insert 2nd method into optimization queue, stop just before checking m_methodsPendingCountingForTier1
Thread B - threadpool thread processes the 2 methods in the queue and loops back to top of this while true loop to run again
Thread C - call OnMethodCalled, m_methodsPendingCountingForTier1 becomes non-NULL
Thread A - still within that 2nd call to AsyncPromoteTier1, because m_methodsPendingCountingForTier1 != NULL, m_hasMethodsToOptimizeAfterDelay is set to TRUE.
Thread B - threadpool exits here because there are no methods in optimization queue, worker count decremented to 0
Thread D - timer callback thread runs, because m_hasMethodsToOptimizeAfterDelay = TRUE it calls OptimizeMethods(). However there are no methods in the queue so it comes here, decrement worker thread count to -1. Invariant broken.
A few complexities you might be able to simplify:
a) Using the timer callback thread optionally as a background method compilation thread introduces multiple code flow paths into the same async work. Either we should keep the timer callback fully separate, or clearly define the invariants that are shared across all worker threads and if possible use a shared code path to deal with shared invariants.
b) We've got two different locks protecting different pieces of state which makes it trickier to reason about the allowable states. I suspect we could converge to a single spin lock? For example m_methodsToOptimize is protected by spin lock and m_hasMethodsToOptimizeAfterDelay is protected under the Crst. If there was a single lock I think you could get rid m_hasMethodsToOptimizeAfterDelay and just check whether or not queued work exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thread A - call AsyncPromoteTier1 again, insert 2nd method into optimization queue, stop just before checking m_methodsPendingCountingForTier1
Before checking m_methodsPendingCountingForTier1, it would check the thread-running count inside the same lock it used to add the method to the optimization queue:
if (0 == m_countOptimizationThreadsRunning && !m_isAppDomainShuttingDown)
And would just return?
My goal was that it only checks if the delay is active when the thread-running count == 1, such that either it will set m_hasMethodsToOptimizeAfterDelay inside a lock (ensuring that the timer callback will optimize methods) or it will queue to the thread pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep you are right... let me see if that sinks it or there is a modified repro ; )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a) Using the timer callback thread optionally as a background method compilation thread introduces multiple code flow paths into the same async work. Either we should keep the timer callback fully separate, or clearly define the invariants that are shared across all worker threads and if possible use a shared code path to deal with shared invariants.
It just felt unnecessary to queue to the thread pool again when we already have a thread pool thread ready to optimize. The invariants to the root of the shared code path (OptimizeMethods) I think are:
- thread-running count is 1
- the thread has already entered the app domain
I should add asserts for those to OptimizeMethods to state the preconditions. Do you have other ideas?
b) We've got two different locks protecting different pieces of state which makes it trickier to reason about the allowable states. I suspect we could converge to a single spin lock? For example m_methodsToOptimize is protected by spin lock and m_hasMethodsToOptimizeAfterDelay is protected under the Crst. If there was a single lock I think you could get rid m_hasMethodsToOptimizeAfterDelay and just check whether or not queued work exists.
I had to change the spin lock for the call counting delay into a crst because apparently you can't enter a crst from inside a spin lock. That lock protects the fields immediately following it in the .h file. Currently the two locks protect distinct things also such that they would not need to be nested. They could be combined for simplicity, I doubt it would make any difference since both locks are typically held for a short duration, but it would have to be a crst. Probably crst wouldn't make much difference from a spin lock either since it's unlikely to be contended for very long or too frequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought is maybe all of the call counting delay stuff can be separated out into a separate class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think merging the locks would be fine for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am getting better grasp on what the delayed queueing invariants are, I'm still thinking if I've got more precise suggestions on what to change or maybe its just a matter of comments to explain the invariants. I'll keep thinking on it but have a good vacation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the end I felt fairly convinced what you had was correct, it just felt hard to reason about it or how it would be affected by further modifications. I messed around with some refactoring in my fork in the TierFix branch. Aside from just breaking down a few methods into smaller pieces I also merged the locks and eliminated m_hasMethodsToOptimizeAfterDelay in favor of being able to recalculate at any time if another worker thread is needed. I haven't tested it nor am I saying you should definitely do it that way, but I think its worth a look. I think the main things I liked refactoring this way:
a) single lock feels easier to reason about state changes
b) worker thread count again represents threads that are actively running (or queued to run imminently)
c) seems like its closer to what we would need if we wanted to increase parallelism or drive Pause/Resume with other triggering mechanisms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @noahfalk. I have folded some suggestions from your fork into the change:
- Merged locks as Crst, see comment above call to CreateTimerQueueTimer. Perf doesn't seem to be affected.
- Refactored thread count increment and queuing to thread pool into separate functions
- Eliminated m_hasMethodsToOptimizeAfterDelay and used your mechanism instead
- If we would need to add manual pause/resume capabilities, there may be more things to take care of
- Such as:
- Keeping track of nested manual pause requests such that tiering is not resumed until all pausers have requested to resume
- Creating/deleting the timer at the appropriate times
- Syncing manual resumes with the automatic resume from the timer
- It would be possible to do if we need that capability, but it seems like it is adding complication by introducing issues that don't exist currently and may or may not exist in the future, so I have left that out
- Such as:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
src/vm/tieredcompilation.cpp
Outdated
@@ -285,6 +337,19 @@ void TieredCompilationManager::AsyncPromoteMethodToTier1(MethodDesc* pMethodDesc | |||
} | |||
} | |||
|
|||
if (m_methodsPendingCountingForTier1 != nullptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the intent of this check is along the lines of
if(IsDelayActive())
If so it might be useful to make a tiny inlinable wrapper and use that. At some point when we better understand circumstances when the delay is useful we might want it to activate for conditions that don't have any methods pending call counting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep will do
Your first set of before/after numbers (the multi-core ones) - can you check if those got posted right? As far as I can tell before and after are perfect copies and I would expect some minimal amount of random variation. |
Ah I copied the wrong one. I'll have to run it again, but I'm running out of time at the moment. I'll finish up this PR when I'm back. |
Issues - When some time passes between process startup and first significant use of the app, startup perf with tiering can be slower because the call counting delay is no longer in effect - This is especially true when the process is affinitized to one cpu Fixes - Initiate and prolong the call counting delay upon tier 0 activity (jitting or r2r code lookup for a new method) - Stop call counting for a called method when the delay is in effect - Stop (and don't start) tier 1 jitting when the delay is in effect - After the delay resume call counting and tier 1 jitting - If the process is affinitized to one cpu at process startup, multiply the delay by 10 No change in benchmarks.
Updated perf numbers above inline |
src/vm/tieredcompilation.cpp
Outdated
EX_TRY | ||
{ | ||
if (ThreadpoolMgr::ChangeTimerQueueTimer( | ||
m_tieringDelayTimerHandle, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be lockless access?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't need to be locked or lock-free, but I can add a lock
src/vm/tieredcompilation.cpp
Outdated
|
||
// Reschedule the timer if there has been recent tier 0 activity (when a new eligible method is called the first time) to | ||
// further delay call counting | ||
if (m_tier1CallCountingCandidateMethodRecentlyRecorded) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be lockfree access?
[EDIT]: I don't think its buggy, I just get cautious about doing anything lock-free if it doesn't need to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't need to be locked or lock-free, but I can add a lock. Will change the update of this variable to be locked as well since it's convenient (though it's not necessary).
src/vm/tieredcompilation.cpp
Outdated
// Reschedule the timer if a tier 0 JIT has been invoked since the timer was started to further delay call counting | ||
if (m_wasTier0JitInvokedSinceCountingDelayReset) | ||
// It's possible for the timer to tick before it is recorded that the delay is in effect, so wait for that to complete | ||
while (!IsTieringDelayActive()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to do this lock-free? Acquiring m_lock would eliminate the need for this improvised wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, adding a lock
src/vm/tieredcompilation.cpp
Outdated
{ | ||
WRAPPER_NO_CONTRACT; | ||
_ASSERTE(m_tieringDelayTimerHandle != nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing this assert lock-free could theoretically trigger because of memory access races.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timer handle is set before the timer is scheduled, there shouldn't be any race, but I'll add a lock here to simplify the other things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that a memory ordering guarantee the OS/threadpool typically makes (real question, trying to inform myself)? I was approaching from the pessimistic point of view... if I couldn't prove there was a memory barrier or lock in between the write and the read I assumed it wasn't there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general when background work is queued, the background work when it runs must be able to see changes to memory made prior to queuing, otherwise it would be too easy to miss (unreliable) and the contract would end up requiring redundant memory barriers by users just to ensure ordering when they may already be necessary by the subsystem.
That aside, it is kind of subtle because it's not always guaranteed that the timer object/handle/etc. is returned and stored in the right memory location before the timer may tick. In this case it is, but otherwise some synchronization would be necessary. I prefer a timer API to have a Start() call that would completely eliminate that issue. We could create a timer with an infinite due time and change it later, but unfortunately changing the timer here may also fail (ideally changing a timer should not fail).
src/vm/tieredcompilation.cpp
Outdated
_ASSERTE(g_pConfig->TieredCompilation()); | ||
_ASSERTE(g_pConfig->TieredCompilation_Tier1CallCountingDelayMs() != 0); | ||
|
||
if (IsTieringDelayActive()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this check is unneeded, the condition was checked just prior to making the call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's generally not needed at all, it's just a shortcut to avoid unnecessary allocation during races. I'll remove since it would be a rare success.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are a couple lock related issues I commented on, but otherwise LGTM, thanks!
Port of dotnet#18610 to 2.2 Issues - When some time passes between process startup and first significant use of the app, startup perf with tiering can be slower because the call counting delay is no longer in effect - This is especially true when the process is affinitized to one cpu Fixes - Initiate and prolong the call counting delay upon tier 0 activity (jitting or r2r code lookup for a new method) - Stop call counting for a called method when the delay is in effect - Stop (and don't start) tier 1 jitting when the delay is in effect - After the delay resume call counting and tier 1 jitting - If the process is affinitized to one cpu at process startup, multiply the delay by 10 No change in benchmarks.
Port of dotnet#18610 to 2.2 Issues - When some time passes between process startup and first significant use of the app, startup perf with tiering can be slower because the call counting delay is no longer in effect - This is especially true when the process is affinitized to one cpu Fixes - Initiate and prolong the call counting delay upon tier 0 activity (jitting or r2r code lookup for a new method) - Stop call counting for a called method when the delay is in effect - Stop (and don't start) tier 1 jitting when the delay is in effect - After the delay resume call counting and tier 1 jitting - If the process is affinitized to one cpu at process startup, multiply the delay by 10 No change in benchmarks.
Port of dotnet#18610 to 2.2 Issues - When some time passes between process startup and first significant use of the app, startup perf with tiering can be slower because the call counting delay is no longer in effect - This is especially true when the process is affinitized to one cpu Fixes - Initiate and prolong the call counting delay upon tier 0 activity (jitting or r2r code lookup for a new method) - Stop call counting for a called method when the delay is in effect - Stop (and don't start) tier 1 jitting when the delay is in effect - After the delay resume call counting and tier 1 jitting - If the process is affinitized to one cpu at process startup, multiply the delay by 10 No change in benchmarks.
Port of dotnet#18610 to 2.2 Issues - When some time passes between process startup and first significant use of the app, startup perf with tiering can be slower because the call counting delay is no longer in effect - This is especially true when the process is affinitized to one cpu Fixes - Initiate and prolong the call counting delay upon tier 0 activity (jitting or r2r code lookup for a new method) - Stop call counting for a called method when the delay is in effect - Stop (and don't start) tier 1 jitting when the delay is in effect - After the delay resume call counting and tier 1 jitting - If the process is affinitized to one cpu at process startup, multiply the delay by 10 No change in benchmarks.
Port of dotnet#18610 to 2.2 Issues - When some time passes between process startup and first significant use of the app, startup perf with tiering can be slower because the call counting delay is no longer in effect - This is especially true when the process is affinitized to one cpu Fixes - Initiate and prolong the call counting delay upon tier 0 activity (jitting or r2r code lookup for a new method) - Stop call counting for a called method when the delay is in effect - Stop (and don't start) tier 1 jitting when the delay is in effect - After the delay resume call counting and tier 1 jitting - If the process is affinitized to one cpu at process startup, multiply the delay by 10 No change in benchmarks.
Port of dotnet#18610 to 2.2 Issues - When some time passes between process startup and first significant use of the app, startup perf with tiering can be slower because the call counting delay is no longer in effect - This is especially true when the process is affinitized to one cpu Fixes - Initiate and prolong the call counting delay upon tier 0 activity (jitting or r2r code lookup for a new method) - Stop call counting for a called method when the delay is in effect - Stop (and don't start) tier 1 jitting when the delay is in effect - After the delay resume call counting and tier 1 jitting - If the process is affinitized to one cpu at process startup, multiply the delay by 10 No change in benchmarks.
Port of dotnet#18610 to 2.2 Issues - When some time passes between process startup and first significant use of the app, startup perf with tiering can be slower because the call counting delay is no longer in effect - This is especially true when the process is affinitized to one cpu Fixes - Initiate and prolong the call counting delay upon tier 0 activity (jitting or r2r code lookup for a new method) - Stop call counting for a called method when the delay is in effect - Stop (and don't start) tier 1 jitting when the delay is in effect - After the delay resume call counting and tier 1 jitting - If the process is affinitized to one cpu at process startup, multiply the delay by 10 No change in benchmarks.
This is a port of several changes that went into master after 2.2 forked, including dependencies for, and enabling tiered compilation by default in 2.2. Quick summary of commits is below, see the commit descriptions and PRs for more info. - Commit 1 - Fix nested spin locks in thread pool etw firing (#17677) - Fixes a lock nesting issue when there is an ETW listener, which can occur without tiering, but is almost deterministic with tiering enabled because the first event that is fired typically hits this code path - Commit 2 - Don't close the JIT func info file on shutdown (#18060) - Fixes a crash during shutdown that only occurs when JIT logging is enabled (typically in the coreclr tests and CI). More frequent with tiering enabled because of different JIT timing and background jitting. - Commit 3 - Apply tiering's call counting delay more broadly (#18610) - Fixes a perf issue when tiering is enabled in server first-request scenarios where there is a significant gap between process startup and first request - Commit 4 - Changes only affect debug builds - Eliminate arm64 contract asserts (#19015) - Fixes some incorrect asserts that trigger more frequently with tiering - Commit 5 - Use 16 bytes to spill SIMD12 (#19237) - Fixes a crash in corefx System.Numerics.Tests.Vector3Tests.Vector3EqualsTest. Occurs with minopt JIT or with tiering. - Commit 6 - Fix an apartment state issue (partial port of #19384) - This is a partial port of this PR (only the portion that addresses issue #17822) - This is a breaking change, though a minor one that we have concluded is an acceptable risk to take for 2.2 - Fixes a behavioral difference that can be seen more easily tiering enabled in APIs on the `Thread` class relevant to apartment state. The issue can also be seen in some cases when tiering is disabled. - Commit 7 - Enable Tiered Compilation by default (#19525) - Enables tiering by default, can be disabled through environment, or through .csproj/.json when using dotnet - Removes deprecated config variable (EXPERIMENTAL_TieredCompilation) that was previously exposed in 2.1 along with the current config variable (TieredCompilation), along with miscellaneous test fixes - Commit 8 - Changes only affect tests - Fix tiered compilation option for case-sensitive systems (#19567) - Fixes tiering environment variable casing for non-Windows platforms - Commit 9 - Disable tiered compilation on arm64 - There is an open issue that may be partly related to minopts on arm64 (https://github.com/dotnet/coreclr/issues/18895). Disabling tiering by default on arm64 to limit exposing new issues. This change would be followed up with dotnet/corefx#31822 - Adds tests for Commit 6 - Fix an apartment state issue (partial port of #19384) - Changes only affect tests Closes https://github.com/dotnet/coreclr/issues/18973
Issues
Fixes
No change in benchmarks.