Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Apply tiering's call counting delay more broadly #18610

Merged
merged 6 commits into from
Jul 17, 2018
Merged

Conversation

kouvel
Copy link
Member

@kouvel kouvel commented Jun 22, 2018

Issues

  • When some time passes between process startup and first significant use of the app, startup perf with tiering can be slower because the call counting delay is no longer in effect
  • This is especially true when the process is affinitized to one cpu

Fixes

  • Initiate and prolong the call counting delay upon tier 0 activity (jitting or r2r code lookup for a new method)
  • Stop call counting for a called method when the delay is in effect
  • Stop (and don't start) tier 1 jitting when the delay is in effect
  • After the delay resume call counting and tier 1 jitting
  • If the process is affinitized to one cpu at process startup, multiply the delay by 10

No change in benchmarks.

@kouvel kouvel added the area-VM label Jun 22, 2018
@kouvel kouvel added this to the 3.0 milestone Jun 22, 2018
@kouvel kouvel self-assigned this Jun 22, 2018
@kouvel kouvel requested a review from noahfalk June 22, 2018 20:01
@kouvel
Copy link
Member Author

kouvel commented Jun 22, 2018

All cores, before:

       Benchmark                Metric                Default               Tiering              Minopts
-----------------------  --------------------  ---------------------  -------------------  --------------------
Dotnet_Build_HelloWorld         Duration (ms)       1230 (1219-1244)     1110 (1102-1117)    1066.5 (1065-1068)
        Csc_Hello_World         Duration (ms)      586 (585.5-586.5)      459 (457-460.5)   443.5 (442.5-445.5)
      Csc_Roslyn_Source         Duration (ms)       2410 (2392-2460)     2428 (2404-2458)  2245.5 (2241-2250.5)
             MusicStore          Startup (ms)    562.5 (558.5-567.5)        506 (499-508)         478 (470-484)
             MusicStore    First Request (ms)        569.5 (568-572)    435.5 (434.5-436)     402.5 (400-404.5)
             MusicStore  Median Response (ms)       2.48 (2.465-2.5)   1.93 (1.865-1.935)   2.765 (2.735-2.785)
               AllReady          Startup (ms)     1284.5 (1284-1291)     1130 (1129-1132)      1031 (1024-1078)
               AllReady    First Request (ms)      468 (465.5-474.5)      344 (343-344.5)         328 (320-350)
               AllReady  Median Response (ms)      3.48 (3.44-3.535)   2.745 (2.73-2.785)     3.88 (3.845-3.92)
               Word2Vec         Training (ms)    32895 (32650-33412)  35034 (34768-35215)   35470 (34910-36134)
               Word2Vec     First Search (ms)           25 (25-25.5)       87.5 (86.5-88)       102 (100.5-107)
               Word2Vec    Median Search (ms)  21.775 (21.735-21.81)  21.81 (21.76-22.56)      101 (99.7-106.1)

After:

       Benchmark                Metric               Default                Tiering                Minopts
-----------------------  --------------------  --------------------  ----------------------  -------------------
Dotnet_Build_HelloWorld         Duration (ms)    1208 (1205.5-1211)  1110.5 (1106.5-1114.5)     1088 (1070-1092)
        Csc_Hello_World         Duration (ms)         590 (588-599)         460 (457.5-463)      446 (445-446.5)
      Csc_Roslyn_Source         Duration (ms)      2389 (2381-2401)        2469 (2460-2474)     2247 (2236-2255)
             MusicStore          Startup (ms)         562 (553-568)           508 (502-514)      479 (477.5-485)
             MusicStore    First Request (ms)       565.5 (564-567)       437 (434.5-438.5)    403.5 (400.5-406)
             MusicStore  Median Response (ms)   2.505 (2.475-2.515)       1.94 (1.865-1.95)   2.775 (2.76-2.795)
               AllReady          Startup (ms)    1301.5 (1300-1306)        1128 (1123-1134)   1032.5 (1030-1038)
               AllReady    First Request (ms)         478 (468-490)           348 (346-360)        322 (318-329)
               AllReady  Median Response (ms)        3.56 (3.5-3.6)       2.77 (2.765-2.77)    4.02 (3.945-4.03)
               Word2Vec         Training (ms)   33272 (32900-33485)     35788 (35524-36118)  35992 (35682-36476)
               Word2Vec     First Search (ms)            25 (25-25)             92 (90-101)        101 (101-101)
               Word2Vec    Median Search (ms)  21.84 (21.84-21.855)   21.505 (21.475-21.52)  99.61 (99.58-99.66)

@kouvel kouvel mentioned this pull request Jun 22, 2018
@kouvel
Copy link
Member Author

kouvel commented Jun 22, 2018

This is with single-proc affinity. No improvements can be seen in these tests, the regressions are due to it taking longer in some cases to reach steady-state.

Single core, before:

       Benchmark                Metric                Default                 Tiering                 Minopts
-----------------------  --------------------  ----------------------  ----------------------  ----------------------
Dotnet_Build_HelloWorld         Duration (ms)        1796 (1788-1799)        1790 (1784-1812)      1777.5 (1774-1783)
        Csc_Hello_World         Duration (ms)           616 (605-632)         474.5 (472-477)       460.5 (458.5-461)
      Csc_Roslyn_Source         Duration (ms)        4668 (4647-4687)        4986 (4970-5020)        4524 (4517-4537)
             MusicStore          Startup (ms)           610 (604-624)         543 (541-545.5)           512 (509-528)
             MusicStore    First Request (ms)         673 (671.5-678)           521 (518-535)       468 (466.5-469.5)
             MusicStore  Median Response (ms)          1.98 (1.975-2)      1.495 (1.485-1.52)       2.21 (2.205-2.21)
               AllReady          Startup (ms)      1445 (1441.5-1448)        1232 (1226-1276)        1122 (1120-1135)
               AllReady    First Request (ms)     521.5 (520.5-522.5)       390.5 (387.5-397)     352.5 (351.5-353.5)
               AllReady  Median Response (ms)      3.365 (3.355-3.37)     2.545 (2.535-2.555)      3.685 (3.635-3.73)
               Word2Vec         Training (ms)  120354 (119706-120462)  124220 (120775-124990)  127307 (126506-130018)
               Word2Vec     First Search (ms)            25 (25-25.5)            90 (87.5-92)         103 (102-103.5)
               Word2Vec    Median Search (ms)     22.36 (22.28-22.43)     22.38 (22.02-22.42)       99.6 (99.4-104.8)

After:

       Benchmark                Metric                Default                 Tiering                 Minopts
-----------------------  --------------------  ----------------------  ----------------------  ----------------------
Dotnet_Build_HelloWorld         Duration (ms)        1614 (1376-1812)        1600 (1345-1852)        1766 (1460-2265)
        Csc_Hello_World         Duration (ms)         594 (593-596.5)           466 (464-482)           450 (450-451)
      Csc_Roslyn_Source         Duration (ms)        4593 (4575-4612)        4878 (4876-4891)      4463 (4458.5-4464)
             MusicStore          Startup (ms)           600 (594-602)     520.5 (517.5-525.5)           506 (504-506)
             MusicStore    First Request (ms)       664.5 (662.5-665)       509.5 (508.5-510)         458 (456.5-459)
             MusicStore  Median Response (ms)       1.92 (1.92-1.925)      1.52 (1.515-1.525)      2.125 (2.115-2.13)
               AllReady          Startup (ms)        1430 (1426-1444)    1199.5 (1197-1206.5)    1109 (1108.5-1109.5)
               AllReady    First Request (ms)         514.5 (513-516)         383 (382-384.5)         346 (345-347.5)
               AllReady  Median Response (ms)      3.255 (3.23-3.285)     2.505 (2.485-2.585)        3.6 (3.58-3.605)
               Word2Vec         Training (ms)  116018 (115800-116184)  117818 (117585-118184)  122346 (121541-122896)
               Word2Vec     First Search (ms)              25 (25-25)             96 (91-101)         100.5 (100-101)
               Word2Vec    Median Search (ms)     21.81 (21.74-21.92)              38 (22-58)      99.52 (99.5-99.62)

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Policy-wise seems like steps in the right direction (though I suspect its not the end of the road)

Implementation-wise the multi-threaded complexity is getting fairly intense and I'm worried bugs are lurking. I made a few suggestions how you might be able to shed some complexity.

Thanks!

break;
}

DecrementWorkerThreadCount();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a race condition hiding here. Its convoluted and I'm not sure you'd ever get the timing to work like this in practice, but the fact that I found one makes we worried that we've got a few too many moving parts. There might be other easier to hit ones lurking. Consider:

Thread A - call AsyncPromoteToTier1, queue threadpool worker, increment worker thread count, insert method into optimization queue
Thread A - call AsyncPromoteTier1 again, insert 2nd method into optimization queue, stop just before checking m_methodsPendingCountingForTier1
Thread B - threadpool thread processes the 2 methods in the queue and loops back to top of this while true loop to run again
Thread C - call OnMethodCalled, m_methodsPendingCountingForTier1 becomes non-NULL
Thread A - still within that 2nd call to AsyncPromoteTier1, because m_methodsPendingCountingForTier1 != NULL, m_hasMethodsToOptimizeAfterDelay is set to TRUE.
Thread B - threadpool exits here because there are no methods in optimization queue, worker count decremented to 0
Thread D - timer callback thread runs, because m_hasMethodsToOptimizeAfterDelay = TRUE it calls OptimizeMethods(). However there are no methods in the queue so it comes here, decrement worker thread count to -1. Invariant broken.

A few complexities you might be able to simplify:
a) Using the timer callback thread optionally as a background method compilation thread introduces multiple code flow paths into the same async work. Either we should keep the timer callback fully separate, or clearly define the invariants that are shared across all worker threads and if possible use a shared code path to deal with shared invariants.
b) We've got two different locks protecting different pieces of state which makes it trickier to reason about the allowable states. I suspect we could converge to a single spin lock? For example m_methodsToOptimize is protected by spin lock and m_hasMethodsToOptimizeAfterDelay is protected under the Crst. If there was a single lock I think you could get rid m_hasMethodsToOptimizeAfterDelay and just check whether or not queued work exists.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread A - call AsyncPromoteTier1 again, insert 2nd method into optimization queue, stop just before checking m_methodsPendingCountingForTier1

Before checking m_methodsPendingCountingForTier1, it would check the thread-running count inside the same lock it used to add the method to the optimization queue:

if (0 == m_countOptimizationThreadsRunning && !m_isAppDomainShuttingDown)

And would just return?

My goal was that it only checks if the delay is active when the thread-running count == 1, such that either it will set m_hasMethodsToOptimizeAfterDelay inside a lock (ensuring that the timer callback will optimize methods) or it will queue to the thread pool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep you are right... let me see if that sinks it or there is a modified repro ; )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a) Using the timer callback thread optionally as a background method compilation thread introduces multiple code flow paths into the same async work. Either we should keep the timer callback fully separate, or clearly define the invariants that are shared across all worker threads and if possible use a shared code path to deal with shared invariants.

It just felt unnecessary to queue to the thread pool again when we already have a thread pool thread ready to optimize. The invariants to the root of the shared code path (OptimizeMethods) I think are:

  • thread-running count is 1
  • the thread has already entered the app domain

I should add asserts for those to OptimizeMethods to state the preconditions. Do you have other ideas?

b) We've got two different locks protecting different pieces of state which makes it trickier to reason about the allowable states. I suspect we could converge to a single spin lock? For example m_methodsToOptimize is protected by spin lock and m_hasMethodsToOptimizeAfterDelay is protected under the Crst. If there was a single lock I think you could get rid m_hasMethodsToOptimizeAfterDelay and just check whether or not queued work exists.

I had to change the spin lock for the call counting delay into a crst because apparently you can't enter a crst from inside a spin lock. That lock protects the fields immediately following it in the .h file. Currently the two locks protect distinct things also such that they would not need to be nested. They could be combined for simplicity, I doubt it would make any difference since both locks are typically held for a short duration, but it would have to be a crst. Probably crst wouldn't make much difference from a spin lock either since it's unlikely to be contended for very long or too frequently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought is maybe all of the call counting delay stuff can be separated out into a separate class

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think merging the locks would be fine for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am getting better grasp on what the delayed queueing invariants are, I'm still thinking if I've got more precise suggestions on what to change or maybe its just a matter of comments to explain the invariants. I'll keep thinking on it but have a good vacation!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the end I felt fairly convinced what you had was correct, it just felt hard to reason about it or how it would be affected by further modifications. I messed around with some refactoring in my fork in the TierFix branch. Aside from just breaking down a few methods into smaller pieces I also merged the locks and eliminated m_hasMethodsToOptimizeAfterDelay in favor of being able to recalculate at any time if another worker thread is needed. I haven't tested it nor am I saying you should definitely do it that way, but I think its worth a look. I think the main things I liked refactoring this way:
a) single lock feels easier to reason about state changes
b) worker thread count again represents threads that are actively running (or queued to run imminently)
c) seems like its closer to what we would need if we wanted to increase parallelism or drive Pause/Resume with other triggering mechanisms.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @noahfalk. I have folded some suggestions from your fork into the change:

  • Merged locks as Crst, see comment above call to CreateTimerQueueTimer. Perf doesn't seem to be affected.
  • Refactored thread count increment and queuing to thread pool into separate functions
  • Eliminated m_hasMethodsToOptimizeAfterDelay and used your mechanism instead
  • If we would need to add manual pause/resume capabilities, there may be more things to take care of
    • Such as:
      • Keeping track of nested manual pause requests such that tiering is not resumed until all pausers have requested to resume
      • Creating/deleting the timer at the appropriate times
      • Syncing manual resumes with the automatic resume from the timer
    • It would be possible to do if we need that capability, but it seems like it is adding complication by introducing issues that don't exist currently and may or may not exist in the future, so I have left that out

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@@ -285,6 +337,19 @@ void TieredCompilationManager::AsyncPromoteMethodToTier1(MethodDesc* pMethodDesc
}
}

if (m_methodsPendingCountingForTier1 != nullptr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the intent of this check is along the lines of

if(IsDelayActive())

If so it might be useful to make a tiny inlinable wrapper and use that. At some point when we better understand circumstances when the delay is useful we might want it to activate for conditions that don't have any methods pending call counting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep will do

@noahfalk
Copy link
Member

Your first set of before/after numbers (the multi-core ones) - can you check if those got posted right? As far as I can tell before and after are perfect copies and I would expect some minimal amount of random variation.

@kouvel
Copy link
Member Author

kouvel commented Jun 23, 2018

before and after are perfect copies

Ah I copied the wrong one. I'll have to run it again, but I'm running out of time at the moment. I'll finish up this PR when I'm back.

kouvel added 5 commits July 11, 2018 13:33
Issues
- When some time passes between process startup and first significant use of the app, startup perf with tiering can be slower because the call counting delay is no longer in effect
- This is especially true when the process is affinitized to one cpu

Fixes
- Initiate and prolong the call counting delay upon tier 0 activity (jitting or r2r code lookup for a new method)
- Stop call counting for a called method when the delay is in effect
- Stop (and don't start) tier 1 jitting when the delay is in effect
- After the delay resume call counting and tier 1 jitting
- If the process is affinitized to one cpu at process startup, multiply the delay by 10

No change in benchmarks.
@kouvel
Copy link
Member Author

kouvel commented Jul 13, 2018

Updated perf numbers above inline

EX_TRY
{
if (ThreadpoolMgr::ChangeTimerQueueTimer(
m_tieringDelayTimerHandle,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be lockless access?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't need to be locked or lock-free, but I can add a lock


// Reschedule the timer if there has been recent tier 0 activity (when a new eligible method is called the first time) to
// further delay call counting
if (m_tier1CallCountingCandidateMethodRecentlyRecorded)
Copy link
Member

@noahfalk noahfalk Jul 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be lockfree access?
[EDIT]: I don't think its buggy, I just get cautious about doing anything lock-free if it doesn't need to be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't need to be locked or lock-free, but I can add a lock. Will change the update of this variable to be locked as well since it's convenient (though it's not necessary).

// Reschedule the timer if a tier 0 JIT has been invoked since the timer was started to further delay call counting
if (m_wasTier0JitInvokedSinceCountingDelayReset)
// It's possible for the timer to tick before it is recorded that the delay is in effect, so wait for that to complete
while (!IsTieringDelayActive())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to do this lock-free? Acquiring m_lock would eliminate the need for this improvised wait.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, adding a lock

{
WRAPPER_NO_CONTRACT;
_ASSERTE(m_tieringDelayTimerHandle != nullptr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this assert lock-free could theoretically trigger because of memory access races.

Copy link
Member Author

@kouvel kouvel Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timer handle is set before the timer is scheduled, there shouldn't be any race, but I'll add a lock here to simplify the other things

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a memory ordering guarantee the OS/threadpool typically makes (real question, trying to inform myself)? I was approaching from the pessimistic point of view... if I couldn't prove there was a memory barrier or lock in between the write and the read I assumed it wasn't there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general when background work is queued, the background work when it runs must be able to see changes to memory made prior to queuing, otherwise it would be too easy to miss (unreliable) and the contract would end up requiring redundant memory barriers by users just to ensure ordering when they may already be necessary by the subsystem.

That aside, it is kind of subtle because it's not always guaranteed that the timer object/handle/etc. is returned and stored in the right memory location before the timer may tick. In this case it is, but otherwise some synchronization would be necessary. I prefer a timer API to have a Start() call that would completely eliminate that issue. We could create a timer with an infinite due time and change it later, but unfortunately changing the timer here may also fail (ideally changing a timer should not fail).

_ASSERTE(g_pConfig->TieredCompilation());
_ASSERTE(g_pConfig->TieredCompilation_Tier1CallCountingDelayMs() != 0);

if (IsTieringDelayActive())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check is unneeded, the condition was checked just prior to making the call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's generally not needed at all, it's just a shortcut to avoid unnecessary allocation during races. I'll remove since it would be a rare success.

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a couple lock related issues I commented on, but otherwise LGTM, thanks!

@kouvel kouvel merged commit 6b403ca into dotnet:master Jul 17, 2018
@kouvel kouvel deleted the TierFix branch July 17, 2018 05:04
kouvel added a commit to kouvel/coreclr that referenced this pull request Aug 16, 2018
Port of dotnet#18610 to 2.2

Issues
- When some time passes between process startup and first significant use of the app, startup perf with tiering can be slower because the call counting delay is no longer in effect
- This is especially true when the process is affinitized to one cpu

Fixes
- Initiate and prolong the call counting delay upon tier 0 activity (jitting or r2r code lookup for a new method)
- Stop call counting for a called method when the delay is in effect
- Stop (and don't start) tier 1 jitting when the delay is in effect
- After the delay resume call counting and tier 1 jitting
- If the process is affinitized to one cpu at process startup, multiply the delay by 10

No change in benchmarks.
kouvel added a commit to kouvel/coreclr that referenced this pull request Aug 16, 2018
Port of dotnet#18610 to 2.2

Issues
- When some time passes between process startup and first significant use of the app, startup perf with tiering can be slower because the call counting delay is no longer in effect
- This is especially true when the process is affinitized to one cpu

Fixes
- Initiate and prolong the call counting delay upon tier 0 activity (jitting or r2r code lookup for a new method)
- Stop call counting for a called method when the delay is in effect
- Stop (and don't start) tier 1 jitting when the delay is in effect
- After the delay resume call counting and tier 1 jitting
- If the process is affinitized to one cpu at process startup, multiply the delay by 10

No change in benchmarks.
kouvel added a commit to kouvel/coreclr that referenced this pull request Aug 20, 2018
Port of dotnet#18610 to 2.2

Issues
- When some time passes between process startup and first significant use of the app, startup perf with tiering can be slower because the call counting delay is no longer in effect
- This is especially true when the process is affinitized to one cpu

Fixes
- Initiate and prolong the call counting delay upon tier 0 activity (jitting or r2r code lookup for a new method)
- Stop call counting for a called method when the delay is in effect
- Stop (and don't start) tier 1 jitting when the delay is in effect
- After the delay resume call counting and tier 1 jitting
- If the process is affinitized to one cpu at process startup, multiply the delay by 10

No change in benchmarks.
kouvel added a commit to kouvel/coreclr that referenced this pull request Aug 22, 2018
Port of dotnet#18610 to 2.2

Issues
- When some time passes between process startup and first significant use of the app, startup perf with tiering can be slower because the call counting delay is no longer in effect
- This is especially true when the process is affinitized to one cpu

Fixes
- Initiate and prolong the call counting delay upon tier 0 activity (jitting or r2r code lookup for a new method)
- Stop call counting for a called method when the delay is in effect
- Stop (and don't start) tier 1 jitting when the delay is in effect
- After the delay resume call counting and tier 1 jitting
- If the process is affinitized to one cpu at process startup, multiply the delay by 10

No change in benchmarks.
kouvel added a commit to kouvel/coreclr that referenced this pull request Aug 22, 2018
Port of dotnet#18610 to 2.2

Issues
- When some time passes between process startup and first significant use of the app, startup perf with tiering can be slower because the call counting delay is no longer in effect
- This is especially true when the process is affinitized to one cpu

Fixes
- Initiate and prolong the call counting delay upon tier 0 activity (jitting or r2r code lookup for a new method)
- Stop call counting for a called method when the delay is in effect
- Stop (and don't start) tier 1 jitting when the delay is in effect
- After the delay resume call counting and tier 1 jitting
- If the process is affinitized to one cpu at process startup, multiply the delay by 10

No change in benchmarks.
kouvel added a commit to kouvel/coreclr that referenced this pull request Aug 24, 2018
Port of dotnet#18610 to 2.2

Issues
- When some time passes between process startup and first significant use of the app, startup perf with tiering can be slower because the call counting delay is no longer in effect
- This is especially true when the process is affinitized to one cpu

Fixes
- Initiate and prolong the call counting delay upon tier 0 activity (jitting or r2r code lookup for a new method)
- Stop call counting for a called method when the delay is in effect
- Stop (and don't start) tier 1 jitting when the delay is in effect
- After the delay resume call counting and tier 1 jitting
- If the process is affinitized to one cpu at process startup, multiply the delay by 10

No change in benchmarks.
kouvel added a commit to kouvel/coreclr that referenced this pull request Aug 27, 2018
Port of dotnet#18610 to 2.2

Issues
- When some time passes between process startup and first significant use of the app, startup perf with tiering can be slower because the call counting delay is no longer in effect
- This is especially true when the process is affinitized to one cpu

Fixes
- Initiate and prolong the call counting delay upon tier 0 activity (jitting or r2r code lookup for a new method)
- Stop call counting for a called method when the delay is in effect
- Stop (and don't start) tier 1 jitting when the delay is in effect
- After the delay resume call counting and tier 1 jitting
- If the process is affinitized to one cpu at process startup, multiply the delay by 10

No change in benchmarks.
kouvel added a commit that referenced this pull request Aug 30, 2018
This is a port of several changes that went into master after 2.2 forked, including dependencies for, and enabling tiered compilation by default in 2.2. Quick summary of commits is below, see the commit descriptions and PRs for more info.
- Commit 1 - Fix nested spin locks in thread pool etw firing (#17677)
  - Fixes a lock nesting issue when there is an ETW listener, which can occur without tiering, but is almost deterministic with tiering enabled because the first event that is fired typically hits this code path
- Commit 2 - Don't close the JIT func info file on shutdown (#18060)
  - Fixes a crash during shutdown that only occurs when JIT logging is enabled (typically in the coreclr tests and CI). More frequent with tiering enabled because of different JIT timing and background jitting.
- Commit 3 - Apply tiering's call counting delay more broadly (#18610)
  - Fixes a perf issue when tiering is enabled in server first-request scenarios where there is a significant gap between process startup and first request
- Commit 4 - Changes only affect debug builds - Eliminate arm64 contract asserts (#19015)
  - Fixes some incorrect asserts that trigger more frequently with tiering
- Commit 5 - Use 16 bytes to spill SIMD12 (#19237)
  - Fixes a crash in corefx System.Numerics.Tests.Vector3Tests.Vector3EqualsTest. Occurs with minopt JIT or with tiering.
- Commit 6 - Fix an apartment state issue (partial port of #19384)
  - This is a partial port of this PR (only the portion that addresses issue #17822)
  - This is a breaking change, though a minor one that we have concluded is an acceptable risk to take for 2.2
  - Fixes a behavioral difference that can be seen more easily tiering enabled in APIs on the `Thread` class relevant to apartment state. The issue can also be seen in some cases when tiering is disabled.
- Commit 7 - Enable Tiered Compilation by default (#19525)
  - Enables tiering by default, can be disabled through environment, or through .csproj/.json when using dotnet
  - Removes deprecated config variable (EXPERIMENTAL_TieredCompilation) that was previously exposed in 2.1 along with the current config variable (TieredCompilation), along with miscellaneous test fixes
- Commit 8 - Changes only affect tests - Fix tiered compilation option for case-sensitive systems (#19567)
  - Fixes tiering environment variable casing for non-Windows platforms
- Commit 9 - Disable tiered compilation on arm64
  - There is an open issue that may be partly related to minopts on arm64 (https://github.com/dotnet/coreclr/issues/18895). Disabling tiering by default on arm64 to limit exposing new issues.

This change would be followed up with dotnet/corefx#31822
- Adds tests for Commit 6 - Fix an apartment state issue (partial port of #19384)
  - Changes only affect tests

Closes https://github.com/dotnet/coreclr/issues/18973
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants