Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve call counting mechanism #1457

Merged
merged 11 commits into from
Jan 28, 2020
6 changes: 1 addition & 5 deletions docs/design/features/code-versioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -330,11 +330,7 @@ to update the active child at either of those levels (ReJIT uses SetActiveILCode
In order to do step 3 the `CodeVersionManager` relies on one of three different mechanisms, a `FixupPrecode`, a `JumpStamp`, or backpatching entry point slots. In [method.hpp](https://github.com/dotnet/coreclr/blob/master/src/vm/method.hpp) these mechanisms are described in the `MethodDesc::IsVersionableWith*()` functions, and all methods have been classified to use at most one of the techniques, based on the `MethodDesc::IsVersionableWith*()` functions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments did a great job sketching out the design you ended up with, but I think the rationale as to why you arrived at this design as opposed to something different could be equally illuminating. Typically before making a change of this scale there would be some discussion about options or a write up in a doc so I'm not sure if I just missed that? At this point I am not expecting that we'd make large deviations from the approach in this PR unless we found a serious issue given how much effort I assume you have invested in it and the perf gains. However it would still be useful to understand among the other design options what has already been eliminated via thought experiment or performance testing and what is still interesting to experiment with in the future.

Some design alternatives that come to mind:

  1. Call counter is part of tier0 jitted code, no stub is used (Andy's JIT prototypes for OSR likely go that direction)
  2. Reaching the call threshold doesn't remove the call counting stub, it persists until the tier1 code is published
  3. Call counting stubs are never deleted
  4. Call counting is integrated into a pre-existing stub (Precode?) rather being a unique stub
  5. Call counts are maintained at a more granular level such as per ILCodeVersion or per-MethodDef
  6. Call counting completion is done fine-grained synchronously rather than course grain asynchronously.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. There hasn't been a doc, most of the discussions happened in-person.

Call counter is part of tier0 jitted code, no stub is used (Andy's JIT prototypes for OSR likely go that direction)

  • It would be difficult to suggest that there would not be a startup time loss from placing counters in tier 0 code, it seems likely that there would be a regression
    • It's evident that there is a noticeable startup time loss from just the additional precodes with tiering, though it's a different type of perf hit. Counting in the method code probably would have less overhead than doing so in a stub, by avoiding an extra hop.
    • The cumulative additional JIT time, memory overhead, and size overhead from including counters in prejitted code would also have to be considered, along with the (relatively less expensive but not free) cumulative time it would take to register methods that reach the threshold for promotion during startup
    • There is not much direct compensation. The overhead of precodes and other overhead from tiering is typically compensated for by the faster tier 0 JIT, and although it may also compensate for the additional overhead from call counting, the compensation is only valid for "no tiering vs tiering" comparisons, the overhead would still exist in "tiering vs tiering" comparisons. The only direct compensation would be from code optimized during startup running faster than the tier 0 / prejitted code, and the degree to which it would improve startup time can vary greatly between scenarios.
    • If there is a startup regression there may not be a good solution. It would be possible to avoid counting during startup with counting in tier 0 code as well, by for example checking something before counting or by patching the code, but it may not eliminate the problem.
  • The types of code that run during startup are often different from the types that runs at steady-state
    • It's possible that startup code also contains pieces of hot code that would benefit from being rejitted. As mentioned above, how much it would help it could vary heavily between sceanrios, and I haven't seen a clear indication so far that counting at startup would help startup time, and since the rejits take some time to happen (with other overhead), the method would have to be rejitted soon enough to compensate (the method would have to be called many times, much more than the call count threshold, to benefit from a rejit). On the other hand, the effect of the tiering delay can be seen by comparison as increased time to reach steady state or time taken to perform a large operation spanning several seconds. Of course we may choose to not rejit during startup anyway but in that case I don't see any benefit in counting calls during startup.
    • Many methods would reach the threshold during startup, it's something like ~2000 methods in MusicStore and ~5000 methods in CscRoslyn. Some of them may be rejitted anyway later and some may not, on CscRoslyn there's not much difference in steady-state rejit counts when the tiering delay is disabled, but on MusicStore an additional ~3000 methods are rejitted.
    • Using call counting stubs allows the flexibility to choose when to start call counting for a method based on other information/heuristics and with minimal overhead that only affects those methods, if there happens to be a need to rejit something during startup
  • Stubs offer a flexible design, other than above, it can allow disabling and enabling counting at a code version level at any time
  • I'll talk about the memory overhead separately

Reaching the call threshold doesn't remove the call counting stub, it persists until the tier1 code is published

The most expensive parts of reaching the call count threshold are:

  1. Promoting - Checking for a tier 1 code version, adding a tier 1 code version, and recording the method
  2. Looking up the active code version and setting the entry point to stop counting
  3. Overhead of calling the helper function - This can be improved if necessary

#1 and #2 are now done in the background, otherwise they were showing up in the spike when methods reach the call count threshold. Doing or avoiding #2 is a tradeoff between some background work and some foreground work, I wasn't particularly trying to change the way it works currently and favored to avoid the extra foreground work.

Call counting stubs are never deleted

Will talk about this separately

Call counting is integrated into a pre-existing stub (Precode?) rather being a unique stub

Separate stubs allow counting only tiered methods and at specific times. Changing an existing stub like a precode would increase the size unnecessarily.

Call counts are maintained at a more granular level such as per ILCodeVersion or per-MethodDef

At the moment it doesn't make much difference. If there were to be a tier 0.5 in the future then it would probably want to be counted separately from tier 0, and due to the unlocked counting it's not easy to deterministically reset the remaining call count. The stub could be per-MethodDesc instead but that would entail making it larger and slower such that it's patchable.

Call counting completion is done fine-grained synchronously rather than course grain asynchronously.

I did fine-grained synchronously first and after seeing the large spikes, changed it to coarse-grained asynchronous. It's not very fine-grained though, methods typically reach the call count threshold in bursts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call counter is part of tier0 jitted code, no stub is used (Andy's JIT prototypes for OSR likely go that direction)

Forgot to add, I'm not familiar with Andy's JIT prototypes for OSR, but it may make sense to count calls in the jitted code for that. There is direct compensation in that jitting methods containing loops at tier 0 would decrease startup time, and since it's a loop it likely will be called at least a few times anyway and may be more worth optimizing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kouvel! My goal was to capture your existing thinking on the topic while it is fresh and that has now been done : )


### Thread-safety ###
CodeVersionManager is designed for use in a free-threaded environment, in many cases by requiring the caller to acquire a lock before calling. This lock can be acquired by constructing an instance of the

```
CodeVersionManager::TableLockHolder(CodeVersionManager*)
```
CodeVersionManager is designed for use in a free-threaded environment, in many cases by requiring the caller to acquire a lock before calling. This lock can be acquired by constructing an instance of `CodeVersionManager::LockHolder`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the commit description wouldn't need to be so large and all the text that describes how the feature works is added to appropriate places in the docs, either for code versioning, for tiered compilation, or a brand new doc specific to call counting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fold some summary-level comments from the description into the code

in some scope for the CodeVersionManager being operated on. CodeVersionManagers from different domains should not have their locks taken by the same thread with one exception, it is OK to take the shared domain manager lock and one AppDomain manager lock in that order. The lock is required to change the shape of the tree or traverse it but not to read/write configuration properties from each node. A few special cases:

Expand Down
8 changes: 4 additions & 4 deletions src/coreclr/src/debug/daccess/request.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4281,7 +4281,7 @@ HRESULT ClrDataAccess::GetPendingReJITID(CLRDATA_ADDRESS methodDesc, int *pRejit
PTR_MethodDesc pMD = PTR_MethodDesc(TO_TADDR(methodDesc));

CodeVersionManager* pCodeVersionManager = pMD->GetCodeVersionManager();
CodeVersionManager::TableLockHolder lock(pCodeVersionManager);
CodeVersionManager::LockHolder codeVersioningLockHolder;
ILCodeVersion ilVersion = pCodeVersionManager->GetActiveILCodeVersion(pMD);
if (ilVersion.IsNull())
{
Expand Down Expand Up @@ -4313,7 +4313,7 @@ HRESULT ClrDataAccess::GetReJITInformation(CLRDATA_ADDRESS methodDesc, int rejit
PTR_MethodDesc pMD = PTR_MethodDesc(TO_TADDR(methodDesc));

CodeVersionManager* pCodeVersionManager = pMD->GetCodeVersionManager();
CodeVersionManager::TableLockHolder lock(pCodeVersionManager);
CodeVersionManager::LockHolder codeVersioningLockHolder;
ILCodeVersion ilVersion = pCodeVersionManager->GetILCodeVersion(pMD, rejitId);
if (ilVersion.IsNull())
{
Expand Down Expand Up @@ -4365,7 +4365,7 @@ HRESULT ClrDataAccess::GetProfilerModifiedILInformation(CLRDATA_ADDRESS methodDe
PTR_MethodDesc pMD = PTR_MethodDesc(TO_TADDR(methodDesc));

CodeVersionManager* pCodeVersionManager = pMD->GetCodeVersionManager();
CodeVersionManager::TableLockHolder lock(pCodeVersionManager);
CodeVersionManager::LockHolder codeVersioningLockHolder;
ILCodeVersion ilVersion = pCodeVersionManager->GetActiveILCodeVersion(pMD);
if (ilVersion.GetRejitState() != ILCodeVersion::kStateActive || !ilVersion.HasDefaultIL())
{
Expand Down Expand Up @@ -4398,7 +4398,7 @@ HRESULT ClrDataAccess::GetMethodsWithProfilerModifiedIL(CLRDATA_ADDRESS mod, CLR

PTR_Module pModule = PTR_Module(TO_TADDR(mod));
CodeVersionManager* pCodeVersionManager = pModule->GetCodeVersionManager();
CodeVersionManager::TableLockHolder lock(pCodeVersionManager);
CodeVersionManager::LockHolder codeVersioningLockHolder;

LookupMap<PTR_MethodTable>::Iterator typeIter(&pModule->m_TypeDefToMethodTableMap);
for (int i = 0; typeIter.Next(); i++)
Expand Down
2 changes: 1 addition & 1 deletion src/coreclr/src/debug/ee/debugger.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3634,7 +3634,7 @@ HRESULT Debugger::SetIP( bool fCanSetIPOnly, Thread *thread,Module *module,

CodeVersionManager *pCodeVersionManager = module->GetCodeVersionManager();
{
CodeVersionManager::TableLockHolder lock(pCodeVersionManager);
CodeVersionManager::LockHolder codeVersioningLockHolder;
ILCodeVersion ilCodeVersion = pCodeVersionManager->GetActiveILCodeVersion(module, mdMeth);
if (!ilCodeVersion.IsDefaultVersion())
{
Expand Down
8 changes: 4 additions & 4 deletions src/coreclr/src/debug/ee/functioninfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -933,7 +933,7 @@ void DebuggerJitInfo::LazyInitBounds()
LOG((LF_CORDB,LL_EVERYTHING, "DJI::LazyInitBounds: this=0x%x GetBoundariesAndVars success=0x%x\n", this, fSuccess));

// SetBoundaries uses the CodeVersionManager, need to take it now for lock ordering reasons
CodeVersionManager::TableLockHolder lockHolder(mdesc->GetCodeVersionManager());
CodeVersionManager::LockHolder codeVersioningLockHolder;
Debugger::DebuggerDataLockHolder debuggerDataLockHolder(g_pDebugger);

if (!m_fAttemptInit)
Expand Down Expand Up @@ -1059,7 +1059,7 @@ void DebuggerJitInfo::SetBoundaries(ULONG32 cMap, ICorDebugInfo::OffsetMapping *
// Pick a unique initial value (-10) so that the 1st doesn't accidentally match.
int ilPrevOld = -10;

_ASSERTE(m_nativeCodeVersion.GetMethodDesc()->GetCodeVersionManager()->LockOwnedByCurrentThread());
_ASSERTE(CodeVersionManager::IsLockOwnedByCurrentThread());

InstrumentedILOffsetMapping mapping;

Expand Down Expand Up @@ -1606,8 +1606,8 @@ DebuggerJitInfo *DebuggerMethodInfo::FindOrCreateInitAndAddJitInfo(MethodDesc* f
NativeCodeVersion nativeCodeVersion;
if (fd->IsVersionable())
{
CodeVersionManager::TableLockHolder lockHolder(fd->GetCodeVersionManager());
CodeVersionManager *pCodeVersionManager = fd->GetCodeVersionManager();
CodeVersionManager::LockHolder codeVersioningLockHolder;
nativeCodeVersion = pCodeVersionManager->GetNativeCodeVersion(fd, startAddr);
if (nativeCodeVersion.IsNull())
{
Expand Down Expand Up @@ -2087,7 +2087,7 @@ void DebuggerMethodInfo::CreateDJIsForMethodDesc(MethodDesc * pMethodDesc)
CodeVersionManager* pCodeVersionManager = pMethodDesc->GetCodeVersionManager();
// grab the code version lock to iterate available versions of the code
{
CodeVersionManager::TableLockHolder lock(pCodeVersionManager);
CodeVersionManager::LockHolder codeVersioningLockHolder;
NativeCodeVersionCollection nativeCodeVersions = pCodeVersionManager->GetNativeCodeVersions(pMethodDesc);

for (NativeCodeVersionIterator itr = nativeCodeVersions.Begin(), end = nativeCodeVersions.End(); itr != end; itr++)
Expand Down
9 changes: 5 additions & 4 deletions src/coreclr/src/inc/CrstTypes.def
Original file line number Diff line number Diff line change
Expand Up @@ -280,7 +280,7 @@ Crst NativeImageCache
End

Crst GCCover
AcquiredBefore LoaderHeap ReJITDomainTable
AcquiredBefore LoaderHeap CodeVersioning
End

Crst GCMemoryPressure
Expand Down Expand Up @@ -486,7 +486,7 @@ Crst Reflection
End

// Used to synchronize all rejit information stored in a given AppDomain.
Crst ReJITDomainTable
Crst CodeVersioning
AcquiredBefore LoaderHeap SingleUseLock DeadlockDetection JumpStubCache DebuggerController FuncPtrStubs
AcquiredAfter ReJITGlobalRequest ThreadStore GlobalStrLiteralMap SystemDomain DebuggerMutex MethodDescBackpatchInfoTracker
End
Expand All @@ -495,7 +495,7 @@ End
// new functions to rejit tables, or request Reverts on existing functions in the rejit
// tables. One of these crsts exist per runtime.
Crst ReJITGlobalRequest
AcquiredBefore ThreadStore ReJITDomainTable SystemDomain JitInlineTrackingMap
AcquiredBefore ThreadStore CodeVersioning SystemDomain JitInlineTrackingMap
End

// ETW infrastructure uses this crst to protect a hash table of TypeHandles which is
Expand Down Expand Up @@ -679,7 +679,7 @@ Crst InlineTrackingMap
End

Crst JitInlineTrackingMap
AcquiredBefore ReJITDomainTable ThreadStore LoaderAllocator
AcquiredBefore CodeVersioning ThreadStore LoaderAllocator
End

Crst EventPipe
Expand All @@ -695,6 +695,7 @@ Crst ReadyToRunEntryPointToMethodDescMap
End

Crst TieredCompilation
AcquiredAfter CodeVersioning
AcquiredBefore ThreadpoolTimerQueue
End

Expand Down
7 changes: 7 additions & 0 deletions src/coreclr/src/inc/clrconfigvalues.h
Original file line number Diff line number Diff line change
Expand Up @@ -633,10 +633,17 @@ RETAIL_CONFIG_DWORD_INFO(INTERNAL_HillClimbing_GainExponent,
RETAIL_CONFIG_DWORD_INFO(EXTERNAL_TieredCompilation, W("TieredCompilation"), 1, "Enables tiered compilation")
RETAIL_CONFIG_DWORD_INFO(EXTERNAL_TC_QuickJit, W("TC_QuickJit"), 1, "For methods that would be jitted, enable using quick JIT when appropriate.")
RETAIL_CONFIG_DWORD_INFO(UNSUPPORTED_TC_QuickJitForLoops, W("TC_QuickJitForLoops"), 0, "When quick JIT is enabled, quick JIT may also be used for methods that contain loops.")
RETAIL_CONFIG_DWORD_INFO(EXTERNAL_TC_AggressiveTiering, W("TC_AggressiveTiering"), 0, "Transition through tiers aggressively.")
RETAIL_CONFIG_DWORD_INFO(INTERNAL_TC_CallCountThreshold, W("TC_CallCountThreshold"), 30, "Number of times a method must be called in tier 0 after which it is promoted to the next tier.")
RETAIL_CONFIG_DWORD_INFO(INTERNAL_TC_CallCountingDelayMs, W("TC_CallCountingDelayMs"), 100, "A perpetual delay in milliseconds that is applied call counting in tier 0 and jitting at higher tiers, while there is startup-like activity.")
RETAIL_CONFIG_DWORD_INFO(INTERNAL_TC_DelaySingleProcMultiplier, W("TC_DelaySingleProcMultiplier"), 10, "Multiplier for TC_CallCountingDelayMs that is applied on a single-processor machine or when the process is affinitized to a single processor.")
RETAIL_CONFIG_DWORD_INFO(INTERNAL_TC_CallCounting, W("TC_CallCounting"), 1, "Enabled by default (only activates when TieredCompilation is also enabled). If disabled immediately backpatches prestub, and likely prevents any promotion to higher tiers")
RETAIL_CONFIG_DWORD_INFO(INTERNAL_TC_UseCallCountingStubs, W("TC_UseCallCountingStubs"), 1, "Uses call counting stubs for faster call counting.")
#ifdef _DEBUG
RETAIL_CONFIG_DWORD_INFO(INTERNAL_TC_DeleteCallCountingStubsAfter, W("TC_DeleteCallCountingStubsAfter"), 1, "Deletes call counting stubs after this many have completed. Zero to disable deleting.")
#else
RETAIL_CONFIG_DWORD_INFO(INTERNAL_TC_DeleteCallCountingStubsAfter, W("TC_DeleteCallCountingStubsAfter"), 4096, "Deletes call counting stubs after this many have completed. Zero to disable deleting.")
#endif
#endif

///
Expand Down
Loading