Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve call counting mechanism #1457

Merged
merged 11 commits into from
Jan 28, 2020
Merged

Improve call counting mechanism #1457

merged 11 commits into from
Jan 28, 2020

Conversation

kouvel
Copy link
Member

@kouvel kouvel commented Jan 8, 2020

  • Call counting through the prestub is fairly expensive and can be seen immediately after call counting begins
  • Added call counting stubs. When starting call counting for a method:
    • A CallCountingInfo is created and initializes a remaining call count with a threshold
    • A CallCountingStub is created. It contains a small amount of code that decrements the remaining call count and checks for zero. When nonzero, it jumps to the code version's native code entry point. When zero, it forwards to a helper function that handles tier promotion.
    • When the call count threshold is reached, the helper call enqueues completion of call counting for background processing
    • When completing call counting, the code version is enqueued for promotion, and the call counting stub is removed from the call chain
    • Once all work queued for promotion is completed and methods transitioned to optimized tier, call counting stubs are deleted based on some heuristics and under runtime suspension
  • The CallCountingManager is the main class with most of the logic. Its private subclasses are just simple data structures.
  • Call counting is done at a NativeCodeVersion level (stub association is with the code version)
  • The code versioning lock is used for data structures used for call counting. Since installing a call counting stub requires that we know what the currently active code version is, it made sense to use the same lock.
  • Call counting stubs have hardcoded code. x64 has short and long stubs, short stubs are used when possible (often) and use IP-relative branches to the method's code and helper stub. Other archs have only one type of stub (a short stub).
  • For tiered methods that don't have a precode (virtual and interface methods), a forwarder stub (a precode) is created and it forwards to the call counting stub. This is so that the call counting stub can be safely and easily deleted. The forwarder stubs are only used when counting calls, there is one per method (not per code version), and they are not deleted. See CallCountingManager::SetCodeEntryPoint() for more info.
  • The OnCallCountThresholdReachedStub() takes a "stub-identifying token". The helper call gets a stub address from it, and tells whether it's a short or long stub. From the stub, the remaining call count pointer is used to get the CallCountingInfo, and from it gets the NativeCodeVersion associated with the stub.
  • The CallCountingStubManager traces through a call counting stub so that VS-like debuggers can step into a method through the call counting stub
  • Call counting is no longer stopped when the tiering delay is reactivated (often happens in larger and more realistic scenarios). The overhead necessary to stop and restart call counting (among other things, many methods will have to go through the prestub again) is greater than the overhead of completing call counting + calling the threshold-reached helper function, even for very high call count thresholds. While it may be (somewhat) desirable to not count method calls during startup phases, there would be a fair bit of additional overhead to stop counting. On the other hand, it may be beneficial in the future to rejit some methods during startup, there is indication on a few tests that that would be beneficial. So I think it's reasonable that for now, only newly called methods during the current tiering delay would not be counted, any that already started counting will continue (their delay already expired).
  • Exceptions (OOM)
    • On foreground threads, exceptions are propagated unless they can be handled without any compromise
    • On background threads, exceptions are caught and logged as before. Tried to limit scope of exception to one per method or code version such that if there is a loop over many, an exception would not abort the entire loop, rather just one iteration of the loop.
  • Fixed a latent race where a method is recorded for call counting and then the method's code entry point is set to tier 0 code
    • With that order, the tiering delay may expire and the method's entry point may be updated for call counting in the background before the code entry point is set by the recording thread, and that last action would disable call counting for the method and cause it to not be optimized. The only thing protecting from this happening was the delay itself, but a configured shorter delay increases the possibility of this happening.
    • Inverted the order such that the method's code entry point is set before recording it for call counting, both on first and subsequent calls
    • Changed the tiered compilation lock to be an any-GC-mode lock so that it can be taken inside the code versioning lock, as some things were more naturally placed inside the code versioning lock where we know the active code version, like checking for the tiering delay to delay call counting and promoting the code version when the call count threshold is reached
      • Unfortunately, that makes code inside the lock a GC-no-trigger scope and things like scheduling a timer or queuing a work item to the thread pool could not be done inside that scope. This tradeoff seems to be better than alternatives, so refactored those pieces to occur outside that scope.
  • Publishing an entry point after changing the active code version now takes call counting into account, fixes https://github.com/dotnet/coreclr/issues/22426
  • Added a config variable TC_AggressiveTiering that may be used in some benchmarks to tier-up more quickly. For now, it disables the tiering delay and sets the call count threshold to 1 (some calls are not counted, so it may take more to actually trigger a rejit).
  • After the changes:
    • Call counting overhead is much smaller and is not many orders of magnitude greater than a method call
    • Some config modes for tuning tiering are now much more reasonable and do not affect perf negatively nearly as much as before - increasing call count threshold, disabling or decreasing the tiering delay. Enables dynamic thresholds in the future, which was not feasible due to the call counting overhead.
    • No change to startup or steady-state perf (by design, and verified on several tests; if anything startup perf is very slightly better than before on some tests, otherwise even)
  • Left for later
    • Eventing work to report call counting stub code ranges and method name (also needs to be done for other stubs)
    • Some tests that consume events to verify run-time behavior in a few config modes
    • Debugger test to verify debugging while call-counting. Debugger tests also need to be fixed for tiering.
    • The call count threshold has not been changed for now. As we don't have many tests that measure the performance in-between startup and steady-state, some will need to be created maybe from existing tests, to determine the effects.
  • Fixes https://github.com/dotnet/coreclr/issues/23596

@kouvel kouvel added this to the 5.0 milestone Jan 8, 2020
@kouvel kouvel self-assigned this Jan 8, 2020
@kouvel
Copy link
Member Author

kouvel commented Jan 9, 2020

To show the difference clearly, I wrote a test case that calls 10000 methods once each per iteration, for a number of iterations, on one and several threads, and sleeping for 100 ms after each iteration to get timings aligned. The methods are empty, so this measures only the call overhead, and background work is mostly not represented. Here are results on x64 and arm64, with default settings.

image

image

image

image

After the change, perf with call counting is much closer to without. There is one spike (after the startup spike) and that is when all of the methods complete call counting at roughly the same time. In benchmarks and more generally scenarios that repeatedly do roughly the same thing, especially with a low call count threshold, a bunch of methods will reach the threshold at about the same time (perhaps multiple times), and the spike is due to the overhead in calling out to the helper function.

Here's how it looks for the first 100+ requests to MusicStore, with default settings (call count 30), one with call count 100, and one with call count 30 and no tiering delay:

image

The spikes are a combination of new methods being called (new methods get called for a while in this test case, it stabilizes only after about 500 requests), and methods reaching the call count threshold in bursts. The overhead from call counting is still clearly visible.

image

With more iterations (and at a higher call count threshold), in the baseline there are separate periods of call counting due to the tiering delay (new methods being called). After the change, other than the isolated spikes of somewhat lower magnitude, perf is fairly stable compared to the baseline despite the same tiering delay behavior.

image

When the tiering delay is disabled, the call counting overhead is much more clear. It's not seen in the graph but the startup time after the change is much closer to default settings, similarly on other test cases such as CscRoslynSource, and both test cases perform decently very quickly following startup with some but not much startup time loss.

@kouvel kouvel requested a review from noahfalk January 9, 2020 08:39
@kouvel
Copy link
Member Author

kouvel commented Jan 14, 2020

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a big step for tiered compilation : ) I've spent a good bit of time trying to understand it and it seems reasonable to me, though I get a little nervous thinking we are probably growing portions of the runtime that are complex enough that very few people are going to feel comfortable changing them. I don't think its burned us yet, but I feel like we should be wary of letting the complexity go much further.

In terms of bugs nothing jumped out at me, but if there was anything subtle its unlikely I would have spotted it. I also don't historically have a lot of experience with creating stubs or new helper frames so @davidwr or @jkotas might be more effective in spotting issues there.

On the perf front I had a few questions:

  1. How this changes our memory usage relative to before?
  2. How this changes the startup perf of an app running on a single core?

Thanks Kount!

```
CodeVersionManager::TableLockHolder(CodeVersionManager*)
```
CodeVersionManager is designed for use in a free-threaded environment, in many cases by requiring the caller to acquire a lock before calling. This lock can be acquired by constructing an instance of `CodeVersionManager::LockHolder`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the commit description wouldn't need to be so large and all the text that describes how the feature works is added to appropriate places in the docs, either for code versioning, for tiered compilation, or a brand new doc specific to call counting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fold some summary-level comments from the description into the code

@@ -330,11 +330,7 @@ to update the active child at either of those levels (ReJIT uses SetActiveILCode
In order to do step 3 the `CodeVersionManager` relies on one of three different mechanisms, a `FixupPrecode`, a `JumpStamp`, or backpatching entry point slots. In [method.hpp](https://github.com/dotnet/coreclr/blob/master/src/vm/method.hpp) these mechanisms are described in the `MethodDesc::IsVersionableWith*()` functions, and all methods have been classified to use at most one of the techniques, based on the `MethodDesc::IsVersionableWith*()` functions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments did a great job sketching out the design you ended up with, but I think the rationale as to why you arrived at this design as opposed to something different could be equally illuminating. Typically before making a change of this scale there would be some discussion about options or a write up in a doc so I'm not sure if I just missed that? At this point I am not expecting that we'd make large deviations from the approach in this PR unless we found a serious issue given how much effort I assume you have invested in it and the perf gains. However it would still be useful to understand among the other design options what has already been eliminated via thought experiment or performance testing and what is still interesting to experiment with in the future.

Some design alternatives that come to mind:

  1. Call counter is part of tier0 jitted code, no stub is used (Andy's JIT prototypes for OSR likely go that direction)
  2. Reaching the call threshold doesn't remove the call counting stub, it persists until the tier1 code is published
  3. Call counting stubs are never deleted
  4. Call counting is integrated into a pre-existing stub (Precode?) rather being a unique stub
  5. Call counts are maintained at a more granular level such as per ILCodeVersion or per-MethodDef
  6. Call counting completion is done fine-grained synchronously rather than course grain asynchronously.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. There hasn't been a doc, most of the discussions happened in-person.

Call counter is part of tier0 jitted code, no stub is used (Andy's JIT prototypes for OSR likely go that direction)

  • It would be difficult to suggest that there would not be a startup time loss from placing counters in tier 0 code, it seems likely that there would be a regression
    • It's evident that there is a noticeable startup time loss from just the additional precodes with tiering, though it's a different type of perf hit. Counting in the method code probably would have less overhead than doing so in a stub, by avoiding an extra hop.
    • The cumulative additional JIT time, memory overhead, and size overhead from including counters in prejitted code would also have to be considered, along with the (relatively less expensive but not free) cumulative time it would take to register methods that reach the threshold for promotion during startup
    • There is not much direct compensation. The overhead of precodes and other overhead from tiering is typically compensated for by the faster tier 0 JIT, and although it may also compensate for the additional overhead from call counting, the compensation is only valid for "no tiering vs tiering" comparisons, the overhead would still exist in "tiering vs tiering" comparisons. The only direct compensation would be from code optimized during startup running faster than the tier 0 / prejitted code, and the degree to which it would improve startup time can vary greatly between scenarios.
    • If there is a startup regression there may not be a good solution. It would be possible to avoid counting during startup with counting in tier 0 code as well, by for example checking something before counting or by patching the code, but it may not eliminate the problem.
  • The types of code that run during startup are often different from the types that runs at steady-state
    • It's possible that startup code also contains pieces of hot code that would benefit from being rejitted. As mentioned above, how much it would help it could vary heavily between sceanrios, and I haven't seen a clear indication so far that counting at startup would help startup time, and since the rejits take some time to happen (with other overhead), the method would have to be rejitted soon enough to compensate (the method would have to be called many times, much more than the call count threshold, to benefit from a rejit). On the other hand, the effect of the tiering delay can be seen by comparison as increased time to reach steady state or time taken to perform a large operation spanning several seconds. Of course we may choose to not rejit during startup anyway but in that case I don't see any benefit in counting calls during startup.
    • Many methods would reach the threshold during startup, it's something like ~2000 methods in MusicStore and ~5000 methods in CscRoslyn. Some of them may be rejitted anyway later and some may not, on CscRoslyn there's not much difference in steady-state rejit counts when the tiering delay is disabled, but on MusicStore an additional ~3000 methods are rejitted.
    • Using call counting stubs allows the flexibility to choose when to start call counting for a method based on other information/heuristics and with minimal overhead that only affects those methods, if there happens to be a need to rejit something during startup
  • Stubs offer a flexible design, other than above, it can allow disabling and enabling counting at a code version level at any time
  • I'll talk about the memory overhead separately

Reaching the call threshold doesn't remove the call counting stub, it persists until the tier1 code is published

The most expensive parts of reaching the call count threshold are:

  1. Promoting - Checking for a tier 1 code version, adding a tier 1 code version, and recording the method
  2. Looking up the active code version and setting the entry point to stop counting
  3. Overhead of calling the helper function - This can be improved if necessary

#1 and #2 are now done in the background, otherwise they were showing up in the spike when methods reach the call count threshold. Doing or avoiding #2 is a tradeoff between some background work and some foreground work, I wasn't particularly trying to change the way it works currently and favored to avoid the extra foreground work.

Call counting stubs are never deleted

Will talk about this separately

Call counting is integrated into a pre-existing stub (Precode?) rather being a unique stub

Separate stubs allow counting only tiered methods and at specific times. Changing an existing stub like a precode would increase the size unnecessarily.

Call counts are maintained at a more granular level such as per ILCodeVersion or per-MethodDef

At the moment it doesn't make much difference. If there were to be a tier 0.5 in the future then it would probably want to be counted separately from tier 0, and due to the unlocked counting it's not easy to deterministically reset the remaining call count. The stub could be per-MethodDesc instead but that would entail making it larger and slower such that it's patchable.

Call counting completion is done fine-grained synchronously rather than course grain asynchronously.

I did fine-grained synchronously first and after seeing the large spikes, changed it to coarse-grained asynchronous. It's not very fine-grained though, methods typically reach the call count threshold in bursts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call counter is part of tier0 jitted code, no stub is used (Andy's JIT prototypes for OSR likely go that direction)

Forgot to add, I'm not familiar with Andy's JIT prototypes for OSR, but it may make sense to count calls in the jitted code for that. There is direct compensation in that jitting methods containing loops at tier 0 would decrease startup time, and since it's a loop it likely will be called at least a few times anyway and may be more worth optimizing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kouvel! My goal was to capture your existing thinking on the topic while it is fresh and that has now been done : )

_ASSERTE(!codeVersion.IsNull());
_ASSERTE(callCountThreshold != 0);
_ASSERTE(IsCallCountingEnabled());
_ASSERTE(GetCallCountThreshold() == callCountThreshold);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is storing the callCountThreshold in m_state needed? The only place I noticed it being read back is in this assert here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at the moment, will remove

PCODE callCountingCodeEntryPoint = callCountingStub->GetEntryPoint();
if (methodDesc->MayHaveEntryPointSlotsToBackpatch())
{
// The call counting stub should not be the entry point that is called first in the process of a call
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if I am messing up the analysis but it isn't clear to me that this is saving memory vs. putting a pointer to the call counter stub in the slot and then never deleting it.
On X64 the short stub is 24 bytes. A MethodDescForwarderStub is 16 bytes and the precode it points to is another 16 bytes. Assuming a capacity factor of 2 on the MethodDescForwarderStub hashtable that gives a total of 2*16+16 = 48 bytes. Being a little leaner on the capacity factor would make them closer, but we are still better off not doing it. On ARM the call counting stub is a bit larger so it looks like it could be profitable there.

Separately any reason not to use mechanisms built-in to MethodDesc to allocate Precodes for entrypoints or FuncPtrStubs if we need to do it dynamically? Is there a reason neither of those work and we need to create a 3rd way to assign precodes? I believe both of those mechanisms also have storage schemes that are more compact than this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forwarder stubs are only created for methods that don't already have a precode and currently that's only virtual and interface methods.

Actually after call counting is complete, the forwarder stub can be removed from the hash table since the method typically won't be counted anymore (only if a profiler creates a new IL code version, then it may create another forwarder stub). Also realized that the hash table doesn't need to contain the MethodDesc since the forwarder stub already does, so I'll go ahead and make those changes.

Ignoring hash table storage for the moment, I agree it's not too profitable for virtual/interface methods on x64 (saves only 8 bytes). It's more profitable for non-virtual/interface methods on x64 and x86, and for all methods on arm64 and arm32.

Stubs, CallCountingInfos, and hash table storage will be there for methods that haven't yet reached the threshold, and until the 4 K completed count is reached (trigger for deleting). I'll post numbers on memory separately.

Regarding your second question, the temporary entry points that are reused for precodes for regular methods (these have a more compact storage scheme), for virtual/interface methods they need to always point to the prestub in order to discover new vtable slots over the same MethodDesc. I thought about using FuncPtrStubs (they are just a precode as well, not any better optimized for storage), and I didn't find a good way of mixing their current use with my intended use. It might be something that can be optimized later with more careful thought.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about using FuncPtrStubs and I didn't find a good way of mixing their current use with my intended use

Could you add some comments in the code describing what the mismatch is? It wasn't immediately obvious to me why it wasn't a viable scenario and someone else seeing the same code later may wonder the same thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are just a precode as well, not any better optimized for storage

The FuncPtr dictionary only stores a Precode*, not a MethodDesc+Precode* pair as this code was doing. I think you noticed the same optimization so once you do that then storage efficiency would match.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, makes sense

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add a comment. It many not be that beneficial to merge them anyway, as for virtual/interface methods FuncPtrStubs are only created for devirtualized cases or cases where a single entry point is needed for multi-call purposes to a virtual/interface method.

// - Stubs should be deletable. Many methods will have call counting stubs associated with them, and although the memory
// involved is typically insignificant compared to the average memory overhead per method, by steady-state it would
// otherwise be unnecessary memory overhead serving no purpose.
// - In order to be able to delete a stub, the jitted code of a method cannot be allowed to load the stub as the entry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW it didn't feel like a foregone conclusion that we have to delete the stubs. It does appear there may be some memory savings to be had, but on the other side of the scale is presumably extra indirections/instructions/cache lines on each method call, periodic runtime suspensions to delete the stubs, and then extra calls to the Prestub to restore them. I don't know how the perf numbers shake out, but this might be one of the spots where its worth checking what happens if we don't bother with deletion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had discussed this with Sergiy, I'll post some numbers separately. Most of the noticeable overhead from deleting stubs seems to be in restoring some of them through the prestub, that's the main reason for the 4 K minimum completed stubs before deleting, and perhaps it could be a lazier heuristic than that, anyway I believe that's the main risk factor. It seems to be working decently for now.


#ifndef DACCESS_COMPILE

void CallCountingManager::DisableCallCounting(NativeCodeVersion codeVersion)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we need to do this any longer? The initial premise of disabling call counting was that it was expensive. If it is now cheap can we simplify our solution by allowing call counting to occur always and only delay when jitting happens?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used as storage for the default code version to indicate that it won't be tiered and is optimized. For example, a method containing a loop would be optimized by the JIT and call counting would be disabled for the default code version, then call counting and other tiering activities would not be attempted on it, also for event reporting purposes to get the correct tier for the default code version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have put this comment elsewhere to be clearer. What I was really angling for is that for methods that we know we will need to call count eventually, do we need a mechanism that allows the call counter to be deferred and applied later. I suspected that it might be simpler and/or more performant to install the call counter immediately at startup and I was curious what your thoughts were on it.

break;
}

// Fully completing call counting for a method is relative expensive. Call counting with stubs is relatively cheap.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it well understood what makes it so expensive? I think of it as adding an entry to the pending jit list, walking a list of entrypoints updating each of them with a new pointer, and taking a couple of locks. Am I forgetting part or underestimating how much part of that costs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I described the expensive parts in an answer to one of your questions above, see following "The most expensive parts of reaching the call count threshold are"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kouvel. I'll probably need to look at some call trees and numbers to ground my understanding further, but I don't want to spend any more of your time than I already have on this. If it becomes relevant again in the future we can always look at it then.

@noahfalk
Copy link
Member

cc @brianrob - If you hadn't already seen it, this seems like a perf change to be aware of

////////////////////////////////////////////////////////////////
// X64

#if defined(_TARGET_AMD64_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to have these platform specific details in platform specific files (putting them into the existing platform specific files e.g. cgencpu.h is fine)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

callcounting.h has too many dependencies to include into something like cgencpu.h. I can put the arch-specific definitions in cgen<arch>.cpp files. They may not be inlined on Unixes, but it might be ok. Alternatively I can separate the definitions into arch-specific and call counting-specific .h files that are included into the callcounting.cpp. I'm leaning towards the latter, let me know if the former would be preferable instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be:

  • Factor the platform-specific stubs so that they have minimal dependencies on the rest of the system. ie the higher level platform neutral call counting stuff should depend on the low level stuff. Not the other way around. Once you do that, there should not be a problem with having it in cgencpu.h.
  • Avoid tiny files. E.g. I would merge the existing callcounting*.asm files that just contain a single method into one of the larger asm files.

@davidwrighton
Copy link
Member

Could you explain the logic around DeleteAllCallCountingStubs. Why do we do this, under what situations, what data drove doing that, etc, what stress modes need to be written to ensure that deleting stubs won't put us in a bad state, etc. For instance, I don't have complete confidence that we have tracking for all cases where a virtual function pointer is captured somewhere, and we could potentially be in a situation where we end up deleting code underneath ourselves. How do I force this to happen for a set of methods and test that the deleting didn't cause a problem.

@@ -439,17 +439,17 @@ class LoaderHeap : public UnlockedLoaderHeap, public ILoaderHeapBackout
LoaderHeap(DWORD dwReserveBlockSize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoaderHeap [](start = 4, length = 10)

Why add a new parameter here instead of making the caller just allocate an UnlocekdLoaderHeap

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnlockedLoaderHeap requires the heap to be created by the user, when m_fExplicitControl == true it doesn't reserve its own pages (https://github.com/dotnet/runtime/blob/master/src/coreclr/src/utilcode/loaderheap.cpp#L1093-L1096). Also LoaderHeap offers a backout scheme that I'm using to backout allocation of a short stub if it can't be used, it calls into the protected UnlockedBackoutMem but is only supported through LoaderHeap, which has tracker APIs for convenient backout.

@noahfalk
Copy link
Member

For instance, I don't have complete confidence that we have tracking for all cases where a virtual function pointer is captured somewhere

Kount probably has a more complete description, but my understanding reading the change is that the pointer to the call counting stub is never given out as an entrypoint. The stub is always reached by first indirecting through a Precode and the Precode is never deleted.

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You and @brianrob probably want to agree if any additional performance regression testing is necessary. My suggestion is to confirm memory usage is acceptable and that single-threaded startup performance is OK. Other than that and a few of the little tweaks in the comments, feature looks good to me 👍 Nice work : )

@kouvel kouvel added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Jan 20, 2020
@kouvel
Copy link
Member Author

kouvel commented Jan 23, 2020

@noahfalk: How this changes our memory usage relative to before?

The latest commit removed the Completed stage and instead deletes the call counting infos, freeing that memory sooner and bringing it a bit closer to getting a fair comparison between deleting and not deleting stubs.

When not deleting stubs, I also made a local change to not create forwarder stubs. This is not included in the change, since I didn't intend changes to the config var to change too much of what happens underneath.

Here's some data on memory usage before the change, and after the change with and without deleting stubs where applicable. Committed memory was averaged over 8 runs after removing upper outliers and following a forced GC.

Call 10,000 empty nonvirtual methods, repeat 50 times in total with 100 ms gaps in-between

  Before No delete Diff Delete Diff
Committed (KB) 21128 21337 209 21119 -10

~210 KB higher when not deleting stubs, no change when deleting stubs. Stubs are deleted once.

Stub counts:

  No delete Delete
Count total 10382 10418
Completed total 10052 10052
Count at end 10382 87
Completed at end 10052 4
Active at end 330 83
Forwarders 0 36

CscRoslynSource, repeat 16 times in the same process

  Before No delete Diff Delete Diff
Committed (KB) 68503 69169 666 68561 58

~660 KB higher when not deleting stubs, ~60 KB higher when deleting stubs. Stubs are deleted once.

Stub counts:

  No delete Delete
Count total 17279 19882
Completed total 12380 13493
Count at end 17279 2654
Completed at end 12380 862
Active at end 4902 1792
Forwarders 0 5643

The slight memory overhead when deleting stubs probably would linger, as it would likely not reach the threshold for deleting stubs again for a while at least. The heuristics for deleting can be improved later if desired by using a timer for coalescing and decreasing the threshold.

Msix-catalog shortly followiing startup.

This is a bit interesting because the point when the app starts is also the point when call counting begins, so not much gets a chance to complete call counting by then and stubs are not deleted immediately.

  Before After Diff
Committed (KB) 81922 82587 664

~660 KB higher after the change (with or without deleting stubs).

Stub counts:

  After
Count total 13081
Completed total 3
Forwarders 2845

Using the app slightly causes most of the stubs to be deleted fairly quickly. After that, the total stub count ranges between 2000 and 6000. Improving the deletion heuristics as described above might keep the memory overhead lower if desired. However, the committed memory was not stable after using the app a bit, so I didn't include numbers for that. It would probably be measurable with many samples and some analysis, scaling with the total stub count, but it seems to fall within error range.


@davidwrighton: Could you explain the logic around DeleteAllCallCountingStubs
Why do we do this
what data drove doing that

Added some data above. It was done to make a reasonable attempt to keep memory utilization similar to before and not leak the usually temporary memory from call counting stubs, as the memory overhead from stubs can be noticeable. In a larger app with ~100,000 methods being called, the overhead would reach into the several-MB range. The main and intended tradeoff is that methods that have not completed counting but are still being called (just less frequently) will have to go through the prestub again to reinstall the call counter.

under what situations

Every time the background JIT queue empties, if the count of call counting stubs that have completed is >= 4 K, call counting stubs are deleted. It occurs fairly sparsely and only once if at all in the larger benchmarks that I have run.

what stress modes need to be written to ensure that deleting stubs won't put us in a bad state

I think the larger perf tests like CscRoslynSource might be a good candidate for stressing, running multiple iterations in a process, and multiple processes over for period of time, with thresholds decreased to delete stubs more frequently and to transition tiers more quickly. I'll kick that off for a sanity test.

For instance, I don't have complete confidence that we have tracking for all cases where a virtual function pointer is captured somewhere, and we could potentially be in a situation where we end up deleting code underneath ourselves. How do I force this to happen for a set of methods and test that the deleting didn't cause a problem.

The main issues I was concerned with were:

  1. The vtable slot value may be fetched already in a GC-safe point (possible on arm64/arm32, may be possible on x64/x86 depending on code gen)
  • This is prevented by using forwarder stubs (precodes) for vtable methods. The vtable slot never points to a call counting stub and instead points to a forwarder stub. Since the runtime cannot be suspended inside a forwarder stub or a call counting stub, suspension in the caller or callee in any place would allow deleting stubs.
  1. The processor may have cached precode targets
  • This is prevented by issuing the necessary barriers after targets are updated and before resuming the runtime. Threads in preemptive GC mode use a lock for synchronizing the relevant data.

For any case that requires a multi-callable entry point to a vtable method, a FuncPtrStub is used for tiering purposes. For single-callable cases, the call goes through the vtable slot.


@noahfalk: How this changes the startup perf of an app running on a single core?

MusicStore affinitized to 1 processor

Similar to when not affinitized except call counting happens a bit later (delay is increased) and with much larger spikes. The y-axis is on a log10 scale.

image

Although the graph only shows two samples, there is no change to startup perf in this mode over multiple runs.

The call counting overhead is decreased, but otherwise it's not much better or worse than before. There are a couple of improvements I have in mind for these types of cases for the future.


Some normal startup numbers

  Before (ms) After (ms) Diff %
MusicStore 963.250 964.250 0.1%
CscRoslyn 1675.733 1674.090 -0.1%
Msix-catalog 1076.580 1076.011 -0.1%

@noahfalk: You and @brianrob probably want to agree if any additional performance regression testing is necessary.

@brianrob, thoughts?


@noahfalk if you haven't already, I would appreciate if you could review the last 3 commits.

Hopefully I have answered all of the questions above. Please let me know if I missed anything. Thanks!

@kouvel kouvel removed the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Jan 23, 2020
kouvel added 10 commits January 23, 2020 02:27
- Call counting through the prestub is fairly expensive and can be seen immediately after call counting begins
- Added call counting stubs. When starting call counting for a method:
  - A `CallCountingInfo` is created and initializes a remaining call count with a threshold
  - A `CallCountingStub` is created. It contains a small amount of code that decrements the remaining call count and checks for zero. When nonzero, it jumps to the code version's native code entry point. When zero, it forwards to a helper function that handles tier promotion.
  - When the call count threshold is reached, the helper call enqueues completion of call counting for background processing
  - When completing call counting, the code version is enqueued for promotion, and the call counting stub is removed from the call chain
  - Once all work queued for promotion is completed and methods transitioned to optimized tier, call counting stubs are deleted based on some heuristics and under runtime suspension
- The `CallCountingManager` is the main class with most of the logic. Its private subclasses are just simple data structures.
- Call counting is done at a `NativeCodeVersion` level (stub association is with the code version)
- The code versioning lock is used for data structures used for call counting. Since installing a call counting stub requires that we know what the currently active code version is, it made sense to use the same lock.
- Call counting stubs have hardcoded code. x64 has short and long stubs, short stubs are used when possible (often) and use IP-relative branches to the method's code and helper stub. Other platforms have only one type of stub (a short stub).
- For tiered methods that don't have a precode (virtual and interface methods), a forwarder stub (a precode) is created and it forwards to the call counting stub. This is so that the call counting stub can be safely and easily deleted. The forwarder stubs are only used when counting calls, there is one per method (not per code version), and they are not deleted. See `CallCountingManager::SetCodeEntryPoint()` for more info.
- The `OnCallCountThresholdReachedStub()` takes a "stub-identifying token". The helper call gets a stub address from it, and tells whether it's a short or long stub. From the stub, the remaining call count pointer is used to get the `CallCountingInfo`, and from it gets the `NativeCodeVersion` associated with the stub.
- The `CallCountingStubManager` traces through a call counting stub so that VS-like debuggers can step into a method through the call counting stub
- Exceptions (OOM)
  - On foreground threads, exceptions are propagated unless they can be handled without any compromise
  - On background threads, exceptions are caught and logged as before. Tried to limit scope of exception to one per method or code version such that a loop over many would not all be aborted by one exception.
- Fixed a latent race where a method is recorded for call counting and then the method's code entry point is set to tier 0 code
  - With that order, the tiering delay may expire and the method's entry point may be updated for call counting in the background before the code entry point is set by the recording thread, and that last action would disable call counting for the method and cause it to not be optimized. The only thing protecting from this happening was the delay itself, but a configured shorter delay increases the possibility of this happening.
  - Inverted the order such that the method's code entry point is set before recording it for call counting, both on first and subsequent calls
  - Changed the tiered compilation lock to be an any-GC-mode lock so that it can be taken inside the code versioning lock, as some things were more naturally placed inside the code versioning lock where we know the active code version, like checking for the tiering delay to delay call counting and promoting the code version when the call count threshold is reached
    - Unfortunately, that makes code inside the lock a GC-no-trigger scope and things like scheduling a timer or queuing a work item to the thread pool could not be done inside that scope. This tradeoff seems to be better than alternatives, so refactored those pieces to occur outside that scope.
- Publishing an entry point after changing the active code version now takes call counting into account, fixes https://github.com/dotnet/coreclr/issues/22426
- After the changes:
  - Call counting overhead is much smaller and is not many orders of magnitude greater than a method call
  - Some config modes for tuning tiering are now much more reasonable and do not affect perf negatively nearly as much as before - increasing call count threshold, disabling or decreasing the tiering delay. Enables dynamic thresholds in the future, which is not feasible due to the overhead currently.
  - No change to startup or steady-state perf
- Left for later
  - Eventing work to report call counting stub code ranges and method name (also needs to be done for other stubs)
  - Some tests that consume events to verify run-time behavior in a few config modes
  - Debugger test to verify debugging while call-counting. Debugger tests also need to be fixed for tiering.
  - The call count threshold has not been changed for now. As we don't have many tests that measure the performance in-between startup and steady-state, some will need to be created maybe from existing tests, to determine the effects
- Fixes https://github.com/dotnet/coreclr/issues/23596
@kouvel
Copy link
Member Author

kouvel commented Jan 23, 2020

Rebased to fix conflict

@kouvel
Copy link
Member Author

kouvel commented Jan 24, 2020

I ran CscRoslynSource overnight with aggressive tiering and a delete-after threshold of completed call counting stubs of 1 (after JIT queue empties, if there is at least one completed stub, stubs will be deleted), in checked and release builds on x64 and arm32. The test was run 16 times per process in a loop. In that mode deletes occur several times and most of the created stubs are deleted several iterations before the end of the test run in a process. Didn't find any issues.

Copy link
Member

@davidwrighton davidwrighton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@noahfalk
Copy link
Member

@noahfalk if you haven't already, I would appreciate if you could review the last 3 commits.
Hopefully I have answered all of the questions above. Please let me know if I missed anything. Thanks!

👍 Yep looks good to me. Thanks Kount!

@kouvel kouvel merged commit 3a457cb into dotnet:master Jan 28, 2020
@kouvel
Copy link
Member Author

kouvel commented Jan 28, 2020

Thanks all!

@kouvel kouvel deleted the CallCounting branch January 28, 2020 22:19
kouvel added a commit to kouvel/runtime that referenced this pull request Jan 28, 2020
stephentoub pushed a commit that referenced this pull request Jan 29, 2020
This reverts commit 3a2dc0f.

Fixed by #1457. Closes #129.
@jkotas
Copy link
Member

jkotas commented Feb 1, 2020

We are seeing a lot more instability in the CI, and everything is pointing to this change.

#29934 has details.

I am sorry, but I will have to revert this to get CI health on track.

@jkotas
Copy link
Member

jkotas commented Feb 1, 2020

Hmm, maybe it is not this change, but #2380 . I will check...

jkotas added a commit that referenced this pull request Feb 1, 2020
* Revert "Revert "Disable test based on #129 (#130)" (#2310)"

This reverts commit 427cd91.

* Revert "Fail FuncEval if slot backpatching lock is held by any thread (#2380)"

This reverts commit fc06054.

* Revert "Improve call counting mechanism (#1457)"

This reverts commit 3a457cb.
@jkotas
Copy link
Member

jkotas commented Feb 1, 2020

Reverted in #30105 .

kouvel added a commit to kouvel/runtime that referenced this pull request Mar 2, 2020
- Commit 1
  - Reverts commit f954c6b, which reverted PR dotnet#1457 due to issues
- Commit 2
  - Fixes crashes and assertion failures seen by the original change, fixes dotnet#29934
  - The crashes were caused by commit dotnet@6aa3c70 in the original PR
  - Call counting infos cannot be deleted when the corresponding call counting stubs may still run, because:
    - The remaining call count decremented by the stub is in the call counting info
    - The only way to get a code version / method desc from a stub is to go through the call counting info
  - Got one repro of the assertion failure in dotnet#22786 and it is most likely caused by the same issue, following heap corruption from modifying a deleted call counting info where the memory is reused for a `NativeCodeVersionNode`, messing up the method desc pointer
  - Fixed with a partial revert of the above commit. Added back the `Complete` stage and then call counting infos are deleted only after it's ensured that call counting stubs won't be used (shortly before deleting them).
- Commit 3
  - Public static functions of `CallCountingManager` that may be called through the debugger may occur before static initialization, added a check for null as suggested in dotnet#29892
kouvel added a commit that referenced this pull request Mar 3, 2020
* Improve call counting mechanism

- Commit 1
  - Reverts commit f954c6b, which reverted PR #1457 due to issues
- Commit 2
  - Fixes crashes and assertion failures seen by the original change, fixes #29934
  - The crashes were caused by commit 6aa3c70 in the original PR
  - Call counting infos cannot be deleted when the corresponding call counting stubs may still run, because:
    - The remaining call count decremented by the stub is in the call counting info
    - The only way to get a code version / method desc from a stub is to go through the call counting info
  - Got one repro of the assertion failure in #22786 and it is most likely caused by the same issue, following heap corruption from modifying a deleted call counting info where the memory is reused for a `NativeCodeVersionNode`, messing up the method desc pointer
  - Fixed with a partial revert of the above commit. Added back the `Complete` stage and then call counting infos are deleted only after it's ensured that call counting stubs won't be used (shortly before deleting them).
- Commit 3
  - Public static functions of `CallCountingManager` that may be called through the debugger may occur before static initialization, added a check for null as suggested in #29892

* Fix crashes and assertion failures seen by the original change

* Add check for null for some functions callable from the debugger
@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants