-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Conversation
@jkotas @davidwrighton - are you guys the best reviewers for this stuff or is there someone else I should be asking? Thanks! |
For the overall approach: @dotnet/jit-contrib For the integration with the rest of the VM: @kouvel @janvorli @gkhanna79 |
clr.coreclr.props
Outdated
@@ -13,6 +13,7 @@ | |||
<FeatureDbiOopDebugging_HostOneCorex86 Condition="'$(TargetArch)' == 'i386' or '$(TargetArch)' == 'arm'">true</FeatureDbiOopDebugging_HostOneCorex86> | |||
<FeatureDbiOopDebugging_HostOneCoreamd64 Condition="'$(TargetArch)' == 'amd64'">true</FeatureDbiOopDebugging_HostOneCoreamd64> | |||
<FeatureEventTrace>true</FeatureEventTrace> | |||
<FeatureFitJit>true</FeatureFitJit> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call it something more self-describing, like FEATURE_TIERED_JIT
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that was a holdover from some internal naming
Apologies is this is a stupid question, but why not interpreted first, then non-optimised, followed by optimised? There's already an Interpreter available, or is it not considered suitable for production code? How different is the overhead between non-optimised and optimised JITting? |
Its a fine question, but you guessed correctly - the interpreter is not in good enough shape to run production code as-is. There are also some significant issues if you want debugging and profiling tools to work (which we do). Given enough time and effort it is all solvable, it just isn't the easiest place to start.
On my machine non-optimized jitting used about ~65% of the time that optimized jitting took for similar IL input sizes, but of course I expect results will vary by workload and hardware. Getting this first step checked in should make it easier to collect better measurements. |
@noahfalk thanks for the response,, I'd not even considered profiling/debugging, that's useful to know.
Interesting, so there's some decent saving to be made, that's cool |
src/vm/appdomain.hpp
Outdated
#if defined(FEATURE_FITJIT) | ||
|
||
public: | ||
TieredCompilationManager & GetTieredCompilationManager() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In coreclr runtime, pointers are used instead of references in most places. I would prefer returning pointer here and from the GetCallCounter below. In AppDomain::Init(), which is the only caller of this method, you need a pointer anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing
src/vm/callcounter.cpp
Outdated
} | ||
CONTRACTL_END; | ||
|
||
SpinLockHolder holder(&m_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the spinlock really needed here? It looks like just making the m_pTieredCompilationManager VolatilePtr and using its Store / Load methods for accessing it would be sufficient.
src/vm/method.hpp
Outdated
// pointer need to be very careful about if and when they cache it | ||
// if it is not stable. | ||
// | ||
// The stability of the native code pointer is seperate from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit: seperate -> separate
src/vm/methodtablebuilder.cpp
Outdated
#ifdef FEATURE_FITJIT | ||
// Keep in-sync with MethodDesc::IsEligibleForTieredCompilation() | ||
if (g_pConfig->TieredCompilation() && | ||
!GetModule()->HasNativeOrReadyToRunImage() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit - the formatting here is somehow strange, you have tabs here instead of spaces.
src/vm/tieredcompilation.cpp
Outdated
// and complicating the code to narrow an already rare error case isn't desirable. | ||
{ | ||
SpinLockHolder holder(&m_lock); | ||
SListElem<MethodDesc*>* pMethodListItem = new (nothrow) SListElem<MethodDesc*>(pMethodDesc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to move the allocation out of the spinlock to minimalize the amount of work done inside of it.
Actually, it seems to me that you really need to use the spinlock just for the m_methodsToOptimize list access and you don't need to use it for the m_countOptimizationThreadsRunning, m_isAppDomainShuttingDown and m_domainId access.
You can use Volatile<...> for m_isAppDomainShuttingDown and m_domainId access and Interlocked operations for incrementing and decrementing the m_countOptimizationThreadsRunning. Please correct me if I am wrong, but it doesn't look like the check for m_isAppDomainShuttingDown and m_countOptimizationThreadsRunning increment / decrement needs to be a single atomic operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on moving the allocation.
You are correct, there is no requirement of atomicity between the various field updates. However I'm not sure that changing to lockless volatile access for the other fields would be an improvement? Unless this lock proved to be a performance hotspot I think we are better off optimizing the code for simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's leave the spinlock usage as it is. I guess the hottest path is the OnMethodCalled function and it needs the spinlock anyways for syncing access to the m_methodsToOptimize list.
If we see that the lock is a perf issue here in the future, it seems we could even get rid of the lock completely by using a simple lockfree list (push one / pop all style that is trivial to make lockfree).
Thanks @janvorli ! If I don't hear anything further I'll squash and commit tomorrow (technically later today now) |
One reason for minopts is to do as little as possible in case there is a bug in non-minopts, e.g. if we hit a noway_assert, or tell customers to try using minopts to avoid hitting a bug in the field. So we really don't want it doing inlining, e.g. |
I'm not saying we should get rid of minopts or change what it does. I'm saying that the initial jit attempt should not be minopts, but something new that we don't have a flag for today, eg fastopts. As an initial cut fastopts can be mapped by the jit onto to minopts. Over time fastopts should diverge from minopts and enable some optimization. And if fastops hits an issue the jit or user can always fall back to minopts. |
@AndyAyersMS / @BruceForstall, I think you're touching on a larger question of what's the right set of optimization levels/flags, which is something we've been meaning to address; I've just opened #10560 for discussion about that. |
@AndyAyersMS / @JosephTremoulet There is certainly some exploration required in order to arrive at the right opt/speed trade-offs. I agree we shouldn't hard code to MinOpts to imply Tier 0 in the JIT and this does overlap with your opt levels Joe, however I don't think it's clear even now how many tiers we might need. We said 3 might be the right thing to shoot for out of the gate. Probably best to start with some notion of an actual level counter and then virtualize it behind the JIT interface to imply the set of on/off and limits per optimization. |
Have profiling scenarios been considered for this change? Specifically, I mean a profiler, which uses If IL-code is interpreted at first, and jitted only later on, then code already runs before |
@AndyAyersMS @cmckinsey @JosephTremoulet @BruceForstall - I think we are all in agreement about the desirability of a jit mode which obtains the best set of perf tradeoffs for tier 0. My above mention of min-opt jit was only to the extent that it is the best pre-existing approximation. Thanks for raising the clarification. How about this as a proposal:
@cmckinsey - I hesitate to add a level counter 'right out of the gate' because we don't yet have machinery to track the progression of a method through multiple levels. Adding a counter at this point would be a place holder only. The JIT would only be called with two of the levels. |
@discostu105 - Yep! As much as possible we want diagnostic tools to continue to work with the tiered jitting support we are building. I'm looking to do it in a way that keeps those tools working as-is, or with relatively minor updates, but given the low level interactions profilers and debuggers have with the runtime its hard to keep significant runtime changes 100% abstracted. For instance I think we'll need to reveal that there are additional method jittings which didn't occur before, but we can preserve semantics that if you update IL when you get the first JitCompilationStarted notification then that modification will correctly apply to every form of the code that gets eventually run. We should continue to evaluate the impact as some additional work comes online that aims to make this change work more smoothly with the profiler. If there are further opportunities to mitigate compat issues by making runtime changes I'm glad to discuss it.
There is no short term plan to add such an interpreter, and one of considerations in that was an expectation that it would cause trouble for diagnostic tools. |
works for me. |
Tiered compilation is a new feature we are experimenting with that aims to improve startup times. Initially we jit methods non-optimized, then switch to an optimized version once the method has been called a number of times. More details about the current feature operation are in the comments of TieredCompilation.cpp. This is only the first step in a longer process building the feature. The primary goal for now is to avoid regressing any runtime behavior in the shipping configuration in which the complus variable is OFF, while putting enough code in place that we can measure performance in the daily builds and make incremental progress visible to collaborators and reviewers. The design of the TieredCompilationManager is likely to change substantively, and the call counter may also change.
This is fantastic. It's going to be a big leap in the long run for hot code performance and startup time.
This means that optimizations currently only slow compilation down by a factor of 100/65=1.5x. If we JIT only hot code than the time spent on optimization can be increased greatly. I don't see why 5x slower compilation would be a problem if that is done on the top 5% of methods only. This would increase the proportion of those 5% to only 25% which is covered still by the gains of compiling cold code faster. |
Note that this feature in only enabling slow or fast JIT, a 'no JIT' (interpreted) option isn't currently possible because the .NET interpreter isn't considered production ready, see #10478 (comment) |
@mattwarren thanks for letting me know. A fast JIT should be similar in consequences to an interpreter I think. So that seems very good still. At the very least this should remove the (correct) reluctance of the team to add expensive optimizations. |
Tiered compilation is a new feature we are experimenting with that aims to improve startup times. Initially we jit methods non-optimized, then switch to an optimized version once the method has been called a number of times. More details about the current feature operation are in the comments of TieredCompilation.cpp.
This is only the first step in a longer process building the feature. The primary goal for now is to avoid regressing any runtime behavior in the shipping configuration in which the complus variable is OFF, while putting enough code in place that we can measure performance in the daily builds and make incremental progress visible to collaborators and reviewers. The design of the TieredCompilationManager is likely to change substantively, and the call counter may also change.