Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Tiered Compilation step 1 #10478

Merged
merged 1 commit into from
Mar 30, 2017
Merged

Tiered Compilation step 1 #10478

merged 1 commit into from
Mar 30, 2017

Conversation

noahfalk
Copy link
Member

Tiered compilation is a new feature we are experimenting with that aims to improve startup times. Initially we jit methods non-optimized, then switch to an optimized version once the method has been called a number of times. More details about the current feature operation are in the comments of TieredCompilation.cpp.

This is only the first step in a longer process building the feature. The primary goal for now is to avoid regressing any runtime behavior in the shipping configuration in which the complus variable is OFF, while putting enough code in place that we can measure performance in the daily builds and make incremental progress visible to collaborators and reviewers. The design of the TieredCompilationManager is likely to change substantively, and the call counter may also change.

@noahfalk
Copy link
Member Author

@jkotas @davidwrighton - are you guys the best reviewers for this stuff or is there someone else I should be asking? Thanks!

@jkotas
Copy link
Member

jkotas commented Mar 26, 2017

are you guys the best reviewers for this stuff or is there someone else I should be asking?

For the overall approach: @dotnet/jit-contrib

For the integration with the rest of the VM: @kouvel @janvorli @gkhanna79

@@ -13,6 +13,7 @@
<FeatureDbiOopDebugging_HostOneCorex86 Condition="'$(TargetArch)' == 'i386' or '$(TargetArch)' == 'arm'">true</FeatureDbiOopDebugging_HostOneCorex86>
<FeatureDbiOopDebugging_HostOneCoreamd64 Condition="'$(TargetArch)' == 'amd64'">true</FeatureDbiOopDebugging_HostOneCoreamd64>
<FeatureEventTrace>true</FeatureEventTrace>
<FeatureFitJit>true</FeatureFitJit>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call it something more self-describing, like FEATURE_TIERED_JIT ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that was a holdover from some internal naming

@mattwarren
Copy link

Initially we jit methods non-optimized, then switch to an optimized version once the method has been called a number of times.

Apologies is this is a stupid question, but why not interpreted first, then non-optimised, followed by optimised? There's already an Interpreter available, or is it not considered suitable for production code?

How different is the overhead between non-optimised and optimised JITting?

@noahfalk
Copy link
Member Author

There's already an Interpreter available, or is it not considered suitable for production code?

Its a fine question, but you guessed correctly - the interpreter is not in good enough shape to run production code as-is. There are also some significant issues if you want debugging and profiling tools to work (which we do). Given enough time and effort it is all solvable, it just isn't the easiest place to start.

How different is the overhead between non-optimised and optimised JITting?

On my machine non-optimized jitting used about ~65% of the time that optimized jitting took for similar IL input sizes, but of course I expect results will vary by workload and hardware. Getting this first step checked in should make it easier to collect better measurements.

@mattwarren
Copy link

mattwarren commented Mar 27, 2017

@noahfalk thanks for the response,, I'd not even considered profiling/debugging, that's useful to know.

On my machine non-optimized jitting used about ~65% of the time that optimized jitting took for similar IL input sizes, but of course I expect results will vary by workload and hardware. Getting this first step checked in should make it easier to collect better measurements.

Interesting, so there's some decent saving to be made, that's cool

#if defined(FEATURE_FITJIT)

public:
TieredCompilationManager & GetTieredCompilationManager()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In coreclr runtime, pointers are used instead of references in most places. I would prefer returning pointer here and from the GetCallCounter below. In AppDomain::Init(), which is the only caller of this method, you need a pointer anyways.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing

}
CONTRACTL_END;

SpinLockHolder holder(&m_lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the spinlock really needed here? It looks like just making the m_pTieredCompilationManager VolatilePtr and using its Store / Load methods for accessing it would be sufficient.

// pointer need to be very careful about if and when they cache it
// if it is not stable.
//
// The stability of the native code pointer is seperate from the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit: seperate -> separate

#ifdef FEATURE_FITJIT
// Keep in-sync with MethodDesc::IsEligibleForTieredCompilation()
if (g_pConfig->TieredCompilation() &&
!GetModule()->HasNativeOrReadyToRunImage() &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit - the formatting here is somehow strange, you have tabs here instead of spaces.

// and complicating the code to narrow an already rare error case isn't desirable.
{
SpinLockHolder holder(&m_lock);
SListElem<MethodDesc*>* pMethodListItem = new (nothrow) SListElem<MethodDesc*>(pMethodDesc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to move the allocation out of the spinlock to minimalize the amount of work done inside of it.
Actually, it seems to me that you really need to use the spinlock just for the m_methodsToOptimize list access and you don't need to use it for the m_countOptimizationThreadsRunning, m_isAppDomainShuttingDown and m_domainId access.
You can use Volatile<...> for m_isAppDomainShuttingDown and m_domainId access and Interlocked operations for incrementing and decrementing the m_countOptimizationThreadsRunning. Please correct me if I am wrong, but it doesn't look like the check for m_isAppDomainShuttingDown and m_countOptimizationThreadsRunning increment / decrement needs to be a single atomic operation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on moving the allocation.

You are correct, there is no requirement of atomicity between the various field updates. However I'm not sure that changing to lockless volatile access for the other fields would be an improvement? Unless this lock proved to be a performance hotspot I think we are better off optimizing the code for simplicity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's leave the spinlock usage as it is. I guess the hottest path is the OnMethodCalled function and it needs the spinlock anyways for syncing access to the m_methodsToOptimize list.
If we see that the lock is a perf issue here in the future, it seems we could even get rid of the lock completely by using a simple lockfree list (push one / pop all style that is trivial to make lockfree).

@noahfalk
Copy link
Member Author

Thanks @janvorli ! If I don't hear anything further I'll squash and commit tomorrow (technically later today now)

@AndyAyersMS
Copy link
Member

I think minopts (as it currently exists) is a plausible starting place for the initial method jit, but is something we will want to change fairly soon.

What we want in the initial jit attempt is to have the jit to generate code as fast as possible, not to generate code with minimal optimizations. Those are not the same thing: some optimizations will actually make jitting faster. We haven't really explored this space very well and I don't have anything concrete to recommend here yet. It should be the case that some optimization more than pays for itself.

Second, minopts does no inlining whatsoever, and this will both cause larger than normal counter overhead as well as kicking off jitting for methods that arguably never need to be jitted on their own (eg methods marked with aggressive inlining).

I have some data that shows inlining is one of the optimizations that may make jitting faster, at least for very simple inlinees. It is not an open and shut case because those measurements were made with the rest of the jit running its normal optimization passes and the measurements do not fully capture possible additional costs from class loading (which are tricky to account for since it's somewhat unfair to pin them on any particular inlining decision). Here's a plot of the data for the jit time impact of individual inlines as a function of IL size. Vertical units are microsesconds, lower/less than zero means the jit is faster if we inline than if we don't.
image
This data shows jitting is faster when the jit inlines methods with IL sizes 0-4, and is a decent bet to be faster or as fast even up to methods as large as 10 IL bytes.

The current inlining policy is to always inline methods that are 16 bytes of IL or less. There is an alternative policy (the "size policy") that might be a good alternative for initial jitting, as it tries to minimize overall method size (it also honors aggressive inlines). For the jit, jit time is typically proportional to the size of the generated code.

All of this impacts policy and tradeoff -- enabling some optimization initially can make the initial jitting faster and make the initially jitted code run faster. So it might buy us more time to use that initially jitted code until we decide to rejit, at which point we can possibly be somewhat more aggressive.

So it would be nice to even now to generalize the notion of "please jit fast" by passing in a new flag instead of reusing an old one. Initially the jit can map this to minopts but in the future we can experiment with alternatives.

@BruceForstall
Copy link
Member

So it would be nice to even now to generalize the notion of "please jit fast" by passing in a new flag instead of reusing an old one.

One reason for minopts is to do as little as possible in case there is a bug in non-minopts, e.g. if we hit a noway_assert, or tell customers to try using minopts to avoid hitting a bug in the field. So we really don't want it doing inlining, e.g.

@AndyAyersMS
Copy link
Member

I'm not saying we should get rid of minopts or change what it does.

I'm saying that the initial jit attempt should not be minopts, but something new that we don't have a flag for today, eg fastopts. As an initial cut fastopts can be mapped by the jit onto to minopts.

Over time fastopts should diverge from minopts and enable some optimization. And if fastops hits an issue the jit or user can always fall back to minopts.

@JosephTremoulet
Copy link

@AndyAyersMS / @BruceForstall, I think you're touching on a larger question of what's the right set of optimization levels/flags, which is something we've been meaning to address; I've just opened #10560 for discussion about that.

@cmckinsey
Copy link

@AndyAyersMS / @JosephTremoulet There is certainly some exploration required in order to arrive at the right opt/speed trade-offs. I agree we shouldn't hard code to MinOpts to imply Tier 0 in the JIT and this does overlap with your opt levels Joe, however I don't think it's clear even now how many tiers we might need. We said 3 might be the right thing to shoot for out of the gate. Probably best to start with some notion of an actual level counter and then virtualize it behind the JIT interface to imply the set of on/off and limits per optimization.

@discostu105
Copy link

Have profiling scenarios been considered for this change? Specifically, I mean a profiler, which uses JitCompilationStarted callack to exchange IL-code for instrumentation. We use this feature heavily in our product.

If IL-code is interpreted at first, and jitted only later on, then code already runs before JitCompilationStarted is called. So, an IL-code modification is only possible "eventually".

@noahfalk
Copy link
Member Author

@AndyAyersMS @cmckinsey @JosephTremoulet @BruceForstall - I think we are all in agreement about the desirability of a jit mode which obtains the best set of perf tradeoffs for tier 0. My above mention of min-opt jit was only to the extent that it is the best pre-existing approximation. Thanks for raising the clarification.

How about this as a proposal:

  1. I will make a small follow-on change very shortly that adds a new flag and changes the code here to use it. I will alias the flag to minopts because that is the closest configuration that currently exists.
  2. At some point that it is convenient a new optimization policy can be developed for this flag, and it can be unaliased from min-opt.
  3. As we gain experience working on tiered compilation in general we can continue to collaborate on what additional configuration knobs are appropriate, be it a level number, tracing info, block counts, type test results, etc.

@cmckinsey - I hesitate to add a level counter 'right out of the gate' because we don't yet have machinery to track the progression of a method through multiple levels. Adding a counter at this point would be a place holder only. The JIT would only be called with two of the levels.

@noahfalk
Copy link
Member Author

Have profiling scenarios been considered for this change?

@discostu105 - Yep! As much as possible we want diagnostic tools to continue to work with the tiered jitting support we are building. I'm looking to do it in a way that keeps those tools working as-is, or with relatively minor updates, but given the low level interactions profilers and debuggers have with the runtime its hard to keep significant runtime changes 100% abstracted. For instance I think we'll need to reveal that there are additional method jittings which didn't occur before, but we can preserve semantics that if you update IL when you get the first JitCompilationStarted notification then that modification will correctly apply to every form of the code that gets eventually run. We should continue to evaluate the impact as some additional work comes online that aims to make this change work more smoothly with the profiler. If there are further opportunities to mitigate compat issues by making runtime changes I'm glad to discuss it.

If IL-code is interpreted at first, and jitted only later on, then code already runs before JitCompilationStarted is called. So, an IL-code modification is only possible "eventually".
I don't think we have any near term plan for introducing an interpreter, in part because of the additional work it would involve integrating it with the current set of profiling and diagnostic tools.

There is no short term plan to add such an interpreter, and one of considerations in that was an expectation that it would cause trouble for diagnostic tools.

@JosephTremoulet
Copy link

How about this as a proposal...

works for me.

Tiered compilation is a new feature we are experimenting with that aims to improve startup times. Initially we jit methods non-optimized, then switch to an optimized version once the method has been called a number of times. More details about the current feature operation are in the comments of TieredCompilation.cpp.

This is only the first step in a longer process building the feature. The primary goal for now is to avoid regressing any runtime behavior in the shipping configuration in which the complus variable is OFF, while putting enough code in place that we can measure performance in the daily builds and make incremental progress visible to collaborators and reviewers. The design of the TieredCompilationManager is likely to change substantively, and the call counter may also change.
@noahfalk noahfalk merged commit bf6a03a into dotnet:master Mar 30, 2017
@noahfalk noahfalk mentioned this pull request Mar 30, 2017
@GSPP
Copy link

GSPP commented May 6, 2017

This is fantastic. It's going to be a big leap in the long run for hot code performance and startup time.

On my machine non-optimized jitting used about ~65% of the time that optimized jitting took for similar IL input sizes

This means that optimizations currently only slow compilation down by a factor of 100/65=1.5x. If we JIT only hot code than the time spent on optimization can be increased greatly. I don't see why 5x slower compilation would be a problem if that is done on the top 5% of methods only. This would increase the proportion of those 5% to only 25% which is covered still by the gains of compiling cold code faster.

@mattwarren
Copy link

@GSPP

If we JIT only hot code than the time spent on optimization can be increased greatly.

Note that this feature in only enabling slow or fast JIT, a 'no JIT' (interpreted) option isn't currently possible because the .NET interpreter isn't considered production ready, see #10478 (comment)

@GSPP
Copy link

GSPP commented May 8, 2017

@mattwarren thanks for letting me know. A fast JIT should be similar in consequences to an interpreter I think. So that seems very good still.

At the very least this should remove the (correct) reluctance of the team to add expensive optimizations.

@karelz karelz added this to the 2.0.0 milestone Aug 28, 2017
@karelz karelz added this to the 2.0.0 milestone Aug 28, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.