-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Force inline small BasicBlock methods #95484
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsSeveral small methods aren't inlined in very large methods (such as
|
cc @dotnet/jit-contrib. My intent was to limit this to roughly one-line helper methods, but I did see some non-inlined calls to If we think the TP diffs are worth it, I'll add some comments explaining the decision (and maybe a TODO to remove these eventually?). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you could collect a per-method TP diff, it would be interesting to see where this makes a difference.
// "retless" BBJ_CALLFINALLY blocks due to a requirement to use the BBJ_ALWAYS for | ||
// generating code. | ||
// | ||
bool BasicBlock::isBBCallAlwaysPair() const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving methods from a .cpp
file to a header is a takeback w.r.t. maintainability (the coding convention encourages placing methods in implementation files). Does it make a good enough difference to pay for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Locally on Windows x64, I observed an additional 0.01% or 0.02% TP improvement; technically a 25-33% improvement over not force-inlining these methods, but only because the TP improvements were small on Windows x64 to begin with. As for maintainability, I agree I'd rather not move code previously in implementation files into header files, but I'm curious where we draw the line for "very small inline member function implementations" in the coding convention -- just one-line methods? Small methods with no function calls to avoid size increases from inlining?
src/coreclr/jit/block.h
Outdated
@@ -538,7 +538,7 @@ struct BasicBlock : private LIR::Range | |||
return bbJumpKind; | |||
} | |||
|
|||
void SetJumpKind(BBjumpKinds jumpKind) | |||
__forceinline void SetJumpKind(BBjumpKinds jumpKind) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__forceinline
is an MSVC-original construct. It works, but it would be more idiomatic to use the FORCEINLINE
macro.
I think like @jkotas has mentioned that it does not make sense to try to optimize inlining decisions in non PGO enabled builds since we expect PGO to be enabled and seriously influence these decisions. If we want to do that we should wait until we have native PGO enabled again and then do one-off TP collection in a PGO build with and without the change. FWIW, there's some MSVC switches like |
Good idea, I'll wait for #91297 to be merged in. I think this should happen soon, since #95510 was recently opened.
Thanks for mentioning that. I suspect our exceptionally large callers are exhausting the inlining budget pretty quickly, leaving out the easy inlining candidates. I'll take another look once we have PGO data to see if this is an issue. |
By the way, we have some existing uses of |
Are we currently enabling and using PGO on all platforms? I had thought we still had some where it was not and could not be trivially enabled?
We are still using I still think this is a bit backwards since you'd likely be more inclined to debug By using
However, we're missing out on:
Additionally, we aren't enabling other perf optimizations that might be beneficial (we do notably enable
|
I agree with this sentiment, but could switching to /O2 impact how we support users facing issues in production? In other words, are there situations where we only have a Release build we can debug? Perhaps such cases are rare enough (and already tricky to debug) that they're worth the performance boost. |
|
Do you know when we last did a comparison? If we haven't tried the new inlining heuristics, I'd be interested in trying them. |
That's surprising. I wouldn't expect that given as it looks like we're only really missing I think it'd be worth testing again and then testing with |
It has been a while.
/GL should be enabled by CMAKE_INTERPROCEDURAL_OPTIMIZATION. |
I tested O2 here #93336 and didn't detect any TP diffs (for jit) |
We absolutely should change MSVC compilation switches for Release to use -O2 instead of -Ox. #53849. Checked and Release should have the same compiler optimization switches. There are bugs for doing that for x86. I might have done it for non-x86 but testing Release is "hard" -- our CI systems don't do it, and you'd want to do some perf testing on it also. |
I disagree. /GL makes sense for Release, but it does not make sense for Checked. |
Why do you think this? The goal for Checked is to run as fast as possible but still have DEBUG checking (asserts, etc.) enabled. Is it because you want Checked builds to compile faster (and maybe be incremental?) for faster dev inner-loop? One optimization I admit would not make sense in Checked is native PGO, since we'll presumably never collect Checked PGO data. |
Yes. Also, the speed up from /GY for code with debug asserts and no PGO data is not going to be dramatic. |
Several small methods aren't inlined in very large methods (such as
Compiler::impImportBlockCode
) by MSVC, probably due to its inlining budget. Force-inlining some of these methods, starting with a fewBasicBlock
methods, may slightly improve TP (at least this was the case locally).