-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mono][interp] Disable inlining into bblocks that are detected as dead early during codegen #97514
Conversation
If both types are immediately known and loaded as constants
During initial bblock formation pass, we detect all bblocks that are targets of branches and we increment their ref count. As we import IL code, we might eagerly optimize out some conditional branches to unconditional branches (in which case we mark that the following bblock in IL order is no longer reachable from current bblock) or we completely optimize out the branch (in which case we reduce the ref count of the target bb). As we continue emitting code, we can detect if the current bblock is dead (if it is not a jump target and either is not linked to prev bblock or it is linked to a dead bblock). This liveness is not exact, but it should handle typical code with if/else conditionals.
Tagging subscribers to this area: @BrzVlad, @kotlarmilos Issue DetailsThis PR addresses compilation overhead for methods containing patterns like:
where MethodX are methods marked as aggressive inlining. This pattern is common in intrinsics related code. Before this PR we would inline all methods, together with all subcalls, even if the code path is oviously unreachable, leading to significant pressure on the compiler. We do inlining, on the fly, in the IL code traversal stage. BBlock and constant folding optimizations are run afterwards once the full IL code has been processed. In order to be able to disable inlining in these dead bblocks, we implement some subset of cfolding to be applied as soon as we import IL code, so we can detect dead bblocks before we actually generate their code. An alternative would be to move inlining as a later phase, after the standard full bblock/cfold optimization passes are run, but that is a more invasive approach. This works by adding jump counters for each bblock, which are set before we actually start generating code for the IL. As we generate code we optimize out some forward branches and decrement the jump counter for the target bblocks. When we then get to emit code in a bblock that is no longer a jump target for any other instruction, if the bblock is also not a fall through bblock from the previous bblock, it means it is dead and no inlining will happen into it. Also, when processing branch es in such dead bblocks, we additionally decrement the jump counters for the target_bb. This isn't able to detect dead code involving loops, but it should most scenarios in the BCL. This makes
|
if (td->cbb->no_inlining && long_op != MINT_CALL_HANDLER) | ||
target_bb->jump_targets--; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering, why not just always eliminate dead blocks early if possible?
Presumably even something like this is worth simplifying, particularly for larger block bodies:
if (someExpressionThatIsTriviallyConstantFalse)
{
// logic that can never be hit
}
else
{
// logic that will always be hit
}
I'd expect it reduces memory overhead, improves overall throughput due to less nodes to traverse, and may even make other optimizations easier to identify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are all sorts of correctness checks during code emit, for example the stack state. So let's say you have a dead bblock that ends up falling through a live bblock. For example, if you just ignore all IL instrutions in the bblock without additional handling, then, when you enter the live bblock, the IL stack state will not contain correct information and we will throw invalid code exception. I'm sure it is doable, but it might not be as trivial as it would seem, making the gained benefit questionable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍, I'm not sure exactly what RyuJIT does for this scenario, but @EgorBo would likely know.
The general consideration comes in for code like this: https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.T.cs,2007
Where, even completely ignoring inlining, there exists a large block of code that is always dead for Mono today and will always be dead for most non-xarch platforms.
This type of pattern is pretty common outside the BCL in perf critical code, and that is often some of the more complex/infrastructural code for frameworks/libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, what is worse in the code sample you provided is the existence of the do/while
loop. So in addition to traversing all that code, we will still inline.
This PR addresses compilation overhead for methods containing patterns like:
where MethodX are methods marked as aggressive inlining. This pattern is common in intrinsics related code. Before this PR we would inline all methods, together with all subcalls, even if the code path is obviously unreachable, leading to significant pressure on the compiler.
We do inlining, on the fly, in the IL code traversal stage. BBlock and constant folding optimizations are run afterwards once the full IL code has been processed. In order to be able to disable inlining in these dead bblocks, we implement some subset of cfolding to be applied as soon as we import IL code, so we can detect dead bblocks before we actually generate their code. An alternative would be to move inlining as a later phase, after the standard full bblock/cfold optimization passes are run, but that is a more invasive approach.
This works by adding jump counters for each bblock, which are set before we actually start generating code for the IL. As we generate code we optimize out some forward branches and decrement the jump counter for the target bblocks. When we then get to emit code in a bblock that is no longer a jump target for any other instruction, if the bblock is also not a fall through bblock from the previous bblock, it means it is dead and no inlining will happen into it. Also, when processing branches in such dead bblocks, we additionally decrement the jump counters for the target_bb. This isn't able to detect dead code involving loops, but it should cover most scenarios in the BCL.
This makes
System.Numerics.Tensors.Tests
go from taking 300s with mempeak of 6GB to a run time of 100s with mempeak of 600MB.