Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mono][interp] Disable inlining into bblocks that are detected as dead early during codegen #97514

Merged
merged 5 commits into from
Jan 31, 2024

Conversation

BrzVlad
Copy link
Member

@BrzVlad BrzVlad commented Jan 25, 2024

This PR addresses compilation overhead for methods containing patterns like:

if (false)
    Method1 ();
else if (false)
    Method2 ();
....

where MethodX are methods marked as aggressive inlining. This pattern is common in intrinsics related code. Before this PR we would inline all methods, together with all subcalls, even if the code path is obviously unreachable, leading to significant pressure on the compiler.

We do inlining, on the fly, in the IL code traversal stage. BBlock and constant folding optimizations are run afterwards once the full IL code has been processed. In order to be able to disable inlining in these dead bblocks, we implement some subset of cfolding to be applied as soon as we import IL code, so we can detect dead bblocks before we actually generate their code. An alternative would be to move inlining as a later phase, after the standard full bblock/cfold optimization passes are run, but that is a more invasive approach.

This works by adding jump counters for each bblock, which are set before we actually start generating code for the IL. As we generate code we optimize out some forward branches and decrement the jump counter for the target bblocks. When we then get to emit code in a bblock that is no longer a jump target for any other instruction, if the bblock is also not a fall through bblock from the previous bblock, it means it is dead and no inlining will happen into it. Also, when processing branches in such dead bblocks, we additionally decrement the jump counters for the target_bb. This isn't able to detect dead code involving loops, but it should cover most scenarios in the BCL.

This makes System.Numerics.Tensors.Tests go from taking 300s with mempeak of 6GB to a run time of 100s with mempeak of 600MB.

If both types are immediately known and loaded as constants
During initial bblock formation pass, we detect all bblocks that are targets of branches and we increment their ref count. As we import IL code, we might eagerly optimize out some conditional branches to unconditional branches (in which case we mark that the following bblock in IL order is no longer reachable from current bblock) or we completely optimize out the branch (in which case we reduce the ref count of the target bb). As we continue emitting code, we can detect if the current bblock is dead (if it is not a jump target and either is not linked to prev bblock or it is linked to a dead bblock). This liveness is not exact, but it should handle typical code with if/else conditionals.
@ghost
Copy link

ghost commented Jan 25, 2024

Tagging subscribers to this area: @BrzVlad, @kotlarmilos
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR addresses compilation overhead for methods containing patterns like:

if (false)
    Method1 ();
else if (false)
    Method2 ();
....

where MethodX are methods marked as aggressive inlining. This pattern is common in intrinsics related code. Before this PR we would inline all methods, together with all subcalls, even if the code path is oviously unreachable, leading to significant pressure on the compiler.

We do inlining, on the fly, in the IL code traversal stage. BBlock and constant folding optimizations are run afterwards once the full IL code has been processed. In order to be able to disable inlining in these dead bblocks, we implement some subset of cfolding to be applied as soon as we import IL code, so we can detect dead bblocks before we actually generate their code. An alternative would be to move inlining as a later phase, after the standard full bblock/cfold optimization passes are run, but that is a more invasive approach.

This works by adding jump counters for each bblock, which are set before we actually start generating code for the IL. As we generate code we optimize out some forward branches and decrement the jump counter for the target bblocks. When we then get to emit code in a bblock that is no longer a jump target for any other instruction, if the bblock is also not a fall through bblock from the previous bblock, it means it is dead and no inlining will happen into it. Also, when processing branch es in such dead bblocks, we additionally decrement the jump counters for the target_bb. This isn't able to detect dead code involving loops, but it should most scenarios in the BCL.

This makes System.Numerics.Tensors.Tests go from taking 300s with mempeak of 6GB to a run time of 100s with mempeak of 600MB.

Author: BrzVlad
Assignees: BrzVlad
Labels:

area-Codegen-Interpreter-mono

Milestone: -

@BrzVlad
Copy link
Member Author

BrzVlad commented Jan 25, 2024

cc @stephentoub @tannergooding

Comment on lines +676 to +677
if (td->cbb->no_inlining && long_op != MINT_CALL_HANDLER)
target_bb->jump_targets--;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, why not just always eliminate dead blocks early if possible?

Presumably even something like this is worth simplifying, particularly for larger block bodies:

if (someExpressionThatIsTriviallyConstantFalse)
{
    // logic that can never be hit
}
else
{
    // logic that will always be hit
}

I'd expect it reduces memory overhead, improves overall throughput due to less nodes to traverse, and may even make other optimizations easier to identify.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are all sorts of correctness checks during code emit, for example the stack state. So let's say you have a dead bblock that ends up falling through a live bblock. For example, if you just ignore all IL instrutions in the bblock without additional handling, then, when you enter the live bblock, the IL stack state will not contain correct information and we will throw invalid code exception. I'm sure it is doable, but it might not be as trivial as it would seem, making the gained benefit questionable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍, I'm not sure exactly what RyuJIT does for this scenario, but @EgorBo would likely know.

The general consideration comes in for code like this: https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.T.cs,2007

Where, even completely ignoring inlining, there exists a large block of code that is always dead for Mono today and will always be dead for most non-xarch platforms.

This type of pattern is pretty common outside the BCL in perf critical code, and that is often some of the more complex/infrastructural code for frameworks/libraries.

Copy link
Member Author

@BrzVlad BrzVlad Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, what is worse in the code sample you provided is the existence of the do/while loop. So in addition to traversing all that code, we will still inline.

@BrzVlad BrzVlad merged commit 05d0922 into dotnet:main Jan 31, 2024
111 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Mar 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants