[mono][interp] Disable inlining into bblocks that are detected as dead early during codegen #97514

BrzVlad · 2024-01-25T15:09:49Z

This PR addresses compilation overhead for methods containing patterns like:

if (false)
    Method1 ();
else if (false)
    Method2 ();
....

where MethodX are methods marked as aggressive inlining. This pattern is common in intrinsics related code. Before this PR we would inline all methods, together with all subcalls, even if the code path is obviously unreachable, leading to significant pressure on the compiler.

We do inlining, on the fly, in the IL code traversal stage. BBlock and constant folding optimizations are run afterwards once the full IL code has been processed. In order to be able to disable inlining in these dead bblocks, we implement some subset of cfolding to be applied as soon as we import IL code, so we can detect dead bblocks before we actually generate their code. An alternative would be to move inlining as a later phase, after the standard full bblock/cfold optimization passes are run, but that is a more invasive approach.

This works by adding jump counters for each bblock, which are set before we actually start generating code for the IL. As we generate code we optimize out some forward branches and decrement the jump counter for the target bblocks. When we then get to emit code in a bblock that is no longer a jump target for any other instruction, if the bblock is also not a fall through bblock from the previous bblock, it means it is dead and no inlining will happen into it. Also, when processing branches in such dead bblocks, we additionally decrement the jump counters for the target_bb. This isn't able to detect dead code involving loops, but it should cover most scenarios in the BCL.

This makes System.Numerics.Tensors.Tests go from taking 300s with mempeak of 6GB to a run time of 100s with mempeak of 600MB.

If both types are immediately known and loaded as constants

During initial bblock formation pass, we detect all bblocks that are targets of branches and we increment their ref count. As we import IL code, we might eagerly optimize out some conditional branches to unconditional branches (in which case we mark that the following bblock in IL order is no longer reachable from current bblock) or we completely optimize out the branch (in which case we reduce the ref count of the target bb). As we continue emitting code, we can detect if the current bblock is dead (if it is not a jump target and either is not linked to prev bblock or it is linked to a dead bblock). This liveness is not exact, but it should handle typical code with if/else conditionals.

ghost · 2024-01-25T15:10:04Z

Tagging subscribers to this area: @BrzVlad, @kotlarmilos
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR addresses compilation overhead for methods containing patterns like:

if (false)
    Method1 ();
else if (false)
    Method2 ();
....

where MethodX are methods marked as aggressive inlining. This pattern is common in intrinsics related code. Before this PR we would inline all methods, together with all subcalls, even if the code path is oviously unreachable, leading to significant pressure on the compiler.

We do inlining, on the fly, in the IL code traversal stage. BBlock and constant folding optimizations are run afterwards once the full IL code has been processed. In order to be able to disable inlining in these dead bblocks, we implement some subset of cfolding to be applied as soon as we import IL code, so we can detect dead bblocks before we actually generate their code. An alternative would be to move inlining as a later phase, after the standard full bblock/cfold optimization passes are run, but that is a more invasive approach.

This works by adding jump counters for each bblock, which are set before we actually start generating code for the IL. As we generate code we optimize out some forward branches and decrement the jump counter for the target bblocks. When we then get to emit code in a bblock that is no longer a jump target for any other instruction, if the bblock is also not a fall through bblock from the previous bblock, it means it is dead and no inlining will happen into it. Also, when processing branch es in such dead bblocks, we additionally decrement the jump counters for the target_bb. This isn't able to detect dead code involving loops, but it should most scenarios in the BCL.

This makes System.Numerics.Tensors.Tests go from taking 300s with mempeak of 6GB to a run time of 100s with mempeak of 600MB.

Author:	BrzVlad
Assignees:	BrzVlad
Labels:	`area-Codegen-Interpreter-mono`
Milestone:	-

BrzVlad · 2024-01-25T15:17:12Z

cc @stephentoub @tannergooding

tannergooding · 2024-01-25T15:33:23Z

src/mono/mono/mini/interp/transform.c

+	if (td->cbb->no_inlining && long_op != MINT_CALL_HANDLER)
+		target_bb->jump_targets--;


Just wondering, why not just always eliminate dead blocks early if possible?

Presumably even something like this is worth simplifying, particularly for larger block bodies:

if (someExpressionThatIsTriviallyConstantFalse) { // logic that can never be hit } else { // logic that will always be hit }

I'd expect it reduces memory overhead, improves overall throughput due to less nodes to traverse, and may even make other optimizations easier to identify.

There are all sorts of correctness checks during code emit, for example the stack state. So let's say you have a dead bblock that ends up falling through a live bblock. For example, if you just ignore all IL instrutions in the bblock without additional handling, then, when you enter the live bblock, the IL stack state will not contain correct information and we will throw invalid code exception. I'm sure it is doable, but it might not be as trivial as it would seem, making the gained benefit questionable.

👍, I'm not sure exactly what RyuJIT does for this scenario, but @EgorBo would likely know.

The general consideration comes in for code like this: https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.T.cs,2007

Where, even completely ignoring inlining, there exists a large block of code that is always dead for Mono today and will always be dead for most non-xarch platforms.

This type of pattern is pretty common outside the BCL in perf critical code, and that is often some of the more complex/infrastructural code for frameworks/libraries.

Actually, what is worse in the code sample you provided is the existence of the do/while loop. So in addition to traversing all that code, we will still inline.

BrzVlad added 4 commits January 25, 2024 11:45

[mono][interp] Resolve immediately type equality comparison

658403a

If both types are immediately known and loaded as constants

[mono][interp] Remove brtrue/brfalse if possible during initial IR emit

fcbc67a

[mono][interp] Remove binop condbr if possible during initial IR emit

aaeaf9c

BrzVlad requested review from vargaz and kotlarmilos as code owners January 25, 2024 15:09

ghost assigned BrzVlad Jan 25, 2024

dotnet-issue-labeler bot added the area-Codegen-Interpreter-mono label Jan 25, 2024

BrzVlad mentioned this pull request Jan 25, 2024

[Mono]: Reduce Mono AOT cross compiler x64 memory footprint. #97096

Merged

tannergooding reviewed Jan 25, 2024

View reviewed changes

build-analysis bot mentioned this pull request Jan 25, 2024

"We stopped hearing from agent Azure Pipelines 32. Verify the agent machine is running and has a healthy network connection" dotnet/dnceng#1886

Open

3 tasks

Merge branch 'main' into feature-interp-eager-bblock-opt

db6990e

lewing approved these changes Jan 31, 2024

View reviewed changes

BrzVlad merged commit 05d0922 into dotnet:main Jan 31, 2024
111 checks passed

radekdoulik mentioned this pull request Feb 7, 2024

[Perf] Linux/x64: 2 Regressions on 1/31/2024 11:26:48 AM dotnet/perf-autofiling-issues#28600

Open

BrzVlad mentioned this pull request Feb 21, 2024

[Wasm] Very large methods cause interpreter tiering to fail and allocates a lot of unmanaged memory #93192

Closed

github-actions bot locked and limited conversation to collaborators Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mono][interp] Disable inlining into bblocks that are detected as dead early during codegen #97514

[mono][interp] Disable inlining into bblocks that are detected as dead early during codegen #97514

BrzVlad commented Jan 25, 2024 •

edited

Loading

ghost commented Jan 25, 2024

BrzVlad commented Jan 25, 2024

tannergooding Jan 25, 2024

BrzVlad Jan 25, 2024

tannergooding Jan 25, 2024

BrzVlad Jan 25, 2024 •

edited

Loading

		if (td->cbb->no_inlining && long_op != MINT_CALL_HANDLER)
		target_bb->jump_targets--;

[mono][interp] Disable inlining into bblocks that are detected as dead early during codegen #97514

[mono][interp] Disable inlining into bblocks that are detected as dead early during codegen #97514

Conversation

BrzVlad commented Jan 25, 2024 • edited Loading

ghost commented Jan 25, 2024

BrzVlad commented Jan 25, 2024

tannergooding Jan 25, 2024

Choose a reason for hiding this comment

BrzVlad Jan 25, 2024

Choose a reason for hiding this comment

tannergooding Jan 25, 2024

Choose a reason for hiding this comment

BrzVlad Jan 25, 2024 • edited Loading

Choose a reason for hiding this comment

BrzVlad commented Jan 25, 2024 •

edited

Loading

BrzVlad Jan 25, 2024 •

edited

Loading