JIT: initial implementation of profile synthesis #82926

AndyAyersMS · 2023-03-03T03:40:15Z

Implements a profile synthesis algorithm based on the classic Wu-Larus paper Static branch frequency and program profile analysis (Micro-27, 1994), with a simple set of heuristics.

First step is construction of a depth-first spanning tree (DFST) for the flowgraph, and corresponding reverse postorder (RPO). Together these drive loop recognition; currently we only recognize reducible loops. We use DFST (non-) ancestry as a proxy for (non-) domination: the dominator of a node is required to be a DFST ancestor. So no explicit dominance computation is needed. Irreducible loops are noted but ignored. Loop features like entry, back, and exit edges, body sets, and nesting are computed and saved.

Next step is assignment of edge likelihoods. Here we use some simple local heuristics based on loop structure, returns, and throws. A final heuristic gives slight preference to conditional branches that fall through to the next IL offset.

After that we use loop nesting to compute the "cyclic probability" $cp$ for each loop, working inner to outer in loops and RPO within loops. $cp$ summarizes the effect of flow through the loop and around loop back edges. We cap $cp$ at no more than 1000. When finding $cp$ for outer loops we use $cp$ for inner loops.

Once all $cp$ values are known, we assign "external" input weights to method and EH entry points, and then a final RPO walk computes the expected weight of each block (and, via edge likelihoods, each edge).

We use the existing DFS code to build the DFST and the RPO, augmented by some fixes to ensure all blocks (even ones in isolated cycles) get numbered.

This initial version is intended to establish the right functionality, enable wider correctness testing, and to provide a foundation for refinement of the heuristics. It is not yet as efficient as it could be.

The loop recognition and recording done here overlaps with similar code elsewhere in the JIT. The version here is structural and not sensitive to IL patterns, so is arguably more general and I believe a good deal simpler than the lexically driven recognition we use for the current loop optimizer. I aspire to reconcile this somehow in future work.

All this is disabled by default; a new config option either enables using synthesis to set block weights for all root methods or just those without PGO data.

Synthesis for inlinees is not yet enabled; progress here is blocked by #82755.

Implements a profile synthesis algorithm based on the classic Wu-Larus paper (Static branch frequency and program profile analysis, Micro-27, 1994), with a simple set of heuristics. First step is construction of a depth-first spanning tree (DFST) for the flowgraph, and corresponding reverse postorder (RPO). Together these drive loop recognition; currently we only recognize reducible loops. We use DFST (non-) ancestry as a proxy for (non-) domination: the dominator of a node is required to be a DFST ancestor. So no explicit dominance computation is needed. Irreducible loops are noted but ignored. Loop features like entry, back, and exit edges, body sets, and nesting are computed and saved. Next step is assignment of edge likelihoods. Here we use some simple local heuristics based on loop structure, returns, and throws. A final heuristic gives slight preference to conditional branches that fall through to the next IL offset. After that we use loop nesting to compute the "cyclic probability" $cp$ for each loop, working inner to outer in loops and RPO within loops. $cp$ summarizes the effect of flow through the loop and around loop back edges. We cap $cp$ at no more than 1000. When finding $cp$ for outer loops we use $cp$ for inner loops. Once all $cp$ values are known, we assign "external" input weights to method and EH entry points, and then a final RPO walk computes the expected weight of each block (and, via edge likelihoods, each edge). We use the existing DFS code to build the DFST and the RPO, augmented by some fixes to ensure all blocks (even ones in isolated cycles) get numbered. This initial version is intended to establish the right functionality, enable wider correctness testing, and to provide a foundation for refinement of the heuristics. It is not yet as efficient as it could be. The loop recognition and recording done here overlaps with similar code elsewhere in the JIT. The version here is structural and not sensitive to IL patterns, so is arguably more general and I believe a good deal simpler than the lexically driven recognition we use for the current loop optimizer. I aspire to reconcile this somehow in future work. All this is disabled by default; a new config option either enables using synthesis to set block weights for all root methods or just those without PGO data. Synthesis for inlinees is not yet enabled; progress here is blocked by dotnet#82755.

ghost · 2023-03-03T03:40:25Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Implements a profile synthesis algorithm based on the classic Wu-Larus paper (Static branch frequency and program profile analysis, Micro-27, 1994), with a simple set of heuristics.