Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: initial work on general count reconstruction #99992

Merged
merged 4 commits into from
Mar 26, 2024

Conversation

AndyAyersMS
Copy link
Member

Implements a Gauss-Seidel solver for cases where method have irreducible loops.

Implements a Gauss-Seidel solver for cases where method have irreducible loops.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 20, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@AndyAyersMS
Copy link
Member Author

Currently this kicks in whenever the initial profile is inconsistent... which turns out to be fairly often. Doing a "blend" may be too radical, we should perhaps do a "repair".

for (unsigned i = m_dfsTree->GetPostOrderCount(); i != 0; i--)
{
BasicBlock* block = m_dfsTree->GetPostOrder(i - 1);
ComputeBlockWeight(block);
}

m_approximate = (m_cappedCyclicProbabilities) || (m_improperLoopHeaders > 0);
Copy link
Member

@jakobbotsch jakobbotsch Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there can be cycles in the flow graph even without any improper loop headers or natural loops (it's the reason why I originally removed m_improperLoopHeaders from FlowGraphNaturalLoops). #96153 has some more information. I'm guessing it doesn't affect whether or not the computation is an approximation here because those cases happen only due to exceptional flow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we assume exceptional flow is sufficiently rare that any cycle it might form can be ignored. Also we don't model flow in and out of handlers, except for normally invoked finallys.

This might need some tweaks when we run this later after the importer rewires the flow.

@AndyAyersMS
Copy link
Member Author

Looks like failures above are all timeouts. I think it's unrelated, SPMI TP doesn't show much impact.

@AndyAyersMS
Copy link
Member Author

Assert failure(PID 13812 [0x000035f4], Thread: 14128 [0x3730]): promoted_bytes (heap_number) == promoted

CORECLR! SVR::gc_heap::mark_phase + 0x167C (0x00007fffc6c7149c) CORECLR! SVR::gc_heap::gc1 + 0x240 (0x00007fffc6c63220)
CORECLR! SVR::gc_heap::garbage_collect + 0x99B (0x00007fffc6c62f8b) CORECLR! SVR::gc_heap::gc_thread_function + 0x504 (0x00007fffc6c64934)
CORECLR! SVR::gc_heap::gc_thread_stub + 0x7E (0x00007fffc6c64dde) CORECLR! <lambda_69e1faed69ef1eb68c88ad291f228203>::<lambda_invoker_cdecl> + 0x74 (0x00007fffc692e204)
KERNEL32! BaseThreadInitThunk + 0x14 (0x00007ffff85484d4) NTDLL! RtlUserThreadStart + 0x21 (0x00007ffff8a01791)
File: D:\a_work\1\s\src\coreclr\gc\gc.cpp:24234

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Mar 22, 2024

Numeric Solvers for Profile Count Reconstruction

It may not be readily apparent how count reconstruction works. Perhaps these notes will shed some light on things.

In our flowgraph model we assume that the edge likelihoods are trustworthy and well formed (meaning each edge's likelihood is in [0,1] and the sum of all likelihoods for a block's successor edges is 1).

The appeal of edge well-formedness is easy to check and relatively easy to maintain during various optimizations. It is a local property.

We will use $p_{i,j}$ to denote the likelihood that block $i$ transfers control to block $j$. Thus local consistency means:

$$ 0 \le p_{i,j} \le 1 $$ and, for blocks with successors: $$ \sum_i p_{i,j} = 1 $$

By contrast, block weight consistency requires that the flow into a block be balanced by the flow out of a block. It is a global property and harder to maintain during optimizations. It may also not be true initially.

We will use $w_j$ for the weight of block $j$. We will also assume there is an external source and sink of weight for some blocks (method entry and exit points), $e_j$. Then block consistency means:

$$ e_j + \sum_i w_i p_{i,j} = \sum_k w_j p_{j,k} $$

where the LHS is flow in and the RHS is flow out of block $j$. But

$$ \sum_k w_j p_{j,k} = w_j \sum_k p_{j,k} = w_j $$

so we can restate this as saying the external flow plus the flow into the block must equal the block weight:

$$ e_j + \sum_i w_i p_{i,j} = w_j$$

The goal of this work is to explore methods for reconstructing a set of consistent block weights $w_j$ from the external weight sources and sinks $e_j$ and edge likelihoods $p_{i,j}$.

General Solution

The above can be summarized in matrix-vector form as

$$ \boldsymbol w = \boldsymbol e + \boldsymbol P \boldsymbol w $$

where to be able to express the sum of incoming flow as a standard matrix-vector product we have:

$$ \boldsymbol P_{i,j} = { p_{j,i} } $$

(that is, in $\boldsymbol P$, the flow from block $i$ is described by the entries in the $i\text{th}$ column, and the flow into block $i$ by the $i\text{th}$ row). A bit of rearranging puts this into the standard linear equation form

$$ (\boldsymbol I - \boldsymbol P) \boldsymbol w = \boldsymbol e$$

and this can be solved (in principle) for $\boldsymbol w$ by computing the inverse of $\boldsymbol I - \boldsymbol P$ (assuming this exists), giving

$$ \boldsymbol w = {(\boldsymbol I - \boldsymbol P)}^{-1} \boldsymbol e $$

For example, given the following graph with edge likelihoods a shown:

we have

$$\boldsymbol P = \begin{bmatrix} 0 & 0 & 0 & 0 \cr 1 & 0 & 0.8 & 0 \cr 0 & 0.5 & 0 & 0 \cr 0 & 0.5 & 0.2 & 0 \end{bmatrix}$$

Note each column save the last sums to 1.0, representing the fact that the outgoing likelihoods from each block must sum to 1.0, unless there are no successors.

Thus

$$(\boldsymbol I - \boldsymbol P) = \begin{bmatrix} 1 & 0 & 0 & 0 \\\\ -1 & 1 & -0.8 & 0 \\\\ 0 & -0.5 & 1 & 0 \\\\ 0 & -0.5 & -0.2 & 1 \end{bmatrix}$$

and so (details of computing the inverse left as exercise for the reader)

$${(\boldsymbol I - \boldsymbol P)}^{-1} = \begin{bmatrix} 1 & 0 & 0 & 0 \\\\ 1.67 & 1.67 & 1.33 & 0 \\\\ 0.83 & 0.83 & 1.67 & 0 \\\\ 1 & 1 & 1 & 1 \end{bmatrix}$$

Note the elements of ${(\boldsymbol I - \boldsymbol P)}^{-1}$ are all non-negative; intuitively, if we increase flow anywhere in the graph, it can only cause weights to increase or stay the same.

If we feed 6 units of flow into A, we have

$$\boldsymbol w = \begin{bmatrix} 6 \\\ 10 \\\ 5 \\\ 6 \end{bmatrix}$$

or graphically

However, explicit computation of the inverse of a matrix is computationally expensive.

Also note (though it's not fully obvious from such a small example) that the matrix $(\boldsymbol I - \boldsymbol P)$ is sparse: a typical block has only 1 or 2 successors, so the number of nonzero entries in each column will generally be either 2 or 3, no matter how many nodes we have. The inverse of a sparse matrix is typically not sparse, so computing it is not only costly in time but also in space.

So solution techniques that can leverage sparseness are of particular interest.

A More Practical Solution

Note the matrix $\boldsymbol I - \boldsymbol P$ has non-negative diagonal elements and negative non-diagonal elements, since all entries of $\boldsymbol P$ are in the range [0,1].

If we further restrict ourselves to the case where $p_{i,i} \lt 1$ (meaning there are are no infinite self-loops) then all the diagonal entries are positive and the matrix has an inverse with no negative elements.

Such matrices are known as M-matrices.

It is well known that for an M-matrix $(\boldsymbol I - \boldsymbol P)$ the inverse can be computed as the limit of an infinite series

$$ {(\boldsymbol I - \boldsymbol P)}^{-1} = \boldsymbol I + \boldsymbol P + \boldsymbol P^2 + \dots $$

This gives rise to a simple iterative procedure for computing an approximate value of $\boldsymbol w$ (here superscripts on $\boldsymbol w$ are successive iterates, not powers)

$$ \boldsymbol w^{(0)} = \boldsymbol e $$ $$ \boldsymbol w^{(1)} = (\boldsymbol I + \boldsymbol P) \boldsymbol e = \boldsymbol e + \boldsymbol P \boldsymbol w^{(0)} $$ $$ \boldsymbol w^{(2)} = (\boldsymbol I + \boldsymbol P + \boldsymbol P^2) \boldsymbol e = \boldsymbol e + \boldsymbol P \boldsymbol w^{(1)}$$ $$ \dots$$ $$ \boldsymbol w^{(k + 1)} = \boldsymbol e + \boldsymbol P \boldsymbol w^{(k)} $$

where we can achieve any desired precision for $\boldsymbol w$ by iterating until the successive $\boldsymbol w$ differ by a small amount.

Intuitively this should make sense, we are effectively pouring weight into the entry block(s) and letting the weights flow around in the graph until they reach a fixed point. If we do this for the example above, we get the following sequence of values for $\boldsymbol w^n$:

$$\boldsymbol w^{(0)} = \begin{bmatrix} 6 \\\ 0 \\\ 0 \\\ 0 \end{bmatrix}, \boldsymbol w^{(1)} = \begin{bmatrix} 6 \\\ 6 \\\ 0 \\\ 0 \end{bmatrix}, \boldsymbol w^{(2)} = \begin{bmatrix} 6 \\\ 6 \\\ 3 \\\ 3 \end{bmatrix}, \boldsymbol w^{(3)} = \begin{bmatrix} 6 \\\ 8.4 \\\ 3 \\\ 3.6 \end{bmatrix}, \boldsymbol w^{(4)} = \begin{bmatrix} 6 \\\ 8.4 \\\ 4.2 \\\ 3.6 \end{bmatrix}, \boldsymbol w^{(5)} = \begin{bmatrix} 6 \\\ 9.36 \\\ 4.2 \\\ 3.6 \end{bmatrix}, \dots, \boldsymbol w^{(20)} = \begin{bmatrix} 6 \\\ 9.9990 \\\ 4.9995 \\\ 5.9992 \end{bmatrix}, \dots$$

and the process converges to the weights found using the inverse. However convergence is fairly slow.

Classically this approach is known as Jacobi's method. At each iterative step, the new values are based only on the old values.

Jacobi's Method

If you read the math literature on iterative solvers, Jacobi's method is often described as follows. Given a linear system $\boldsymbol A \boldsymbol x = \boldsymbol b$, a splitting of $\boldsymbol A$ is $\boldsymbol A = \boldsymbol M - \boldsymbol N$, where $\boldsymbol M^{-1}$ exists. Then the iteration matrix $\boldsymbol H$ is given by $\boldsymbol H = \boldsymbol M^{-1} \boldsymbol N$. Given some initial guess at an answer $\boldsymbol x^{(0)}$ the iteration scheme is:

$$ \boldsymbol x^{(k+1)} = \boldsymbol H \boldsymbol x^{(k)} + \boldsymbol M^{-1}\boldsymbol b$$

And provided that $\rho(\boldsymbol H) \lt 1$,

$$\lim_{k \to \infty} \boldsymbol x^{(k)}=\boldsymbol A^{-1} \boldsymbol b$$

In our case $\boldsymbol A = \boldsymbol I - \boldsymbol P$ and so the splitting is simply $\boldsymbol M = \boldsymbol I$ and $\boldsymbol N = \boldsymbol P$. Since $\boldsymbol M = \boldsymbol I$, $\boldsymbol M^{-1} = \boldsymbol I$ (the identity matrix is its own inverse), $\boldsymbol H = \boldsymbol P$, $\boldsymbol x = \boldsymbol w$ and $\boldsymbol b = \boldsymbol e$, we end up with

$$ \boldsymbol w^{(k+1)} = \boldsymbol P \boldsymbol w^{(k)} + \boldsymbol e$$

as we derived above.

As an alternative we could split $\boldsymbol A = (\boldsymbol I - \boldsymbol P)$ into diagonal part $\boldsymbol M = \boldsymbol D$ and remainder part $\boldsymbol N$. This only leads to differences from the splitting above when there are self loops, otherwise the diagonal of $\boldsymbol P$ is all zeros.

With that splitting,

$$\boldsymbol D^{-1}_{i,i} = 1/a_{i,i} = 1/(1 - p_{i,i})$$

so as $p_{i,i}$ gets close to 1.0 the value can be quite large: these are the count amplifications caused by self-loops. If we write things out component-wise we get the classic formulation for Jacobi iteration:

$$x^{(k+1)}_i = \frac{1}{a_{i,i}} \left (b_i - \sum_{j \ne i} a_{i,j} x^{(k)}_j \right)$$

or in our block weight and edge likelihood notation

$$w^{(k+1)}_i = \frac{1}{(1 - p_{i,i})} \left (e_i + \sum_{j \ne i} p_{j,i} w^{(k)}_j \right)$$

Intuitively this reads: the new value of node $i$ is the sum of the external input (if any) plus the weights flowing in from (non-self) predecessors, with the sum scaled up by the self-loop factor.

On Convergence and Stability

While the iterative method above is guaranteed to converge when $\boldsymbol A$ is an M-matrix, its rate of convergence is potentially problematic. For an iterative scheme, the asymptotic rate of convergence can be shown to be $R \approx -log_{10} \rho(\boldsymbol H)$ digits / iteration.

Here the spectral radius $\rho(\boldsymbol H)$ is the magnitude of the largest eigenvalue of $\boldsymbol H$. For the example above $\boldsymbol H = \boldsymbol P$ and $\rho(\boldsymbol P) \approx 0.63$, giving $R = 0.2$. So to converge to $4$ decimal places takes about $20$ iterations, as the table of data above indicates.

it is also worth noting that for synthesis the matrix $(\boldsymbol I - \boldsymbol P)$ is often ill-conditioned, meaning that small changes in the input vector $\boldsymbol e$ (or small inaccuracies in the likelihoods $p_{i,j}$) can lead to large changes in the solution vector $\boldsymbol w$. In some sense this is a feature; we know that blocks in flow graphs can have widely varying weights, with some blocks rarely executed and others executed millions of times per call to the method. So it must be possible for $(\boldsymbol I - \boldsymbol P)$ to amplify the magnitude of a "small" input (say 1 call to the method) into large block counts.

Accelerating Convergence I: Gauss-Seidel and Reverse Postorder

It's also well-known that Gauss-Seidel iteration often converges faster than Jacobi iteration. Here instead of always using the old iteration values, we try and use the new iteration values that are available, where we presume each update happens in order of increasing $i$:

$$x^{(k+1)}_i = \frac{1}{a_{i,i}} \left(b_i - \sum_{j \lt i} a_{i,j} x^{(k+1)}_j - \sum_{j \gt i} a_{i,j} x^{(k)}_j \right) $$$$

or again in our notation

$$w^{(k+1)}_i = \frac{1}{(1 - p_{i,i})} \left(e_i + \sum_{j \lt i} p_{j,i} w^{(k + 1)}_j + \sum_{j \gt i} p_{j,i} w^{(k)}_j \right) $$$$

In the above scheme the order of visiting successive blocks is fixed unspecified, and (in principle) any order can be used. But by using a reverse postorder to index the blocks, we can ensure a maximal amount of forward propagation per iteration. Note that if a block has an incoming edge from a node that appears later in the reverse postorder, that block is a loop header.

If we do, that the code above nicely corresponds to our notion of forward and backward edges in the RPO:

$$w^{(k+1)}_i = \frac{1}{\underbrace{(1 - p_{i,i}}_\text{self edge})} \left(e_i + \underbrace{\sum_{j \lt i} p_{j,i} w^{(k + 1)}_j}_\text{forward edges in RPO} + \underbrace{\sum_{j \gt i} p_{j,i} w^{(k)}_j}_\text{backward edges in RPO} \right)$$

Note because of the order of reads and writes, $\boldsymbol w^{(k+1)}$ can share storage with $\boldsymbol w^{(k)}$.

On the example above this results in:

$$ \boldsymbol w^{(0)} = \begin{bmatrix} 6 \\ 6 \\ 3 \\ 3 \end{bmatrix}, \boldsymbol w^{(1)} = \begin{bmatrix} 6 \\ 8.4 \\ 4.2 \\ 5.04 \end{bmatrix}, \boldsymbol w^{(2)} = \begin{bmatrix} 6 \\ 9.36 \\ 4.68 \\ 5.62 \end{bmatrix}, \boldsymbol w^{(3)} = \begin{bmatrix} 6 \\ 9.74 \\ 4.87 \\ 5.85 \end{bmatrix}, \boldsymbol w^{(4)} = \begin{bmatrix} 6 \\ 9.90 \\ 4.95 \\ 5.94 \end{bmatrix}, \boldsymbol w^{(5)} = \begin{bmatrix} 6 \\ 9.96 \\ 4.98 \\ 5.98 \end{bmatrix}, \dots, \boldsymbol w^{(9)} = \begin{bmatrix} 6 \\ 9.9990 \\ 4.9995 \\ 5.9994 \end{bmatrix}, \dots $$

So it is converging about twice as fast. As with the Jacobi method one can re-express this as a splitting and determine an iteration matrix $\boldsymbol H$ and determine the dominant eigenvalue, and from this the rate of convergence, but we will not do so here.

Accelerating Convergence II: Cyclic Probabilities

A flow graph is reducible (or is said to have reducible loops) if every cycle in the graph has a block in the cycle that dominates the other blocks in the cycle. We will call such cycles natural loops, distinguished by their entry blocks.

For reducible loops we can compute the amount by which they amplify flow using a technique described by Wu and Larus: given a loop head $h$ we classify the predecessor into two sets: input edges that do not come from a block within the loop, and back edges that come from a block within the loop. We then inject one unit of flow into the block and propagate it through the loop, and compute the sum of the weights on the back edges. This value will be some $p$ where $0 \le p \le 1$. Then the cyclic probability $C_p$ for $h$ is $C_p(h) = 1 / (1 - p)$. To avoid dividing by zero we artificially cap $C_p$ at some value less than $1$.

If we add this refinement to our algorithm we end up with:

$$w^{(k+1)}_i = \begin{cases} C_p(i) \left(e_i + \sum_{j \lt i} p_{j,i} w^{(k + 1)}_j \right), \text{ block } i \text{ is a natural loop head} \\\ \frac{1}{(1 - p_{i,i})} \left(e_i + \sum_{j \lt i} p_{j,i} w^{(k + 1)}_j + \sum_{j \gt i} p_{j,i} w^{(k)}_j \right) \end{cases}$$

the second clause includes both blocks without any back edges and blocks with back edges that are not headers of natural loops.

On an example like the one above this converges in one pass. If any $C_p$ was capped then the solution will be approximate and we will have failed to achieve a global balance. But we will also (generally) have avoided creating infinite or very large counts.

One can imagine that if we cap some $C_p$ we could also try to alter some of the $p_{j,i}$ to bring things back into balance, but this seems tricky if there are multiple paths through the loop. And we're basically deciding at that point that consistency is more important than accuracy.

Since the remainder of the JIT is going to have to cope with lack of global balance anyways (recall it is hard to preserve) for now we are going to ty and tolerate reconstruction inconsistencies.

The algorithm described above is implemented in the code as the GaussSeidel solver.

Cycles That Are Not Natural Loops, More Sophisticated Solvers, and Deep Nests

If the flow graph has cycles that are not natural loops (irreducible loops) the above computations will converge but again may converge very slowly. On a sample of about 500 graphs with irreducible loops the modified Gauss-Seidel approach above required more than 20 iterations in 120 cases and more than 50 iterations in 70 cases, with worst-case around 500 iterations.

SOR is a classic convergence altering technique, but unfortunately, for M-Matrices SOR can only safely be used to slow down convergence.

There does not seem to be a good analog of $C_p$ for such cases, though it's possible that "block diagonal" solvers may be tackling exactly that problem.

It's possible that more sophisticated solution techniques like BiCGstab or CGS might be worth consideration. Or perhaps a least-squares solution, if we're forced to be approximate, to try and minimize the overall approximation error.

In very deep loop nests even $C_p$ is not enough to prevent creation of large counts. We could try and adjust the cap level downwards as the loops get deeper, or distribute the $C_p$ "tax" across all the loops. This tends to only be a problem for stress test cases.

References

Carl D. Meyer. Matrix Analysis and Applied Linear Algebra, in particular section 7.10.

Nick Higham. What is an M-Matrix?

Youfeng Wu and James R. Larus. Static branch frequency and program profile analysis, Micro-27 (1994).

@AndyAyersMS AndyAyersMS marked this pull request as ready for review March 23, 2024 19:00
@AndyAyersMS
Copy link
Member Author

@jakobbotsch ptal
cc @dotnet/jit-contrib

Think repair works better than blend here.

Fair number of diffs, likely from cases where PGO counts were not consistent (and likely much of that from the approximate count probes).

Diffs

@AndyAyersMS AndyAyersMS merged commit ba84d1e into dotnet:main Mar 26, 2024
108 of 110 checks passed
@BruceForstall BruceForstall mentioned this pull request Mar 27, 2024
@BruceForstall
Copy link
Member

Some notes:

  1. This makes no sense:

$$ 0 \ge p_{i,j} \ge 1 $$

  1. Shouldn't the extensive comments in this issue (JIT: initial work on general count reconstruction #99992 (comment)) be encapsulated in the code as comment, or adjacent as markdown?

  2. I'm worried about the math-y "rocket science" complexity here: will anyone be able to debug a failure?

@AndyAyersMS
Copy link
Member Author

Yeah I will copy this writeup somewhere.

This is a standard iterative linear equation solver; the only real nuance is using RPO to ensure maximum forward propagation (hopefully a familiar concept to anyone working on the jit) and then using the cyclic probabilities.

There's also a fairly obvious intuitive explanation, we pour some counts in the top and iteratively flow them around using the edge likelihoods until we reach a fixed point or run out of iterations. Most of what's above is just an explanation for why this works, why it might take a long time, and why there are cases it can't solve exactly.

I have held off on implementing something more "rocket science" state of the art (like BiCGSTAB or CGS) which rely heavily on less familiar bits of linear algebra.

@AndyAyersMS
Copy link
Member Author

Some issues that have come up:

  • Assertion failed 'change >= 0' #100350 (fixed by JIT: fix count reconstruction problem #100385)
  • if a natural loop contains an improper header, that natural loop and all enclosing natural loop $C_p$ values may be inaccurate (too low), so we can't rely on them alone for convergence acceleration. It is probably ok to use them sparingly (perhaps initially), but there needs to be at least one trailing iteration that does not use them.

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this pull request Apr 3, 2024
I left a long note on the algorithm as a comment on dotnet#99992.
Move it to the doc folder.
AndyAyersMS added a commit that referenced this pull request Apr 3, 2024
I left a long note on the algorithm as a comment on #99992.
Move it to the doc folder.
AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this pull request Apr 3, 2024
I left a long note on the algorithm as a comment on dotnet#99992.
Move it to the doc folder.
@github-actions github-actions bot locked and limited conversation to collaborators Apr 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants