Spanning tree instrumentation #47959

AndyAyersMS · 2021-02-06T19:42:36Z

@dotnet/jit-contrib think this is ready for some review.

See #46882 for notes on the overall approach.

This turned out to be a bit more involved than I'd thought, between work to uncover and cope with or remove complications I hit while running analyses this early in the jit's phase pipeline, and the need to navigate between several different representations for data.

A rough guide to reviewing the code:

pgo.cpp -- possible fix for issue PGO: seeing nonsensical counts and corrupted type handles #47930
corjit.h, PgoFormat.cs -- new record type in PGO schemas for edge counts
block.h -- another scratch field, used by reconstruction
compiler.h -- made some data public so the new classes in fgprofile can have access; new methods
compphases.h, phase.cpp, compiler.cpp -- new phase to allow "early" preparation for instrumentation (that is, before importation). Edge profiling needs this so it sees the "same" flow graph that profile incorporation sees.
jitconfigvalues.h -- new option to enable sparse edge instrumentation
fgbasic.cpp -- allow fgSplitEdge to be called before we have pred lists, defer computation of inlinee scale until profile incorporation
fgprofile.cpp -- the crux of the change. Two large clumps (instrumentation and reconstruction) with a shared utility (spanning tree walker), and various supporting changes to enable and cope with the new mode.

For sparse edge profiling both instrumentation and reconstruction rely on being able to create a spanning tree, so this part is factored out into a walker/visitor pattern. This tree is evolved by a DFS that is mostly normal but has a few odd aspects:

we try to bias the DFS into visiting certain successors first, as an attempt at creating a "maximum weight" spanning tree
we add some pseudo-edges from blocks with no successors to method and/or handler entries. Edge instrumentation generally prefers to instrument region exits rather than region entries, and this is triggered by these edges.

The reconstruction algorithm uses a fair amount of auxiliary data and maps. There may be less costly ways to encode some of this; suggestions welcome. It also repeatedly iterates over blocks when ideally it would be using some kind of priority queue or other more focused organization of the remaining work.

There are a number of tricky points:

using BlockSet in inlinees is tricky as basic blocks always live in the root compiler's "space"
likewise in the more comprehensive NumSucc we are always using the root instance (mainly so any switch descriptor stuff happens up at the root) -- one could imagine hiding this detail in the implementation of this machinery instead
OSR causes trouble
- on reconstruction the flowgraph shape doesn't match the base method shape, as we add a block to handle the initial OSR control flow
- for instrumentation we really need pseudo-edges from patchpoints to method entry, otherwise we will have incomplete counts
- so for now we don't allow edge instrumentation if OSR is enabled
I had to rework how we compute profile scale, because for edge profiling there is no entry block count probe. So scale computation has to wait until we have solved for the entry block count. I'll likely refactor all this again when I do early normalization by marking the return block or pseudo-edge records as special, and summing those early to determine fgCalledCount.
there are a number of failure modes possible
- Running PRI-1 with all this enabled in tiered PGO and strong assertion checks shows a handful of tests ended up with negative edge counts (which we currently just reset to zero) and one test failed to converge. Currently we'll just toss out PGO data if there's apparent badcode, a schema/graph mismatch, or the solver fails to converge, but we'll proceed if we end up with negative counts (which are summarily set to zero). Eventually we may have negative counts trigger a smoothing/reconciliation pass.
- the failures above did not repro when running tests in isolation. We are using live in-process data so the counts naturally will vary from run to run and there is some dependence of tiering on wall-clock time, so heavily loaded machines (like we see when running tests) may get quite different count data.
I added a CI leg for this so the only testing result here is proving that adding this didn't break something else.
I also plan to gather some perf data showing the runtime impact of various instrumentation options.

Add a new instrumentation mode that only instruments a subset of the edges in the control flow graph. This reduces the total number of counters and so has less compile time and runtime overhead than instrumenting every block. Add a matching count reconstruction algorithm that recovers the missing edge counts and all block counts. See dotnet#46882 for more details on this approach.

AndyAyersMS · 2021-02-07T23:24:41Z

Wrote a simple checker to compare block probe based counts with sparse edge probe based reconstructed block counts. Looks like the algorithm is generally doing quite well in getting to the same normalized block count (absolute counts vary from run to run based on the vagaries of tier1 timing).

Eg in the deserialization benchmark, only 4 blocks differ (to two decimal places) and only by 0.01 each time:

normalized mismatch in System.Xml.XmlUTF8TextReader:ReadEndElement():this (bb07f8e3)[-1] BB02: block 8.56, edge 8.55
normalized mismatch in System.Xml.XmlUTF8TextReader:ReadEndElement():this (bb07f8e3)[-1] BB04: block 8.56, edge 8.55
normalized mismatch in System.Xml.XmlUTF8TextReader:ReadEndElement():this (bb07f8e3)[-1] BB05: block 9.56, edge 9.55
normalized mismatch in System.Xml.XmlBaseReader:IsStartElement():bool:this (8192e4a7)[-1] BB05: block 0.13, edge 0.12
normalized mismatch in System.Xml.XmlBaseReader:IsStartElement():bool:this (8192e4a7)[-1] BB06: block 0.13, edge 0.12
normalized mismatch in System.Xml.XmlBaseReader:IsStartElement():bool:this (8192e4a7)[-1] BB07: block 0.13, edge 0.12
normalized mismatch in Newtonsoft.Json.Bson.BsonDataReader:ReadType(byte):this (2d68de06)[-1] BB06: block 0.12, edge 0.13
normalized mismatch in Newtonsoft.Json.Bson.BsonDataReader:ReadType(byte):this (2d68de06)[-1] BB08: block 0.12, edge 0.13
normalized mismatch in Newtonsoft.Json.Bson.BsonDataReader:ReadType(byte):this (2d68de06)[-1] BB28: block 0.12, edge 0.13
normalized mismatch in Newtonsoft.Json.JsonSerializer:GetMatchingConverter(System.Collections.Generic.IList`1[[Newtonsoft.Json.JsonConverter, Newtonsoft.Json, Version=12.0.0.0, Culture=neutral, PublicKeyToken=30ad4fe6b2a6aeed]],System.Type):Newtonsoft.Json.JsonConverter (4a2111e2)[-1] BB03: block 0.01, edge 0
normalized mismatch in Newtonsoft.Json.JsonSerializer:GetMatchingConverter(System.Collections.Generic.IList`1[[Newtonsoft.Json.JsonConverter, Newtonsoft.Json, Version=12.0.0.0, Culture=neutral, PublicKeyToken=30ad4fe6b2a6aeed]],System.Type):Newtonsoft.Json.JsonConverter (4a2111e2)[-1] BB05: block 0.01, edge 0
4 mismatched methods out of 1848 total methods
11 mismatched blocks out of 5462 total blocks

Spot checking, the counts also appear to be fairly consistent, so the copying the runtime is doing before handing data back to the jit seems to be fairly effective at preventing counter skew.

However, I'm also consistently seeing a convergence failure in System.Text.Unicode.Utf16Utility:GetPointerToFirstInvalidChar so will need to debug why that happens.

AndyAyersMS · 2021-02-08T03:57:56Z

The flowgraph / spanning tree for GetPointerToFirstInvalidChar. Tree edges in bold, these have unknown counts. Instrumented edges are fainter (return pseudo-edges are in orange, green are lexically backwards, blue are fall through). The solver's visitation order is the culprit; on those long nearly linear chains (eg BB11-BB28) the solver basically works backwards, and so takes quite a few iterations to solve (would need about 16 in all).

I may also look into why we are creating linear flow here; seems like we should be smart enough not to start a new basic block for non-joins...

BruceForstall

Looking good

BruceForstall · 2021-02-08T04:44:36Z

src/coreclr/jit/block.h

+        BasicBlock* bbIDom; // Represent the closest dominator to this block (called the Immediate
+                            // Dominator) used to compute the dominance tree.
+        void* bbProbeList;  // Used early on by fgInstrument
+        void* bbInfo;       // Used early on by fgIncorporate


nit: Could you use a more specific name than Info?

BruceForstall · 2021-02-08T04:49:08Z

src/coreclr/jit/fgbasic.cpp

@@ -450,7 +453,10 @@ void Compiler::fgReplaceSwitchJumpTarget(BasicBlock* blockSwitch, BasicBlock* ne
        {
            // Remove the old edge [oldTarget from blockSwitch]
            //
-            fgRemoveAllRefPreds(oldTarget, blockSwitch);
+            if (fgComputePredsDone)


Update the header comment that says "We also must update the predecessor lists for 'oldTarget' and 'newPred'."?

It seems like our basic manipulation routines need to document if they can run without preds, maybe with an appropriate assert if they can't.

Missed this one in my first review feedback commit.

BruceForstall · 2021-02-08T05:09:51Z

src/coreclr/jit/fgprofile.cpp

-        }
+            case BBJ_EHFILTERRET:
+            {
+                // Ignore filters; they are single block and only


There is no limitation that a filter be a single basic block

Not sure why I thought this (maybe a language restriction?) We can handle filter like other handler regions, then.

BruceForstall · 2021-02-08T05:10:46Z

src/coreclr/jit/fgprofile.cpp

+                // Since our keying scheme is IL based and this
+                // block has no IL offset, we'd need to invent
+                // some new keying scheme. For now we just
+                // ignore this (rare) case.


Are you saying it's a rare case that a finally has a throw (for instance) and hence doesn't return? I suppose that's true, though certainly possible.

Yes. We're not going to get an accurate picture of flow in methods that throw as the point at which control resumes is ambiguous. But we likely don't need an accurate picture; if a method throws with any frequency then it's unlikely to be performance sensitive.

BruceForstall · 2021-02-08T05:12:45Z

src/coreclr/jit/fgprofile.cpp

+                //
+                // Note if the throw is caught locally this will over-state the profile
+                // count for method entry. But we likely don't care too much about
+                // profiles for methods that throw lots of exceptions.


I guess you don't want to (or can't) add an edge to all possible in-function handlers as well as a pseudo-edge to the method entry?

If there are multiple out-edges we'll end up wanting to instrument some of them, which we can't do. In the paper, they suggest (in the context of longjmp) fixing his at runtime by having the runtime increment the appropriate edge counter, once it resolves the stack target. Again not an option for us. So, as noted above, we'll just not worry about these cases as they only arise when methods throw.

BruceForstall · 2021-02-08T08:38:32Z

src/coreclr/jit/fgprofile.cpp

        {
-            continue;
+            sourceOffset = block->bbNum | IL_OFFSETX_CALLINSTRUCTIONBIT;


Seems like you need to define your own high bit here, and maybe even accessor/settors, for extreme clarity.

BruceForstall · 2021-02-08T08:42:36Z

src/coreclr/jit/fgprofile.cpp

+//
+//    Solving is done in four steps:
+//    * Prepare
+//      *  walk the blocks settting up per block info, and a map


typo: settting

BruceForstall · 2021-02-08T08:43:34Z

src/coreclr/jit/fgprofile.cpp

+//        add in an unknown count edge.
+//    * Solve
+//      * repeatedly walk blocks, looking for blocks where all
+//        incoming our outgoing edges are known. This determines


BruceForstall · 2021-02-08T08:44:55Z

src/coreclr/jit/fgprofile.cpp

+private:
+    Compiler*     m_comp;
+    CompAllocator m_allocator;
+    uint32_t      m_blocks;


why are you using uint32_t / int32_t here? Can't we use the normal C++ types?

Sure. There will be a bit left as some of it comes from the instrumentation schema declaration in corjit.h

BruceForstall · 2021-02-08T08:49:30Z

src/coreclr/jit/block.h

+    union {
+        BasicBlock* bbIDom; // Represent the closest dominator to this block (called the Immediate
+                            // Dominator) used to compute the dominance tree.
+        void* bbProbeList;  // Used early on by fgInstrument


Do you have something that nulls these out before the phases that uses them? Or initializes them before they are read? I see somewhere that you assert bbInfo is != nullptr, for instance.

For instrumentation this happens in the VisitBlock method during the tree walk. For reconstruction the value is set during Prepare before it is read.

AndyAyersMS · 2021-02-08T16:42:00Z

Realized that the ideal solver order is likely reverse post order over the depth first spanning tree. We could either materialize that order somehow or perhaps even integrate the solve and traversal passes together.

For now, a simpler fix is to just solve for the blocks from last to first, instead of first to last, as most edges are lexically forward. With that change the example above (GetPointerToFirstInvalidChar) converges in 2 passes:

;;; before (front to back)
Solver: 54 blocks, 54 unknown; 78 edges, 53 unknown, 0 zero (and so ignored)
Pass [1]: 54 unknown blocks, 53 unknown edges
Pass [2]: 38 unknown blocks, 38 unknown edges
...
Pass [10]: 8 unknown blocks, 8 unknown edges
Solver: failed to converge in 10 passes, 6 blocks and 6 edges remain unsolved

;;; after (back to front)
Solver: 54 blocks, 54 unknown; 78 edges, 53 unknown, 0 zero (and so ignored)
Pass [1]: 54 unknown blocks, 53 unknown edges
Pass [2]: 16 unknown blocks, 15 unknown edges
Solver: converged in 2 passes

AndyAyersMS · 2021-02-08T16:50:19Z

One last note on that example. In my test run most of the blocks had zero count, because that method has an early return if all the strings it sees are ascii, and that's what happened in the test. So the solver is doing a fair amount of work to set most blocks to zero:

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd                 weight     IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                           16893. 16893    [000..016)-> BB03 ( cond )                     IBC 
BB02 [0001]  1                             0        0    [016..01C)-> BB04 (always)                     rare IBC 
BB03 [0002]  1                           16893. 16893    [01C..01D)                                     IBC 
BB04 [0003]  2                           16893. 16893    [01D..035)-> BB06 ( cond )                     IBC 
BB05 [0004]  1                           16893. 16893    [035..03E)-> BB07 (always)                     IBC 
BB06 [0005]  1                             0        0    [03E..03F)                                     rare IBC 
BB07 [0006]  2                           16893. 16893    [03F..056)-> BB09 ( cond )                     IBC 
BB08 [0007]  1                           16893. 16893    [056..05F)        (return)                     IBC 
BB09 [0008]  1                             0        0    [05F..06E)-> BB29 ( cond )                     rare IBC 
BB10 [0009]  1                             0        0    [06E..079)-> BB53 ( cond )                     rare IBC 
BB11 [0010]  1                             0        0    [079..0AA)-> BB12 ( cond )                     rare IBC 
... remaining blocks all zero ...

There is an optimization I contemplated that leaves zero weight edges out of the solve graph that might also reduce the amount of work needed to reach a solution. I left it out because if we subsequently want to deduce edge likelihoods, or perhaps also run synthesis, it seemed prudent to have edge objects for each successor edge.

AndyAyersMS · 2021-02-08T20:48:14Z

@BruceForstall think I covered most of your feedback.

AndyAyersMS · 2021-02-08T21:25:02Z

/azp run runtime-jit-experimental

azure-pipelines · 2021-02-08T21:25:18Z

Azure Pipelines successfully started running 1 pipeline(s).

AndyAyersMS · 2021-02-08T21:26:41Z

Mono build failure:

##[error]/Users/runner/.dotnet/sdk/5.0.102/NuGet.RestoreEx.targets(19,5): error : (NETCORE_ENGINEERING_TELEMETRY=Restore) Failed to download package 'Microsoft.DotNet.GenAPI.6.0.0-beta.21105.5' from 'https://pkgs.dev.azure.com/dnceng/9ee6d478-d288-47f7-aacc-f6e6d082ae6d/_packaging/1a5f89f6-d8da-4080-b15f-242650c914a8/nuget/v3/flat2/microsoft.dotnet.genapi/6.0.0-beta.21105.5/microsoft.dotnet.genapi.6.0.0-beta.21105.5.nupkg'.

BruceForstall · 2021-02-08T21:44:52Z

@BruceForstall think I covered most of your feedback.

Thanks. Presumably when you take this out of "Draft" mode you'll ask for "final" reviews.

AndyAyersMS · 2021-02-08T21:48:11Z

No longer a draft PR.

AndyAyersMS · 2021-02-08T22:59:22Z

OSX build also hitting nuget issues:

/Users/runner/.dotnet/sdk/5.0.102/NuGet.RestoreEx.targets(19,5): error : (NETCORE_ENGINEERING_TELEMETRY=Restore) Failed to retrieve information about 'Microsoft.DotNet.Build.Tasks.Installers

Am planning on ignoring this and the mono failure.

Also expect jit-experimentail to fail overall with OSR/EhWriteThru issues. None of the PGO legs should fail.

AndyAyersMS · 2021-02-09T01:32:34Z

Windows jit-experimental had only the "expected failures" -- waiting on Linux. Also, the failures noted in #47930 are fixed.

AndyAyersMS · 2021-02-09T01:58:19Z

Windows jit-experimental had only the "expected failures"

Ditto for Linux.

AndyAyersMS added 2 commits February 6, 2021 10:38

hacky runtime fix

9165fb5

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 6, 2021

fix build issues

191a339

BruceForstall reviewed Feb 8, 2021

View reviewed changes

AndyAyersMS added 2 commits February 8, 2021 08:58

solve last to first, assert if fails to converge

658a75b

review feedback; clenup runtime fix

cb025c1

AndyAyersMS mentioned this pull request Feb 8, 2021

PGO: seeing nonsensical counts and corrupted type handles #47930

Closed

AndyAyersMS added 2 commits February 8, 2021 12:58

update header comment for fgReplaceSwitchJumpTarget

26ce9e5

add test leg to jit-experimental

7ce99ae

AndyAyersMS added this to the 6.0.0 milestone Feb 8, 2021

AndyAyersMS mentioned this pull request Feb 8, 2021

Dynamic PGO #43618

Closed

54 tasks

JulieLeeMSFT assigned AndyAyersMS Feb 8, 2021

AndyAyersMS marked this pull request as ready for review February 8, 2021 21:48

BruceForstall approved these changes Feb 8, 2021

View reviewed changes

AndyAyersMS merged commit 95f8f00 into dotnet:master Feb 9, 2021

AndyAyersMS deleted the SpanningTreeInstrumentation branch February 9, 2021 01:59

AndyAyersMS mentioned this pull request Feb 9, 2021

JIT: efficient profiling schemes #46882

Closed

JulieLeeMSFT mentioned this pull request Feb 24, 2021

What's new in .NET 6 Preview 2 dotnet/core#5889

Closed

ghost locked as resolved and limited conversation to collaborators Mar 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spanning tree instrumentation #47959

Spanning tree instrumentation #47959

AndyAyersMS commented Feb 6, 2021 •

edited

Loading

AndyAyersMS commented Feb 7, 2021

AndyAyersMS commented Feb 8, 2021

BruceForstall left a comment

BruceForstall Feb 8, 2021

AndyAyersMS Feb 8, 2021

BruceForstall Feb 8, 2021

AndyAyersMS Feb 8, 2021

BruceForstall Feb 8, 2021

AndyAyersMS Feb 8, 2021

BruceForstall Feb 8, 2021

AndyAyersMS Feb 8, 2021

BruceForstall Feb 8, 2021

AndyAyersMS Feb 8, 2021

BruceForstall Feb 8, 2021

BruceForstall Feb 8, 2021

BruceForstall Feb 8, 2021

BruceForstall Feb 8, 2021

AndyAyersMS Feb 8, 2021

BruceForstall Feb 8, 2021

AndyAyersMS Feb 8, 2021

AndyAyersMS commented Feb 8, 2021

AndyAyersMS commented Feb 8, 2021

AndyAyersMS commented Feb 8, 2021

AndyAyersMS commented Feb 8, 2021

azure-pipelines bot commented Feb 8, 2021

AndyAyersMS commented Feb 8, 2021

BruceForstall commented Feb 8, 2021

AndyAyersMS commented Feb 8, 2021

AndyAyersMS commented Feb 8, 2021

AndyAyersMS commented Feb 9, 2021

AndyAyersMS commented Feb 9, 2021

Spanning tree instrumentation #47959

Spanning tree instrumentation #47959

Conversation

AndyAyersMS commented Feb 6, 2021 • edited Loading

AndyAyersMS commented Feb 7, 2021

AndyAyersMS commented Feb 8, 2021

BruceForstall left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndyAyersMS commented Feb 8, 2021

AndyAyersMS commented Feb 8, 2021

AndyAyersMS commented Feb 8, 2021

AndyAyersMS commented Feb 8, 2021

azure-pipelines bot commented Feb 8, 2021

AndyAyersMS commented Feb 8, 2021

BruceForstall commented Feb 8, 2021

AndyAyersMS commented Feb 8, 2021

AndyAyersMS commented Feb 8, 2021

AndyAyersMS commented Feb 9, 2021

AndyAyersMS commented Feb 9, 2021

AndyAyersMS commented Feb 6, 2021 •

edited

Loading