Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spanning tree instrumentation #47959

Merged
merged 7 commits into from
Feb 9, 2021

Conversation

AndyAyersMS
Copy link
Member

@AndyAyersMS AndyAyersMS commented Feb 6, 2021

@dotnet/jit-contrib think this is ready for some review.

See #46882 for notes on the overall approach.

This turned out to be a bit more involved than I'd thought, between work to uncover and cope with or remove complications I hit while running analyses this early in the jit's phase pipeline, and the need to navigate between several different representations for data.

A rough guide to reviewing the code:

  • pgo.cpp -- possible fix for issue PGO: seeing nonsensical counts and corrupted type handles #47930
  • corjit.h, PgoFormat.cs -- new record type in PGO schemas for edge counts
  • block.h -- another scratch field, used by reconstruction
  • compiler.h -- made some data public so the new classes in fgprofile can have access; new methods
  • compphases.h, phase.cpp, compiler.cpp -- new phase to allow "early" preparation for instrumentation (that is, before importation). Edge profiling needs this so it sees the "same" flow graph that profile incorporation sees.
  • jitconfigvalues.h -- new option to enable sparse edge instrumentation
  • fgbasic.cpp -- allow fgSplitEdge to be called before we have pred lists, defer computation of inlinee scale until profile incorporation
  • fgprofile.cpp -- the crux of the change. Two large clumps (instrumentation and reconstruction) with a shared utility (spanning tree walker), and various supporting changes to enable and cope with the new mode.

For sparse edge profiling both instrumentation and reconstruction rely on being able to create a spanning tree, so this part is factored out into a walker/visitor pattern. This tree is evolved by a DFS that is mostly normal but has a few odd aspects:

  • we try to bias the DFS into visiting certain successors first, as an attempt at creating a "maximum weight" spanning tree
  • we add some pseudo-edges from blocks with no successors to method and/or handler entries. Edge instrumentation generally prefers to instrument region exits rather than region entries, and this is triggered by these edges.

The reconstruction algorithm uses a fair amount of auxiliary data and maps. There may be less costly ways to encode some of this; suggestions welcome. It also repeatedly iterates over blocks when ideally it would be using some kind of priority queue or other more focused organization of the remaining work.

There are a number of tricky points:

  • using BlockSet in inlinees is tricky as basic blocks always live in the root compiler's "space"
  • likewise in the more comprehensive NumSucc we are always using the root instance (mainly so any switch descriptor stuff happens up at the root) -- one could imagine hiding this detail in the implementation of this machinery instead
  • OSR causes trouble
    • on reconstruction the flowgraph shape doesn't match the base method shape, as we add a block to handle the initial OSR control flow
    • for instrumentation we really need pseudo-edges from patchpoints to method entry, otherwise we will have incomplete counts
    • so for now we don't allow edge instrumentation if OSR is enabled
  • I had to rework how we compute profile scale, because for edge profiling there is no entry block count probe. So scale computation has to wait until we have solved for the entry block count. I'll likely refactor all this again when I do early normalization by marking the return block or pseudo-edge records as special, and summing those early to determine fgCalledCount.
  • there are a number of failure modes possible
    • Running PRI-1 with all this enabled in tiered PGO and strong assertion checks shows a handful of tests ended up with negative edge counts (which we currently just reset to zero) and one test failed to converge. Currently we'll just toss out PGO data if there's apparent badcode, a schema/graph mismatch, or the solver fails to converge, but we'll proceed if we end up with negative counts (which are summarily set to zero). Eventually we may have negative counts trigger a smoothing/reconciliation pass.
    • the failures above did not repro when running tests in isolation. We are using live in-process data so the counts naturally will vary from run to run and there is some dependence of tiering on wall-clock time, so heavily loaded machines (like we see when running tests) may get quite different count data.
  • I added a CI leg for this so the only testing result here is proving that adding this didn't break something else.
  • I also plan to gather some perf data showing the runtime impact of various instrumentation options.

Add a new instrumentation mode that only instruments a subset of the edges in
the control flow graph. This reduces the total number of counters and so has
less compile time and runtime overhead than instrumenting every block.

Add a matching count reconstruction algorithm that recovers the missing edge
counts and all block counts.

See dotnet#46882 for more details on this approach.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 6, 2021
@AndyAyersMS
Copy link
Member Author

Wrote a simple checker to compare block probe based counts with sparse edge probe based reconstructed block counts. Looks like the algorithm is generally doing quite well in getting to the same normalized block count (absolute counts vary from run to run based on the vagaries of tier1 timing).

Eg in the deserialization benchmark, only 4 blocks differ (to two decimal places) and only by 0.01 each time:

normalized mismatch in System.Xml.XmlUTF8TextReader:ReadEndElement():this (bb07f8e3)[-1] BB02: block 8.56, edge 8.55
normalized mismatch in System.Xml.XmlUTF8TextReader:ReadEndElement():this (bb07f8e3)[-1] BB04: block 8.56, edge 8.55
normalized mismatch in System.Xml.XmlUTF8TextReader:ReadEndElement():this (bb07f8e3)[-1] BB05: block 9.56, edge 9.55
normalized mismatch in System.Xml.XmlBaseReader:IsStartElement():bool:this (8192e4a7)[-1] BB05: block 0.13, edge 0.12
normalized mismatch in System.Xml.XmlBaseReader:IsStartElement():bool:this (8192e4a7)[-1] BB06: block 0.13, edge 0.12
normalized mismatch in System.Xml.XmlBaseReader:IsStartElement():bool:this (8192e4a7)[-1] BB07: block 0.13, edge 0.12
normalized mismatch in Newtonsoft.Json.Bson.BsonDataReader:ReadType(byte):this (2d68de06)[-1] BB06: block 0.12, edge 0.13
normalized mismatch in Newtonsoft.Json.Bson.BsonDataReader:ReadType(byte):this (2d68de06)[-1] BB08: block 0.12, edge 0.13
normalized mismatch in Newtonsoft.Json.Bson.BsonDataReader:ReadType(byte):this (2d68de06)[-1] BB28: block 0.12, edge 0.13
normalized mismatch in Newtonsoft.Json.JsonSerializer:GetMatchingConverter(System.Collections.Generic.IList`1[[Newtonsoft.Json.JsonConverter, Newtonsoft.Json, Version=12.0.0.0, Culture=neutral, PublicKeyToken=30ad4fe6b2a6aeed]],System.Type):Newtonsoft.Json.JsonConverter (4a2111e2)[-1] BB03: block 0.01, edge 0
normalized mismatch in Newtonsoft.Json.JsonSerializer:GetMatchingConverter(System.Collections.Generic.IList`1[[Newtonsoft.Json.JsonConverter, Newtonsoft.Json, Version=12.0.0.0, Culture=neutral, PublicKeyToken=30ad4fe6b2a6aeed]],System.Type):Newtonsoft.Json.JsonConverter (4a2111e2)[-1] BB05: block 0.01, edge 0
4 mismatched methods out of 1848 total methods
11 mismatched blocks out of 5462 total blocks

Spot checking, the counts also appear to be fairly consistent, so the copying the runtime is doing before handing data back to the jit seems to be fairly effective at preventing counter skew.

However, I'm also consistently seeing a convergence failure in System.Text.Unicode.Utf16Utility:GetPointerToFirstInvalidChar so will need to debug why that happens.

@AndyAyersMS
Copy link
Member Author

The flowgraph / spanning tree for GetPointerToFirstInvalidChar. Tree edges in bold, these have unknown counts. Instrumented edges are fainter (return pseudo-edges are in orange, green are lexically backwards, blue are fall through). The solver's visitation order is the culprit; on those long nearly linear chains (eg BB11-BB28) the solver basically works backwards, and so takes quite a few iterations to solve (would need about 16 in all).

I may also look into why we are creating linear flow here; seems like we should be smart enough not to start a new basic block for non-joins...

image - 2021-02-07T192212 961

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good

BasicBlock* bbIDom; // Represent the closest dominator to this block (called the Immediate
// Dominator) used to compute the dominance tree.
void* bbProbeList; // Used early on by fgInstrument
void* bbInfo; // Used early on by fgIncorporate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could you use a more specific name than Info?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed.

@@ -450,7 +453,10 @@ void Compiler::fgReplaceSwitchJumpTarget(BasicBlock* blockSwitch, BasicBlock* ne
{
// Remove the old edge [oldTarget from blockSwitch]
//
fgRemoveAllRefPreds(oldTarget, blockSwitch);
if (fgComputePredsDone)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the header comment that says "We also must update the predecessor lists for 'oldTarget' and 'newPred'."?

It seems like our basic manipulation routines need to document if they can run without preds, maybe with an appropriate assert if they can't.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this one in my first review feedback commit.

}
case BBJ_EHFILTERRET:
{
// Ignore filters; they are single block and only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no limitation that a filter be a single basic block

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why I thought this (maybe a language restriction?) We can handle filter like other handler regions, then.

// Since our keying scheme is IL based and this
// block has no IL offset, we'd need to invent
// some new keying scheme. For now we just
// ignore this (rare) case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying it's a rare case that a finally has a throw (for instance) and hence doesn't return? I suppose that's true, though certainly possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We're not going to get an accurate picture of flow in methods that throw as the point at which control resumes is ambiguous. But we likely don't need an accurate picture; if a method throws with any frequency then it's unlikely to be performance sensitive.

//
// Note if the throw is caught locally this will over-state the profile
// count for method entry. But we likely don't care too much about
// profiles for methods that throw lots of exceptions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you don't want to (or can't) add an edge to all possible in-function handlers as well as a pseudo-edge to the method entry?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are multiple out-edges we'll end up wanting to instrument some of them, which we can't do. In the paper, they suggest (in the context of longjmp) fixing his at runtime by having the runtime increment the appropriate edge counter, once it resolves the stack target. Again not an option for us. So, as noted above, we'll just not worry about these cases as they only arise when methods throw.

{
continue;
sourceOffset = block->bbNum | IL_OFFSETX_CALLINSTRUCTIONBIT;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you need to define your own high bit here, and maybe even accessor/settors, for extreme clarity.

//
// Solving is done in four steps:
// * Prepare
// * walk the blocks settting up per block info, and a map
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: settting

// add in an unknown count edge.
// * Solve
// * repeatedly walk blocks, looking for blocks where all
// incoming our outgoing edges are known. This determines
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our => or?

private:
Compiler* m_comp;
CompAllocator m_allocator;
uint32_t m_blocks;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you using uint32_t / int32_t here? Can't we use the normal C++ types?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. There will be a bit left as some of it comes from the instrumentation schema declaration in corjit.h

union {
BasicBlock* bbIDom; // Represent the closest dominator to this block (called the Immediate
// Dominator) used to compute the dominance tree.
void* bbProbeList; // Used early on by fgInstrument
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have something that nulls these out before the phases that uses them? Or initializes them before they are read? I see somewhere that you assert bbInfo is != nullptr, for instance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instrumentation this happens in the VisitBlock method during the tree walk. For reconstruction the value is set during Prepare before it is read.

@AndyAyersMS
Copy link
Member Author

Realized that the ideal solver order is likely reverse post order over the depth first spanning tree. We could either materialize that order somehow or perhaps even integrate the solve and traversal passes together.

For now, a simpler fix is to just solve for the blocks from last to first, instead of first to last, as most edges are lexically forward. With that change the example above (GetPointerToFirstInvalidChar) converges in 2 passes:

;;; before (front to back)
Solver: 54 blocks, 54 unknown; 78 edges, 53 unknown, 0 zero (and so ignored)
Pass [1]: 54 unknown blocks, 53 unknown edges
Pass [2]: 38 unknown blocks, 38 unknown edges
...
Pass [10]: 8 unknown blocks, 8 unknown edges
Solver: failed to converge in 10 passes, 6 blocks and 6 edges remain unsolved
;;; after (back to front)
Solver: 54 blocks, 54 unknown; 78 edges, 53 unknown, 0 zero (and so ignored)
Pass [1]: 54 unknown blocks, 53 unknown edges
Pass [2]: 16 unknown blocks, 15 unknown edges
Solver: converged in 2 passes

@AndyAyersMS
Copy link
Member Author

One last note on that example. In my test run most of the blocks had zero count, because that method has an early return if all the strings it sees are ascii, and that's what happened in the test. So the solver is doing a fair amount of work to set most blocks to zero:

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd                 weight     IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                           16893. 16893    [000..016)-> BB03 ( cond )                     IBC 
BB02 [0001]  1                             0        0    [016..01C)-> BB04 (always)                     rare IBC 
BB03 [0002]  1                           16893. 16893    [01C..01D)                                     IBC 
BB04 [0003]  2                           16893. 16893    [01D..035)-> BB06 ( cond )                     IBC 
BB05 [0004]  1                           16893. 16893    [035..03E)-> BB07 (always)                     IBC 
BB06 [0005]  1                             0        0    [03E..03F)                                     rare IBC 
BB07 [0006]  2                           16893. 16893    [03F..056)-> BB09 ( cond )                     IBC 
BB08 [0007]  1                           16893. 16893    [056..05F)        (return)                     IBC 
BB09 [0008]  1                             0        0    [05F..06E)-> BB29 ( cond )                     rare IBC 
BB10 [0009]  1                             0        0    [06E..079)-> BB53 ( cond )                     rare IBC 
BB11 [0010]  1                             0        0    [079..0AA)-> BB12 ( cond )                     rare IBC 
... remaining blocks all zero ...

There is an optimization I contemplated that leaves zero weight edges out of the solve graph that might also reduce the amount of work needed to reach a solution. I left it out because if we subsequently want to deduce edge likelihoods, or perhaps also run synthesis, it seemed prudent to have edge objects for each successor edge.

@AndyAyersMS
Copy link
Member Author

@BruceForstall think I covered most of your feedback.

@AndyAyersMS
Copy link
Member Author

/azp run runtime-jit-experimental

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@AndyAyersMS
Copy link
Member Author

Mono build failure:

##[error]/Users/runner/.dotnet/sdk/5.0.102/NuGet.RestoreEx.targets(19,5): error : (NETCORE_ENGINEERING_TELEMETRY=Restore) Failed to download package 'Microsoft.DotNet.GenAPI.6.0.0-beta.21105.5' from 'https://pkgs.dev.azure.com/dnceng/9ee6d478-d288-47f7-aacc-f6e6d082ae6d/_packaging/1a5f89f6-d8da-4080-b15f-242650c914a8/nuget/v3/flat2/microsoft.dotnet.genapi/6.0.0-beta.21105.5/microsoft.dotnet.genapi.6.0.0-beta.21105.5.nupkg'.

@AndyAyersMS AndyAyersMS added this to the 6.0.0 milestone Feb 8, 2021
@AndyAyersMS AndyAyersMS mentioned this pull request Feb 8, 2021
54 tasks
@BruceForstall
Copy link
Member

@BruceForstall think I covered most of your feedback.

Thanks. Presumably when you take this out of "Draft" mode you'll ask for "final" reviews.

@AndyAyersMS
Copy link
Member Author

No longer a draft PR.

@AndyAyersMS AndyAyersMS marked this pull request as ready for review February 8, 2021 21:48
@AndyAyersMS
Copy link
Member Author

OSX build also hitting nuget issues:

/Users/runner/.dotnet/sdk/5.0.102/NuGet.RestoreEx.targets(19,5): error : (NETCORE_ENGINEERING_TELEMETRY=Restore) Failed to retrieve information about 'Microsoft.DotNet.Build.Tasks.Installers

Am planning on ignoring this and the mono failure.

Also expect jit-experimentail to fail overall with OSR/EhWriteThru issues. None of the PGO legs should fail.

@AndyAyersMS
Copy link
Member Author

Windows jit-experimental had only the "expected failures" -- waiting on Linux. Also, the failures noted in #47930 are fixed.

@AndyAyersMS
Copy link
Member Author

Windows jit-experimental had only the "expected failures"

Ditto for Linux.

@AndyAyersMS AndyAyersMS merged commit 95f8f00 into dotnet:master Feb 9, 2021
@AndyAyersMS AndyAyersMS deleted the SpanningTreeInstrumentation branch February 9, 2021 01:59
@ghost ghost locked as resolved and limited conversation to collaborators Mar 11, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants