Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always create loop pre headers #83956

Merged
merged 14 commits into from
Apr 6, 2023

Conversation

BruceForstall
Copy link
Member

@BruceForstall BruceForstall commented Mar 27, 2023

As part of finding natural loops and creating the loop table, create a loop pre-header for every loop. This
simplifies a lot of downstream phases, as the loop pre-header will be guaranteed to exist, and will already
exist in the dominator tree.

Introduce code to preserve an empty pre-header block through the optimization phases.

Remove now unnecessary code in hoisting and elsewhere.

Fixes #77033, #62665

@ghost ghost assigned BruceForstall Mar 27, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 27, 2023
@ghost
Copy link

ghost commented Mar 27, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: BruceForstall
Assignees: BruceForstall
Labels:

area-CodeGen-coreclr

Milestone: -

@BruceForstall
Copy link
Member Author

BruceForstall commented Mar 27, 2023

[This comment is from a draft PR]

Diffs

Overall diffs is size improvement. TP regression about 0.1% to 0.6%. Perhaps because more optimizations kick in? Or more blocks to process? Even though the overall size diff is an improvement (e.g., when additional redundant block opts kicks in more), there are cases where it regresses, e.g., more loop cloning occurs.

@BruceForstall
Copy link
Member Author

/azp run runtime-coreclr outerloop, runtime-coreclr jitstress

@dotnet dotnet deleted a comment from azure-pipelines bot Mar 29, 2023
@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@BruceForstall
Copy link
Member Author

There are diffs due to LSRA basing its traversal order on bbNum ordering, and the bbNum order changes with pre-headers created early. There are a few small diffs because the order of nested child loop pre-header blocks processed during hoisting is different than before. Because the pre-headers exist early, some additional downstream optimizations kick in. E.g., I saw additional cases of redundant branch opts.

@BruceForstall BruceForstall marked this pull request as ready for review March 29, 2023 22:41
@BruceForstall BruceForstall changed the title Always create loop pre header Always create loop pre headers Mar 29, 2023
@BruceForstall
Copy link
Member Author

@AndyAyersMS PTAL
cc @dotnet/jit-contrib

defExec.Reset();
preHeadersList = existingPreHeaders;
defExec.Pop(defExec.Height() - childLoopPreHeaders);
assert(defExec.Height() == childLoopPreHeaders);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, this if appears to be dead code. It's never hit in SPMI anyway. And that makes sense: walking up the immediate dominators from the single loop exit block (or any loop exit block, for that matter, though we don't track non-single-exit blocks) should always reach the loop entry block. Or, in other words, the (single) loop entry should dominate the (single) loop exit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding assert(false) here and run SPMI for all configurations and get rid of it if we don't hit it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran an experiment and we actually do hit this case in a very few x86 tests. We should probably augment our loop recognition to reject those loops, in which the "exit" block of the loop is an EH handler. Presumably if the loop has a normal exit as well as an EH exit, on x86 we won't get here because we only get here for "single exit" loops. So it requires an "infinite" loop with the only exit being from the EH handler.

Opened #84222 to track.

@BruceForstall
Copy link
Member Author

Diffs

There's an overall significant size improvement, and also a non-trivial TP regression of up to ~0.5%. I presume this is due to (1) creating pre-headers where we didn't before, (2) maintaining and processing extra blocks, and (3) the cost of increased downstream optimization phases in the presence of the additional block, e.g., larger bit-vectors, more blocks in dominators, etc.

@kunalspathak
Copy link
Member

Could you paste some of the before vs. after diffs from the "hoisting from nested loop" examples I had in #68061?

@BruceForstall
Copy link
Member Author

outerloop failed with 2 known issues: R2R-CG2 test failures in Loader\classloader\TypeInitialization\CctorsWithSideEffects\CctorForWrite\CctorForWrite.cmd (#84007); one config failed baseservices/threading/regressions/2164/foreground-shutdown/foreground-shutdown.cmd (#83658).

@BruceForstall
Copy link
Member Author

Could you paste some of the before vs. after diffs from the "hoisting from nested loop" examples I had in #68061?

I'll look into that.

Note that I locally did diffs of this change before removing the special hoisting pre-header handling compared to after that change was mostly removed but replaced with different code to add child loop pre-headers to the blocks to consider. There were very few diffs (on win-x64): mostly a few reordering because the blocks are processed in a slightly different order, so code gets hoisted in a different order; and one case where a slightly different set of things got hoisted because we hoisted in a different order and exceeded the hoisting budget before getting to everything.

@AndyAyersMS
Copy link
Member

There's an overall significant size improvement, and also a non-trivial TP regression of up to ~0.5%. I presume this is due to (1) creating pre-headers where we didn't before, (2) maintaining and processing extra blocks, and (3) the cost of increased downstream optimization phases in the presence of the additional block, e.g., larger bit-vectors, more blocks in dominators, etc.

I am a bit surprised it costs this much. Maybe look into the TP costs via the more fine-grained profiling via PIN?

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall code looks good, Left a few comments on comments.

I think you should look more closely at the TP costs. Maybe this is exposing some poorly scaling algorithm somewhere?

src/coreclr/jit/fgopt.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/fgopt.cpp Show resolved Hide resolved
src/coreclr/jit/loopcloning.cpp Show resolved Hide resolved
src/coreclr/jit/loopcloning.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/optimizer.cpp Outdated Show resolved Hide resolved
@BruceForstall
Copy link
Member Author

The per-function PIN diffs (for win-x64 benchmarks collection, which has a TP regression of %0.56 with this PR) is interesting: it shows all kinds of effects of having more basic blocks (ignore the fgDominate change: I changed the signature, so there's a corresponding improvement at the bottom). fgComputeReachabilitySets and fgComputeDoms are the biggest regressions. fgComputeReachabilitySets in particular iterates over the block list 2+ times, but since it iterates to a fixed point, it could be many more than 2 times. fgComputeDoms also iterates over the blocks multiple times and iterates to a fixed point.

One interesting thing about the fgDfsReversePostorderHelper and related regressions: it uses ArrayStack<DfsBlockEntry> with a default initial capacity, so we see a regression because Realloc is called. Perhaps it needs to size the initial capacity better based on number of blocks in the function.

Base: 55245533581, Diff: 55555209636, +0.5605%

?fgComputeReachabilitySets@Compiler@@IEAAXXZ                                                                                                                                                                                                                                                                                            : 61902267  : +34.68%  : 15.41% : +0.1120%
?fgComputeDoms@Compiler@@IEAAXXZ                                                                                                                                                                                                                                                                                                        : 38419209  : +36.04%  : 9.56%  : +0.0695%
?fgDominate@Compiler@@QEAA_NPEBUBasicBlock@@0@Z                                                                                                                                                                                                                                                                                         : 34346995  : NA       : 8.55%  : +0.0622%
?fgDfsReversePostorderHelper@Compiler@@IEAAXPEAUBasicBlock@@AEAPEA_KAEAI2@Z                                                                                                                                                                                                                                                             : 25242649  : +36.18%  : 6.28%  : +0.0457%
?fgUpdateFlowGraph@Compiler@@QEAA_N_N0@Z                                                                                                                                                                                                                                                                                                : 19906161  : +3.73%   : 4.95%  : +0.0360%
?GetSucc@BasicBlock@@QEAAPEAU1@IPEAVCompiler@@@Z                                                                                                                                                                                                                                                                                        : 17099420  : +4.05%   : 4.26%  : +0.0310%
?fgDfsReversePostorder@Compiler@@IEAAXXZ                                                                                                                                                                                                                                                                                                : 13681737  : +35.28%  : 3.41%  : +0.0248%
?NumSucc@BasicBlock@@QEAAIPEAVCompiler@@@Z                                                                                                                                                                                                                                                                                              : 10881176  : +4.07%   : 2.71%  : +0.0197%
?fgRenumberBlocks@Compiler@@QEAA_NXZ                                                                                                                                                                                                                                                                                                    : 9195714   : +10.85%  : 2.29%  : +0.0166%
DomTreeVisitor<`Compiler::fgNumberDomTree'::`2'::NumberDomTreeVisitor>::WalkTree                                                                                                                                                                                                                                                        : 8870325   : +34.93%  : 2.21%  : +0.0161%
?fgCanCompactBlocks@Compiler@@QEAA_NPEAUBasicBlock@@0@Z                                                                                                                                                                                                                                                                                 : 7068583   : +5.04%   : 1.76%  : +0.0128%
?PerBlockAnalysis@LiveVarAnalysis@@AEAA_NPEAUBasicBlock@@_N1@Z                                                                                                                                                                                                                                                                          : 5800820   : +0.76%   : 1.44%  : +0.0105%
?isEmpty@BasicBlock@@QEBA_NXZ                                                                                                                                                                                                                                                                                                           : 5057038   : +3.61%   : 1.26%  : +0.0092%
?fgBuildDomTree@Compiler@@IEAAPEAUDomTreeNode@@XZ                                                                                                                                                                                                                                                                                       : 4889433   : +22.03%  : 1.22%  : +0.0089%
??0AllSuccessorIterPosition@@QEAA@PEAVCompiler@@PEAUBasicBlock@@@Z                                                                                                                                                                                                                                                                      : 4864797   : +1.67%   : 1.21%  : +0.0088%
?allocateMemory@ArenaAllocator@@QEAAPEAX_K@Z                                                                                                                                                                                                                                                                                            : 4396313   : +0.40%   : 1.09%  : +0.0080%
?FindNextRegSuccTry@EHSuccessorIterPosition@@AEAAXPEAVCompiler@@PEAUBasicBlock@@@Z                                                                                                                                                                                                                                                      : 4148232   : +1.56%   : 1.03%  : +0.0075%
?BlockPredsWithEH@Compiler@@QEAAPEAUFlowEdge@@PEAUBasicBlock@@@Z                                                                                                                                                                                                                                                                        : 3686431   : +2.31%   : 0.92%  : +0.0067%
?CheckGrowth@?$JitHashTable@PEAUBasicBlock@@U?$JitPtrKeyFuncs@UBasicBlock@@@@PEAU1@VCompAllocator@@VJitHashTableBehavior@@@@AEAAXXZ                                                                                                                                                                                                     : 3296879   : +14.51%  : 0.82%  : +0.0060%
?fgCompactBlocks@Compiler@@QEAAXPEAUBasicBlock@@0@Z                                                                                                                                                                                                                                                                                     : 3281891   : +2.67%   : 0.82%  : +0.0059%
?fgCreateLoopPreHeader@Compiler@@QEAA_NI@Z                                                                                                                                                                                                                                                                                              : 2758659   : +466.47% : 0.69%  : +0.0050%
?doLinearScan@LinearScan@@UEAA?AW4PhaseStatus@@XZ                                                                                                                                                                                                                                                                                       : 2534677   : +1.34%   : 0.63%  : +0.0046%
?Advance@AllSuccessorIterPosition@@QEAAXPEAVCompiler@@PEAUBasicBlock@@@Z                                                                                                                                                                                                                                                                : 2486850   : +1.23%   : 0.62%  : +0.0045%
ArrayStack<`Compiler::fgDfsReversePostorderHelper'::`2'::DfsBlockEntry>::Realloc                                                                                                                                                                                                                                                        : 2476960   : +48.38%  : 0.62%  : +0.0045%
?ComputeIteratedDominanceFrontier@SsaBuilder@@AEAAXPEAUBasicBlock@@PEBV?$JitHashTable@PEAUBasicBlock@@U?$JitPtrKeyFuncs@UBasicBlock@@@@V?$vector@PEAUBasicBlock@@V?$allocator@PEAUBasicBlock@@@jitstd@@@jitstd@@VCompAllocator@@VJitHashTableBehavior@@@@PEAV?$vector@PEAUBasicBlock@@V?$allocator@PEAUBasicBlock@@@jitstd@@@jitstd@@@Z : 2268187   : +2.23%   : 0.56%  : +0.0041%
??$ForwardAnalysis@VAssertionPropFlowCallback@@@DataFlow@@QEAAXAEAVAssertionPropFlowCallback@@@Z                                                                                                                                                                                                                                        : 2253689   : +1.70%   : 0.56%  : +0.0041%
?optReachable@Compiler@@QEAA_NQEAUBasicBlock@@00@Z                                                                                                                                                                                                                                                                                      : 2126301   : +2.85%   : 0.53%  : +0.0038%
?ComputeDominanceFrontiers@SsaBuilder@@AEAAXPEAPEAUBasicBlock@@HPEAV?$JitHashTable@PEAUBasicBlock@@U?$JitPtrKeyFuncs@UBasicBlock@@@@V?$vector@PEAUBasicBlock@@V?$allocator@PEAUBasicBlock@@@jitstd@@@jitstd@@VCompAllocator@@VJitHashTableBehavior@@@@@Z                                                                                : 1943378   : +3.45%   : 0.48%  : +0.0035%
?optImpliedByTypeOfAssertions@Compiler@@QEAAXAEAPEA_K@Z                                                                                                                                                                                                                                                                                 : 1851366   : +2.37%   : 0.46%  : +0.0034%
?ComputeImmediateDom@SsaBuilder@@AEAAXPEAPEAUBasicBlock@@H@Z                                                                                                                                                                                                                                                                            : 1840454   : +2.45%   : 0.46%  : +0.0033%
?TopologicalSort@SsaBuilder@@AEAAHPEAPEAUBasicBlock@@H@Z                                                                                                                                                                                                                                                                                : 1786891   : +3.00%   : 0.44%  : +0.0032%
??$fgAddRefPred@$0A@@Compiler@@QEAAPEAUFlowEdge@@PEAUBasicBlock@@0PEAU1@@Z                                                                                                                                                                                                                                                              : 1724117   : +3.93%   : 0.43%  : +0.0031%
?optHoistThisLoop@Compiler@@IEAA_NIPEAULoopHoistContext@1@@Z                                                                                                                                                                                                                                                                            : 1704027   : NA       : 0.42%  : +0.0031%
?fgOptimizeEmptyBlock@Compiler@@QEAA_NPEAUBasicBlock@@@Z                                                                                                                                                                                                                                                                                : 1450385   : +23.49%  : 0.36%  : +0.0026%
??$ForwardAnalysis@VCSE_DataFlow@@@DataFlow@@QEAAXAEAVCSE_DataFlow@@@Z                                                                                                                                                                                                                                                                  : 1432333   : +1.49%   : 0.36%  : +0.0026%
?optJumpThreadCheck@Compiler@@QEAA_NQEAUBasicBlock@@0@Z                                                                                                                                                                                                                                                                                 : 1404980   : +12.64%  : 0.35%  : +0.0025%
?reorderPredList@BasicBlock@@QEAAXPEAVCompiler@@@Z                                                                                                                                                                                                                                                                                      : 1346707   : +6.43%   : 0.34%  : +0.0024%
??$KindIs@W4BBjumpKinds@@@BasicBlock@@QEBA_NW4BBjumpKinds@@0@Z                                                                                                                                                                                                                                                                          : 1303183   : +3.63%   : 0.32%  : +0.0024%
?fgValueNumberBlock@Compiler@@QEAAXPEAUBasicBlock@@@Z                                                                                                                                                                                                                                                                                   : 1250664   : +0.97%   : 0.31%  : +0.0023%
?fgRemoveRefPred@Compiler@@QEAAPEAUFlowEdge@@PEAUBasicBlock@@0@Z                                                                                                                                                                                                                                                                        : 1198425   : +2.83%   : 0.30%  : +0.0022%
?BlockRenameVariables@SsaBuilder@@AEAAXPEAUBasicBlock@@@Z                                                                                                                                                                                                                                                                               : 1194368   : +0.70%   : 0.30%  : +0.0022%
?ehGetBlockExnFlowDsc@Compiler@@QEAAPEAUEHblkDsc@@PEAUBasicBlock@@@Z                                                                                                                                                                                                                                                                    : 1178071   : +1.40%   : 0.29%  : +0.0021%
?AddPhiArgsToSuccessors@SsaBuilder@@AEAAXPEAUBasicBlock@@@Z                                                                                                                                                                                                                                                                             : 1135341   : +1.53%   : 0.28%  : +0.0021%
jitstd::`anonymous namespace'::quick_sort<jitstd::vector<FlowEdge *,jitstd::allocator<FlowEdge *> >::iterator,`BasicBlock::reorderPredList'::`2'::FlowEdgeBBNumCmp>                                                                                                                                                                     : 1101272   : +6.28%   : 0.27%  : +0.0020%
?ensure_capacity@?$vector@PEAUBasicBlock@@V?$allocator@PEAUBasicBlock@@@jitstd@@@jitstd@@AEAA_N_K@Z                                                                                                                                                                                                                                     : 1077164   : +3.18%   : 0.27%  : +0.0019%
?fgValueNumber@Compiler@@QEAA?AW4PhaseStatus@@XZ                                                                                                                                                                                                                                                                                        : 1060039   : +1.17%   : 0.26%  : +0.0019%
?Assign@?$BitSetOps@PEA_K$00PEAVCompiler@@VTrackedVarBitSetTraits@@@@SAXPEAVCompiler@@AEAPEA_KPEA_K@Z                                                                                                                                                                                                                                   : 977026    : +0.66%   : 0.24%  : +0.0018%
?optFindNaturalLoops@Compiler@@IEAAXXZ                                                                                                                                                                                                                                                                                                  : 924808    : +8.45%   : 0.23%  : +0.0017%
?bbNewBasicBlock@Compiler@@QEAAPEAUBasicBlock@@W4BBjumpKinds@@@Z                                                                                                                                                                                                                                                                        : 903977    : +0.88%   : 0.22%  : +0.0016%
?fgPerBlockLocalVarLiveness@Compiler@@QEAAXXZ                                                                                                                                                                                                                                                                                           : 884567    : +0.31%   : 0.22%  : +0.0016%
?GetDescriptorForSwitch@Compiler@@QEAA?AUSwitchUniqueSuccSet@1@PEAUBasicBlock@@@Z                                                                                                                                                                                                                                                       : 878679    : +5.21%   : 0.22%  : +0.0016%
?optRedundantBranch@Compiler@@QEAA_NQEAUBasicBlock@@@Z                                                                                                                                                                                                                                                                                  : 846054    : +0.17%   : 0.21%  : +0.0015%
?fgUpdateLoopsAfterCompacting@Compiler@@QEAAXPEAUBasicBlock@@0@Z                                                                                                                                                                                                                                                                        : 814495    : +7.11%   : 0.20%  : +0.0015%
?Assign@?$BitSetOps@PEA_K$00PEAUBitVecTraits@@U1@@@SAXPEAUBitVecTraits@@AEAPEA_KPEA_K@Z                                                                                                                                                                                                                                                 : 807181    : +1.80%   : 0.20%  : +0.0015%
?fgCompDominatedByExceptionalEntryBlocks@Compiler@@IEAAXXZ                                                                                                                                                                                                                                                                              : 772144    : +25.66%  : 0.19%  : +0.0014%
?Run@LiveVarAnalysis@@AEAAX_N@Z                                                                                                                                                                                                                                                                                                         : 737728    : +0.84%   : 0.18%  : +0.0013%
??$CountBitsInIntegral@_K@BitSetSupport@@SAI_K@Z                                                                                                                                                                                                                                                                                        : 718278    : +14.79%  : 0.18%  : +0.0013%
?optBlockCopyProp@Compiler@@QEAA_NPEAUBasicBlock@@PEAV?$JitHashTable@IU?$JitSmallPrimitiveKeyFuncs@I@@PEAV?$ArrayStack@VCopyPropSsaDef@Compiler@@@@VCompAllocator@@VJitHashTableBehavior@@@@@Z                                                                                                                                          : 715155    : +0.36%   : 0.18%  : +0.0013%
?optCopyProp@Compiler@@QEAA_NPEAUBasicBlock@@PEAUStatement@@PEAUGenTreeLclVarCommon@@IPEAV?$JitHashTable@IU?$JitSmallPrimitiveKeyFuncs@I@@PEAV?$ArrayStack@VCopyPropSsaDef@Compiler@@@@VCompAllocator@@VJitHashTableBehavior@@@@@Z                                                                                                      : 695610    : +0.09%   : 0.17%  : +0.0013%
?ClearD@?$BitSetOps@PEA_K$00PEAVCompiler@@VTrackedVarBitSetTraits@@@@SAXPEAVCompiler@@AEAPEA_K@Z                                                                                                                                                                                                                                        : 634075    : +0.69%   : 0.16%  : +0.0011%
?optComputeLoopSideEffectsOfBlock@Compiler@@AEAA_NPEAUBasicBlock@@@Z                                                                                                                                                                                                                                                                    : 575091    : +1.20%   : 0.14%  : +0.0010%
?fgInterBlockLocalVarLiveness@Compiler@@QEAAXXZ                                                                                                                                                                                                                                                                                         : 567791    : +0.20%   : 0.14%  : +0.0010%
?InsertPhiFunctions@SsaBuilder@@AEAAXPEAPEAUBasicBlock@@H@Z                                                                                                                                                                                                                                                                             : 558383    : +0.64%   : 0.14%  : +0.0010%
?optValnumCSE_InitDataFlow@Compiler@@IEAAXXZ                                                                                                                                                                                                                                                                                            : 531119    : +1.46%   : 0.13%  : +0.0010%
?fgInitBlockVarSets@Compiler@@QEAAXXZ                                                                                                                                                                                                                                                                                                   : 490339    : +0.69%   : 0.12%  : +0.0009%
?MakeEmpty@?$BitSetOps@PEA_K$00PEAVCompiler@@VTrackedVarBitSetTraits@@@@SAPEA_KPEAVCompiler@@@Z                                                                                                                                                                                                                                         : 464869    : +3.07%   : 0.12%  : +0.0008%
?RenameVariables@SsaBuilder@@AEAAXXZ                                                                                                                                                                                                                                                                                                    : 454551    : +0.75%   : 0.11%  : +0.0008%
?optInitAssertionDataflowFlags@Compiler@@QEAAPEAPEA_KXZ                                                                                                                                                                                                                                                                                 : 444968    : +1.71%   : 0.11%  : +0.0008%
?optAssertionPropMain@Compiler@@QEAA?AW4PhaseStatus@@XZ                                                                                                                                                                                                                                                                                 : 403540    : +0.17%   : 0.10%  : +0.0007%
memset                                                                                                                                                                                                                                                                                                                                  : -567179   : -0.13%   : 0.14%  : -0.0010%
?compInit@Compiler@@QEAAXPEAVArenaAllocator@@PEAUCORINFO_METHOD_STRUCT_@@PEAVICorJitInfo@@PEAUCORINFO_METHOD_INFO@@PEAUInlineInfo@@@Z                                                                                                                                                                                                   : -572769   : -0.51%   : 0.14%  : -0.0010%
?optUpdateLoopsBeforeRemoveBlock@Compiler@@IEAAXPEAUBasicBlock@@_N@Z                                                                                                                                                                                                                                                                    : -610530   : -21.54%  : 0.15%  : -0.0011%
?processBlockStartLocations@LinearScan@@AEAAXPEAUBasicBlock@@@Z                                                                                                                                                                                                                                                                         : -694841   : -0.10%   : 0.17%  : -0.0013%
?optHoistThisLoop@Compiler@@IEAA_NIPEAULoopHoistContext@1@PEAUBasicBlockList@@@Z                                                                                                                                                                                                                                                        : -1734446  : -100.00% : 0.43%  : -0.0031%
?fgDominate@Compiler@@QEAA_NPEAUBasicBlock@@0@Z                                                                                                                                                                                                                                                                                         : -39379589 : -100.00% : 9.80%  : -0.0713%

@AndyAyersMS
Copy link
Member

In fgComputeReachabilitySets, if UnionD returned a bool indicating if any bits changed, we could get rid of the newReach, Equals, and Assign calls, and just modify block->bbReach directly, and still know whether to keep iterating. This bool should be fairly cheap to compute in the existing methods, or we could add a new method to isolate those extra costs to cases where we're going to look at the result.

We could also avoid repeatedly walking the full block list in the do loop since we know exactly which blocks might see new reachability bits (successors of any block that just did an update). If we had the appropriate worklist set up (say another blockset indexed by the reverse postorder number) we would organically process blocks in pred->succ order and so possibly greatly accelerate convergence.

At the very least we should be walking the blocks in the do loop in reverse postorder; that might be an easy change with some nice wins (if the bbPostOrderNum -> block mapping is not current, you can build it in the loop above).

Not sure if you want to fold all that in here or do it separately -- should be a win on its own and also minimize some of the extra costs here.

That plus resizing the array stack might cut out half of the worst case TP impact. Maybe.

@BruceForstall
Copy link
Member Author

I generated COUNT_BASIC_BLOCKS and COUNT_LOOPS stats, and added a new blocks stats histogram for post-loop-recognition and see the following (win-x64 benchmarks replay):

Blocks after importer (same base & diff):

--------------------------------------------------
Basic block count frequency table:
--------------------------------------------------
     <=          1 ===>  356063 count ( 66% of total)
      2 ..       2 ===>     156 count ( 66% of total)
      3 ..       3 ===>  124928 count ( 90% of total)
      4 ..       5 ===>   11703 count ( 92% of total)
      6 ..      10 ===>   29412 count ( 97% of total)
     11 ..      20 ===>    9327 count ( 99% of total)
     21 ..      50 ===>    1493 count ( 99% of total)
     51 ..     100 ===>     280 count ( 99% of total)
    101 ..    1000 ===>      79 count ( 99% of total)
   1001 ..   10000 ===>       1 count (100% of total)
--------------------------------------------------

(Note: this is for 35418 method contexts, so it appears we're hugely overemphasizing inlinees (and probably double counting)

Post loop recognition: base
--------------------------------------------------
Basic block count frequency table (post loop recognition):
--------------------------------------------------
     <=          1 ===>   10652 count ( 31% of total)
      2 ..       2 ===>      40 count ( 31% of total)
      3 ..       3 ===>    2515 count ( 38% of total)
      4 ..       5 ===>    2193 count ( 45% of total)
      6 ..      10 ===>    3457 count ( 55% of total)
     11 ..      20 ===>    5832 count ( 72% of total)
     21 ..      50 ===>    8131 count ( 96% of total)
     51 ..     100 ===>     906 count ( 98% of total)
    101 ..    1000 ===>     383 count ( 99% of total)
   1001 ..   10000 ===>       1 count (100% of total)
--------------------------------------------------

Post loop recognition: diff
--------------------------------------------------
Basic block count frequency table (post loop recognition):
--------------------------------------------------
     <=          1 ===>   10652 count ( 31% of total)
      2 ..       2 ===>      40 count ( 31% of total)
      3 ..       3 ===>    2052 count ( 37% of total)
      4 ..       5 ===>    2368 count ( 44% of total)
      6 ..      10 ===>    3615 count ( 54% of total)
     11 ..      20 ===>    3588 count ( 65% of total)
     21 ..      50 ===>   10459 count ( 96% of total)
     51 ..     100 ===>     940 count ( 98% of total)
    101 ..    1000 ===>     395 count ( 99% of total)
   1001 ..   10000 ===>       1 count (100% of total)
--------------------------------------------------

So, a lot of functions get bumped up to higher block count buckets.

Loops: base
---------------------------------------------------
Loop stats
---------------------------------------------------
Total number of methods with loops is 11816
Total number of              loops is 14933
Maximum number of loops per method is    45
# of methods overflowing nat loop table is     0
Total number of 'unnatural' loops is 15481
# of methods overflowing unnat loop limit is     0
Total number of loops with an         iterator is  3063
Total number of loops with a constant iterator is   929
--------------------------------------------------
Loop count frequency table:
--------------------------------------------------
     <=          0 ===>     336 count (  2% of total)
      1 ..       1 ===>   10484 count ( 89% of total)
      2 ..       2 ===>     759 count ( 95% of total)
      3 ..       3 ===>     263 count ( 97% of total)
      4 ..       4 ===>     130 count ( 98% of total)
      5 ..       5 ===>      58 count ( 99% of total)
      6 ..       6 ===>      23 count ( 99% of total)
      7 ..       7 ===>      20 count ( 99% of total)
      8 ..       8 ===>      14 count ( 99% of total)
      9 ..       9 ===>      17 count ( 99% of total)
     10 ..      10 ===>       4 count ( 99% of total)
     11 ..      11 ===>       7 count ( 99% of total)
     12 ..      12 ===>       3 count (100% of total)
      >         12 ===>      34 count (100% of total)
--------------------------------------------------
Loop exit count frequency table:
--------------------------------------------------
     <=          0 ===>       1 count (  0% of total)
      1 ..       1 ===>    3716 count ( 26% of total)
      2 ..       2 ===>    1733 count ( 38% of total)
      3 ..       3 ===>     556 count ( 42% of total)
      4 ..       4 ===>    2836 count ( 62% of total)
      5 ..       5 ===>    5223 count ( 98% of total)
      6 ..       6 ===>     167 count (100% of total)
      >          6 ===>     701 count (104% of total)
--------------------------------------------------

You would expect diffs to be the same, but it has two differences:
Total number of 'unnatural' loops is 15480 (-1)
Total number of loops with a constant iterator is   851 (-78)

I'm not sure why the number of constant iterator loops has changed; creating loop preheaders happens after all loop recognition and recording (where the constant iterator determination occurs).

@BruceForstall
Copy link
Member Author

Not sure if you want to fold all that in here or do it separately -- should be a win on its own and also minimize some of the extra costs here.

Probably should be done separately; possibly before this change is merged.

// loop pre-header block would be added anyway (by dominating the loop exit block), we don't
// add it here, and let it be added naturally, below.
//
// Note that all pre-headers get added first, which means they get considered for hoisting last. It is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now, the hoisting order is: "the entry block" followed by the pre-headers from inner loop. Can you write a comment giving example of an order of the pre-headers hoisting for multi-nested loop?

// preheader 1
for (....) {
  // preheader 2
  for (...) {
    // preheader 3
    for (...) {
    }
  }
}

At preheader 1, what will be the order in which preheader will be considered? 1, 2, 3 or 3, 2, 1 or something else?

Note that the order does matter for the hoisting profitability heuristics

Is there a way where we can hoist the block depending on size?

Copy link
Member Author

@BruceForstall BruceForstall Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this comment:

    // For example, consider this loop nest:
    // 
    // for (....) { // loop L00
    //    pre-header 1
    //    for (...) { // loop L01
    //    }
    //    // pre-header 2
    //    for (...) { // loop L02
    //       // pre-header 3
    //       for (...) { // loop L03
    //       }
    //    }
    // }
    //
    // When processing the outer loop L00 (with an assumed single exit), we will push on the defExec stack
    // pre-header 2, pre-header 1, the loop exit block, any IDom tree blocks leading to the entry block,
    // and finally the entry block. (Note that the child loop iteration order of a loop is from "farthest"
    // from the loop "head" to "nearest".) Blocks are considered for hoisting in the opposite order.
    //
    // Note that pre-header 3 is not pushed, since it is not a direct child. It would have been processed
    // when loop L02 was considered for hoisting.
    //
    // The order of pushing pre-header 1 and pre-header 2 is based on the order in the loop table (which is
    // convenient). But note that it is arbitrary because there is not guaranteed execution order amongst
    // the child loops.

Is there a way where we can hoist the block depending on size?

I'm not sure I understand the question. Hoisting of expressions does have various cost metrics applied. What kind of "block size" are you thinking about? Would it affect the order here, or the normal hoisting costing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of "block size" are you thinking about?

What I meant was if inside pre-header 1, we hoisted out 2 expressions and inside pre-header 2, we hoisted 4 expressions, should we track that and determine which block should be hoisted first. I am also wondering if we should first hoist the inner-most pre-header because that's the one that gets executed more often than that of outer loops preheader? That way if we hit CSE limit, we at least would have hoisted the hot parts first. Let me know if it is still not clear and we can talk offline.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was if inside pre-header 1, we hoisted out 2 expressions and inside pre-header 2, we hoisted 4 expressions, should we track that and determine which block should be hoisted first.

In the example above, pre-header 1 and 2 are from sibling loops. How would we decide which block should be considered first? I don't think size makes sense. It would make sense to order based on (PGO or synthesized) block weights.

Any change here is independent of this change, though.

I am also wondering if we should first hoist the inner-most pre-header because that's the one that gets executed more often than that of outer loops preheader?

We only hoist one level at a time, and from inner-to-outer. So it's possible expressions in L03 got hoisted to pre-header 3, then got hoisted to pre-header 2. Then, they should be considered together (and equivalently) to the other expressions in pre-header 2, possibly using weighting, as described above.

@BruceForstall
Copy link
Member Author

Could you paste some of the before vs. after diffs from the "hoisting from nested loop" examples I had in #68061?

@kunalspathak I tried all the examples listed there. There is no codegen difference for any between the baseline and this PR. (This is one minor case of a label difference induced by our PerfScore code.)

@BruceForstall
Copy link
Member Author

@AndyAyersMS @kunalspathak I've updated the PR to address the feedback, especially for comments. I added code to handle rebuilding the loop table when a pre-header block was previously added, and still recognize a constant initializer.

If the tests pass and I get a sign-off, it's ready to merge.

@BruceForstall
Copy link
Member Author

Current diffs

No change from before. win-arm64 timed out (infra problem?)

@BruceForstall
Copy link
Member Author

@AndyAyersMS @kunalspathak ping

@kunalspathak
Copy link
Member

Is there any reason why changes in da05026 need to go in this PR?

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM but worried about the number of methods regressed. On an average for every configuration, there is approx. 20% methods that regressed in code size. Part of the reason is because of block renumbering.

image

Were you able to run asmdiff with PerfScore on local machine to see if you notice any improvements? Hopefully micro benchmarks would catch anything important.

@BruceForstall
Copy link
Member Author

BruceForstall commented Apr 6, 2023

Is there any reason why changes in da05026 need to go in this PR?

Actually, those already got merged. Having that here was an accident because I cherry-picked it to use the change but didn't rebase them away (to avoid messing up comment threads). Presumably it ends up being a nop? Anyway, I rebased now.

As part of finding natural loops and creating the loop
table, create a loop pre-header for every loop. This
simplifies a lot of downstream phases, as the loop
pre-header will be guaranteed to exist, and will already
exist in the dominator tree.

Introduce code to preserve an empty pre-header block through
the optimization phases.

Remove now unnecessary code in hoisting and elsewhere.

Fixes dotnet#77033, dotnet#62665
Disallow creating pre-header after SSA is built
When the loop table is built, it looks around for various loop patterns,
including looking for a guaranteed-executed, pre-loop constant initializer.
This is used in loop cloning and loop unrolling. It needs to look
"a little harder" in the case we created loop pre-headers, then
rebuild the loop table (currently, only due to loop unrolling of loops that
contain nested loops). The new code only allows for empty pre-headers. This
works since in our current phase ordering, no hoisting happens by the time
the loop table is rebuilt.

(Actually, it's currently not necessary to do this at all, since the constant
initializer info is only used by cloning and loop unrolling, both of which
have finished by the time the loop table is rebuilt. However, we might someday
choose to rebuild the loop table after cloning and before unrolling, at which
point it would be necessary.)
@BruceForstall
Copy link
Member Author

Were you able to run asmdiff with PerfScore on local machine to see if you notice any improvements? Hopefully micro benchmarks would catch anything important.

As with CodeSize diffs, the PerfScore diffs show lots of differences, some improvements and some regressions. On balance, it appears more improvements than regressions. Regressions seems to be due mostly to block weight changes (which look better in diff).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Loop canonicalization should always create loop pre-headers
3 participants