From 955a6812c249fe9e16196f16d921742beba02b4e Mon Sep 17 00:00:00 2001
From: Bruce Forstall <brucefo@microsoft.com>
Date: Mon, 13 Dec 2021 23:02:51 -0800
Subject: [PATCH] Generalize loop pre-header creation and loop hoisting
 (#62560)

* Generalize loop pre-header creation and loop hoisting

A loop pre-header is a block that flows directly (and only) to the loop entry
block. The loop pre-header is the only non-loop predecessor of the entry block.
Loop invariant code can be hoisted to the loop pre-header where it is
guaranteed to be executed just once (per loop entry).

Currently, a loop pre-header has a number of restrictions:
- it is only created for a "do-while" (top entry) loop, not for a mid-loop entry.
- it isn't created if the current loop head and the loop entry block are in
different EH try regions

Additionally, partially due those restrictions, loop hoisting has restrictions:
- it requires a "do-while" (top entry) loop
- it requires the existing `head` block to dominate the loop entry block
- it requires the existing `head` block to be in the same EH region as the entry block
- it won't hoist if the `entry` block is the first block of a handler

This change removes all these restrictions.

Previously, even if we did create a pre-header, the definition of a pre-header
was a little weaker: an entry predecessor could be a non-loop block and also
not the pre-header, if the predecessor was dominated by the entry block. This
is more complicated to reason about, so I change the pre-header creation to
force entry block non-loop predecessors to branch to the pre-header instead.
This case only rarely occurs, when we have what looks like an outer loop back
edge but the natural loop recognition package doesn't recognize it as an outer loop.

I added a "stress mode" to always create a loop pre-header immediately after
loop recognition. This is disabled currently because loop cloning doesn't
respect the special status and invariants of a pre-header, and so inserts
all the cloning conditions and copied blocks after the pre-header, triggering
new loop structure asserts. This should be improved in the future.

A lot more checking of the loop table and loop annotations on blocks has been
added. This revealed a number of problems with loop unrolling leaving things
in a bad state for downstream phases. Loop unrolling has been updated to fix
this, in particular, the loop table is rebuilt if it is detected that we unroll
a loop that contains nested loops, since there will now be multiple copies of
those nested loops. This is the first case where we might rebuild the loop
table, so it lays the groundwork for potentially rebuilding the loop table in
other cases, such as after loop cloning where we don't add the "slow path"
loops to the table.

There is some code refactoring to simplify the "find loops" code as well.

Some change details:
- `optSetBlockWeights` is elevated to a "phase" that runs prior to loop recognition.
- LoopFlags is simplified:
	- LPFLG_DO_WHILE is removed; call `lpIsTopEntry` instead
	- LPFLG_ONE_EXIT is removed; check `lpExitCnt == 1` instead
	- LPFLG_HOISTABLE is removed (there are no restrictions anymore)
	- LPFLG_CONST is removed: check `lpFlags & (LPFLG_CONST_INIT | LPFLG_CONST_LIMIT) == (LPFLG_CONST_INIT | LPFLG_CONST_LIMIT)` instead (only used in one place
	- bool lpContainsCall is removed and replaced by LPFLG_CONTAINS_CALL
- Added a `lpInitBlock` field to the loop table. For constant and variable
initialization loops, code assumed that these expressions existed in the
`head` block. This isn't true anymore if we insert a pre-header block.
So, capture the block where these actually exist when we determine that
they do exist, and explicitly use this block pointer where needed.
- Added `fgComputeReturnBlocks()` to extract this code out of `fgComputeReachability` into a function
- Added `optFindAndScaleGeneralLoopBlocks()` to extract this out of loop recognition to its own function.
- Added `optResetLoopInfo()` to reset the loop table and block annotations related to loops
- Added `fgDebugCheckBBNumIncreasing()` to allow asserting that the bbNum
order of blocks is increasing. This should be used in phases that depend
on this order to do bbNum comparisons.
- Add a lot more loop table validation in `fgDebugCheckLoopTable()`

* Inline fgBuildBlockNumMap to allow using _alloca

* Fix BBJ_SWITCH output

1. Change `dspSuccs()` to not call code that will call `GetDescriptorForSwitch()`
2. Change `GetDescriptorForSwitch()` to use the correct max block number while
inlining. We probably don't or shouldn't call GetDescriptorForSwitch while inlining,
especially after (1), but this change doesn't hurt.

* Remove incorrect assertion

There was an assertion when considering an existing `head` block as
a potential pre-header, that the `entry` block have the `head`
block as a predecessor. However, we early exit if we find a non-head,
non-loop edge. This could happen before we encounter the `head` block,
making the assert incorrect. We don't want to run the entire loop just
for the purpose of the assert (at least not here), so just remove the
assert.

* Formatting

* Use `_alloca` instead of `alloca` name

* Convert fgBBNumMax usage

Change:
```
compIsForInlining() ? impInlineInfo->InlinerCompiler->fgBBNumMax : fgBBNumMax
```
to:
```
impInlineRoot()->fgBBNumMax
```

* Code review feedback

1. Added loop epoch concept. Currently set but never checked.
2. Added disabled assert about return blocks always being moved
out of line of loop body.
3. Fixed bug checking `totalIter` after it was decremented to zero
as a loop variable
4. Added more comments on incremental loop block `bbNatLoopNum`
setting.

* Add EH condition when converting head to pre-header

When considering converting an existing loop `head` block to a pre-header,
verify that the block has the same EH try region that we would create
for a new pre-header. If not, we go ahead and create the new pre-header.
---
 src/coreclr/jit/block.cpp        |  47 +-
 src/coreclr/jit/block.h          |  12 +-
 src/coreclr/jit/clrjit.natvis    |   1 +
 src/coreclr/jit/compiler.cpp     |  16 +-
 src/coreclr/jit/compiler.h       | 129 ++--
 src/coreclr/jit/compphases.h     |   7 +-
 src/coreclr/jit/error.cpp        |   2 +-
 src/coreclr/jit/fgbasic.cpp      |   5 +-
 src/coreclr/jit/fgdiagnostic.cpp | 244 +++++++-
 src/coreclr/jit/fgflow.cpp       |   3 +-
 src/coreclr/jit/fgopt.cpp        |  93 ++-
 src/coreclr/jit/flowgraph.cpp    |  18 +-
 src/coreclr/jit/gentree.cpp      |   2 +-
 src/coreclr/jit/hashbv.cpp       |   4 +-
 src/coreclr/jit/importer.cpp     |   2 +-
 src/coreclr/jit/jiteh.cpp        |  30 +-
 src/coreclr/jit/loopcloning.cpp  |  10 +-
 src/coreclr/jit/morph.cpp        |   2 +-
 src/coreclr/jit/optimizer.cpp    | 978 ++++++++++++++++++++-----------
 src/coreclr/jit/ssabuilder.cpp   |   2 +-
 20 files changed, 1147 insertions(+), 460 deletions(-)

diff --git a/src/coreclr/jit/block.cpp b/src/coreclr/jit/block.cpp
index 4472d573b0c92..9073d3dc50fe6 100644
--- a/src/coreclr/jit/block.cpp
+++ b/src/coreclr/jit/block.cpp
@@ -575,18 +575,49 @@ unsigned BasicBlock::dspCheapPreds()
     return count;
 }
 
-/*****************************************************************************
- *
- *  Display the basic block successors.
- */
-
+//------------------------------------------------------------------------
+// dspSuccs: Display the basic block successors.
+//
+// Arguments:
+//    compiler - compiler instance; passed to NumSucc(Compiler*) -- see that function for implications.
+//
 void BasicBlock::dspSuccs(Compiler* compiler)
 {
     bool first = true;
-    for (BasicBlock* const succ : Succs(compiler))
+
+    // If this is a switch, we don't want to call `Succs(Compiler*)` because it will eventually call
+    // `GetSwitchDescMap()`, and that will have the side-effect of allocating the unique switch descriptor map
+    // and/or compute this switch block's unique succ set if it is not present. Debug output functions should
+    // never have an effect on codegen. We also don't want to assume the unique succ set is accurate, so we
+    // compute it ourselves here.
+    if (bbJumpKind == BBJ_SWITCH)
+    {
+        // Create a set with all the successors. Don't use BlockSet, so we don't need to worry
+        // about the BlockSet epoch.
+        unsigned     bbNumMax = compiler->impInlineRoot()->fgBBNumMax;
+        BitVecTraits bitVecTraits(bbNumMax + 1, compiler);
+        BitVec       uniqueSuccBlocks(BitVecOps::MakeEmpty(&bitVecTraits));
+        for (BasicBlock* const bTarget : SwitchTargets())
+        {
+            BitVecOps::AddElemD(&bitVecTraits, uniqueSuccBlocks, bTarget->bbNum);
+        }
+        BitVecOps::Iter iter(&bitVecTraits, uniqueSuccBlocks);
+        unsigned        bbNum = 0;
+        while (iter.NextElem(&bbNum))
+        {
+            // Note that we will output switch successors in increasing numerical bbNum order, which is
+            // not related to their order in the bbJumpSwt->bbsDstTab table.
+            printf("%s" FMT_BB, first ? "" : ",", bbNum);
+            first = false;
+        }
+    }
+    else
     {
-        printf("%s" FMT_BB, first ? "" : ",", succ->bbNum);
-        first = false;
+        for (BasicBlock* const succ : Succs(compiler))
+        {
+            printf("%s" FMT_BB, first ? "" : ",", succ->bbNum);
+            first = false;
+        }
     }
 }
 
diff --git a/src/coreclr/jit/block.h b/src/coreclr/jit/block.h
index 78c76a805ab3c..a7c49a03328f0 100644
--- a/src/coreclr/jit/block.h
+++ b/src/coreclr/jit/block.h
@@ -559,7 +559,7 @@ enum BasicBlockFlags : unsigned __int64
 
     // Flags that relate blocks to loop structure.
 
-    BBF_LOOP_FLAGS = BBF_LOOP_PREHEADER | BBF_LOOP_HEAD | BBF_LOOP_CALL0 | BBF_LOOP_CALL1,
+    BBF_LOOP_FLAGS = BBF_LOOP_PREHEADER | BBF_LOOP_HEAD | BBF_LOOP_CALL0 | BBF_LOOP_CALL1 | BBF_LOOP_ALIGN,
 
     // Flags to update when two blocks are compacted
 
@@ -1001,6 +1001,16 @@ struct BasicBlock : private LIR::Range
         bbHndIndex = from->bbHndIndex;
     }
 
+    void copyTryIndex(const BasicBlock* from)
+    {
+        bbTryIndex = from->bbTryIndex;
+    }
+
+    void copyHndIndex(const BasicBlock* from)
+    {
+        bbHndIndex = from->bbHndIndex;
+    }
+
     static bool sameTryRegion(const BasicBlock* blk1, const BasicBlock* blk2)
     {
         return blk1->bbTryIndex == blk2->bbTryIndex;
diff --git a/src/coreclr/jit/clrjit.natvis b/src/coreclr/jit/clrjit.natvis
index aa5fb650c28de..9debf0999048a 100644
--- a/src/coreclr/jit/clrjit.natvis
+++ b/src/coreclr/jit/clrjit.natvis
@@ -22,6 +22,7 @@ Documentation for VS debugger format specifiers: https://docs.microsoft.com/en-u
 
   <Type Name="BasicBlock">
     <DisplayString Condition="bbJumpKind==BBJ_COND || bbJumpKind==BBJ_ALWAYS || bbJumpKind==BBJ_LEAVE || bbJumpKind==BBJ_EHCATCHRET || bbJumpKind==BBJ_CALLFINALLY">BB{bbNum,d}->BB{bbJumpDest->bbNum,d}; {bbJumpKind,en}</DisplayString>
+    <DisplayString Condition="bbJumpKind==BBJ_SWITCH">BB{bbNum,d}; {bbJumpKind,en}; {bbJumpSwt->bbsCount} cases</DisplayString>
     <DisplayString>BB{bbNum,d}; {bbJumpKind,en}</DisplayString>
   </Type>
 
diff --git a/src/coreclr/jit/compiler.cpp b/src/coreclr/jit/compiler.cpp
index fcb6f12e3d7fe..3e9059eb156af 100644
--- a/src/coreclr/jit/compiler.cpp
+++ b/src/coreclr/jit/compiler.cpp
@@ -4898,10 +4898,14 @@ void Compiler::compCompile(void** methodCodePtr, uint32_t* methodCodeSize, JitFl
         //
         DoPhase(this, PHASE_COMPUTE_REACHABILITY, &Compiler::fgComputeReachability);
 
+        // Scale block weights and mark run rarely blocks.
+        //
+        DoPhase(this, PHASE_SET_BLOCK_WEIGHTS, &Compiler::optSetBlockWeights);
+
         // Discover and classify natural loops (e.g. mark iterative loops as such). Also marks loop blocks
         // and sets bbWeight to the loop nesting levels.
         //
-        DoPhase(this, PHASE_FIND_LOOPS, &Compiler::optFindLoops);
+        DoPhase(this, PHASE_FIND_LOOPS, &Compiler::optFindLoopsPhase);
 
         // Clone loops with optimization opportunities, and choose one based on dynamic condition evaluation.
         //
@@ -5444,20 +5448,16 @@ void Compiler::ResetOptAnnotations()
 //    The intent of this method is to update loop structure annotations, and those
 //    they depend on; these annotations may have become stale during optimization,
 //    and need to be up-to-date before running another iteration of optimizations.
-
+//
 void Compiler::RecomputeLoopInfo()
 {
     assert(opts.optRepeat);
     assert(JitConfig.JitOptRepeatCount() > 0);
     // Recompute reachability sets, dominators, and loops.
-    optLoopCount   = 0;
+    optResetLoopInfo();
     fgDomsComputed = false;
-    for (BasicBlock* const block : Blocks())
-    {
-        block->bbFlags &= ~BBF_LOOP_FLAGS;
-        block->bbNatLoopNum = BasicBlock::NOT_IN_LOOP;
-    }
     fgComputeReachability();
+    optSetBlockWeights();
     // Rebuild the loop tree annotations themselves
     optFindLoops();
 }
diff --git a/src/coreclr/jit/compiler.h b/src/coreclr/jit/compiler.h
index a7a7a55b9f46f..ea088cccdd529 100644
--- a/src/coreclr/jit/compiler.h
+++ b/src/coreclr/jit/compiler.h
@@ -2468,15 +2468,15 @@ enum LoopFlags : unsigned short
 {
     LPFLG_EMPTY = 0,
 
-    LPFLG_DO_WHILE  = 0x0001, // it's a do-while loop (i.e ENTRY is at the TOP)
-    LPFLG_ONE_EXIT  = 0x0002, // the loop has only one exit
-    LPFLG_ITER      = 0x0004, // loop of form: for (i = icon or lclVar; test_condition(); i++)
-    LPFLG_HOISTABLE = 0x0008, // the loop is in a form that is suitable for hoisting expressions
+    // LPFLG_UNUSED  = 0x0001,
+    // LPFLG_UNUSED  = 0x0002,
+    LPFLG_ITER = 0x0004, // loop of form: for (i = icon or lclVar; test_condition(); i++)
+    // LPFLG_UNUSED    = 0x0008,
 
-    LPFLG_CONST      = 0x0010, // loop of form: for (i=icon;i<icon;i++){ ... } - constant loop
-    LPFLG_VAR_INIT   = 0x0020, // iterator is initialized with a local var (var # found in lpVarInit)
-    LPFLG_CONST_INIT = 0x0040, // iterator is initialized with a constant (found in lpConstInit)
-    LPFLG_SIMD_LIMIT = 0x0080, // iterator is compared with vector element count (found in lpConstLimit)
+    LPFLG_CONTAINS_CALL = 0x0010, // If executing the loop body *may* execute a call
+    LPFLG_VAR_INIT      = 0x0020, // iterator is initialized with a local var (var # found in lpVarInit)
+    LPFLG_CONST_INIT    = 0x0040, // iterator is initialized with a constant (found in lpConstInit)
+    LPFLG_SIMD_LIMIT    = 0x0080, // iterator is compared with vector element count (found in lpConstLimit)
 
     LPFLG_VAR_LIMIT    = 0x0100, // iterator is compared with a local var (var # found in lpVarLimit)
     LPFLG_CONST_LIMIT  = 0x0200, // iterator is compared with a constant (found in lpConstLimit)
@@ -5075,11 +5075,12 @@ class Compiler
     unsigned fgMeasureIR();
 #endif // FEATURE_JIT_METHOD_PERF
 
-    bool fgModified;         // True if the flow graph has been modified recently
-    bool fgComputePredsDone; // Have we computed the bbPreds list
-    bool fgCheapPredsValid;  // Is the bbCheapPreds list valid?
-    bool fgDomsComputed;     // Have we computed the dominator sets?
-    bool fgOptimizedFinally; // Did we optimize any try-finallys?
+    bool fgModified;             // True if the flow graph has been modified recently
+    bool fgComputePredsDone;     // Have we computed the bbPreds list
+    bool fgCheapPredsValid;      // Is the bbCheapPreds list valid?
+    bool fgDomsComputed;         // Have we computed the dominator sets?
+    bool fgReturnBlocksComputed; // Have we computed the return blocks list?
+    bool fgOptimizedFinally;     // Did we optimize any try-finallys?
 
     bool fgHasSwitch; // any BBJ_SWITCH jumps?
 
@@ -5681,6 +5682,8 @@ class Compiler
 
     void fgComputeReachabilitySets(); // Compute bbReach sets. (Also sets BBF_GC_SAFE_POINT flag on blocks.)
 
+    void fgComputeReturnBlocks(); // Initialize fgReturnBlocks to a list of BBJ_RETURN blocks.
+
     void fgComputeEnterBlocksSet(); // Compute the set of entry blocks, 'fgEnterBlks'.
 
     bool fgRemoveUnreachableBlocks(); // Remove blocks determined to be unreachable by the bbReach sets.
@@ -5708,9 +5711,12 @@ class Compiler
     // && postOrder(A) >= postOrder(B) making the computation O(1).
     void fgNumberDomTree(DomTreeNode* domTree);
 
-    // When the flow graph changes, we need to update the block numbers, predecessor lists, reachability sets, and
-    // dominators.
-    void fgUpdateChangedFlowGraph(const bool computePreds = true, const bool computeDoms = true);
+    // When the flow graph changes, we need to update the block numbers, predecessor lists, reachability sets,
+    // dominators, and possibly loops.
+    void fgUpdateChangedFlowGraph(const bool computePreds        = true,
+                                  const bool computeDoms         = true,
+                                  const bool computeReturnBlocks = false,
+                                  const bool computeLoops        = false);
 
 public:
     // Compute the predecessors of the blocks in the control flow graph.
@@ -5816,6 +5822,8 @@ class Compiler
 
     BasicBlock* fgFirstBlockOfHandler(BasicBlock* block);
 
+    bool fgIsFirstBlockOfFilterOrHandler(BasicBlock* block);
+
     flowList* fgGetPredForBlock(BasicBlock* block, BasicBlock* blockPred);
 
     flowList* fgGetPredForBlock(BasicBlock* block, BasicBlock* blockPred, flowList*** ptrToPred);
@@ -5985,6 +5993,7 @@ class Compiler
 #endif // DUMP_FLOWGRAPHS
 
 #ifdef DEBUG
+
     void fgDispDoms();
     void fgDispReach();
     void fgDispBBLiveness(BasicBlock* block);
@@ -5999,6 +6008,8 @@ class Compiler
     static fgWalkPreFn fgStress64RsltMulCB;
     void               fgStress64RsltMul();
     void               fgDebugCheckUpdate();
+
+    void fgDebugCheckBBNumIncreasing();
     void fgDebugCheckBBlist(bool checkBBNum = false, bool checkBBRefs = true);
     void fgDebugCheckBlockLinks();
     void fgDebugCheckLinks(bool morphTrees = false);
@@ -6014,7 +6025,8 @@ class Compiler
     void fgDebugCheckProfileData();
     bool fgDebugCheckIncomingProfileData(BasicBlock* block);
     bool fgDebugCheckOutgoingProfileData(BasicBlock* block);
-#endif
+
+#endif // DEBUG
 
     static bool fgProfileWeightsEqual(weight_t weight1, weight_t weight2);
     static bool fgProfileWeightsConsistent(weight_t weight1, weight_t weight2);
@@ -6600,8 +6612,7 @@ class Compiler
     void optHoistLoopCode();
 
     // To represent sets of VN's that have already been hoisted in outer loops.
-    typedef JitHashTable<ValueNum, JitSmallPrimitiveKeyFuncs<ValueNum>, bool> VNToBoolMap;
-    typedef VNToBoolMap VNSet;
+    typedef JitHashTable<ValueNum, JitSmallPrimitiveKeyFuncs<ValueNum>, bool> VNSet;
 
     struct LoopHoistContext
     {
@@ -6612,9 +6623,10 @@ class Compiler
     public:
         // Value numbers of expressions that have been hoisted in parent loops in the loop nest.
         VNSet m_hoistedInParentLoops;
+
         // Value numbers of expressions that have been hoisted in the current (or most recent) loop in the nest.
         // Previous decisions on loop-invariance of value numbers in the current loop.
-        VNToBoolMap m_curLoopVnInvariantCache;
+        VNSet m_curLoopVnInvariantCache;
 
         VNSet* GetHoistedInCurLoop(Compiler* comp)
         {
@@ -6657,7 +6669,7 @@ class Compiler
     void optHoistLoopBlocks(unsigned loopNum, ArrayStack<BasicBlock*>* blocks, LoopHoistContext* hoistContext);
 
     // Return true if the tree looks profitable to hoist out of loop 'lnum'.
-    bool optIsProfitableToHoistableTree(GenTree* tree, unsigned lnum);
+    bool optIsProfitableToHoistTree(GenTree* tree, unsigned lnum);
 
     // Performs the hoisting 'tree' into the PreHeader for loop 'lnum'
     void optHoistCandidate(GenTree* tree, BasicBlock* treeBb, unsigned lnum, LoopHoistContext* hoistCtxt);
@@ -6665,7 +6677,7 @@ class Compiler
     // Returns true iff the ValueNum "vn" represents a value that is loop-invariant in "lnum".
     //   Constants and init values are always loop invariant.
     //   VNPhi's connect VN's to the SSA definition, so we can know if the SSA def occurs in the loop.
-    bool optVNIsLoopInvariant(ValueNum vn, unsigned lnum, VNToBoolMap* recordedVNs);
+    bool optVNIsLoopInvariant(ValueNum vn, unsigned lnum, VNSet* recordedVNs);
 
     // If "blk" is the entry block of a natural loop, returns true and sets "*pLnum" to the index of the loop
     // in the loop table.
@@ -6705,7 +6717,10 @@ class Compiler
 public:
     PhaseStatus optInvertLoops();    // Invert loops so they're entered at top and tested at bottom.
     PhaseStatus optOptimizeLayout(); // Optimize the BasicBlock layout of the method
-    PhaseStatus optFindLoops();      // Finds loops and records them in the loop table
+    PhaseStatus optSetBlockWeights();
+    PhaseStatus optFindLoopsPhase(); // Finds loops and records them in the loop table
+
+    void optFindLoops();
 
     PhaseStatus optCloneLoops();
     void optCloneLoop(unsigned loopInd, LoopCloneContext* context);
@@ -6763,7 +6778,6 @@ class Compiler
         bool lpLoopHasMemoryHavoc[MemoryKindCount]; // The loop contains an operation that we assume has arbitrary
                                                     // memory side effects.  If this is set, the fields below
                                                     // may not be accurate (since they become irrelevant.)
-        bool lpContainsCall;                        // True if executing the loop body *may* execute a call
 
         VARSET_TP lpVarInOut;  // The set of variables that are IN or OUT during the execution of this loop
         VARSET_TP lpVarUseDef; // The set of variables that are USE or DEF during the execution of this loop
@@ -6809,6 +6823,10 @@ class Compiler
 
         var_types lpIterOperType() const; // For overflow instructions
 
+        // Set to the block where we found the initialization for LPFLG_CONST_INIT or LPFLG_VAR_INIT loops.
+        // Initially, this will be 'head', but 'head' might change if we insert a loop pre-header block.
+        BasicBlock* lpInitBlock;
+
         union {
             int lpConstInit;    // initial constant value of iterator
                                 // : Valid if LPFLG_CONST_INIT
@@ -6839,6 +6857,21 @@ class Compiler
         // : Valid if LPFLG_ARRLEN_LIMIT
         bool lpArrLenLimit(Compiler* comp, ArrIndex* index) const;
 
+        // Returns "true" iff this is a "top entry" loop.
+        bool lpIsTopEntry() const
+        {
+            if (lpHead->bbNext == lpEntry)
+            {
+                assert(lpHead->bbFallsThrough());
+                assert(lpTop == lpEntry);
+                return true;
+            }
+            else
+            {
+                return false;
+            }
+        }
+
         // Returns "true" iff "*this" contains the blk.
         bool lpContains(BasicBlock* blk) const
         {
@@ -6891,6 +6924,24 @@ class Compiler
                    (lpHead->bbNum < lpTop->bbNum || lpHead->bbNum > lpBottom->bbNum);
         }
 
+#ifdef DEBUG
+        void lpValidatePreHeader()
+        {
+            // If this is called, we expect there to be a pre-header.
+            assert(lpFlags & LPFLG_HAS_PREHEAD);
+
+            // The pre-header must unconditionally enter the loop.
+            assert(lpHead->GetUniqueSucc() == lpEntry);
+
+            // The loop block must be marked as a pre-header.
+            assert(lpHead->bbFlags & BBF_LOOP_PREHEADER);
+
+            // The loop entry must have a single non-loop predecessor, which is the pre-header.
+            // We can't assume here that the bbNum are properly ordered, so we can't do a simple lpContained()
+            // check. So, we defer this check, which will be done by `fgDebugCheckLoopTable()`.
+        }
+#endif // DEBUG
+
         // LoopBlocks: convenience method for enabling range-based `for` iteration over all the
         // blocks in a loop, e.g.:
         //    for (BasicBlock* const block : loop->LoopBlocks()) ...
@@ -6913,6 +6964,17 @@ class Compiler
     unsigned char optLoopCount;        // number of tracked loops
     unsigned char loopAlignCandidates; // number of loops identified for alignment
 
+    // Every time we rebuild the loop table, we increase the global "loop epoch". Any loop indices or
+    // loop table pointers from the previous epoch are invalid.
+    // TODO: validate this in some way?
+    unsigned optCurLoopEpoch;
+
+    void NewLoopEpoch()
+    {
+        ++optCurLoopEpoch;
+        JITDUMP("New loop epoch %d\n", optCurLoopEpoch);
+    }
+
 #ifdef DEBUG
     unsigned char loopsAligned; // number of loops actually aligned
 #endif                          // DEBUG
@@ -6940,11 +7002,12 @@ class Compiler
     void optCheckPreds();
 #endif
 
+    void optResetLoopInfo();
+    void optFindAndScaleGeneralLoopBlocks();
+
     // Determine if there are any potential loops, and set BBF_LOOP_HEAD on potential loop heads.
     void optMarkLoopHeads();
 
-    void optSetBlockWeights();
-
     void optScaleLoopBlocks(BasicBlock* begBlk, BasicBlock* endBlk);
 
     void optUnmarkLoopBlocks(BasicBlock* begBlk, BasicBlock* endBlk);
@@ -6955,7 +7018,7 @@ class Compiler
     unsigned optIsLoopIncrTree(GenTree* incr);
     bool optCheckIterInLoopTest(unsigned loopInd, GenTree* test, BasicBlock* from, BasicBlock* to, unsigned iterVar);
     bool optComputeIterInfo(GenTree* incr, BasicBlock* from, BasicBlock* to, unsigned* pIterVar);
-    bool optPopulateInitInfo(unsigned loopInd, GenTree* init, unsigned iterVar);
+    bool optPopulateInitInfo(unsigned loopInd, BasicBlock* initBlock, GenTree* init, unsigned iterVar);
     bool optExtractInitTestIncr(
         BasicBlock* head, BasicBlock* bottom, BasicBlock* exit, GenTree** ppInit, GenTree** ppTest, GenTree** ppIncr);
 
@@ -7006,15 +7069,13 @@ class Compiler
     // The depth of the loop described by "lnum" (an index into the loop table.) (0 == top level)
     unsigned optLoopDepth(unsigned lnum)
     {
-        unsigned par = optLoopTable[lnum].lpParent;
-        if (par == BasicBlock::NOT_IN_LOOP)
-        {
-            return 0;
-        }
-        else
+        assert(lnum < optLoopCount);
+        unsigned depth = 0;
+        while ((lnum = optLoopTable[lnum].lpParent) != BasicBlock::NOT_IN_LOOP)
         {
-            return 1 + optLoopDepth(par);
+            ++depth;
         }
+        return depth;
     }
 
     // Struct used in optInvertWhileLoop to count interesting constructs to boost the profitability score.
diff --git a/src/coreclr/jit/compphases.h b/src/coreclr/jit/compphases.h
index b8805fca75352..7ecdd1edd2a1d 100644
--- a/src/coreclr/jit/compphases.h
+++ b/src/coreclr/jit/compphases.h
@@ -31,7 +31,7 @@ CompPhaseNameMacro(PHASE_POST_IMPORT,            "Post-import",
 CompPhaseNameMacro(PHASE_IBCPREP,                "Profile instrumentation prep",   "IBCPREP",  false, -1, false)
 CompPhaseNameMacro(PHASE_IBCINSTR,               "Profile instrumentation",        "IBCINSTR", false, -1, false)
 CompPhaseNameMacro(PHASE_INCPROFILE,             "Profile incorporation",          "INCPROF",  false, -1, false)
-CompPhaseNameMacro(PHASE_MORPH_INIT,             "Morph - Init",                   "MOR-INIT" ,false, -1, false)
+CompPhaseNameMacro(PHASE_MORPH_INIT,             "Morph - Init",                   "MOR-INIT", false, -1, false)
 CompPhaseNameMacro(PHASE_MORPH_INLINE,           "Morph - Inlining",               "MOR-INL",  false, -1, true)
 CompPhaseNameMacro(PHASE_MORPH_ADD_INTERNAL,     "Morph - Add internal blocks",    "MOR-ADD",  false, -1, true)
 CompPhaseNameMacro(PHASE_ALLOCATE_OBJECTS,       "Allocate Objects",               "ALLOC-OBJ", false, -1, false)
@@ -48,7 +48,7 @@ CompPhaseNameMacro(PHASE_PROMOTE_STRUCTS,        "Morph - Promote Structs",
 CompPhaseNameMacro(PHASE_MORPH_GLOBAL,           "Morph - Global",                 "MOR-GLOB", false, -1, false)
 CompPhaseNameMacro(PHASE_MORPH_END,              "Morph - Finish",                 "MOR-END",  false, -1, true)
 CompPhaseNameMacro(PHASE_GS_COOKIE,              "GS Cookie",                      "GS-COOK",  false, -1, false)
-CompPhaseNameMacro(PHASE_COMPUTE_EDGE_WEIGHTS,   "Compute edge weights (1, false)",       "EDG-WGT",  false, -1, false)
+CompPhaseNameMacro(PHASE_COMPUTE_EDGE_WEIGHTS,   "Compute edge weights (1, false)","EDG-WGT",  false, -1, false)
 #if defined(FEATURE_EH_FUNCLETS)
 CompPhaseNameMacro(PHASE_CREATE_FUNCLETS,        "Create EH funclets",             "EH-FUNC",  false, -1, false)
 #endif // FEATURE_EH_FUNCLETS
@@ -56,6 +56,7 @@ CompPhaseNameMacro(PHASE_MERGE_THROWS,           "Merge throw blocks",
 CompPhaseNameMacro(PHASE_INVERT_LOOPS,           "Invert loops",                   "LOOP-INV", false, -1, false)
 CompPhaseNameMacro(PHASE_OPTIMIZE_LAYOUT,        "Optimize layout",                "LAYOUT",   false, -1, false)
 CompPhaseNameMacro(PHASE_COMPUTE_REACHABILITY,   "Compute blocks reachability",    "BL_REACH", false, -1, false)
+CompPhaseNameMacro(PHASE_SET_BLOCK_WEIGHTS,      "Set block weights",              "BL-WEIGHTS", false, -1, false)
 CompPhaseNameMacro(PHASE_ZERO_INITS,             "Redundant zero Inits",           "ZERO-INIT", false, -1, false)
 CompPhaseNameMacro(PHASE_FIND_LOOPS,             "Find loops",                     "LOOP-FND", false, -1, false)
 CompPhaseNameMacro(PHASE_CLONE_LOOPS,            "Clone loops",                    "LP-CLONE", false, -1, false)
@@ -82,7 +83,7 @@ CompPhaseNameMacro(PHASE_OPTIMIZE_BRANCHES,      "Redundant branch opts",
 CompPhaseNameMacro(PHASE_ASSERTION_PROP_MAIN,    "Assertion prop",                 "AST-PROP", false, -1, false)
 #endif
 CompPhaseNameMacro(PHASE_OPT_UPDATE_FLOW_GRAPH,  "Update flow graph opt pass",     "UPD-FG-O", false, -1, false)
-CompPhaseNameMacro(PHASE_COMPUTE_EDGE_WEIGHTS2,  "Compute edge weights (2, false)",       "EDG-WGT2", false, -1, false)
+CompPhaseNameMacro(PHASE_COMPUTE_EDGE_WEIGHTS2,  "Compute edge weights (2, false)","EDG-WGT2", false, -1, false)
 CompPhaseNameMacro(PHASE_INSERT_GC_POLLS,        "Insert GC Polls",                "GC-POLLS", false, -1, true)
 CompPhaseNameMacro(PHASE_DETERMINE_FIRST_COLD_BLOCK, "Determine first cold block", "COLD-BLK", false, -1, true)
 CompPhaseNameMacro(PHASE_RATIONALIZE,            "Rationalize IR",                 "RAT",      false, -1, false)
diff --git a/src/coreclr/jit/error.cpp b/src/coreclr/jit/error.cpp
index 36aef2c5a0e4b..7906d4c6f4a8a 100644
--- a/src/coreclr/jit/error.cpp
+++ b/src/coreclr/jit/error.cpp
@@ -279,7 +279,7 @@ extern "C" void __cdecl assertAbort(const char* why, const char* file, unsigned
     const char* msg       = why;
     LogEnv*     env       = JitTls::GetLogEnv();
     const int   BUFF_SIZE = 8192;
-    char*       buff      = (char*)alloca(BUFF_SIZE);
+    char*       buff      = (char*)_alloca(BUFF_SIZE);
     const char* phaseName = "unknown phase";
     if (env->compiler)
     {
diff --git a/src/coreclr/jit/fgbasic.cpp b/src/coreclr/jit/fgbasic.cpp
index fb868e4566e32..73695932cb543 100644
--- a/src/coreclr/jit/fgbasic.cpp
+++ b/src/coreclr/jit/fgbasic.cpp
@@ -35,7 +35,8 @@ void Compiler::fgInit()
     fgCalledCount            = BB_ZERO_WEIGHT;
 
     /* We haven't yet computed the dominator sets */
-    fgDomsComputed = false;
+    fgDomsComputed         = false;
+    fgReturnBlocksComputed = false;
 
 #ifdef DEBUG
     fgReachabilitySetsValid = false;
@@ -4701,7 +4702,7 @@ void Compiler::fgRemoveBlock(BasicBlock* block, bool unreachable)
             {
                 loopAlignCandidates++;
                 succBlock->bbFlags |= BBF_LOOP_ALIGN;
-                JITDUMP("Propagating LOOP_ALIGN flag from " FMT_BB " to " FMT_BB " for loop# %d.\n", block->bbNum,
+                JITDUMP("Propagating LOOP_ALIGN flag from " FMT_BB " to " FMT_BB " for " FMT_LP "\n ", block->bbNum,
                         succBlock->bbNum, block->bbNatLoopNum);
             }
 
diff --git a/src/coreclr/jit/fgdiagnostic.cpp b/src/coreclr/jit/fgdiagnostic.cpp
index ce76dc7989619..318f7f9df2dbc 100644
--- a/src/coreclr/jit/fgdiagnostic.cpp
+++ b/src/coreclr/jit/fgdiagnostic.cpp
@@ -585,7 +585,7 @@ FILE* Compiler::fgOpenFlowGraphFile(bool* wbDontClose, Phases phase, PhasePositi
         {
             wCharCount += wcslen(pathname) + 1;
         }
-        filename = (LPCWSTR)alloca(wCharCount * sizeof(WCHAR));
+        filename = (LPCWSTR)_alloca(wCharCount * sizeof(WCHAR));
 
         if (pathname != nullptr)
         {
@@ -651,7 +651,7 @@ FILE* Compiler::fgOpenFlowGraphFile(bool* wbDontClose, Phases phase, PhasePositi
         {
             wCharCount += wcslen(pathname) + 1;
         }
-        filename = (LPCWSTR)alloca(wCharCount * sizeof(WCHAR));
+        filename = (LPCWSTR)_alloca(wCharCount * sizeof(WCHAR));
         if (pathname != nullptr)
         {
             swprintf_s((LPWSTR)filename, wCharCount, W("%s\\%s.%s"), pathname, origFilename, type);
@@ -839,7 +839,7 @@ bool Compiler::fgDumpFlowGraph(Phases phase, PhasePosition pos)
     // that size, even though it means allocating a block map possibly much bigger than what's required for just
     // the inlinee blocks.
 
-    unsigned  blkMapSize   = 1 + (compIsForInlining() ? impInlineInfo->InlinerCompiler->fgBBNumMax : fgBBNumMax);
+    unsigned  blkMapSize   = 1 + impInlineRoot()->fgBBNumMax;
     unsigned  blockOrdinal = 1;
     unsigned* blkMap       = new (this, CMK_DebugOnly) unsigned[blkMapSize];
     memset(blkMap, 0, sizeof(unsigned) * blkMapSize);
@@ -1794,8 +1794,8 @@ void Compiler::fgDispDoms()
 
 void Compiler::fgTableDispBasicBlock(BasicBlock* block, int ibcColWidth /* = 0 */)
 {
-    const unsigned __int64 flags    = block->bbFlags;
-    unsigned               bbNumMax = compIsForInlining() ? impInlineInfo->InlinerCompiler->fgBBNumMax : fgBBNumMax;
+    const unsigned __int64 flags            = block->bbFlags;
+    unsigned               bbNumMax         = impInlineRoot()->fgBBNumMax;
     int                    maxBlockNumWidth = CountDigits(bbNumMax);
     maxBlockNumWidth                        = max(maxBlockNumWidth, 2);
     int blockNumWidth                       = CountDigits(block->bbNum);
@@ -2178,7 +2178,7 @@ void Compiler::fgDispBasicBlocks(BasicBlock* firstBlock, BasicBlock* lastBlock,
         ibcColWidth = max(ibcColWidth, 3) + 1; // + 1 for the leading space
     }
 
-    unsigned bbNumMax         = compIsForInlining() ? impInlineInfo->InlinerCompiler->fgBBNumMax : fgBBNumMax;
+    unsigned bbNumMax         = impInlineRoot()->fgBBNumMax;
     int      maxBlockNumWidth = CountDigits(bbNumMax);
     maxBlockNumWidth          = max(maxBlockNumWidth, 2);
     int padWidth              = maxBlockNumWidth - 2; // Account for functions with a large number of blocks.
@@ -2630,6 +2630,21 @@ bool BBPredsChecker::CheckEHFinallyRet(BasicBlock* blockPred, BasicBlock* block)
     return false;
 }
 
+//------------------------------------------------------------------------------
+// fgDebugCheckBBNumIncreasing: Check that the block list bbNum are in increasing order in the bbNext
+// traversal. Given a block B1 and its bbNext successor B2, this means `B1->bbNum < B2->bbNum`, but not
+// that `B1->bbNum + 1 == B2->bbNum` (which is true after renumbering). This can be used as a precondition
+// to a phase that expects this ordering to compare block numbers (say, to look for backwards branches)
+// and doesn't want to call fgRenumberBlocks(), to avoid that potential expense.
+//
+void Compiler::fgDebugCheckBBNumIncreasing()
+{
+    for (BasicBlock* const block : Blocks())
+    {
+        assert(block->bbNext == nullptr || (block->bbNum < block->bbNext->bbNum));
+    }
+}
+
 // This variable is used to generate "traversal labels": one-time constants with which
 // we label basic blocks that are members of the basic block list, in order to have a
 // fast, high-probability test for membership in that list.  Type is "volatile" because
@@ -3762,6 +3777,8 @@ void Compiler::fgDebugCheckNodesUniqueness()
 //------------------------------------------------------------------------------
 // fgDebugCheckLoopTable: checks that the loop table is valid.
 //    - If the method has natural loops, the loop table is not null
+//    - Loop `top` must come before `bottom`.
+//    - Loop `entry` must be between `top` and `bottom`.
 //    - All basic blocks with loop numbers set have a corresponding loop in the table
 //    - All basic blocks without a loop number are not in a loop
 //    - All parents of the loop with the block contain that block
@@ -3773,6 +3790,200 @@ void Compiler::fgDebugCheckLoopTable()
         assert(optLoopTable != nullptr);
     }
 
+    // Build a mapping from existing block list number (bbNum) to the block number it would be after the
+    // blocks are renumbered. This allows making asserts about the relative ordering of blocks using block number
+    // without actually renumbering the blocks, which would affect non-DEBUG code paths. Note that there may be
+    // `blockNumMap[bbNum] == 0` if the `bbNum` block was deleted and blocks haven't been renumbered since
+    // the deletion.
+
+    unsigned bbNumMax = impInlineRoot()->fgBBNumMax;
+
+    // blockNumMap[old block number] => new block number
+    size_t    blockNumBytes = (bbNumMax + 1) * sizeof(unsigned);
+    unsigned* blockNumMap   = (unsigned*)_alloca(blockNumBytes);
+    memset(blockNumMap, 0, blockNumBytes);
+
+    unsigned newBBnum = 1;
+    for (BasicBlock* const block : Blocks())
+    {
+        if ((block->bbFlags & BBF_REMOVED) == 0)
+        {
+            assert(1 <= block->bbNum && block->bbNum <= bbNumMax);
+            assert(blockNumMap[block->bbNum] == 0); // If this fails, we have two blocks with the same block number.
+            blockNumMap[block->bbNum] = newBBnum++;
+        }
+    }
+
+    struct MappedChecks
+    {
+        static bool lpWellFormed(const unsigned* blockNumMap, const LoopDsc* loop)
+        {
+            return (blockNumMap[loop->lpTop->bbNum] <= blockNumMap[loop->lpEntry->bbNum]) &&
+                   (blockNumMap[loop->lpEntry->bbNum] <= blockNumMap[loop->lpBottom->bbNum]) &&
+                   ((blockNumMap[loop->lpHead->bbNum] < blockNumMap[loop->lpTop->bbNum]) ||
+                    (blockNumMap[loop->lpHead->bbNum] > blockNumMap[loop->lpBottom->bbNum]));
+        }
+
+        static bool lpContains(const unsigned* blockNumMap, const LoopDsc* loop, const BasicBlock* blk)
+        {
+            return (blockNumMap[loop->lpTop->bbNum] <= blockNumMap[blk->bbNum]) &&
+                   (blockNumMap[blk->bbNum] <= blockNumMap[loop->lpBottom->bbNum]);
+        }
+        static bool lpContains(const unsigned*   blockNumMap,
+                               const LoopDsc*    loop,
+                               const BasicBlock* top,
+                               const BasicBlock* bottom)
+        {
+            return (blockNumMap[loop->lpTop->bbNum] <= blockNumMap[top->bbNum]) &&
+                   (blockNumMap[bottom->bbNum] < blockNumMap[loop->lpBottom->bbNum]);
+        }
+        static bool lpContains(const unsigned* blockNumMap, const LoopDsc* loop, const LoopDsc& lp2)
+        {
+            return lpContains(blockNumMap, loop, lp2.lpTop, lp2.lpBottom);
+        }
+
+        static bool lpContainedBy(const unsigned*   blockNumMap,
+                                  const LoopDsc*    loop,
+                                  const BasicBlock* top,
+                                  const BasicBlock* bottom)
+        {
+            return (blockNumMap[top->bbNum] <= blockNumMap[loop->lpTop->bbNum]) &&
+                   (blockNumMap[loop->lpBottom->bbNum] < blockNumMap[bottom->bbNum]);
+        }
+        static bool lpContainedBy(const unsigned* blockNumMap, const LoopDsc* loop, const LoopDsc& lp2)
+        {
+            return lpContainedBy(blockNumMap, loop, lp2.lpTop, lp2.lpBottom);
+        }
+
+        static bool lpDisjoint(const unsigned*   blockNumMap,
+                               const LoopDsc*    loop,
+                               const BasicBlock* top,
+                               const BasicBlock* bottom)
+        {
+            return (blockNumMap[bottom->bbNum] < blockNumMap[loop->lpTop->bbNum]) ||
+                   (blockNumMap[loop->lpBottom->bbNum] < blockNumMap[top->bbNum]);
+        }
+        static bool lpDisjoint(const unsigned* blockNumMap, const LoopDsc* loop, const LoopDsc& lp2)
+        {
+            return lpDisjoint(blockNumMap, loop, lp2.lpTop, lp2.lpBottom);
+        }
+    };
+
+    // Check the loop table itself.
+
+    int preHeaderCount = 0;
+
+    for (unsigned i = 0; i < optLoopCount; i++)
+    {
+        const LoopDsc& loop = optLoopTable[i];
+
+        // Ignore removed loops
+        if (loop.lpFlags & LPFLG_REMOVED)
+        {
+            continue;
+        }
+
+        assert(loop.lpHead != nullptr);
+        assert(loop.lpTop != nullptr);
+        assert(loop.lpEntry != nullptr);
+        assert(loop.lpBottom != nullptr);
+
+        assert(MappedChecks::lpWellFormed(blockNumMap, &loop));
+
+        if (loop.lpExitCnt == 1)
+        {
+            assert(loop.lpExit != nullptr);
+            assert(MappedChecks::lpContains(blockNumMap, &loop, loop.lpExit));
+        }
+        else
+        {
+            assert(loop.lpExit == nullptr);
+        }
+
+        if (loop.lpParent != BasicBlock::NOT_IN_LOOP)
+        {
+            assert(loop.lpParent < optLoopCount);
+            assert(loop.lpParent < i); // outer loops come before inner loops in the table
+            const LoopDsc& parentLoop = optLoopTable[loop.lpParent];
+            assert((parentLoop.lpFlags & LPFLG_REMOVED) == 0); // don't allow removed parent loop?
+            assert(MappedChecks::lpContainedBy(blockNumMap, &loop, optLoopTable[loop.lpParent]));
+        }
+
+        if (loop.lpChild != BasicBlock::NOT_IN_LOOP)
+        {
+            // Verify all child loops are contained in the parent loop.
+            for (unsigned child = loop.lpChild;    //
+                 child != BasicBlock::NOT_IN_LOOP; //
+                 child = optLoopTable[child].lpSibling)
+            {
+                assert(child < optLoopCount);
+                assert(i < child); // outer loops come before inner loops in the table
+                const LoopDsc& childLoop = optLoopTable[child];
+                if ((childLoop.lpFlags & LPFLG_REMOVED) == 0) // removed child loop might still be in table
+                {
+                    assert(MappedChecks::lpContains(blockNumMap, &loop, childLoop));
+                }
+            }
+
+            // Verify all child loops are disjoint.
+            for (unsigned child = loop.lpChild;    //
+                 child != BasicBlock::NOT_IN_LOOP; //
+                 child = optLoopTable[child].lpSibling)
+            {
+                const LoopDsc& childLoop = optLoopTable[child];
+                if (childLoop.lpFlags & LPFLG_REMOVED)
+                {
+                    continue;
+                }
+                for (unsigned child2 = optLoopTable[child].lpSibling; //
+                     child2 != BasicBlock::NOT_IN_LOOP;               //
+                     child2 = optLoopTable[child2].lpSibling)
+                {
+                    const LoopDsc& child2Loop = optLoopTable[child2];
+                    if (child2Loop.lpFlags & LPFLG_REMOVED)
+                    {
+                        continue;
+                    }
+                    assert(MappedChecks::lpDisjoint(blockNumMap, &childLoop, child2Loop));
+                }
+            }
+        }
+
+        // If the loop has a pre-header, ensure the pre-header form is correct.
+        if ((loop.lpFlags & LPFLG_HAS_PREHEAD) != 0)
+        {
+            ++preHeaderCount;
+
+            BasicBlock* h = loop.lpHead;
+            assert(h->bbFlags & BBF_LOOP_PREHEADER);
+
+            // The pre-header can only be BBJ_ALWAYS or BBJ_NONE and must enter the loop.
+            BasicBlock* e = loop.lpEntry;
+            if (h->bbJumpKind == BBJ_ALWAYS)
+            {
+                assert(h->bbJumpDest == e);
+            }
+            else
+            {
+                assert(h->bbJumpKind == BBJ_NONE);
+                assert(h->bbNext == e);
+                assert(loop.lpTop == e);
+                assert(loop.lpIsTopEntry());
+            }
+
+            // The entry block has a single non-loop predecessor, and it is the pre-header.
+            for (BasicBlock* const predBlock : e->PredBlocks())
+            {
+                if (predBlock != h)
+                {
+                    assert(MappedChecks::lpContains(blockNumMap, &loop, predBlock));
+                }
+            }
+        }
+    }
+
+    // Check basic blocks for loop annotations.
+
     for (BasicBlock* const block : Blocks())
     {
         if (optLoopCount == 0)
@@ -3792,7 +4003,7 @@ void Compiler::fgDebugCheckLoopTable()
                 continue;
             }
             // Does this loop contain our block?
-            if (optLoopTable[i].lpContains(block))
+            if (MappedChecks::lpContains(blockNumMap, &optLoopTable[i], block))
             {
                 loopNum = i;
                 break;
@@ -3804,6 +4015,11 @@ void Compiler::fgDebugCheckLoopTable()
         {
             // ...it must be the one pointed to by bbNatLoopNum.
             assert(block->bbNatLoopNum == loopNum);
+
+            // TODO: We might want the following assert, but there are cases where we don't move all
+            //       return blocks out of the loop.
+            // Return blocks are not allowed inside a loop; they should have been moved elsewhere.
+            // assert(block->bbJumpKind != BBJ_RETURN);
         }
         else
         {
@@ -3814,11 +4030,21 @@ void Compiler::fgDebugCheckLoopTable()
         // All loops that contain the innermost loop with this block must also contain this block.
         while (loopNum != BasicBlock::NOT_IN_LOOP)
         {
-            assert(optLoopTable[loopNum].lpContains(block));
-
+            assert(MappedChecks::lpContains(blockNumMap, &optLoopTable[loopNum], block));
             loopNum = optLoopTable[loopNum].lpParent;
         }
+
+        if (block->bbFlags & BBF_LOOP_PREHEADER)
+        {
+            // Note that the bbNatLoopNum will not point to the loop where this is a pre-header, since bbNatLoopNum
+            // is only set on the blocks from `top` to `bottom`, and `head` is outside that.
+            --preHeaderCount;
+        }
     }
+
+    // Verify that the number of loops marked as having pre-headers is the same as the number of blocks
+    // with the pre-header flag set.
+    assert(preHeaderCount == 0);
 }
 
 /*****************************************************************************/
diff --git a/src/coreclr/jit/fgflow.cpp b/src/coreclr/jit/fgflow.cpp
index 29f24b76b77a2..1a22263438a78 100644
--- a/src/coreclr/jit/fgflow.cpp
+++ b/src/coreclr/jit/fgflow.cpp
@@ -921,7 +921,8 @@ Compiler::SwitchUniqueSuccSet Compiler::GetDescriptorForSwitch(BasicBlock* switc
         // can create a new epoch, thus invalidating all existing BlockSet objects, such as
         // reachability information stored in the blocks. To avoid that, we just use a local BitVec.
 
-        BitVecTraits blockVecTraits(fgBBNumMax + 1, this);
+        unsigned     bbNumMax = impInlineRoot()->fgBBNumMax;
+        BitVecTraits blockVecTraits(bbNumMax + 1, this);
         BitVec       uniqueSuccBlocks(BitVecOps::MakeEmpty(&blockVecTraits));
         for (BasicBlock* const targ : switchBlk->SwitchTargets())
         {
diff --git a/src/coreclr/jit/fgopt.cpp b/src/coreclr/jit/fgopt.cpp
index d8171d6868dc1..361ff92ee8099 100644
--- a/src/coreclr/jit/fgopt.cpp
+++ b/src/coreclr/jit/fgopt.cpp
@@ -61,15 +61,15 @@ bool Compiler::fgDominate(BasicBlock* b1, BasicBlock* b2)
 
     if (b1->bbNum > fgDomBBcount)
     {
-        // if b1 is a loop preheader and Succ is its only successor, then all predecessors of
-        // Succ either are b1 itself or are dominated by Succ. Under these conditions, b1
-        // dominates b2 if and only if Succ dominates b2 (or if b2 == b1, but we already tested
-        // for this case)
+        // If b1 is a loop preheader (that was created after the dominators were calculated),
+        // then it has a single successor that is the loop entry, and it is the only non-loop
+        // predecessor of the loop entry. Thus, b1 dominates the loop entry and also dominates
+        // what the loop entry dominates.
         if (b1->bbFlags & BBF_LOOP_PREHEADER)
         {
-            noway_assert(b1->bbFlags & BBF_INTERNAL);
-            noway_assert(b1->bbJumpKind == BBJ_NONE);
-            return fgDominate(b1->bbNext, b2);
+            BasicBlock* loopEntry = b1->GetUniqueSucc();
+            assert(loopEntry != nullptr);
+            return fgDominate(loopEntry, b2);
         }
 
         // unknown dominators; err on the safe side and return false
@@ -170,15 +170,28 @@ bool Compiler::fgReachable(BasicBlock* b1, BasicBlock* b2)
 // fgUpdateChangedFlowGraph: Update changed flow graph information.
 //
 // If the flow graph has changed, we need to recompute various information if we want to use it again.
+// This does similar work to `fgComputeReachability`, but the caller can pick and choose what needs
+// to be recomputed if they know certain things do NOT need to be recomputed.
 //
 // Arguments:
-//    computeDoms -- `true` if we should recompute dominators
+//    computePreds        -- `true` if we should recompute predecessors
+//    computeDoms         -- `true` if we should recompute dominators
+//    computeReturnBlocks -- `true` if we should recompute the list of return blocks
+//    computeLoops        -- `true` if we should recompute the loop table
 //
-void Compiler::fgUpdateChangedFlowGraph(const bool computePreds, const bool computeDoms)
+void Compiler::fgUpdateChangedFlowGraph(const bool computePreds,
+                                        const bool computeDoms,
+                                        const bool computeReturnBlocks,
+                                        const bool computeLoops)
 {
     // We need to clear this so we don't hit an assert calling fgRenumberBlocks().
     fgDomsComputed = false;
 
+    if (computeReturnBlocks)
+    {
+        fgComputeReturnBlocks();
+    }
+
     JITDUMP("\nRenumbering the basic blocks for fgUpdateChangeFlowGraph\n");
     fgRenumberBlocks();
 
@@ -192,6 +205,14 @@ void Compiler::fgUpdateChangedFlowGraph(const bool computePreds, const bool comp
     {
         fgComputeDoms();
     }
+    if (computeLoops)
+    {
+        // Reset the loop info annotations and find the loops again.
+        // Note: this is similar to `RecomputeLoopInfo`.
+        optResetLoopInfo();
+        optSetBlockWeights();
+        optFindLoops();
+    }
 }
 
 //------------------------------------------------------------------------
@@ -280,6 +301,47 @@ void Compiler::fgComputeReachabilitySets()
 #endif // DEBUG
 }
 
+//------------------------------------------------------------------------
+// fgComputeReturnBlocks: Compute the set of BBJ_RETURN blocks.
+//
+// Initialize `fgReturnBlocks` to a list of the BBJ_RETURN blocks in the function.
+//
+void Compiler::fgComputeReturnBlocks()
+{
+    fgReturnBlocks = nullptr;
+
+    for (BasicBlock* const block : Blocks())
+    {
+        // If this is a BBJ_RETURN block, add it to our list of all BBJ_RETURN blocks. This list is only
+        // used to find return blocks.
+        if (block->bbJumpKind == BBJ_RETURN)
+        {
+            fgReturnBlocks = new (this, CMK_Reachability) BasicBlockList(block, fgReturnBlocks);
+        }
+    }
+
+    fgReturnBlocksComputed = true;
+
+#ifdef DEBUG
+    if (verbose)
+    {
+        printf("Return blocks:");
+        if (fgReturnBlocks == nullptr)
+        {
+            printf(" NONE");
+        }
+        else
+        {
+            for (const BasicBlockList* bl = fgReturnBlocks; bl != nullptr; bl = bl->next)
+            {
+                printf(" " FMT_BB, bl->block->bbNum);
+            }
+        }
+        printf("\n");
+    }
+#endif // DEBUG
+}
+
 //------------------------------------------------------------------------
 // fgComputeEnterBlocksSet: Compute the entry blocks set.
 //
@@ -501,18 +563,7 @@ void Compiler::fgComputeReachability()
     fgDebugCheckBBlist();
 #endif // DEBUG
 
-    /* Create a list of all BBJ_RETURN blocks. The head of the list is 'fgReturnBlocks'. */
-    fgReturnBlocks = nullptr;
-
-    for (BasicBlock* const block : Blocks())
-    {
-        // If this is a BBJ_RETURN block, add it to our list of all BBJ_RETURN blocks. This list is only
-        // used to find return blocks.
-        if (block->bbJumpKind == BBJ_RETURN)
-        {
-            fgReturnBlocks = new (this, CMK_Reachability) BasicBlockList(block, fgReturnBlocks);
-        }
-    }
+    fgComputeReturnBlocks();
 
     // Compute reachability and then delete blocks determined to be unreachable. If we delete blocks, we
     // need to loop, as that might have caused more blocks to become unreachable. This can happen in the
diff --git a/src/coreclr/jit/flowgraph.cpp b/src/coreclr/jit/flowgraph.cpp
index 63e2c53c97df5..1dca62569681b 100644
--- a/src/coreclr/jit/flowgraph.cpp
+++ b/src/coreclr/jit/flowgraph.cpp
@@ -1394,14 +1394,13 @@ void Compiler::fgLoopCallTest(BasicBlock* srcBB, BasicBlock* dstBB)
     }
 }
 
-/*****************************************************************************
- *
- *  Mark which loops are guaranteed to execute a call.
- */
-
+//------------------------------------------------------------------------
+// fgLoopCallMark: Mark which loops are guaranteed to execute a call by setting the
+// block BBF_LOOP_CALL0 and BBF_LOOP_CALL1 flags, as appropriate.
+//
 void Compiler::fgLoopCallMark()
 {
-    /* If we've already marked all the block, bail */
+    // If we've already marked all the blocks, bail.
 
     if (fgLoopCallMarked)
     {
@@ -1410,7 +1409,12 @@ void Compiler::fgLoopCallMark()
 
     fgLoopCallMarked = true;
 
-    /* Walk the blocks, looking for backward edges */
+#ifdef DEBUG
+    // This code depends on properly ordered bbNum, so check that.
+    fgDebugCheckBBNumIncreasing();
+#endif // DEBUG
+
+    // Walk the blocks, looking for backward edges.
 
     for (BasicBlock* const block : Blocks())
     {
diff --git a/src/coreclr/jit/gentree.cpp b/src/coreclr/jit/gentree.cpp
index af0a2a1e47fe5..4b14dc38522be 100644
--- a/src/coreclr/jit/gentree.cpp
+++ b/src/coreclr/jit/gentree.cpp
@@ -9603,7 +9603,7 @@ void Compiler::gtDispNode(GenTree* tree, IndentStack* indentStack, __in __in_z _
         if (tree->IsValue())
         {
             const size_t bufLength = msgLength - 1;
-            msg                    = reinterpret_cast<char*>(alloca(bufLength * sizeof(char)));
+            msg                    = reinterpret_cast<char*>(_alloca(bufLength * sizeof(char)));
             sprintf_s(const_cast<char*>(msg), bufLength, "t%d = %s", tree->gtTreeID, hasOperands ? "" : " ");
         }
     }
diff --git a/src/coreclr/jit/hashbv.cpp b/src/coreclr/jit/hashbv.cpp
index 5f996a96c2223..34798eaa645eb 100644
--- a/src/coreclr/jit/hashbv.cpp
+++ b/src/coreclr/jit/hashbv.cpp
@@ -534,7 +534,7 @@ void hashBv::Resize(int newSize)
 
     hashBvNode** newNodes = this->getNewVector(newSize);
 
-    hashBvNode*** insertionPoints = (hashBvNode***)alloca(sizeof(hashBvNode*) * newSize);
+    hashBvNode*** insertionPoints = (hashBvNode***)_alloca(sizeof(hashBvNode*) * newSize);
     memset(insertionPoints, 0, sizeof(hashBvNode*) * newSize);
 
     for (int i = 0; i < newSize; i++)
@@ -1277,7 +1277,7 @@ bool hashBv::MultiTraverseLHSBigger(hashBv* other)
     // this is larger
     hashBvNode*** cursors;
     int           expansionFactor = hts / ots;
-    cursors                       = (hashBvNode***)alloca(expansionFactor * sizeof(void*));
+    cursors                       = (hashBvNode***)_alloca(expansionFactor * sizeof(void*));
 
     for (int h = 0; h < other->hashtable_size(); h++)
     {
diff --git a/src/coreclr/jit/importer.cpp b/src/coreclr/jit/importer.cpp
index 90c6161fb2d86..3519ac6f7b650 100644
--- a/src/coreclr/jit/importer.cpp
+++ b/src/coreclr/jit/importer.cpp
@@ -11723,7 +11723,7 @@ GenTree* Compiler::impCastClassOrIsInstToTree(GenTree*                op1,
         if (!(cond))                                                                                                   \
         {                                                                                                              \
             const int cchAssertImpBuf = 600;                                                                           \
-            char*     assertImpBuf    = (char*)alloca(cchAssertImpBuf);                                                \
+            char*     assertImpBuf    = (char*)_alloca(cchAssertImpBuf);                                               \
             _snprintf_s(assertImpBuf, cchAssertImpBuf, cchAssertImpBuf - 1,                                            \
                         "%s : Possibly bad IL with CEE_%s at offset %04Xh (op1=%s op2=%s stkDepth=%d)", #cond,         \
                         impCurOpcName, impCurOpcOffs, op1 ? varTypeName(op1->TypeGet()) : "NULL",                      \
diff --git a/src/coreclr/jit/jiteh.cpp b/src/coreclr/jit/jiteh.cpp
index 10e7652a6caa0..5cf2d7bfb8e68 100644
--- a/src/coreclr/jit/jiteh.cpp
+++ b/src/coreclr/jit/jiteh.cpp
@@ -3112,7 +3112,7 @@ void Compiler::fgVerifyHandlerTab()
     // between debug and non-debug code paths. So, create a renumbered block mapping: map the
     // existing block number to a renumbered block number that is ordered by block list order.
 
-    unsigned bbNumMax = compIsForInlining() ? impInlineInfo->InlinerCompiler->fgBBNumMax : fgBBNumMax;
+    unsigned bbNumMax = impInlineRoot()->fgBBNumMax;
 
     // blockNumMap[old block number] => new block number
     size_t    blockNumBytes = (bbNumMax + 1) * sizeof(unsigned);
@@ -4617,3 +4617,31 @@ bool Compiler::fgCheckEHCanInsertAfterBlock(BasicBlock* blk, unsigned regionInde
 
     return insertOK;
 }
+
+//------------------------------------------------------------------------
+// fgIsFirstBlockOfFilterOrHandler: return true if the given block is the first block of an EH handler
+// or filter.
+//
+// Arguments:
+//    block - the BasicBlock in question
+//
+// Return Value:
+//    As described above.
+//
+bool Compiler::fgIsFirstBlockOfFilterOrHandler(BasicBlock* block)
+{
+    if (!block->hasHndIndex())
+    {
+        return false;
+    }
+    EHblkDsc* ehDsc = ehGetDsc(block->getHndIndex());
+    if (ehDsc->ebdHndBeg == block)
+    {
+        return true;
+    }
+    if (ehDsc->HasFilter() && (ehDsc->ebdFilter == block))
+    {
+        return true;
+    }
+    return false;
+}
diff --git a/src/coreclr/jit/loopcloning.cpp b/src/coreclr/jit/loopcloning.cpp
index 09f42b6e6f665..82847b9981466 100644
--- a/src/coreclr/jit/loopcloning.cpp
+++ b/src/coreclr/jit/loopcloning.cpp
@@ -1592,17 +1592,16 @@ bool Compiler::optIsLoopClonable(unsigned loopInd)
         return false;
     }
 
-    BasicBlock* head = loop.lpHead;
-    BasicBlock* end  = loop.lpBottom;
-    BasicBlock* beg  = head->bbNext;
+    BasicBlock* top    = loop.lpTop;
+    BasicBlock* bottom = loop.lpBottom;
 
-    if (end->bbJumpKind != BBJ_COND)
+    if (bottom->bbJumpKind != BBJ_COND)
     {
         JITDUMP("Loop cloning: rejecting loop " FMT_LP ". Couldn't find termination test.\n", loopInd);
         return false;
     }
 
-    if (end->bbJumpDest != beg)
+    if (bottom->bbJumpDest != top)
     {
         JITDUMP("Loop cloning: rejecting loop " FMT_LP ". Branch at loop 'bottom' not looping to 'top'.\n", loopInd);
         return false;
@@ -2622,6 +2621,7 @@ PhaseStatus Compiler::optCloneLoops()
     {
         JITDUMP("Recompute reachability and dominators after loop cloning\n");
         constexpr bool computePreds = false;
+        // TODO: recompute the loop table, to include the slow loop path in the table?
         fgUpdateChangedFlowGraph(computePreds);
     }
 
diff --git a/src/coreclr/jit/morph.cpp b/src/coreclr/jit/morph.cpp
index 9e51c24734995..f3aceac0dd14c 100644
--- a/src/coreclr/jit/morph.cpp
+++ b/src/coreclr/jit/morph.cpp
@@ -15302,7 +15302,7 @@ bool Compiler::fgFoldConditional(BasicBlock* block)
                         /* This was a bogus loop (condition always false)
                          * Remove the loop from the table */
 
-                        optLoopTable[loopNum].lpFlags |= LPFLG_REMOVED;
+                        optMarkLoopRemoved(loopNum);
 
                         optLoopTable[loopNum].lpTop->unmarkLoopAlign(this DEBUG_ARG("Bogus loop"));
 
diff --git a/src/coreclr/jit/optimizer.cpp b/src/coreclr/jit/optimizer.cpp
index 43e9616b2192f..5fc26491fc616 100644
--- a/src/coreclr/jit/optimizer.cpp
+++ b/src/coreclr/jit/optimizer.cpp
@@ -25,8 +25,9 @@ void Compiler::optInit()
     loopAlignCandidates = 0;
 
     /* Initialize the # of tracked loops to 0 */
-    optLoopCount = 0;
-    optLoopTable = nullptr;
+    optLoopCount    = 0;
+    optLoopTable    = nullptr;
+    optCurLoopEpoch = 0;
 
 #ifdef DEBUG
     loopsAligned = 0;
@@ -56,10 +57,11 @@ DataFlow::DataFlow(Compiler* pCompiler) : m_pCompiler(pCompiler)
 // Notes:
 //    Depends on dominators, and fgReturnBlocks being set.
 //
-void Compiler::optSetBlockWeights()
+PhaseStatus Compiler::optSetBlockWeights()
 {
     noway_assert(opts.OptimizationEnabled());
     assert(fgDomsComputed);
+    assert(fgReturnBlocksComputed);
 
 #ifdef DEBUG
     bool changed = false;
@@ -137,6 +139,8 @@ void Compiler::optSetBlockWeights()
     /* Check that the flowgraph data (bbNum, bbRefs, bbPreds) is up-to-date */
     fgDebugCheckBBlist();
 #endif
+
+    return PhaseStatus::MODIFIED_EVERYTHING;
 }
 
 //------------------------------------------------------------------------
@@ -459,8 +463,9 @@ void Compiler::optUpdateLoopsBeforeRemoveBlock(BasicBlock* block, bool skipUnmar
         if (loop.lpExit == block)
         {
             reportBefore();
+            assert(loop.lpExitCnt == 1);
+            --loop.lpExitCnt;
             loop.lpExit = nullptr;
-            loop.lpFlags &= ~LPFLG_ONE_EXIT;
         }
 
         /* If this points to the actual entry in the loop
@@ -657,6 +662,14 @@ void Compiler::optPrintLoopInfo(const LoopDsc* loop, bool printVerbose /* = fals
                 printf(" from V%02u", loop->lpVarInit);
             }
 
+            if (loop->lpFlags & (LPFLG_CONST_INIT | LPFLG_VAR_INIT))
+            {
+                if (loop->lpInitBlock != loop->lpHead)
+                {
+                    printf(" (in " FMT_BB ")", loop->lpInitBlock->bbNum);
+                }
+            }
+
             // If a simple test condition print operator and the limits */
             printf(" %s", GenTree::OpName(loop->lpTestOper()));
 
@@ -692,7 +705,7 @@ void Compiler::optPrintLoopInfo(const LoopDsc* loop, bool printVerbose /* = fals
 
         // Print the flags
 
-        if (loop->lpContainsCall)
+        if (loop->lpFlags & LPFLG_CONTAINS_CALL)
         {
             printf(" call");
         }
@@ -752,8 +765,10 @@ void Compiler::optPrintLoopTable()
 // optPopulateInitInfo: Populate loop init info in the loop table.
 //
 // Arguments:
-//     init     -  the tree that is supposed to initialize the loop iterator.
-//     iterVar  -  loop iteration variable.
+//     loopInd   -  loop index
+//     initBlock -  block in which the initialization lives.
+//     init      -  the tree that is supposed to initialize the loop iterator.
+//     iterVar   -  loop iteration variable.
 //
 // Return Value:
 //     "false" if the loop table could not be populated with the loop iterVar init info.
@@ -762,7 +777,7 @@ void Compiler::optPrintLoopTable()
 //     The 'init' tree is checked if its lhs is a local and rhs is either
 //     a const or a local.
 //
-bool Compiler::optPopulateInitInfo(unsigned loopInd, GenTree* init, unsigned iterVar)
+bool Compiler::optPopulateInitInfo(unsigned loopInd, BasicBlock* initBlock, GenTree* init, unsigned iterVar)
 {
     // Operator should be =
     if (init->gtOper != GT_ASG)
@@ -784,11 +799,13 @@ bool Compiler::optPopulateInitInfo(unsigned loopInd, GenTree* init, unsigned ite
     {
         optLoopTable[loopInd].lpFlags |= LPFLG_CONST_INIT;
         optLoopTable[loopInd].lpConstInit = (int)rhs->AsIntCon()->gtIconVal;
+        optLoopTable[loopInd].lpInitBlock = initBlock;
     }
     else if (rhs->gtOper == GT_LCL_VAR)
     {
         optLoopTable[loopInd].lpFlags |= LPFLG_VAR_INIT;
-        optLoopTable[loopInd].lpVarInit = rhs->AsLclVarCommon()->GetLclNum();
+        optLoopTable[loopInd].lpVarInit   = rhs->AsLclVarCommon()->GetLclNum();
+        optLoopTable[loopInd].lpInitBlock = initBlock;
     }
     else
     {
@@ -1167,6 +1184,11 @@ bool Compiler::optExtractInitTestIncr(
 bool Compiler::optRecordLoop(
     BasicBlock* head, BasicBlock* top, BasicBlock* entry, BasicBlock* bottom, BasicBlock* exit, unsigned char exitCnt)
 {
+    if (exitCnt == 1)
+    {
+        noway_assert(exit != nullptr);
+    }
+
     // Record this loop in the table, if there's room.
 
     assert(optLoopCount <= BasicBlock::MAX_LOOP_NUM);
@@ -1189,6 +1211,8 @@ bool Compiler::optRecordLoop(
     {
         assert(loopInd == 0);
         optLoopTable = getAllocator(CMK_LoopOpt).allocate<LoopDsc>(BasicBlock::MAX_LOOP_NUM);
+
+        NewLoopEpoch();
     }
     else
     {
@@ -1243,23 +1267,9 @@ bool Compiler::optRecordLoop(
     {
         optLoopTable[loopInd].lpLoopHasMemoryHavoc[memoryKind] = false;
     }
-    optLoopTable[loopInd].lpContainsCall           = false;
     optLoopTable[loopInd].lpFieldsModified         = nullptr;
     optLoopTable[loopInd].lpArrayElemTypesModified = nullptr;
 
-    // If DO-WHILE loop mark it as such.
-    if (head->bbNext == entry)
-    {
-        optLoopTable[loopInd].lpFlags |= LPFLG_DO_WHILE;
-    }
-
-    // If single exit loop mark it as such.
-    if (exitCnt == 1)
-    {
-        noway_assert(exit);
-        optLoopTable[loopInd].lpFlags |= LPFLG_ONE_EXIT;
-    }
-
     //
     // Try to find loops that have an iterator (i.e. for-like loops) "for (init; test; incr){ ... }"
     // We have the following restrictions:
@@ -1295,7 +1305,7 @@ bool Compiler::optRecordLoop(
             }
         }
 
-        if (!optPopulateInitInfo(loopInd, init, iterVar))
+        if (!optPopulateInitInfo(loopInd, head, init, iterVar))
         {
             goto DONE_LOOP;
         }
@@ -1325,15 +1335,14 @@ bool Compiler::optRecordLoop(
         simpleTestLoopCount++;
 #endif
 
+#if COUNT_LOOPS
         // Check if a constant iteration loop.
         if ((optLoopTable[loopInd].lpFlags & LPFLG_CONST_INIT) && (optLoopTable[loopInd].lpFlags & LPFLG_CONST_LIMIT))
         {
             // This is a constant loop.
-            optLoopTable[loopInd].lpFlags |= LPFLG_CONST;
-#if COUNT_LOOPS
             constIterLoopCount++;
-#endif
         }
+#endif
 
 #ifdef DEBUG
         if (verbose && 0)
@@ -2566,6 +2575,23 @@ void Compiler::optFindNaturalLoops()
         fgUpdateChangedFlowGraph(computePreds);
     }
 
+    if (false /* pre-header stress */)
+    {
+        // Stress mode: aggressively create loop pre-header for every loop.
+        for (unsigned loopInd = 0; loopInd < optLoopCount; loopInd++)
+        {
+            fgCreateLoopPreHeader(loopInd);
+        }
+
+        if (fgModified)
+        {
+            // The predecessors were maintained in fgCreateLoopPreHeader; don't rebuild them.
+            constexpr bool computePreds = false;
+            constexpr bool computeDoms  = true;
+            fgUpdateChangedFlowGraph(computePreds, computeDoms);
+        }
+    }
+
 #ifdef DEBUG
     if (verbose && (optLoopCount > 0))
     {
@@ -3546,8 +3572,14 @@ bool Compiler::optComputeLoopRep(int        constInit,
 // Loops handled are fully unrolled; there is no partial unrolling.
 //
 // Limitations: only the following loop types are handled:
-// 1. "while" loops
-// 2. constant bound loops
+// 1. "while" loops (top entry)
+// 2. constant initializer, constant bound
+// 3. The entire loop must be in the same EH region.
+// 4. The loop iteration variable can't be address exposed.
+// 5. The loop iteration variable can't be a promoted struct field.
+// 6. We must be able to calculate the total constant iteration count.
+// 7. On x86, there is a limit to the number of return blocks. So if there are return blocks in the loop that
+//    would be unrolled, the unrolled code can't exceed that limit.
 //
 // Cost heuristics:
 // 1. there are cost metrics for maximum number of allowed iterations, and maximum unroll size
@@ -3557,6 +3589,8 @@ bool Compiler::optComputeLoopRep(int        constInit,
 // In stress modes, these heuristic limits are expanded, and loops aren't required to have the
 // Vector<T>.Length limit.
 //
+// Loops are processed from innermost to outermost order, to attempt to unroll the most nested loops first.
+//
 // Returns:
 //   suitable phase status
 //
@@ -3588,7 +3622,8 @@ PhaseStatus Compiler::optUnrollLoops()
 
     /* Look for loop unrolling candidates */
 
-    bool change = false;
+    bool change                 = false;
+    bool anyNestedLoopsUnrolled = false;
     INDEBUG(int unrollCount = 0);    // count of loops unrolled
     INDEBUG(int unrollFailures = 0); // count of loops attempted to be unrolled, but failed
 
@@ -3629,12 +3664,13 @@ PhaseStatus Compiler::optUnrollLoops()
         // cannot wrap due to the loop termination condition.
         PREFAST_ASSUME(lnum != 0U - 1);
 
-        BasicBlock* block;
+        LoopDsc&    loop = optLoopTable[lnum];
         BasicBlock* head;
+        BasicBlock* top;
         BasicBlock* bottom;
+        BasicBlock* initBlock;
 
-        bool       dupCond;
-        int        lval;
+        bool       dupCond;      // Does the 'head' block contain a duplicate loop condition (zero trip test)?
         int        lbeg;         // initial value for iterator
         int        llim;         // limit value for iterator
         unsigned   lvar;         // iterator lclVar #
@@ -3647,15 +3683,12 @@ PhaseStatus Compiler::optUnrollLoops()
         unsigned loopRetCount; // number of BBJ_RETURN blocks in loop
         unsigned totalIter;    // total number of iterations in the constant loop
 
-        const unsigned loopFlags = optLoopTable[lnum].lpFlags;
+        const unsigned loopFlags = loop.lpFlags;
 
         // Check for required flags:
-        // LPFLG_DO_WHILE - required because this transform only handles loops of this form
-        // LPFLG_CONST    - required because this transform only handles full unrolls
-        const unsigned requiredFlags = LPFLG_DO_WHILE | LPFLG_CONST;
-
-        // Ignore the loop if we don't have a do-while that has a constant number of iterations.
-
+        // LPFLG_CONST_INIT  - required because this transform only handles full unrolls
+        // LPFLG_CONST_LIMIT - required because this transform only handles full unrolls
+        const unsigned requiredFlags = LPFLG_CONST_INIT | LPFLG_CONST_LIMIT;
         if ((loopFlags & requiredFlags) != requiredFlags)
         {
             // Don't print to the JitDump about this common case.
@@ -3663,17 +3696,25 @@ PhaseStatus Compiler::optUnrollLoops()
         }
 
         // Ignore if removed or marked as not unrollable.
-
         if (loopFlags & (LPFLG_DONT_UNROLL | LPFLG_REMOVED))
         {
             // Don't print to the JitDump about this common case.
             continue;
         }
 
-        head = optLoopTable[lnum].lpHead;
-        noway_assert(head);
-        bottom = optLoopTable[lnum].lpBottom;
-        noway_assert(bottom);
+        // This transform only handles loops of this form
+        if (!loop.lpIsTopEntry())
+        {
+            JITDUMP("Failed to unroll loop " FMT_LP ": not top entry\n", lnum);
+            continue;
+        }
+
+        head = loop.lpHead;
+        noway_assert(head != nullptr);
+        top = loop.lpTop;
+        noway_assert(top != nullptr);
+        bottom = loop.lpBottom;
+        noway_assert(bottom != nullptr);
 
         // Get the loop data:
         //  - initial constant
@@ -3683,16 +3724,17 @@ PhaseStatus Compiler::optUnrollLoops()
         //  - increment operation type (i.e. ADD, SUB, etc...)
         //  - loop test type (i.e. GT_GE, GT_LT, etc...)
 
-        lbeg     = optLoopTable[lnum].lpConstInit;
-        llim     = optLoopTable[lnum].lpConstLimit();
-        testOper = optLoopTable[lnum].lpTestOper();
+        initBlock = loop.lpInitBlock;
+        lbeg      = loop.lpConstInit;
+        llim      = loop.lpConstLimit();
+        testOper  = loop.lpTestOper();
 
-        lvar     = optLoopTable[lnum].lpIterVar();
-        iterInc  = optLoopTable[lnum].lpIterConst();
-        iterOper = optLoopTable[lnum].lpIterOper();
+        lvar     = loop.lpIterVar();
+        iterInc  = loop.lpIterConst();
+        iterOper = loop.lpIterOper();
 
-        iterOperType = optLoopTable[lnum].lpIterOperType();
-        unsTest      = (optLoopTable[lnum].lpTestTree->gtFlags & GTF_UNSIGNED) != 0;
+        iterOperType = loop.lpIterOperType();
+        unsTest      = (loop.lpTestTree->gtFlags & GTF_UNSIGNED) != 0;
 
         if (lvaTable[lvar].IsAddressExposed())
         {
@@ -3708,7 +3750,7 @@ PhaseStatus Compiler::optUnrollLoops()
         }
 
         // Locate/initialize the increment/test statements.
-        Statement* initStmt = head->lastStmt();
+        Statement* initStmt = initBlock->lastStmt();
         noway_assert((initStmt != nullptr) && (initStmt->GetNextStmt() == nullptr));
 
         Statement* testStmt = bottom->lastStmt();
@@ -3759,6 +3801,7 @@ PhaseStatus Compiler::optUnrollLoops()
         else if (totalIter <= 1)
         {
             // No limit for single iteration loops
+            // If there is no iteration (totalIter == 0), we will remove the loop body entirely.
             unrollLimitSz = INT_MAX;
         }
         else if (!(loopFlags & LPFLG_SIMD_LIMIT))
@@ -3807,12 +3850,12 @@ PhaseStatus Compiler::optUnrollLoops()
 
         {
             ClrSafeInt<unsigned> loopCostSz; // Cost is size of one iteration
+            auto                 tryIndex = loop.lpTop->bbTryIndex;
 
-            block         = head->bbNext;
-            auto tryIndex = block->bbTryIndex;
-
+            // Besides calculating the loop cost, also ensure that all loop blocks are within the same EH
+            // region, and count the number of BBJ_RETURN blocks in the loop.
             loopRetCount = 0;
-            for (;; block = block->bbNext)
+            for (BasicBlock* const block : loop.LoopBlocks())
             {
                 if (block->bbTryIndex != tryIndex)
                 {
@@ -3831,15 +3874,10 @@ PhaseStatus Compiler::optUnrollLoops()
                     gtSetStmtInfo(stmt);
                     loopCostSz += stmt->GetCostSz();
                 }
-
-                if (block == bottom)
-                {
-                    break;
-                }
             }
 
 #ifdef JIT32_GCENCODER
-            if (fgReturnCount + loopRetCount * (totalIter - 1) > SET_EPILOGCNT_MAX)
+            if ((totalIter > 0) && (fgReturnCount + loopRetCount * (totalIter - 1) > SET_EPILOGCNT_MAX))
             {
                 // Jit32 GC encoder can't report more than SET_EPILOGCNT_MAX epilogs.
                 JITDUMP("Failed to unroll loop " FMT_LP ": GC encoder max epilog constraint\n", lnum);
@@ -3870,32 +3908,40 @@ PhaseStatus Compiler::optUnrollLoops()
             if (verbose)
             {
                 printf("\nUnrolling loop ");
-                optPrintLoopInfo(&optLoopTable[lnum]);
+                optPrintLoopInfo(&loop);
                 printf(" over V%02u from %u to %u unrollCostSz = %d\n\n", lvar, lbeg, llim, unrollCostSz);
             }
 #endif
         }
 
 #if FEATURE_LOOP_ALIGN
-        for (block = head->bbNext;; block = block->bbNext)
+        for (BasicBlock* const block : loop.LoopBlocks())
         {
             block->unmarkLoopAlign(this DEBUG_ARG("Unrolled loop"));
-
-            if (block == bottom)
-            {
-                break;
-            }
         }
 #endif
 
         // Create the unrolled loop statement list.
         {
-            BlockToBlockMap blockMap(getAllocator(CMK_LoopOpt));
-            BasicBlock*     insertAfter = bottom;
-
-            for (lval = lbeg; totalIter; totalIter--)
-            {
-                for (block = head->bbNext;; block = block->bbNext)
+            // When unrolling a loop, that loop disappears (and will be removed from the loop table). Each unrolled
+            // block will be set to exist within the parent loop, if any. However, if we unroll a loop that has
+            // nested loops, we will create multiple copies of the nested loops. This requires adding new loop table
+            // entries to represent the new loops. Instead of trying to do this incrementally, in the case where
+            // nested loops exist (in any unrolled loop) we rebuild the entire loop table after unrolling.
+
+            BlockToBlockMap        blockMap(getAllocator(CMK_LoopOpt));
+            BasicBlock*            insertAfter                    = bottom;
+            BasicBlock::loopNumber newLoopNum                     = loop.lpParent;
+            bool                   anyNestedLoopsUnrolledThisLoop = false;
+            int                    lval;
+            unsigned               iterToUnroll = totalIter; // The number of iterations left to unroll
+
+            for (lval = lbeg; iterToUnroll > 0; iterToUnroll--)
+            {
+                // Note: we can't use the loop.LoopBlocks() iterator, as it captures loop.lpBottom->bbNext at the
+                // beginning of iteration, and we insert blocks before that. So we need to evaluate lpBottom->bbNext
+                // every iteration.
+                for (BasicBlock* block = loop.lpTop; block != loop.lpBottom->bbNext; block = block->bbNext)
                 {
                     BasicBlock* newBlock = insertAfter =
                         fgNewBBafter(block->bbJumpKind, insertAfter, /*extendRegion*/ true);
@@ -3903,17 +3949,29 @@ PhaseStatus Compiler::optUnrollLoops()
 
                     if (!BasicBlock::CloneBlockState(this, newBlock, block, lvar, lval))
                     {
-                        // cloneExpr doesn't handle everything
+                        // CloneBlockState (specifically, gtCloneExpr) doesn't handle everything. If it fails
+                        // to clone a block in the loop, splice out and forget all the blocks we cloned so far:
+                        // put the loop blocks back to how they were before we started cloning blocks,
+                        // and abort unrolling the loop.
                         BasicBlock* oldBottomNext = insertAfter->bbNext;
                         bottom->bbNext            = oldBottomNext;
                         oldBottomNext->bbPrev     = bottom;
-                        optLoopTable[lnum].lpFlags |= LPFLG_DONT_UNROLL;
+                        loop.lpFlags |= LPFLG_DONT_UNROLL; // Mark it so we don't try to unroll it again.
                         INDEBUG(++unrollFailures);
                         JITDUMP("Failed to unroll loop " FMT_LP ": block cloning failed on " FMT_BB "\n", lnum,
                                 block->bbNum);
                         goto DONE_LOOP;
                     }
 
+                    // All blocks in the unrolled loop will now be marked with the parent loop number. Note that
+                    // if the loop being unrolled contains nested (child) loops, we will notice this below (when
+                    // we set anyNestedLoopsUnrolledThisLoop), and that will cause us to rebuild the entire loop
+                    // table and all loop annotations on blocks. However, if the loop contains no nested loops,
+                    // setting the block `bbNatLoopNum` here is sufficient to incrementally update the block's
+                    // loop info.
+
+                    newBlock->bbNatLoopNum = newLoopNum;
+
                     // Block weight should no longer have the loop multiplier
                     //
                     // Note this is not quite right, as we may not have upscaled by this amount
@@ -3942,14 +4000,13 @@ PhaseStatus Compiler::optUnrollLoops()
                             testCopyStmt->SetRootNode(sideEffList);
                         }
                         newBlock->bbJumpKind = BBJ_NONE;
-
-                        // Exit this loop; we've walked all the blocks.
-                        break;
                     }
                 }
 
-                // Now redirect any branches within the newly-cloned iteration
-                for (block = head->bbNext; block != bottom; block = block->bbNext)
+                // Now redirect any branches within the newly-cloned iteration.
+                // Don't include `bottom` in the iteration, since we've already changed the
+                // newBlock->bbJumpKind, above.
+                for (BasicBlock* block = loop.lpTop; block != loop.lpBottom; block = block->bbNext)
                 {
                     BasicBlock* newBlock = blockMap[block];
                     optCopyBlkDest(block, newBlock);
@@ -3979,39 +4036,38 @@ PhaseStatus Compiler::optUnrollLoops()
                 }
             }
 
+            // If we get here, we successfully cloned all the blocks in the unrolled loop.
+
             // Gut the old loop body
-            for (block = head->bbNext;; block = block->bbNext)
+            for (BasicBlock* const block : loop.LoopBlocks())
             {
+                // Check if the old loop body had any nested loops that got cloned. Note that we need to do this
+                // here, and not in the loop above, to handle the special case where totalIter is zero, and the
+                // above loop doesn't execute.
+                if (block->bbNatLoopNum != lnum)
+                {
+                    anyNestedLoopsUnrolledThisLoop = true;
+                }
+
                 block->bbStmtList = nullptr;
                 block->bbJumpKind = BBJ_NONE;
                 block->bbFlags &= ~BBF_LOOP_HEAD;
-                if (block->bbJumpDest != nullptr)
-                {
-                    block->bbJumpDest = nullptr;
-                }
+                block->bbJumpDest   = nullptr;
+                block->bbNatLoopNum = newLoopNum;
+            }
 
-                if (block == bottom)
-                {
-                    break;
-                }
+            if (anyNestedLoopsUnrolledThisLoop)
+            {
+                anyNestedLoopsUnrolled = true;
             }
 
             // If the HEAD is a BBJ_COND drop the condition (and make HEAD a BBJ_NONE block).
 
             if (head->bbJumpKind == BBJ_COND)
             {
-                Statement* preHeaderStmt = head->firstStmt();
-                noway_assert(preHeaderStmt != nullptr);
-
-                testStmt = preHeaderStmt->GetPrevStmt();
-                noway_assert((testStmt != nullptr) && (testStmt->GetNextStmt() == nullptr));
+                testStmt = head->lastStmt();
                 noway_assert(testStmt->GetRootNode()->gtOper == GT_JTRUE);
-
-                initStmt = testStmt->GetPrevStmt();
-                noway_assert((initStmt != nullptr) && (initStmt->GetNextStmt() == testStmt));
-
-                initStmt->SetNextStmt(nullptr);
-                preHeaderStmt->SetPrevStmt(initStmt);
+                fgRemoveStmt(head, testStmt);
                 head->bbJumpKind = BBJ_NONE;
             }
             else
@@ -4027,26 +4083,33 @@ PhaseStatus Compiler::optUnrollLoops()
 
                 gtDispTree(initStmt->GetRootNode());
                 printf("\n");
-                fgDumpTrees(head->bbNext, insertAfter);
-            }
-#endif
+                fgDumpTrees(top, insertAfter);
 
-            // Remember that something has changed.
-
-            change = true;
-
-            // Make sure to update loop table.
-
-            // Mark the loop as removed. Make head and bottom nullptr to make it likelier for downstream
-            // phases that don't properly check the LPFLG_REMOVED flag to hit an assert or an access violation.
+                if (anyNestedLoopsUnrolledThisLoop)
+                {
+                    printf("Unrolled loop " FMT_LP " contains nested loops\n", lnum);
+                }
+            }
+#endif // DEBUG
 
-            optLoopTable[lnum].lpFlags |= LPFLG_REMOVED;
-            optLoopTable[lnum].lpHead = optLoopTable[lnum].lpBottom = nullptr;
+            // Update loop table.
+            optMarkLoopRemoved(lnum);
 
-            // Note if we created new BBJ_RETURNs
-            fgReturnCount += loopRetCount * (totalIter - 1);
+            // Note if we created new BBJ_RETURNs (or removed some).
+            if (totalIter > 0)
+            {
+                fgReturnCount += loopRetCount * (totalIter - 1);
+            }
+            else
+            {
+                assert(totalIter == 0);
+                assert(fgReturnCount >= loopRetCount);
+                fgReturnCount -= loopRetCount;
+            }
 
+            // Remember that something has changed.
             INDEBUG(++unrollCount);
+            change = true;
         }
 
     DONE_LOOP:;
@@ -4063,11 +4126,21 @@ PhaseStatus Compiler::optUnrollLoops()
                 printf(", %d failures due to block cloning", unrollFailures);
             }
             printf("\n");
+            if (anyNestedLoopsUnrolled)
+            {
+                printf("At least one unrolled loop contains nested loops; recomputing loop table\n");
+            }
         }
 #endif // DEBUG
 
-        constexpr bool computePreds = true;
-        fgUpdateChangedFlowGraph(computePreds);
+        // If we unrolled any nested loops, we rebuild the loop table (including recomputing the
+        // return blocks list).
+
+        constexpr bool computePreds        = true;
+        constexpr bool computeDoms         = true;
+        const bool     computeReturnBlocks = anyNestedLoopsUnrolled;
+        const bool     computeLoops        = anyNestedLoopsUnrolled;
+        fgUpdateChangedFlowGraph(computePreds, computeDoms, computeReturnBlocks, computeLoops);
 
         DBEXEC(verbose, fgDispBasicBlocks());
     }
@@ -4075,6 +4148,7 @@ PhaseStatus Compiler::optUnrollLoops()
     {
 #ifdef DEBUG
         assert(unrollCount == 0);
+        assert(!anyNestedLoopsUnrolled);
 
         if (unrollFailures > 0)
         {
@@ -4085,7 +4159,8 @@ PhaseStatus Compiler::optUnrollLoops()
 
 #ifdef DEBUG
     fgDebugCheckBBlist(true);
-#endif
+    fgDebugCheckLoopTable();
+#endif // DEBUG
 
     return PhaseStatus::MODIFIED_EVERYTHING;
 }
@@ -4752,15 +4827,16 @@ PhaseStatus Compiler::optOptimizeLayout()
 //
 void Compiler::optMarkLoopHeads()
 {
-    assert(!fgCheapPredsValid);
-    assert(fgReachabilitySetsValid);
-
 #ifdef DEBUG
     if (verbose)
     {
         printf("*************** In optMarkLoopHeads()\n");
     }
 
+    assert(!fgCheapPredsValid);
+    assert(fgReachabilitySetsValid);
+    fgDebugCheckBBNumIncreasing();
+
     int loopHeadsMarked = 0;
 #endif
 
@@ -4798,117 +4874,166 @@ void Compiler::optMarkLoopHeads()
 }
 
 //-----------------------------------------------------------------------------
-// optFindLoops: find loops in the function.
+// optResetLoopInfo: reset all loop info in preparation for rebuilding the loop table, or preventing
+// future phases from accessing loop-related data.
 //
-// The JIT recognizes two types of loops in a function: natural loops and "general" (or "unnatural") loops.
-// Natural loops are those which get added to the loop table. Most downstream optimizations require
-// using natural loops. See `optFindNaturalLoops` for a definition of the criteria for recognizing a natural loop.
-// A general loop is defined as a lexical (program order) range of blocks where a later block branches to an
-// earlier block (that is, there is a back edge in the flow graph), and the later block is reachable from the earlier
-// block. General loops are used for weighting flow graph blocks (when there is no block profile data), as well as
-// for determining if we require fully interruptible GC information.
-//
-// Notes:
-//  Also (re)sets all non-IBC block weights, and marks loops potentially needing alignment padding.
-//
-PhaseStatus Compiler::optFindLoops()
+void Compiler::optResetLoopInfo()
 {
-    noway_assert(opts.OptimizationEnabled());
-    assert(fgDomsComputed);
-
 #ifdef DEBUG
     if (verbose)
     {
-        printf("*************** In optFindLoops()\n");
+        printf("*************** In optResetLoopInfo()\n");
     }
 #endif
 
-    optMarkLoopHeads();
+    optLoopCount        = 0; // This will force the table to be rebuilt
+    loopAlignCandidates = 0;
 
-    optSetBlockWeights();
+    // This will cause users to crash if they use the table when it is considered empty.
+    // TODO: the loop table is always allocated as the same (maximum) size, so this is wasteful.
+    // We could zero it out (possibly only in DEBUG) to be paranoid, but there's no reason to
+    // force it to be re-allocated.
+    optLoopTable = nullptr;
+
+    for (BasicBlock* const block : Blocks())
+    {
+        // If the block weight didn't come from profile data, reset it so it can be calculated again.
+        if (!block->hasProfileWeight())
+        {
+            block->bbWeight = BB_UNITY_WEIGHT;
+            block->bbFlags &= ~BBF_RUN_RARELY;
+        }
 
-    /* Were there any loops in the flow graph? */
+        block->bbFlags &= ~BBF_LOOP_FLAGS;
+        block->bbNatLoopNum = BasicBlock::NOT_IN_LOOP;
+    }
+}
 
-    if (fgHasLoops)
+//-----------------------------------------------------------------------------
+// optFindAndScaleGeneralLoopBlocks: scale block weights based on loop nesting depth.
+// Note that this uses a very general notion of "loop": any block targeted by a reachable
+// back-edge is considered a loop.
+//
+void Compiler::optFindAndScaleGeneralLoopBlocks()
+{
+#ifdef DEBUG
+    if (verbose)
     {
-        optFindNaturalLoops();
+        printf("*************** In optFindAndScaleGeneralLoopBlocks()\n");
+    }
+#endif
 
-        // Now find the general loops and scale block weights.
+    // This code depends on block number ordering.
+    INDEBUG(fgDebugCheckBBNumIncreasing());
 
-        unsigned generalLoopCount = 0;
+    unsigned generalLoopCount = 0;
 
-        // We will use the following terminology:
-        // top        - the first basic block in the loop (i.e. the head of the backward edge)
-        // bottom     - the last block in the loop (i.e. the block from which we jump to the top)
-        // lastBottom - used when we have multiple back edges to the same top
+    // We will use the following terminology:
+    // top        - the first basic block in the loop (i.e. the head of the backward edge)
+    // bottom     - the last block in the loop (i.e. the block from which we jump to the top)
+    // lastBottom - used when we have multiple back edges to the same top
 
-        for (BasicBlock* const top : Blocks())
+    for (BasicBlock* const top : Blocks())
+    {
+        // Only consider `top` blocks already determined to be potential loop heads.
+        if (!top->isLoopHead())
         {
-            // Only consider `top` blocks already determined to be potential loop heads.
-            if (!top->isLoopHead())
+            continue;
+        }
+
+        BasicBlock* foundBottom = nullptr;
+
+        for (BasicBlock* const bottom : top->PredBlocks())
+        {
+            // Is this a loop candidate? - We look for "back edges"
+
+            // Is this a backward edge? (from BOTTOM to TOP)
+            if (top->bbNum > bottom->bbNum)
             {
                 continue;
             }
 
-            BasicBlock* foundBottom = nullptr;
-
-            for (BasicBlock* const bottom : top->PredBlocks())
+            // We only consider back-edges that are BBJ_COND or BBJ_ALWAYS for loops.
+            if ((bottom->bbJumpKind != BBJ_COND) && (bottom->bbJumpKind != BBJ_ALWAYS))
             {
-                // Is this a loop candidate? - We look for "back edges"
-
-                // Is this a backward edge? (from BOTTOM to TOP)
-                if (top->bbNum > bottom->bbNum)
-                {
-                    continue;
-                }
-
-                // We only consider back-edges that are BBJ_COND or BBJ_ALWAYS for loops.
-                if ((bottom->bbJumpKind != BBJ_COND) && (bottom->bbJumpKind != BBJ_ALWAYS))
-                {
-                    continue;
-                }
+                continue;
+            }
 
-                /* the top block must be able to reach the bottom block */
-                if (!fgReachable(top, bottom))
-                {
-                    continue;
-                }
+            /* the top block must be able to reach the bottom block */
+            if (!fgReachable(top, bottom))
+            {
+                continue;
+            }
 
-                /* Found a new loop, record the longest backedge in foundBottom */
+            /* Found a new loop, record the longest backedge in foundBottom */
 
-                if ((foundBottom == nullptr) || (bottom->bbNum > foundBottom->bbNum))
-                {
-                    foundBottom = bottom;
-                }
+            if ((foundBottom == nullptr) || (bottom->bbNum > foundBottom->bbNum))
+            {
+                foundBottom = bottom;
             }
+        }
 
-            if (foundBottom)
-            {
-                generalLoopCount++;
+        if (foundBottom)
+        {
+            generalLoopCount++;
 
-                /* Mark all blocks between 'top' and 'bottom' */
+            /* Mark all blocks between 'top' and 'bottom' */
 
-                optScaleLoopBlocks(top, foundBottom);
-            }
+            optScaleLoopBlocks(top, foundBottom);
+        }
 
-            // We track at most 255 loops
-            if (generalLoopCount == 255)
-            {
+        // We track at most 255 loops
+        if (generalLoopCount == 255)
+        {
 #if COUNT_LOOPS
-                totalUnnatLoopOverflows++;
+            totalUnnatLoopOverflows++;
 #endif
-                break;
-            }
+            break;
         }
+    }
 
-        JITDUMP("\nFound a total of %d general loops.\n", generalLoopCount);
+    JITDUMP("\nFound a total of %d general loops.\n", generalLoopCount);
 
 #if COUNT_LOOPS
-        totalUnnatLoopCount += generalLoopCount;
+    totalUnnatLoopCount += generalLoopCount;
+#endif
+}
+
+//-----------------------------------------------------------------------------
+// optFindLoops: find loops in the function.
+//
+// The JIT recognizes two types of loops in a function: natural loops and "general" (or "unnatural") loops.
+// Natural loops are those which get added to the loop table. Most downstream optimizations require
+// using natural loops. See `optFindNaturalLoops` for a definition of the criteria for recognizing a natural loop.
+// A general loop is defined as a lexical (program order) range of blocks where a later block branches to an
+// earlier block (that is, there is a back edge in the flow graph), and the later block is reachable from the earlier
+// block. General loops are used for weighting flow graph blocks (when there is no block profile data), as well as
+// for determining if we require fully interruptible GC information.
+//
+// Notes:
+//  Also (re)sets all non-IBC block weights, and marks loops potentially needing alignment padding.
+//
+void Compiler::optFindLoops()
+{
+#ifdef DEBUG
+    if (verbose)
+    {
+        printf("*************** In optFindLoops()\n");
+    }
 #endif
 
-        // Check if any of the loops need alignment
-        optIdentifyLoopsForAlignment();
+    noway_assert(opts.OptimizationEnabled());
+    assert(fgDomsComputed);
+
+    optMarkLoopHeads();
+
+    // Were there any potential loops in the flow graph?
+
+    if (fgHasLoops)
+    {
+        optFindNaturalLoops();
+        optFindAndScaleGeneralLoopBlocks();
+        optIdentifyLoopsForAlignment(); // Check if any of the loops need alignment
     }
 
 #ifdef DEBUG
@@ -4916,6 +5041,14 @@ PhaseStatus Compiler::optFindLoops()
 #endif
 
     optLoopsMarked = true;
+}
+
+//-----------------------------------------------------------------------------
+// optFindLoopsPhase: The wrapper function for the "find loops" phase.
+//
+PhaseStatus Compiler::optFindLoopsPhase()
+{
+    optFindLoops();
     return PhaseStatus::MODIFIED_EVERYTHING;
 }
 
@@ -5640,6 +5773,8 @@ int Compiler::optIsSetAssgLoop(unsigned lnum, ALLVARSET_VALARG_TP vars, varRefKi
 
 void Compiler::optPerformHoistExpr(GenTree* origExpr, BasicBlock* exprBb, unsigned lnum)
 {
+    assert(exprBb != nullptr);
+
 #ifdef DEBUG
     if (verbose)
     {
@@ -5653,11 +5788,6 @@ void Compiler::optPerformHoistExpr(GenTree* origExpr, BasicBlock* exprBb, unsign
     }
 #endif
 
-    assert(exprBb != nullptr);
-
-    // This loop has to be in a form that is approved for hoisting.
-    assert(optLoopTable[lnum].lpFlags & LPFLG_HOISTABLE);
-
     // Create a copy of the expression and mark it for CSE's.
     GenTree* hoistExpr = gtCloneExpr(origExpr, GTF_MAKE_CSE);
 
@@ -5681,10 +5811,9 @@ void Compiler::optPerformHoistExpr(GenTree* origExpr, BasicBlock* exprBb, unsign
 
     /* Put the statement in the preheader */
 
-    fgCreateLoopPreHeader(lnum);
+    INDEBUG(optLoopTable[lnum].lpValidatePreHeader());
 
     BasicBlock* preHead = optLoopTable[lnum].lpHead;
-    assert(preHead->bbJumpKind == BBJ_NONE);
 
     // fgMorphTree requires that compCurBB be the block that contains
     // (or in this case, will contain) the expression.
@@ -5975,56 +6104,16 @@ void Compiler::optHoistThisLoop(unsigned lnum, LoopHoistContext* hoistCtxt)
         return;
     }
 
-    /* Get the head and tail of the loop */
-
-    BasicBlock* head = pLoopDsc->lpHead;
-    BasicBlock* tail = pLoopDsc->lpBottom;
-    BasicBlock* lbeg = pLoopDsc->lpEntry;
-
-    // We must have a do-while loop
-    if ((pLoopDsc->lpFlags & LPFLG_DO_WHILE) == 0)
-    {
-        JITDUMP("   ... not hoisting " FMT_LP ": not top entry\n", lnum);
-        return;
-    }
-
-    // The loop-head must dominate the loop-entry.
-    // TODO-CQ: Couldn't we make this true if it's not?
-    if (!fgDominate(head, lbeg))
-    {
-        JITDUMP("   ... not hoisting " FMT_LP ": head " FMT_BB " does not dominate beg " FMT_BB "\n", lnum, head->bbNum,
-                lbeg->bbNum);
-        return;
-    }
-
-    // if lbeg is the start of a new try block then we won't be able to hoist
-    if (!BasicBlock::sameTryRegion(head, lbeg))
-    {
-        JITDUMP("   ... not hoisting in " FMT_LP ", eh region constraint\n", lnum);
-        return;
-    }
-
-    // We don't bother hoisting when inside of a catch block
-    if ((lbeg->bbCatchTyp != BBCT_NONE) && (lbeg->bbCatchTyp != BBCT_FINALLY))
-    {
-        JITDUMP("   ... not hoisting in " FMT_LP ", within catch\n", lnum);
-        return;
-    }
-
-    pLoopDsc->lpFlags |= LPFLG_HOISTABLE;
-
-    unsigned begn = lbeg->bbNum;
-    unsigned endn = tail->bbNum;
-
     // Ensure the per-loop sets/tables are empty.
     hoistCtxt->m_curLoopVnInvariantCache.RemoveAll();
 
 #ifdef DEBUG
     if (verbose)
     {
-        printf("optHoistThisLoop for loop " FMT_LP " <" FMT_BB ".." FMT_BB ">:\n", lnum, begn, endn);
-        printf("  Loop body %s a call\n", pLoopDsc->lpContainsCall ? "contains" : "does not contain");
-        printf("  Loop has %s\n", (pLoopDsc->lpFlags & LPFLG_ONE_EXIT) ? "single exit" : "multiple exits");
+        printf("optHoistThisLoop for loop " FMT_LP " <" FMT_BB ".." FMT_BB ">:\n", lnum, pLoopDsc->lpTop->bbNum,
+               pLoopDsc->lpBottom->bbNum);
+        printf("  Loop body %s a call\n", (pLoopDsc->lpFlags & LPFLG_CONTAINS_CALL) ? "contains" : "does not contain");
+        printf("  Loop has %s\n", (pLoopDsc->lpExitCnt == 1) ? "single exit" : "multiple exits");
     }
 #endif
 
@@ -6122,7 +6211,7 @@ void Compiler::optHoistThisLoop(unsigned lnum, LoopHoistContext* hoistCtxt)
     // hoisting inner to outer.
     //
     ArrayStack<BasicBlock*> defExec(getAllocatorLoopHoist());
-    if (pLoopDsc->lpFlags & LPFLG_ONE_EXIT)
+    if (pLoopDsc->lpExitCnt == 1)
     {
         assert(pLoopDsc->lpExit != nullptr);
         JITDUMP("  Only considering hoisting in blocks that dominate exit block " FMT_BB "\n", pLoopDsc->lpExit->bbNum);
@@ -6154,11 +6243,11 @@ void Compiler::optHoistThisLoop(unsigned lnum, LoopHoistContext* hoistCtxt)
     optHoistLoopBlocks(lnum, &defExec, hoistCtxt);
 }
 
-bool Compiler::optIsProfitableToHoistableTree(GenTree* tree, unsigned lnum)
+bool Compiler::optIsProfitableToHoistTree(GenTree* tree, unsigned lnum)
 {
     LoopDsc* pLoopDsc = &optLoopTable[lnum];
 
-    bool loopContainsCall = pLoopDsc->lpContainsCall;
+    bool loopContainsCall = (pLoopDsc->lpFlags & LPFLG_CONTAINS_CALL) != 0;
 
     int availRegCount;
     int hoistedExprCount;
@@ -6931,30 +7020,41 @@ void Compiler::optHoistLoopBlocks(unsigned loopNum, ArrayStack<BasicBlock*>* blo
 void Compiler::optHoistCandidate(GenTree* tree, BasicBlock* treeBb, unsigned lnum, LoopHoistContext* hoistCtxt)
 {
     assert(lnum != BasicBlock::NOT_IN_LOOP);
-    assert((optLoopTable[lnum].lpFlags & LPFLG_HOISTABLE) != 0);
 
     // It must pass the hoistable profitablity tests for this loop level
-    if (!optIsProfitableToHoistableTree(tree, lnum))
+    if (!optIsProfitableToHoistTree(tree, lnum))
     {
         JITDUMP("   ... not profitable to hoist\n");
         return;
     }
 
-    bool b;
-    if (hoistCtxt->m_hoistedInParentLoops.Lookup(tree->gtVNPair.GetLiberal(), &b))
+    if (hoistCtxt->m_hoistedInParentLoops.Lookup(tree->gtVNPair.GetLiberal()))
     {
         JITDUMP("   ... already hoisted same VN in parent\n");
         // already hoisted in a parent loop, so don't hoist this expression.
         return;
     }
 
-    if (hoistCtxt->GetHoistedInCurLoop(this)->Lookup(tree->gtVNPair.GetLiberal(), &b))
+    if (hoistCtxt->GetHoistedInCurLoop(this)->Lookup(tree->gtVNPair.GetLiberal()))
     {
         JITDUMP("   ... already hoisted same VN in current\n");
         // already hoisted this expression in the current loop, so don't hoist this expression.
         return;
     }
 
+    // Create a loop pre-header in which to put the hoisted code.
+    fgCreateLoopPreHeader(lnum);
+
+    // If the block we're hoisting from and the pre-header are in different EH regions, don't hoist.
+    // TODO: we could probably hoist things that won't raise exceptions, such as constants.
+    if (!BasicBlock::sameTryRegion(optLoopTable[lnum].lpHead, treeBb))
+    {
+        JITDUMP("   ... not hoisting in " FMT_LP ", eh region constraint (pre-header try index %d, candidate " FMT_BB
+                " try index %d\n",
+                lnum, optLoopTable[lnum].lpHead->bbTryIndex, treeBb->bbNum, treeBb->bbTryIndex);
+        return;
+    }
+
     // Expression can be hoisted
     optPerformHoistExpr(tree, treeBb, lnum);
 
@@ -6979,7 +7079,7 @@ void Compiler::optHoistCandidate(GenTree* tree, BasicBlock* treeBb, unsigned lnu
     hoistCtxt->GetHoistedInCurLoop(this)->Set(tree->gtVNPair.GetLiberal(), true);
 }
 
-bool Compiler::optVNIsLoopInvariant(ValueNum vn, unsigned lnum, VNToBoolMap* loopVnInvariantCache)
+bool Compiler::optVNIsLoopInvariant(ValueNum vn, unsigned lnum, VNSet* loopVnInvariantCache)
 {
     // If it is not a VN, is not loop-invariant.
     if (vn == ValueNumStore::NoVN)
@@ -7067,53 +7167,140 @@ bool Compiler::optVNIsLoopInvariant(ValueNum vn, unsigned lnum, VNToBoolMap* loo
     return res;
 }
 
-/*****************************************************************************
- *
- *  Creates a pre-header block for the given loop - a preheader is a BBJ_NONE
- *  header. The pre-header will replace the current lpHead in the loop table.
- *  The loop has to be a do-while loop. Thus, all blocks dominated by lpHead
- *  will also be dominated by the loop-top, lpHead->bbNext.
- *
- */
-
+//------------------------------------------------------------------------------
+// fgCreateLoopPreHeader: Creates a pre-header block for the given loop.
+// A pre-header is a block outside the loop that falls through or branches to the loop
+// entry block. It is the only non-loop predecessor block to the entry block (thus, it
+// dominates the entry block). The pre-header replaces the current lpHead in the loop table.
+// The pre-header will be placed immediately before the loop top block, which is the first
+// block of the loop in program order.
+//
+// Once a loop has a pre-header, calling this function will immediately return without
+// creating another.
+//
+// If there already exists a block that meets the pre-header requirements, that block is marked
+// as a pre-header, and no flow graph modification is made.
+//
+// Note that the pre-header block can be in a different EH region from blocks in the loop, including the
+// entry block. Code doing hoisting is required to check the EH legality of hoisting to the pre-header
+// before doing so.
+//
+// Since the flow graph has changed, if needed, fgUpdateChangedFlowGraph() should be called after this
+// to update the block numbers, reachability, and dominators. The loop table does not need to be rebuilt.
+// The new pre-header block does have a copy of the previous 'head' reachability set, but the pre-header
+// itself doesn't exist in any reachability/dominator sets. `fgDominate` has code to specifically
+// handle queries about the pre-header dominating other blocks, even without re-computing dominators.
+// The preds lists have been maintained.
+//
+// Currently, if you create a pre-header but don't put any code in it, any subsequent fgUpdateFlowGraph()
+// pass might choose to compact the empty pre-header with a predecessor block. That is, a pre-header
+// block might disappear if not used.
+//
+// The code does not depend on the order of the BasicBlock bbNum.
+//
+// Arguments:
+//    lnum  - loop index
+//
 void Compiler::fgCreateLoopPreHeader(unsigned lnum)
 {
-    LoopDsc* pLoopDsc = &optLoopTable[lnum];
-
-    /* This loop has to be a "do-while" loop */
-
-    assert(pLoopDsc->lpFlags & LPFLG_DO_WHILE);
-
-    /* Have we already created a loop-preheader block? */
-
-    if (pLoopDsc->lpFlags & LPFLG_HAS_PREHEAD)
+#ifdef DEBUG
+    if (verbose)
     {
-        return;
+        printf("*************** In fgCreateLoopPreHeader for " FMT_LP "\n", lnum);
     }
+#endif // DEBUG
+
+    LoopDsc& loop = optLoopTable[lnum];
 
-    BasicBlock* head  = pLoopDsc->lpHead;
-    BasicBlock* top   = pLoopDsc->lpTop;
-    BasicBlock* entry = pLoopDsc->lpEntry;
+    // Have we already created a loop-preheader block?
 
-    // if 'entry' and 'head' are in different try regions then we won't be able to hoist
-    if (!BasicBlock::sameTryRegion(head, entry))
+    if (loop.lpFlags & LPFLG_HAS_PREHEAD)
     {
+        JITDUMP("   pre-header already exists\n");
+        INDEBUG(loop.lpValidatePreHeader());
         return;
     }
 
+    BasicBlock* head  = loop.lpHead;
+    BasicBlock* top   = loop.lpTop;
+    BasicBlock* entry = loop.lpEntry;
+
     // Ensure that lpHead always dominates lpEntry
 
     noway_assert(fgDominate(head, entry));
 
-    /* Get hold of the first block of the loop body */
+    // If `head` is already a valid pre-header, then mark it so.
+    if (head->GetUniqueSucc() == entry)
+    {
+        // The loop entry must have a single non-loop predecessor, which is the pre-header.
+        bool loopHasProperEntryBlockPreds = true;
+        for (BasicBlock* const predBlock : entry->PredBlocks())
+        {
+            if (head == predBlock)
+            {
+                continue;
+            }
+            const bool intraLoopPred = optLoopContains(lnum, predBlock->bbNatLoopNum);
+            if (!intraLoopPred)
+            {
+                loopHasProperEntryBlockPreds = false;
+                break;
+            }
+        }
+        if (loopHasProperEntryBlockPreds)
+        {
+            // Does this existing region have the same EH region index that we will use when we create the pre-header?
+            // If not, we want to create a new pre-header with the expected region.
+            bool headHasCorrectEHRegion = false;
+            if ((top->bbFlags & BBF_TRY_BEG) != 0)
+            {
+                assert(top->hasTryIndex());
+                unsigned newTryIndex     = ehTrueEnclosingTryIndexIL(top->getTryIndex());
+                unsigned compareTryIndex = head->hasTryIndex() ? head->getTryIndex() : EHblkDsc::NO_ENCLOSING_INDEX;
+                headHasCorrectEHRegion   = newTryIndex == compareTryIndex;
+            }
+            else
+            {
+                headHasCorrectEHRegion = BasicBlock::sameTryRegion(head, top);
+            }
 
-    assert(top == entry);
+            if (headHasCorrectEHRegion)
+            {
+                JITDUMP("   converting existing header " FMT_BB " into pre-header\n", head->bbNum);
+                loop.lpFlags |= LPFLG_HAS_PREHEAD;
+                assert((head->bbFlags & BBF_LOOP_PREHEADER) == 0); // It isn't already a loop pre-header
+                head->bbFlags |= BBF_LOOP_PREHEADER;
+                INDEBUG(loop.lpValidatePreHeader());
+                INDEBUG(fgDebugCheckLoopTable());
+                return;
+            }
+            else
+            {
+                JITDUMP("   existing head " FMT_BB " doesn't have correct EH region\n", head->bbNum);
+            }
+        }
+        else
+        {
+            JITDUMP("   existing head " FMT_BB " isn't unique non-loop predecessor of loop entry\n", head->bbNum);
+        }
+    }
+    else
+    {
+        JITDUMP("   existing head " FMT_BB " doesn't have unique successor branching to loop entry\n", head->bbNum);
+    }
 
-    /* Allocate a new basic block */
+    // Allocate a new basic block for the pre-header.
 
-    BasicBlock* preHead = bbNewBasicBlock(BBJ_NONE);
+    const bool isTopEntryLoop = loop.lpIsTopEntry();
+
+    BasicBlock* preHead = bbNewBasicBlock(isTopEntryLoop ? BBJ_NONE : BBJ_ALWAYS);
     preHead->bbFlags |= BBF_INTERNAL | BBF_LOOP_PREHEADER;
 
+    if (!isTopEntryLoop)
+    {
+        preHead->bbJumpDest = entry;
+    }
+
     // Must set IL code offset
     preHead->bbCodeOffs = top->bbCodeOffs;
 
@@ -7133,25 +7320,42 @@ void Compiler::fgCreateLoopPreHeader(unsigned lnum)
 #ifdef DEBUG
     if (verbose)
     {
-        printf("\nCreated PreHeader (" FMT_BB ") for loop " FMT_LP " (" FMT_BB " - " FMT_BB "), with weight = %s\n",
-               preHead->bbNum, lnum, top->bbNum, pLoopDsc->lpBottom->bbNum, refCntWtd2str(preHead->getBBWeight(this)));
+        printf("\nCreated PreHeader (" FMT_BB ") for loop " FMT_LP " (" FMT_BB " - " FMT_BB, preHead->bbNum, lnum,
+               top->bbNum, loop.lpBottom->bbNum);
+        if (!isTopEntryLoop)
+        {
+            printf(", entry " FMT_BB, entry->bbNum);
+        }
+        printf("), with weight = %s\n", refCntWtd2str(preHead->getBBWeight(this)));
     }
 #endif
 
     // The preheader block is part of the containing loop (if any).
-    preHead->bbNatLoopNum = pLoopDsc->lpParent;
+    preHead->bbNatLoopNum = loop.lpParent;
 
     if (fgIsUsingProfileWeights() && (head->bbJumpKind == BBJ_COND))
     {
-        if ((head->bbWeight == BB_ZERO_WEIGHT) || (head->bbNext->bbWeight == BB_ZERO_WEIGHT))
+        if ((head->bbWeight == BB_ZERO_WEIGHT) || (entry->bbWeight == BB_ZERO_WEIGHT))
         {
             preHead->bbWeight = BB_ZERO_WEIGHT;
             preHead->bbFlags |= BBF_RUN_RARELY;
         }
         else
         {
+            // Allow for either the fall-through or branch to target 'entry'.
+            BasicBlock* skipLoopBlock;
+            if (head->bbNext == entry)
+            {
+                skipLoopBlock = head->bbJumpDest;
+            }
+            else
+            {
+                skipLoopBlock = head->bbNext;
+            }
+            assert(skipLoopBlock != entry);
+
             bool allValidProfileWeights =
-                (head->hasProfileWeight() && head->bbJumpDest->hasProfileWeight() && head->bbNext->hasProfileWeight());
+                (head->hasProfileWeight() && skipLoopBlock->hasProfileWeight() && entry->hasProfileWeight());
 
             if (allValidProfileWeights)
             {
@@ -7161,13 +7365,13 @@ void Compiler::fgCreateLoopPreHeader(unsigned lnum)
 
                 if (useEdgeWeights)
                 {
-                    flowList* edgeToNext = fgGetPredForBlock(head->bbNext, head);
-                    flowList* edgeToJump = fgGetPredForBlock(head->bbJumpDest, head);
-                    noway_assert(edgeToNext != nullptr);
-                    noway_assert(edgeToJump != nullptr);
+                    const flowList* edgeToEntry    = fgGetPredForBlock(entry, head);
+                    const flowList* edgeToSkipLoop = fgGetPredForBlock(skipLoopBlock, head);
+                    noway_assert(edgeToEntry != nullptr);
+                    noway_assert(edgeToSkipLoop != nullptr);
 
-                    loopEnteredCount = (edgeToNext->edgeWeightMin() + edgeToNext->edgeWeightMax()) / 2.0;
-                    loopSkippedCount = (edgeToJump->edgeWeightMin() + edgeToJump->edgeWeightMax()) / 2.0;
+                    loopEnteredCount = (edgeToEntry->edgeWeightMin() + edgeToEntry->edgeWeightMax()) / 2.0;
+                    loopSkippedCount = (edgeToSkipLoop->edgeWeightMin() + edgeToSkipLoop->edgeWeightMax()) / 2.0;
 
                     // Watch out for cases where edge weights were not properly maintained
                     // so that it appears no profile flow enters the loop.
@@ -7177,8 +7381,8 @@ void Compiler::fgCreateLoopPreHeader(unsigned lnum)
 
                 if (!useEdgeWeights)
                 {
-                    loopEnteredCount = head->bbNext->bbWeight;
-                    loopSkippedCount = head->bbJumpDest->bbWeight;
+                    loopEnteredCount = entry->bbWeight;
+                    loopSkippedCount = skipLoopBlock->bbWeight;
                 }
 
                 weight_t loopTakenRatio = loopEnteredCount / (loopEnteredCount + loopSkippedCount);
@@ -7227,41 +7431,95 @@ void Compiler::fgCreateLoopPreHeader(unsigned lnum)
         }
     }
 
-    // The handler can't begin at the top of the loop.  If it did, it would be incorrect
-    // to set the handler index on the pre header without updating the exception table.
-    noway_assert(!top->hasHndIndex() || fgFirstBlockOfHandler(top) != top);
+    // In which EH region should the pre-header live?
+    //
+    // The pre-header block is added immediately before `top`.
+    //
+    // The `top` block cannot be the first block of a filter or handler: `top` must have a back-edge from a
+    // BBJ_COND or BBJ_ALWAYS within the loop, and a filter or handler cannot be branched to like that.
+    //
+    // The `top` block can be the first block of a `try` region, and you can fall into or branch to the
+    // first block of a `try` region. (For top-entry loops, `top` will both be the target of a back-edge
+    // and a fall-through from the previous block.)
+    //
+    // If the `top` block is NOT the first block of a `try` region, the pre-header can simply extend the
+    // `top` block region.
+    //
+    // If the `top` block IS the first block of a `try`, we find its parent region and use that. For mutual-protect
+    // regions, we need to find the actual parent, as the block stores the most "nested" mutual region. For
+    // non-mutual-protect regions, due to EH canonicalization, we are guaranteed that no other EH regions begin
+    // on the same block, so looking to just the parent is sufficient. Note that we can't just extend the EH
+    // region of `top` to the pre-header, because `top` will still be the target of backward branches from
+    // within the loop. If those backward branches come from outside the `try` (say, only the top half of the loop
+    // is a `try` region), then we can't branch to a non-first `try` region block (you always must entry the `try`
+    // in the first block).
+    //
+    // Note that hoisting any code out of a try region, for example, to a pre-header block in a different
+    // EH region, needs to ensure that no exceptions will be thrown.
+
+    assert(!fgIsFirstBlockOfFilterOrHandler(top));
+
+    if ((top->bbFlags & BBF_TRY_BEG) != 0)
+    {
+        // `top` is the beginning of a try block. Figure out the EH region to use.
+        assert(top->hasTryIndex());
+        unsigned short newTryIndex = (unsigned short)ehTrueEnclosingTryIndexIL(top->getTryIndex());
+        if (newTryIndex == EHblkDsc::NO_ENCLOSING_INDEX)
+        {
+            // No EH try index.
+            preHead->clearTryIndex();
+        }
+        else
+        {
+            preHead->setTryIndex(newTryIndex);
+        }
 
-    // Update the EH table to make the hoisted block part of the loop's EH block.
-    fgExtendEHRegionBefore(top);
+        // What handler region to use? Use the same handler region as `top`.
+        preHead->copyHndIndex(top);
+    }
+    else
+    {
+        // `top` is not the beginning of a try block. Just extend the EH region to the pre-header.
+        // We don't need to call `fgExtendEHRegionBefore()` because all the special handling that function
+        // does it to account for `top` being the first block of a `try` or handler region, which we know
+        // is not true.
+
+        preHead->copyEHRegion(top);
+    }
 
     // TODO-CQ: set dominators for this block, to allow loop optimizations requiring them
     //        (e.g: hoisting expression in a loop with the same 'head' as this one)
 
-    /* Update the loop entry */
+    // Update the loop table
 
-    pLoopDsc->lpHead = preHead;
-    pLoopDsc->lpFlags |= LPFLG_HAS_PREHEAD;
+    loop.lpHead = preHead;
+    loop.lpFlags |= LPFLG_HAS_PREHEAD;
 
-    /* The new block becomes the 'head' of the loop - update bbRefs and bbPreds
-       All predecessors of 'beg', (which is the entry in the loop)
-       now have to jump to 'preHead', unless they are dominated by 'head' */
+    // The new block becomes the 'head' of the loop - update bbRefs and bbPreds.
+    // All non-loop predecessors of 'entry' now jump to 'preHead'.
 
-    preHead->bbRefs                 = 0;
-    flowList* const edgeToPreHeader = fgAddRefPred(preHead, head);
-    edgeToPreHeader->setEdgeWeights(preHead->bbWeight, preHead->bbWeight, preHead);
+    preHead->bbRefs       = 0;
     bool checkNestedLoops = false;
 
-    for (BasicBlock* const predBlock : top->PredBlocks())
+    for (BasicBlock* const predBlock : entry->PredBlocks())
     {
-        if (fgDominate(top, predBlock))
-        {
-            // note: if 'top' dominates predBlock, 'head' dominates predBlock too
-            // (we know that 'head' dominates 'top'), but using 'top' instead of
-            // 'head' in the test allows us to not enter here if 'predBlock == head'
+        // Is the predBlock in the loop?
+        //
+        // We want to use:
+        //    const bool intraLoopPred = loop.lpContains(predBlock);
+        // but we can't depend on the bbNum ordering.
+        //
+        // Previously, this code wouldn't redirect predecessors dominated by the entry. However, that can
+        // lead to a case where non-loop predecessor is dominated by the loop entry, and that predecessor
+        // continues to branch to the entry, not the new pre-header. This is normally ok for hoisting
+        // because it will introduce an SSA PHI def within the loop, which will inhibit hoisting. However,
+        // it complicates the definition of what a pre-header is.
 
-            if (predBlock != pLoopDsc->lpBottom)
+        const bool intraLoopPred = optLoopContains(lnum, predBlock->bbNatLoopNum);
+        if (intraLoopPred)
+        {
+            if (predBlock != loop.lpBottom)
             {
-                noway_assert(predBlock != head);
                 checkNestedLoops = true;
             }
             continue;
@@ -7270,33 +7528,33 @@ void Compiler::fgCreateLoopPreHeader(unsigned lnum)
         switch (predBlock->bbJumpKind)
         {
             case BBJ_NONE:
-                noway_assert(predBlock == head);
+                // This 'entry' predecessor that isn't dominated by 'entry' must be outside the loop,
+                // meaning it must be fall-through to 'entry', and we must have a top-entry loop.
+                noway_assert((entry == top) && (predBlock == head) && (predBlock->bbNext == preHead));
+                fgRemoveRefPred(entry, predBlock);
+                fgAddRefPred(preHead, predBlock);
                 break;
 
             case BBJ_COND:
-                if (predBlock == head)
+                if (predBlock->bbJumpDest == entry)
                 {
-                    noway_assert(predBlock->bbJumpDest != top);
-                    break;
+                    predBlock->bbJumpDest = preHead;
+                    noway_assert(predBlock->bbNext != preHead);
                 }
-                FALLTHROUGH;
+                else
+                {
+                    noway_assert((entry == top) && (predBlock == head) && (predBlock->bbNext == preHead));
+                }
+                fgRemoveRefPred(entry, predBlock);
+                fgAddRefPred(preHead, predBlock);
+                break;
 
             case BBJ_ALWAYS:
             case BBJ_EHCATCHRET:
-                noway_assert(predBlock->bbJumpDest == top);
+                noway_assert(predBlock->bbJumpDest == entry);
                 predBlock->bbJumpDest = preHead;
-
-                if (predBlock == head)
-                {
-                    // This is essentially the same case of predBlock being a BBJ_NONE. We may not be
-                    // able to make this a BBJ_NONE if it's an internal block (for example, a leave).
-                    // Just break, pred will be removed after switch.
-                }
-                else
-                {
-                    fgRemoveRefPred(top, predBlock);
-                    fgAddRefPred(preHead, predBlock);
-                }
+                fgRemoveRefPred(entry, predBlock);
+                fgAddRefPred(preHead, predBlock);
                 break;
 
             case BBJ_SWITCH:
@@ -7308,11 +7566,11 @@ void Compiler::fgCreateLoopPreHeader(unsigned lnum)
                 do
                 {
                     assert(*jumpTab);
-                    if ((*jumpTab) == top)
+                    if ((*jumpTab) == entry)
                     {
                         (*jumpTab) = preHead;
 
-                        fgRemoveRefPred(top, predBlock);
+                        fgRemoveRefPred(entry, predBlock);
                         fgAddRefPred(preHead, predBlock);
                     }
                 } while (++jumpTab, --jumpCnt);
@@ -7324,13 +7582,16 @@ void Compiler::fgCreateLoopPreHeader(unsigned lnum)
         }
     }
 
-    noway_assert(!fgGetPredForBlock(top, preHead));
-    fgRemoveRefPred(top, head);
-    flowList* const edgeFromPreHeader = fgAddRefPred(top, preHead);
-    edgeFromPreHeader->setEdgeWeights(preHead->bbWeight, preHead->bbWeight, top);
+    flowList* const edgeToPreHeader = fgGetPredForBlock(preHead, head);
+    noway_assert(edgeToPreHeader != nullptr);
+    edgeToPreHeader->setEdgeWeights(preHead->bbWeight, preHead->bbWeight, preHead);
+
+    noway_assert(fgGetPredForBlock(entry, preHead) == nullptr);
+    flowList* const edgeFromPreHeader = fgAddRefPred(entry, preHead);
+    edgeFromPreHeader->setEdgeWeights(preHead->bbWeight, preHead->bbWeight, entry);
 
     /*
-        If we found at least one back-edge in the flowgraph pointing to the top/entry of the loop
+        If we found at least one back-edge in the flowgraph pointing to the entry of the loop
         (other than the back-edge of the loop we are considering) then we likely have nested
         do-while loops with the same entry block and inserting the preheader block changes the head
         of all the nested loops. Now we will update this piece of information in the loop table, and
@@ -7343,8 +7604,12 @@ void Compiler::fgCreateLoopPreHeader(unsigned lnum)
         {
             if (optLoopTable[l].lpHead == head)
             {
-                noway_assert(l != lnum); // pLoopDsc->lpHead was already changed from 'head' to 'preHead'
+                // loop.lpHead was already changed from 'head' to 'preHead'
+                noway_assert(l != lnum);
+
+                // If it shares head, it must be a top-entry loop that shares top.
                 noway_assert(optLoopTable[l].lpEntry == top);
+
                 optUpdateLoopHead(l, optLoopTable[l].lpHead, preHead);
                 optLoopTable[l].lpFlags |= LPFLG_HAS_PREHEAD;
 #ifdef DEBUG
@@ -7358,7 +7623,14 @@ void Compiler::fgCreateLoopPreHeader(unsigned lnum)
         }
     }
 
+    // We added a new block and altered the preds list; make sure the flow graph has been marked as being modified.
+    assert(fgModified);
+
 #ifdef DEBUG
+    fgDebugCheckBBlist();
+    fgVerifyHandlerTab();
+    fgDebugCheckLoopTable();
+
     if (verbose)
     {
         JITDUMP("*************** After fgCreateLoopPreHeader for " FMT_LP "\n", lnum);
@@ -7393,7 +7665,7 @@ void Compiler::optComputeLoopSideEffects()
     {
         VarSetOps::AssignNoCopy(this, optLoopTable[lnum].lpVarInOut, VarSetOps::MakeEmpty(this));
         VarSetOps::AssignNoCopy(this, optLoopTable[lnum].lpVarUseDef, VarSetOps::MakeEmpty(this));
-        optLoopTable[lnum].lpContainsCall = false;
+        optLoopTable[lnum].lpFlags &= ~LPFLG_CONTAINS_CALL;
     }
 
     for (lnum = 0; lnum < optLoopCount; lnum++)
@@ -7506,10 +7778,10 @@ bool Compiler::optComputeLoopSideEffectsOfBlock(BasicBlock* blk)
                     AddContainsCallAllContainingLoops(mostNestedLoop);
                 }
 
-                // If we just set lpContainsCall or it was previously set
-                if (optLoopTable[mostNestedLoop].lpContainsCall)
+                // If we just set LPFLG_CONTAINS_CALL or it was previously set
+                if (optLoopTable[mostNestedLoop].lpFlags & LPFLG_CONTAINS_CALL)
                 {
-                    // We can early exit after both memoryHavoc and lpContainsCall are both set to true.
+                    // We can early exit after both memoryHavoc and LPFLG_CONTAINS_CALL are both set to true.
                     break;
                 }
 
@@ -7774,8 +8046,8 @@ void Compiler::AddContainsCallAllContainingLoops(unsigned lnum)
     assert(0 <= lnum && lnum < optLoopCount);
     while (lnum != BasicBlock::NOT_IN_LOOP)
     {
-        optLoopTable[lnum].lpContainsCall = true;
-        lnum                              = optLoopTable[lnum].lpParent;
+        optLoopTable[lnum].lpFlags |= LPFLG_CONTAINS_CALL;
+        lnum = optLoopTable[lnum].lpParent;
     }
 }
 
diff --git a/src/coreclr/jit/ssabuilder.cpp b/src/coreclr/jit/ssabuilder.cpp
index b9c506e29400a..4646769263d37 100644
--- a/src/coreclr/jit/ssabuilder.cpp
+++ b/src/coreclr/jit/ssabuilder.cpp
@@ -1515,7 +1515,7 @@ void SsaBuilder::Build()
     }
     else
     {
-        postOrder = (BasicBlock**)alloca(blockCount * sizeof(BasicBlock*));
+        postOrder = (BasicBlock**)_alloca(blockCount * sizeof(BasicBlock*));
     }
 
     m_visitedTraits = BitVecTraits(blockCount, m_pCompiler);