Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize loop pre-header creation and loop hoisting #62560

Merged

Conversation

BruceForstall
Copy link
Member

A loop pre-header is a block that flows directly (and only) to the loop entry
block. The loop pre-header is the only non-loop predecessor of the entry block.
Loop invariant code can be hoisted to the loop pre-header where it is
guaranteed to be executed just once (per loop entry).

Currently, a loop pre-header has a number of restrictions:

  • it is only created for a "do-while" (top entry) loop, not for a mid-loop entry.
  • it isn't created if the current loop head and the loop entry block are in
    different EH try regions

Additionally, partially due those restrictions, loop hoisting has restrictions:

  • it requires a "do-while" (top entry) loop
  • it requires the existing head block to dominate the loop entry block
  • it requires the existing head block to be in the same EH region as the entry block
  • it won't hoist if the entry block is the first block of a handler

This change removes all these restrictions.

Previously, even if we did create a pre-header, the definition of a pre-header
was a little weaker: an entry predecessor could be a non-loop block and also
not the pre-header, if the predecessor was dominated by the entry block. This
is more complicated to reason about, so I change the pre-header creation to
force entry block non-loop predecessors to branch to the pre-header instead.
This case only rarely occurs, when we have what looks like an outer loop back
edge but the natural loop recognition package doesn't recognize it as an outer loop.

I added a "stress mode" to always create a loop pre-header immediately after
loop recognition. This is disabled currently because loop cloning doesn't
respect the special status and invariants of a pre-header, and so inserts
all the cloning conditions and copied blocks after the pre-header, triggering
new loop structure asserts. This should be improved in the future.

A lot more checking of the loop table and loop annotations on blocks has been
added. This revealed a number of problems with loop unrolling leaving things
in a bad state for downstream phases. Loop unrolling has been updated to fix
this, in particular, the loop table is rebuilt if it is detected that we unroll
a loop that contains nested loops, since there will now be multiple copies of
those nested loops. This is the first case where we might rebuild the loop
table, so it lays the groundwork for potentially rebuilding the loop table in
other cases, such as after loop cloning where we don't add the "slow path"
loops to the table.

There is some code refactoring to simplify the "find loops" code as well.

Some change details:

  • optSetBlockWeights is elevated to a "phase" that runs prior to loop recognition.
  • LoopFlags is simplified:
    • LPFLG_DO_WHILE is removed; call lpIsTopEntry instead
    • LPFLG_ONE_EXIT is removed; check lpExitCnt == 1 instead
    • LPFLG_HOISTABLE is removed (there are no restrictions anymore)
    • LPFLG_CONST is removed: check lpFlags & (LPFLG_CONST_INIT | LPFLG_CONST_LIMIT) == (LPFLG_CONST_INIT | LPFLG_CONST_LIMIT) instead (only used in one place
    • bool lpContainsCall is removed and replaced by LPFLG_CONTAINS_CALL
  • Added a lpInitBlock field to the loop table. For constant and variable
    initialization loops, code assumed that these expressions existed in the
    head block. This isn't true anymore if we insert a pre-header block.
    So, capture the block where these actually exist when we determine that
    they do exist, and explicitly use this block pointer where needed.
  • Added fgComputeReturnBlocks() to extract this code out of fgComputeReachability into a function
  • Added optFindAndScaleGeneralLoopBlocks() to extract this out of loop recognition to its own function.
  • Added optResetLoopInfo() to reset the loop table and block annotations related to loops
  • Added fgDebugCheckBBNumIncreasing() to allow asserting that the bbNum
    order of blocks is increasing. This should be used in phases that depend
    on this order to do bbNum comparisons.
  • Add a lot more loop table validation in fgDebugCheckLoopTable()

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 9, 2021
@ghost
Copy link

ghost commented Dec 9, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

A loop pre-header is a block that flows directly (and only) to the loop entry
block. The loop pre-header is the only non-loop predecessor of the entry block.
Loop invariant code can be hoisted to the loop pre-header where it is
guaranteed to be executed just once (per loop entry).

Currently, a loop pre-header has a number of restrictions:

  • it is only created for a "do-while" (top entry) loop, not for a mid-loop entry.
  • it isn't created if the current loop head and the loop entry block are in
    different EH try regions

Additionally, partially due those restrictions, loop hoisting has restrictions:

  • it requires a "do-while" (top entry) loop
  • it requires the existing head block to dominate the loop entry block
  • it requires the existing head block to be in the same EH region as the entry block
  • it won't hoist if the entry block is the first block of a handler

This change removes all these restrictions.

Previously, even if we did create a pre-header, the definition of a pre-header
was a little weaker: an entry predecessor could be a non-loop block and also
not the pre-header, if the predecessor was dominated by the entry block. This
is more complicated to reason about, so I change the pre-header creation to
force entry block non-loop predecessors to branch to the pre-header instead.
This case only rarely occurs, when we have what looks like an outer loop back
edge but the natural loop recognition package doesn't recognize it as an outer loop.

I added a "stress mode" to always create a loop pre-header immediately after
loop recognition. This is disabled currently because loop cloning doesn't
respect the special status and invariants of a pre-header, and so inserts
all the cloning conditions and copied blocks after the pre-header, triggering
new loop structure asserts. This should be improved in the future.

A lot more checking of the loop table and loop annotations on blocks has been
added. This revealed a number of problems with loop unrolling leaving things
in a bad state for downstream phases. Loop unrolling has been updated to fix
this, in particular, the loop table is rebuilt if it is detected that we unroll
a loop that contains nested loops, since there will now be multiple copies of
those nested loops. This is the first case where we might rebuild the loop
table, so it lays the groundwork for potentially rebuilding the loop table in
other cases, such as after loop cloning where we don't add the "slow path"
loops to the table.

There is some code refactoring to simplify the "find loops" code as well.

Some change details:

  • optSetBlockWeights is elevated to a "phase" that runs prior to loop recognition.
  • LoopFlags is simplified:
    • LPFLG_DO_WHILE is removed; call lpIsTopEntry instead
    • LPFLG_ONE_EXIT is removed; check lpExitCnt == 1 instead
    • LPFLG_HOISTABLE is removed (there are no restrictions anymore)
    • LPFLG_CONST is removed: check lpFlags & (LPFLG_CONST_INIT | LPFLG_CONST_LIMIT) == (LPFLG_CONST_INIT | LPFLG_CONST_LIMIT) instead (only used in one place
    • bool lpContainsCall is removed and replaced by LPFLG_CONTAINS_CALL
  • Added a lpInitBlock field to the loop table. For constant and variable
    initialization loops, code assumed that these expressions existed in the
    head block. This isn't true anymore if we insert a pre-header block.
    So, capture the block where these actually exist when we determine that
    they do exist, and explicitly use this block pointer where needed.
  • Added fgComputeReturnBlocks() to extract this code out of fgComputeReachability into a function
  • Added optFindAndScaleGeneralLoopBlocks() to extract this out of loop recognition to its own function.
  • Added optResetLoopInfo() to reset the loop table and block annotations related to loops
  • Added fgDebugCheckBBNumIncreasing() to allow asserting that the bbNum
    order of blocks is increasing. This should be used in phases that depend
    on this order to do bbNum comparisons.
  • Add a lot more loop table validation in fgDebugCheckLoopTable()
Author: BruceForstall
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

A loop pre-header is a block that flows directly (and only) to the loop entry
block. The loop pre-header is the only non-loop predecessor of the entry block.
Loop invariant code can be hoisted to the loop pre-header where it is
guaranteed to be executed just once (per loop entry).

Currently, a loop pre-header has a number of restrictions:
- it is only created for a "do-while" (top entry) loop, not for a mid-loop entry.
- it isn't created if the current loop head and the loop entry block are in
different EH try regions

Additionally, partially due those restrictions, loop hoisting has restrictions:
- it requires a "do-while" (top entry) loop
- it requires the existing `head` block to dominate the loop entry block
- it requires the existing `head` block to be in the same EH region as the entry block
- it won't hoist if the `entry` block is the first block of a handler

This change removes all these restrictions.

Previously, even if we did create a pre-header, the definition of a pre-header
was a little weaker: an entry predecessor could be a non-loop block and also
not the pre-header, if the predecessor was dominated by the entry block. This
is more complicated to reason about, so I change the pre-header creation to
force entry block non-loop predecessors to branch to the pre-header instead.
This case only rarely occurs, when we have what looks like an outer loop back
edge but the natural loop recognition package doesn't recognize it as an outer loop.

I added a "stress mode" to always create a loop pre-header immediately after
loop recognition. This is disabled currently because loop cloning doesn't
respect the special status and invariants of a pre-header, and so inserts
all the cloning conditions and copied blocks after the pre-header, triggering
new loop structure asserts. This should be improved in the future.

A lot more checking of the loop table and loop annotations on blocks has been
added. This revealed a number of problems with loop unrolling leaving things
in a bad state for downstream phases. Loop unrolling has been updated to fix
this, in particular, the loop table is rebuilt if it is detected that we unroll
a loop that contains nested loops, since there will now be multiple copies of
those nested loops. This is the first case where we might rebuild the loop
table, so it lays the groundwork for potentially rebuilding the loop table in
other cases, such as after loop cloning where we don't add the "slow path"
loops to the table.

There is some code refactoring to simplify the "find loops" code as well.

Some change details:
- `optSetBlockWeights` is elevated to a "phase" that runs prior to loop recognition.
- LoopFlags is simplified:
	- LPFLG_DO_WHILE is removed; call `lpIsTopEntry` instead
	- LPFLG_ONE_EXIT is removed; check `lpExitCnt == 1` instead
	- LPFLG_HOISTABLE is removed (there are no restrictions anymore)
	- LPFLG_CONST is removed: check `lpFlags & (LPFLG_CONST_INIT | LPFLG_CONST_LIMIT) == (LPFLG_CONST_INIT | LPFLG_CONST_LIMIT)` instead (only used in one place
	- bool lpContainsCall is removed and replaced by LPFLG_CONTAINS_CALL
- Added a `lpInitBlock` field to the loop table. For constant and variable
initialization loops, code assumed that these expressions existed in the
`head` block. This isn't true anymore if we insert a pre-header block.
So, capture the block where these actually exist when we determine that
they do exist, and explicitly use this block pointer where needed.
- Added `fgComputeReturnBlocks()` to extract this code out of `fgComputeReachability` into a function
- Added `optFindAndScaleGeneralLoopBlocks()` to extract this out of loop recognition to its own function.
- Added `optResetLoopInfo()` to reset the loop table and block annotations related to loops
- Added `fgDebugCheckBBNumIncreasing()` to allow asserting that the bbNum
order of blocks is increasing. This should be used in phases that depend
on this order to do bbNum comparisons.
- Add a lot more loop table validation in `fgDebugCheckLoopTable()`
@BruceForstall BruceForstall force-pushed the GeneralizeCreatePreHeaderSquashMain branch from 5488052 to 622e701 Compare December 9, 2021 02:06
@BruceForstall
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr outerloop

@dotnet dotnet deleted a comment from azure-pipelines bot Dec 9, 2021
@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@BruceForstall BruceForstall marked this pull request as ready for review December 9, 2021 02:07
@BruceForstall
Copy link
Member Author

@AndyAyersMS @dotnet/jit-contrib PTAL

@BruceForstall
Copy link
Member Author

I'll report asmdiffs (and re-kick-off the asmdiffs job) after the spmi collection gets rebuilt; currently, they won't be too useful.

@BruceForstall
Copy link
Member Author

BruceForstall commented Dec 9, 2021

A few fixes to make:

  • A couple jitstress asserts:
    Assertion failed 'loopEntryHasHeadPred' in 'Microsoft.CodeAnalysis.CSharp.Symbols.SourceMemberMethodSymbol:CheckModifiers(Microsoft.CodeAnalysis.Location,Microsoft.CodeAnalysis.DiagnosticBag):this' during 'Hoist loop code' (IL size 912)
  • a GCC compilation failure.

1. Change `dspSuccs()` to not call code that will call `GetDescriptorForSwitch()`
2. Change `GetDescriptorForSwitch()` to use the correct max block number while
inlining. We probably don't or shouldn't call GetDescriptorForSwitch while inlining,
especially after (1), but this change doesn't hurt.
src/coreclr/jit/fgdiagnostic.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/loopcloning.cpp Show resolved Hide resolved
// 4. The loop iteration variable can't be address exposed.
// 5. The loop iteration variable can't be a promoted struct field.
// 6. We must be able to calculate the total constant iteration count.
// 7. On x86, there is a limit to the number of return blocks. So if there are return blocks in the loop that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully by now we've moved any such blocks out of the loop range. Can you look into whether this is still a true constraint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, we have moved all BBJ_RETURN out of the loop body. I can't see any case where we haven't, but I don't know how to prove that currently. I could follow-up by adding an assert to the new loop checker that there are no BBJ_RETURN blocks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it looks like there are cases where we don't move BBJ_RETURN out of the loop body. I don't know the precise conditions. SPMI found coreclr test Program:M49() and System.Security.Cryptography.Pkcs.Rfc3161TimestampToken:TryGetCertIds() (which has lots of nested loops + return + EH).

This test creates a case that fails:

        for (int i = 0; i < Vector<int>.Count; i++)
        {
            a += 1;
            if (a > 12000) return a;
            try {
                a += 2;
            } finally {
                a += 3;
            }
        }

The block table is:

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1       [000..006)-> BB06 ( cond )                     i
BB02 [0001]  2       BB01,BB05             0.50    [006..012)-> BB04 ( cond )                     i bwd bwd-target
BB03 [0002]  1       BB02                  0.50    [012..014)        (return)                     i
BB04 [0004]  1  0    BB02                  0.50    [014..01B)                 T0      try { }     keep i try bwd
BB05 [0011]  1       BB04                  0.50    [01B..02C)-> BB02 ( cond )                     keep i bwd cfb cfe
BB06 [0008]  2       BB01,BB05             0.50    [02C..02E)        (return)                     i
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ funclets follow
BB07 [0005]  1     0                       0       [01B..020)        (finret)    H0 F fault { }   keep i rare flet bwd
-----------------------------------------------------------------------------------------------------------------------------------------

and the loop is from BB02..BB05 with an exit at BB03.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, presuming we ever would unroll with an embedded return, we check to avoid doing so on x86 if it would exceed the max allowed epilog limit, but not elsewhere. Elsewhere, we still limit return blocks to 4 in fgAddInternal, but it's not a hard limit, so presumably we could exceed it in unrolling.

INDEBUG(++unrollFailures);
JITDUMP("Failed to unroll loop " FMT_LP ": block cloning failed on " FMT_BB "\n", lnum,
block->bbNum);
goto DONE_LOOP;
}

// All blocks in the unrolled loop and all children loops will now be marked
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

children here confuses me -- are we unrolling a loop that has children...?

Seems like maybe we unroll a loop that has a cloned loop inside where we've unrolled the clone fast path? -- so both the unrolled fast path and the (loop but not marked as loop slow path) get reparented? Still, seems like a lot of code expansion, especially if we create N copies of the slow path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will unroll a loop that has nested child loops. If there are child loops, I later will force a re-generation of the loop table, so this code to apply new loop number to the unrolled loop block will only be permanent if there are no child loops. The idea here is that if there are no child loops, it is "easy" to set the correct loop number, and there is no need to pay the expense of rebuilding the loop table.

Now, it turns out that we (currently) very rarely unroll loops, but one case where we always unroll is a loop statically determined to execute exactly once. That still goes through all the mechanism, and there might be nested loops within that.

I could expand on the comment here if you thing that would help.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more expansive comment would be helpful.

Is it just the "single iteration" case where we might have nested loops?

I guess unrolling there makes sense. Seems like for that case -- if we're not already doing so -- we should change the loop exit test to always exit, instead of cloning and then deleting the original loop.

I suppose if unrolling were smart enough to make N-1 copies this would just fall out, and we would not need to zap the originals either -- just repair the loop exit branch as above.

Again, maybe this already happens?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have nested loops in any case, not just the single iteration case. Consider:

public static int test_01()
{
    int a = 0;
    for (int i = 0; i < 2; i++)
    {
        a += 9;
        for (int j = 0; j < 10000; j++)
        {
            a += 1;
        }
    }
    return a;
}

With our current heuristics, we won't unroll this. With COMPlus_JitStressModeNames=STRESS_UNROLL_LOOPS we will unroll the outer loop but not the inner. Replace 2 by 0 (zero) and we'll remove the outer loop but also need to remove the inner.

It turns out you can "game" our heuristics as well, e.g.:

public static int test_02()
{
    int a = 0;
    for (int i = 0; i < Vector<int>.Count; i++)
    {
        a += 9;
        for (int j = 0; j < 10000; j++)
        {
            a += 1;
        }
    }
    return a;
}

We will always unroll a constant loop with Vector<T>.Count constant limit, here 8, but we won't unroll the inner loop.

The current code always creates N copies of the original code and then eliminates the original loop. It does seem like it could make N-1 copies instead.

src/coreclr/jit/optimizer.cpp Show resolved Hide resolved
src/coreclr/jit/optimizer.cpp Show resolved Hide resolved

// If the block we're hoisting from and the pre-header are in different EH regions, don't hoist.
// TODO: we could probably hoist things that won't raise exceptions, such as constants.
if (!BasicBlock::sameTryRegion(optLoopTable[lnum].lpHead, treeBb))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it the preheader should always at least be in the same try region as the loop top/entry? So this is only filtering out hoists from a try that's within a loop?

src/coreclr/jit/optimizer.cpp Show resolved Hide resolved
//
// Currently, if you create a pre-header but don't put any code in it, any subsequent fgUpdateFlowGraph()
// pass might choose to compact the empty pre-header with a predecessor block. That is, a pre-header
// block might disappear if not used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should reconsider this compaction -- those edges tend to be critical edges and we're probably better off having blocks there, at least for most of the phases.

Eg it might help prevent LSRA from putting its resolution move blocks in strange places.

We'd probably need a late flow opts phase that then does the cond->(empty) uncond opt.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not be hard to prevent compaction or cond->(empty)uncond opt. when a pre-header is involved, but as you point out, we might need a late flow opts. phase that we currently don't have to clean these up. So put it on the list for future work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We should consider deferring these simple flow graph updates until some (new) late phase.

There was an assertion when considering an existing `head` block as
a potential pre-header, that the `entry` block have the `head`
block as a predecessor. However, we early exit if we find a non-head,
non-loop edge. This could happen before we encounter the `head` block,
making the assert incorrect. We don't want to run the entire loop just
for the purpose of the assert (at least not here), so just remove the
assert.
Change:
```
compIsForInlining() ? impInlineInfo->InlinerCompiler->fgBBNumMax : fgBBNumMax
```
to:
```
impInlineRoot()->fgBBNumMax
```
@BruceForstall
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

1. Added loop epoch concept. Currently set but never checked.
2. Added disabled assert about return blocks always being moved
out of line of loop body.
3. Fixed bug checking `totalIter` after it was decremented to zero
as a loop variable
4. Added more comments on incremental loop block `bbNatLoopNum`
setting.
@BruceForstall
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

When considering converting an existing loop `head` block to a pre-header,
verify that the block has the same EH try region that we would create
for a new pre-header. If not, we go ahead and create the new pre-header.
@BruceForstall
Copy link
Member Author

@AndyAyersMS Is there any unaddressed question or concern left here?

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No other concerns.

@BruceForstall BruceForstall merged commit 955a681 into dotnet:main Dec 14, 2021
@BruceForstall BruceForstall deleted the GeneralizeCreatePreHeaderSquashMain branch December 14, 2021 07:03
@ghost ghost locked as resolved and limited conversation to collaborators Jan 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants