-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve loop cloning, with debugging improvements #55299
Improve loop cloning, with debugging improvements #55299
Conversation
@AndyAyersMS @kunalspathak @dotnet/jit-contrib PTAL |
/azp run runtime-coreclr jitstress |
Azure Pipelines successfully started running 1 pipeline(s). |
Allows inner loop of 3-nested loops (e.g., Array2 benchmark) to be cloned.
to avoid unnecessary bounds checks. Revert max cloning condition blocks to 3; allowing more doesn't seem to improve performance (probably too many conditions before a not-sufficiently-executed loop, at least for the Array2 benchmark)
1. "#if 0" the guts of the CSE dataflow; that's not useful to most people. 2. Add readable CSE number output to the CSE dataflow set output 3. Add FMT_CSE to commonize CSE number output. 4. Add PHASE_OPTIMIZE_VALNUM_CSES to the pre-phase output "allow list" and stop doing its own blocks/trees output. 5. Remove unused optCSECandidateTotal 6. Add functions `getCSEAvailBit` and `getCSEAvailCrossCallBit` to avoid hand-coding these bit calculations in multiple places, for the CSE dataflow set bits.
When generating loop cloning conditions, mark array index expressions as non-faulting, as we have already null- and range-checked the array before generating an index expression. I also added similary code to mark array length expressions as non-faulting, for the same reason. However, that leads to CQ losses because of downstream CSE effects.
This outputs the alignment boundaries without requiring outputting the actual addresses. It makes it easier to diff changes.
Create function for printing bound emitter labels. Also, add debug code to associate a BasicBlock with an insGroup, and output the block number and ID with the emitter label in JitDump, so it's easier to find where a group of generated instructions came from.
For instructions or instruction sequences which match the Intel jcc erratum criteria, note that in the alignment boundary dump. Also, a few fixes: 1. Move the alignment boundary dumping from `emitIssue1Instr` to `emitEndCodeGen` to avoid the possibility of reading the next instruction in a group when there is no next instruction. 2. Create `IsJccInstruction` and `IsJmpInstruction` functions for use by the jcc criteria detection, and fix that detection to fix a few omissions/errors. 3. Change the jcc criteria detection to be hard-coded to 32 byte boundaries instead of assuming `compJitAlignLoopBoundary` is 32. An example: ``` cmp r11d, dword ptr [rax+8] ; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (cmp: 0 ; jcc erratum) 32B boundary ............................... jae G_M42486_IG103 ``` In this case, the `cmp` doesn't cross the boundary, it is adjacent (the zero indicates the number of bytes of the instruction which cross the boundary), followed by the `jae` which starts after the boundary. Indicating the jcc erratum criteria can help point out potential performance issues due to unlucky alignment of these instructions in asm diffs.
XArch sometimes prepends a "v" to the instructions names from the instruction table. Add a function `genInsDisplayName` to create the full instruction name that should be displayed, and use that in most places an instruction name will be displayed, such as in the alignment messages, and normal disassembly. Use this instead of the raw `genInsName`. This could be extended to handle arm32 appending an "s", but I didn't want to touch arm32 with this change.
25ab640
to
35de73f
Compare
Rebased to pick up formatting fix. All test failures were standard infra noise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good...Added some minor feedback.
if (id->idCodeSize() == 0) | ||
{ | ||
// We're not going to generate any instruction, so it doesn't count for PerfScore. | ||
result.insThroughput = PERFSCORE_THROUGHPUT_ZERO; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be PERFSCORE_THROUGHPUT_2X * id->idCodeSize()
? PERFSCORE_THROUGHPUT_2X
because NOP
are cheap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also count the NOP compensation
we add as part of alignment?
runtime/src/coreclr/jit/emitxarch.cpp
Lines 14355 to 14364 in ede3733
if (emitComp->opts.disAsm) | |
{ | |
emitDispInsAddr(dst); | |
printf("\t\t ;; NOP compensation instructions of %d bytes.\n", diff); | |
} | |
#endif | |
BYTE* dstRW = dst + writeableOffset; | |
dstRW = emitOutputNOP(dstRW, diff); | |
dst = dstRW - writeableOffset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both of these ideas make sense as follow-up work, although maybe use PERFSCORE_THROUGHPUT_4X * id->idCodeSize()
to make them even cheaper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok as a follow-up work, not sure which one PERFSCORE_THROUGHPUT_2X
, PERFSCORE_THROUGHPUT_4X
, etc. to pick. @tannergooding - any thoughts?
// one of cmp, test, add, sub, and, inc, or dec), direct unconditional jump, indirect jump, | ||
// direct/indirect call, and return. | ||
|
||
size_t jccAlignBoundary = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, JitAlignLoopForJcc
alignment is done when we align the loop using non-adaptive approach. At that time, to determine how much padding is needed to align JCC, we base our calculation on the compJitAlignLoopBoundary
boundary.
runtime/src/coreclr/jit/emit.cpp
Lines 5144 to 5149 in fdbca22
// Mitigate JCC erratum by making sure the jmp doesn't fall on the boundary | |
if (emitComp->opts.compJitAlignLoopForJcc) | |
{ | |
// TODO: See if extra padding we might end up adding to mitigate JCC erratum is worth doing? | |
currentOffset++; | |
} |
As such, either change this to compJitAlignLoopBoundary
or add a note that once we start aligning JCC during adaptive loop alignment, make sure that JCC is always aligned using 32-byte boundary because that's what the reference manual says.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple questions about this comment.
- We don't do anything about jcc erratum in adaptive loop alignment mode, currently. Are you suggesting we do?
- I don't see how this code for
compJitAlignLoopForJcc
works. First, it's DEBUG only, so it appears to have been an experiment only, not in the shipping product. Also, how does incrementingcurrentOffset
affect whether the jcc erratum condition is avoided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't do anything about jcc erratum in adaptive loop alignment mode, currently. Are you suggesting we do?
In general, if we think that it degrades performance (as you observed), then yes, we should at least do it for instructions - Here is the graph from the findings made in #35730:
Axis:
- X: Ratio of after/before. Stated another way, the ratio is (withmcu/withoutmcu). Ratios less than 1 mean the benchmark performed better with the JCC microcode update applied. Ratios greater than 1 mean the benchmark performed worse with the JCC microcode update applied.
- Y: Count of benchmarks in the bucket.
In general, I see more Microbenchmarks degraded (be it by small amount) than improved.
My proposal would be to at least do JCC erratum for jumps that participate in the loop (like backedge).
I don't see how this code for compJitAlignLoopForJcc works. First, it's DEBUG only, so it appears to have been an experiment only, not in the shipping product. Also, how does incrementing currentOffset affect whether the jcc erratum condition is avoided?
Yes, it was intentionally made for DEBUG only because it didn't fully solve the JCC erratum nor did it give big benefits. The way it works is - for non-adaptive alignment, if the loop already starts from an offset such that it will still fit in minimum no. of blocks required, then we skip aligning it. Before determining this, if JitAlignLoopForJcc=1
, we just increase the offset from which the loop starts, so that the last backedge is pushed further down. The assumption is, if there was JCC erratum, we would hope that our condition if (currentOffset > extraBytesNotInLoop)
would do the right thing and not align the last backedge at the boundary. The more I think now - it won't work properly in most of the cases, and we need some more tracking to make sure that the backedges don't fall on the boundary. On the contrary, it could also happen that previously the backedge was not on boundary and after aligning the loop, it falls on the boundary, worsening the performance, and that too needs to be handle correctly.
{ | ||
printf("\n"); | ||
} | ||
if (GetEmitter()->IsAVXInstruction(ins) && !GetEmitter()->IsBMIInstruction(ins)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (GetEmitter()->IsAVXInstruction(ins) && !GetEmitter()->IsBMIInstruction(ins)) | |
#ifdef TARGET_XARCH | |
const int TEMP_BUFFER_LEN = 40; | |
static unsigned curBuf = 0; | |
static char buf[4][TEMP_BUFFER_LEN]; | |
const char* retbuf; | |
if (GetEmitter()->IsAVXInstruction(ins) && !GetEmitter()->IsBMIInstruction(ins)) | |
{ | |
sprintf_s(buf[curBuf], TEMP_BUFFER_LEN, "v%s", insName); | |
retbuf = buf[curBuf]; | |
curBuf = (curBuf + 1) % 4; | |
return retbuf; | |
} | |
#endif | |
return insName; | |
We don't need the else
and #else
.
@@ -183,4 +190,48 @@ The .NET Foundation licenses this file to you under the MIT license. | |||
</Expand> | |||
</Type> | |||
|
|||
<Type Name="jitstd::vector<*>"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad to see that this file is getting longer :)
@@ -1620,6 +1620,9 @@ void Compiler::fgCompactBlocks(BasicBlock* block, BasicBlock* bNext) | |||
} | |||
} | |||
bNext->bbPreds = nullptr; | |||
|
|||
// `block` can no longer be a loop pre-header (if it was before). | |||
block->bbFlags &= ~BBF_LOOP_PREHEADER; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we maintain this invariant? Do we have asserts at other places to make sure that a preheader loop block doesn't have more than one incoming edge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bit is not widely used (just assertion prop and fgDominate for "new" blocks). Adding such an assert might make sense. I just noticed this because it didn't make sense in a JitDump I was looking at for a block to be marked as a pre-header. fwiw, this change alone causes no asm diffs.
@@ -38,7 +38,6 @@ void Compiler::optInit() | |||
optNativeCallCount = 0; | |||
optAssertionCount = 0; | |||
optAssertionDep = nullptr; | |||
optCSECandidateTotal = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth adding it in JitTimeLogCsv?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not. This gets back to the idea for generally improving JIT stats/metrics: #52877
src/coreclr/jit/gentree.h
Outdated
@@ -494,6 +494,7 @@ enum GenTreeFlags : unsigned int | |||
|
|||
GTF_INX_RNGCHK = 0x80000000, // GT_INDEX/GT_INDEX_ADDR -- the array reference should be range-checked. | |||
GTF_INX_STRING_LAYOUT = 0x40000000, // GT_INDEX -- this uses the special string array layout | |||
GTF_INX_NONFAULTING = 0x20000000, // GT_INDEX -- the INDEX does not throw an exception (morph to GTF_IND_NONFAULTING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name can be easily confused with GTF_IND_NONFAULTING
...In fact, even I got confused while reviewing. Can you please pick a different name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually exactly the same as GTF_IND_NONFAULTING
except for use on GT_INDEX nodes, not GT_IND nodes. Maybe GTF_INX_NOFAULT
just to be different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good.
// REVIEW: due to the definition of `condBlocks`, above, the effective max is 3 blocks, meaning | ||
// `maxRank` of 1. Question: should the heuristic allow more blocks to be created in some situations? | ||
// REVIEW: make this based on a COMPlus configuration? | ||
if (condBlocks > 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier, the comment mentioned as 3 blocks
but we were allowing 4 blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value being compared was 2n + 1
, so was always odd. Comparing against 3 and 4 are equivalent. I started this work trying to change it to 5, to allow more cloning (especially for the Array2 benchmark), but saw perf regressions. Now that I have fixed some issues, and have more experience looking into those kind of regressions, especially the jcc erratum and alignment related regressions, I want to go back and try setting this to 5 again, and see if the regressions still exist and can be mitigated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me.
It would be ideal to separate out the diff causing changes from the dumping/refactoring changes and have a no-diff PR for the former latter. I realize one often works on both in tandem and pulling them apart later can be painful. So no need to do that here. When things are intermixed like this it makes it harder to be confident that all the changes that lead to diffs were looked at carefully.
Yeah, I thought about that. In fact, I'd prefer to submit PRs for each separate debugging change (CSE, alignment, etc.). But it's so painful and time consuming nowadays to get clean CI runs (and code reviews), that I "gave up" and decided not to. |
1. Rename GTF_INX_NONFAULTING to GTF_INX_NOFAULT to increase clarity compared to existing GTF_IND_NONFAULTING. 2. Minor cleanup in getInsDisplayName.
#ifdef DEBUG | ||
ig->lastGeneratedBlock = nullptr; | ||
// Explicitly call the constructor, since IGs don't actually have a constructor. | ||
ig->igBlocks.jitstd::list<BasicBlock*>::list(emitComp->getAllocator(CMK_LoopOpt)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GCC leg was failing when PR got merged. it looks to be expecting type rather than ctor (function):
ig->igBlocks.jitstd::list<BasicBlock*>::list(emitComp->getAllocator(CMK_LoopOpt)); | |
ig->igBlocks.jitstd::list<BasicBlock*>(emitComp->getAllocator(CMK_LoopOpt)); |
When loop cloning was creating cloning conditions, it was creating unnecessary bounds checks in some multi-dimensional array index cases. When creating a set of cloning conditions, first a null check is done, then an array length check is done, etc. Thus, the array length expression itself won't fault because we've already done a null check. And a subsequent array index expression won't fault (or need a bounds check) because we've already checked the array length (i.e., we've done a manual bounds check). So, stop creating the unnecessary bounds checks, and mark the appropriate instructions as non-faulting by clearing the GTF_EXCEPT bit.
Note that I did not turn on the code to clear GTF_EXCEPT for array length checks because it leads to negative downstream effects in CSE. Namely, there end up being array length expressions that are identical except for the exception bit. When CSE sees this, it gives up on creating a CSE, which leads to regressions in some cases where we don't CSE the array length expression.
Also, for multi-dimension jagged arrays, when optimizing the fast path, we were not removing as many bounds checks as we could. In particular, we weren't removing outer bounds checks, only inner ones. Add code to handle all the bounds checks.
There are some runtime improvements (measured via BenchmarkDotNet on the JIT microbenchmarks), but also some regressions, due, as far as I can tell, to the Intel jcc erratum performance impact. In particular, benchmark ludcmp shows up to a 9% regression due to a
jae
instruction in the hot loop now crossing a 32-byte boundary due to code changes earlier in the function affecting instruction alignment. The hot loop itself is exactly the same (module register allocation differences). As there is nothing that can be done (without mitigating the jcc erratum) -- it's "bad luck".In addition to those functional changes, there are a number of debugging-related improvements:
LcOptInfo
, (c) convert theLoopCloneContext
raw arrays tojitstd::vector
for easier debugging, as clrjit.natvis can be taught to understand them.getCSEAvailBit
andgetCSEAvailCrossCallBit
functions to avoid multiple hard-codings of these expresions, (b) stop printing all the details of the CSE dataflow to JitDump; just print the result, (c) addoptPrintCSEDataFlowSet
function to print the CSE dataflow set in symbolic form, not just the raw bits, (d) addedFMT_CSE
string to use for formatting CSE candidates, (e) addedoptOptimizeCSEs
to the phase structure for JitDump output, (f) remove unusedoptCSECandidateTotal
(remnant of Valnum + lexical CSE)emitIssue1Instr
toemitEndCodeGen
, to avoid the possibility of reading an instruction beyond the basic block. Also, improved the Intel jcc erratum criteria calculations, (b) Changealign
instructions of zero size to have a zero PerfScore throughput number (since they don't generate code), (c) AddCOMPlus_JitDasmWithAlignmentBoundaries
to force disasm output to display alignment boundaries.emitLabelString
function for constructing a string to display for a bound emitter label. CreatedemitPrintLabel
to directly print the label, (b) AddgenInsDisplayName
function to create a string for use when outputting an instruction. For xarch, this prepends the "v" for SIMD instructions, as necessary. This is preferable to calling the rawgenInsName
function, (c) For each insGroup, created a debug-only list of basic blocks that contributed code to that insGroup. Display this set of blocks in the JitDump disasm output, with block ID. This is useful for looking at an IG, and finding the blocks in a .dot flow graph visualization that contributed to it, (d) remove unusedinstDisp
jitstd::vector
,JitExpandArray<T>
,JitExpandArrayStack<T>
,LcOptInfo
.benchmarks.run.windows.x64.checked.mch:
Detail diffs
Detail diffs
coreclr_tests.pmi.windows.x64.checked.mch:
Detail diffs
Detail diffs
libraries.crossgen2.windows.x64.checked.mch:
Detail diffs
Detail diffs