Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Improve loop cloning, with debugging improvements (#55299)
When loop cloning was creating cloning conditions, it was creating unnecessary bounds checks in some multi-dimensional array index cases. When creating a set of cloning conditions, first a null check is done, then an array length check is done, etc. Thus, the array length expression itself won't fault because we've already done a null check. And a subsequent array index expression won't fault (or need a bounds check) because we've already checked the array length (i.e., we've done a manual bounds check). So, stop creating the unnecessary bounds checks, and mark the appropriate instructions as non-faulting by clearing the GTF_EXCEPT bit. Note that I did not turn on the code to clear GTF_EXCEPT for array length checks because it leads to negative downstream effects in CSE. Namely, there end up being array length expressions that are identical except for the exception bit. When CSE sees this, it gives up on creating a CSE, which leads to regressions in some cases where we don't CSE the array length expression. Also, for multi-dimension jagged arrays, when optimizing the fast path, we were not removing as many bounds checks as we could. In particular, we weren't removing outer bounds checks, only inner ones. Add code to handle all the bounds checks. There are some runtime improvements (measured via BenchmarkDotNet on the JIT microbenchmarks), but also some regressions, due, as far as I can tell, to the Intel jcc erratum performance impact. In particular, benchmark ludcmp shows up to a 9% regression due to a `jae` instruction in the hot loop now crossing a 32-byte boundary due to code changes earlier in the function affecting instruction alignment. The hot loop itself is exactly the same (module register allocation differences). As there is nothing that can be done (without mitigating the jcc erratum) -- it's "bad luck". In addition to those functional changes, there are a number of debugging-related improvements: 1. Loop cloning: (a) Improved dumping of cloning conditions and other things, (b) remove an unnecessary member to `LcOptInfo`, (c) convert the `LoopCloneContext` raw arrays to `jitstd::vector` for easier debugging, as clrjit.natvis can be taught to understand them. 2. CSE improvements: (a) Add `getCSEAvailBit` and `getCSEAvailCrossCallBit` functions to avoid multiple hard-codings of these expresions, (b) stop printing all the details of the CSE dataflow to JitDump; just print the result, (c) add `optPrintCSEDataFlowSet` function to print the CSE dataflow set in symbolic form, not just the raw bits, (d) added `FMT_CSE` string to use for formatting CSE candidates, (e) added `optOptimizeCSEs` to the phase structure for JitDump output, (f) remove unused `optCSECandidateTotal` (remnant of Valnum + lexical CSE) 3. Alignment: (a) Moved printing of alignment boundaries from `emitIssue1Instr` to `emitEndCodeGen`, to avoid the possibility of reading an instruction beyond the basic block. Also, improved the Intel jcc erratum criteria calculations, (b) Change `align` instructions of zero size to have a zero PerfScore throughput number (since they don't generate code), (c) Add `COMPlus_JitDasmWithAlignmentBoundaries` to force disasm output to display alignment boundaries. 4. Codegen / Emitter: (a) Added `emitLabelString` function for constructing a string to display for a bound emitter label. Created `emitPrintLabel` to directly print the label, (b) Add `genInsDisplayName` function to create a string for use when outputting an instruction. For xarch, this prepends the "v" for SIMD instructions, as necessary. This is preferable to calling the raw `genInsName` function, (c) For each insGroup, created a debug-only list of basic blocks that contributed code to that insGroup. Display this set of blocks in the JitDump disasm output, with block ID. This is useful for looking at an IG, and finding the blocks in a .dot flow graph visualization that contributed to it, (d) remove unused `instDisp` 5. Clrjit.natvis: (a) add support for `jitstd::vector`, `JitExpandArray<T>`, `JitExpandArrayStack<T>`, `LcOptInfo`. 6. Misc: (a) When compacting an empty loop preheader block with a subsequent block, clear the preheader flag. ## benchmarks.run.windows.x64.checked.mch: ``` Summary of Code Size diffs: (Lower is better) Total bytes of base: 25504 Total bytes of diff: 25092 Total bytes of delta: -412 (-1.62% of base) Total relative delta: -0.31 diff is an improvement. relative diff is an improvement. ``` <details> <summary>Detail diffs</summary> ``` Top file improvements (bytes): -92 : 14861.dasm (-2.57% of base) -88 : 2430.dasm (-0.77% of base) -68 : 12182.dasm (-3.82% of base) -48 : 24678.dasm (-1.61% of base) -31 : 21598.dasm (-5.13% of base) -26 : 21601.dasm (-4.57% of base) -21 : 25069.dasm (-7.14% of base) -16 : 14859.dasm (-1.38% of base) -11 : 14862.dasm (-1.35% of base) -6 : 21600.dasm (-1.83% of base) -5 : 25065.dasm (-0.58% of base) 11 total files with Code Size differences (11 improved, 0 regressed), 1 unchanged. Top method improvements (bytes): -92 (-2.57% of base) : 14861.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int -88 (-0.77% of base) : 2430.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this -68 (-3.82% of base) : 12182.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][]) -48 (-1.61% of base) : 24678.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][]) -31 (-5.13% of base) : 21598.dasm - Benchstone.BenchI.Array2:Bench(int):bool -26 (-4.57% of base) : 21601.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool -21 (-7.14% of base) : 25069.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int) -16 (-1.38% of base) : 14859.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[]) -11 (-1.35% of base) : 14862.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[]) -6 (-1.83% of base) : 21600.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][]) -5 (-0.58% of base) : 25065.dasm - Benchstone.BenchF.InProd:Test():bool:this Top method improvements (percentages): -21 (-7.14% of base) : 25069.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int) -31 (-5.13% of base) : 21598.dasm - Benchstone.BenchI.Array2:Bench(int):bool -26 (-4.57% of base) : 21601.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool -68 (-3.82% of base) : 12182.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][]) -92 (-2.57% of base) : 14861.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int -6 (-1.83% of base) : 21600.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][]) -48 (-1.61% of base) : 24678.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][]) -16 (-1.38% of base) : 14859.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[]) -11 (-1.35% of base) : 14862.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[]) -88 (-0.77% of base) : 2430.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this -5 (-0.58% of base) : 25065.dasm - Benchstone.BenchF.InProd:Test():bool:this 11 total methods with Code Size differences (11 improved, 0 regressed), 1 unchanged. ``` </details> -------------------------------------------------------------------------------- ``` Summary of Perf Score diffs: (Lower is better) Total PerfScoreUnits of base: 38374.96 Total PerfScoreUnits of diff: 37914.07000000001 Total PerfScoreUnits of delta: -460.89 (-1.20% of base) Total relative delta: -0.12 diff is an improvement. relative diff is an improvement. ``` <details> <summary>Detail diffs</summary> ``` Top file improvements (PerfScoreUnits): -220.67 : 24678.dasm (-1.74% of base) -99.27 : 14861.dasm (-2.09% of base) -66.30 : 21598.dasm (-1.41% of base) -18.73 : 2430.dasm (-0.28% of base) -18.40 : 21601.dasm (-1.37% of base) -9.73 : 25065.dasm (-0.56% of base) -9.05 : 14859.dasm (-0.77% of base) -5.51 : 21600.dasm (-0.77% of base) -4.15 : 12182.dasm (-0.17% of base) -3.92 : 14860.dasm (-0.32% of base) -3.46 : 25069.dasm (-2.31% of base) -1.70 : 14862.dasm (-0.20% of base) 12 total files with Perf Score differences (12 improved, 0 regressed), 0 unchanged. Top method improvements (PerfScoreUnits): -220.67 (-1.74% of base) : 24678.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][]) -99.27 (-2.09% of base) : 14861.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int -66.30 (-1.41% of base) : 21598.dasm - Benchstone.BenchI.Array2:Bench(int):bool -18.73 (-0.28% of base) : 2430.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this -18.40 (-1.37% of base) : 21601.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool -9.73 (-0.56% of base) : 25065.dasm - Benchstone.BenchF.InProd:Test():bool:this -9.05 (-0.77% of base) : 14859.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[]) -5.51 (-0.77% of base) : 21600.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][]) -4.15 (-0.17% of base) : 12182.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][]) -3.92 (-0.32% of base) : 14860.dasm - LUDecomp:DoLUIteration(System.Double[][],System.Double[],System.Double[][][],System.Double[][],int):long -3.46 (-2.31% of base) : 25069.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int) -1.70 (-0.20% of base) : 14862.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[]) Top method improvements (percentages): -3.46 (-2.31% of base) : 25069.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int) -99.27 (-2.09% of base) : 14861.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int -220.67 (-1.74% of base) : 24678.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][]) -66.30 (-1.41% of base) : 21598.dasm - Benchstone.BenchI.Array2:Bench(int):bool -18.40 (-1.37% of base) : 21601.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool -9.05 (-0.77% of base) : 14859.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[]) -5.51 (-0.77% of base) : 21600.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][]) -9.73 (-0.56% of base) : 25065.dasm - Benchstone.BenchF.InProd:Test():bool:this -3.92 (-0.32% of base) : 14860.dasm - LUDecomp:DoLUIteration(System.Double[][],System.Double[],System.Double[][][],System.Double[][],int):long -18.73 (-0.28% of base) : 2430.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this -1.70 (-0.20% of base) : 14862.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[]) -4.15 (-0.17% of base) : 12182.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][]) 12 total methods with Perf Score differences (12 improved, 0 regressed), 0 unchanged. ``` </details> -------------------------------------------------------------------------------- ## coreclr_tests.pmi.windows.x64.checked.mch: ``` Summary of Code Size diffs: (Lower is better) Total bytes of base: 25430 Total bytes of diff: 24994 Total bytes of delta: -436 (-1.71% of base) Total relative delta: -0.42 diff is an improvement. relative diff is an improvement. ``` <details> <summary>Detail diffs</summary> ``` Top file improvements (bytes): -92 : 194668.dasm (-2.57% of base) -68 : 194589.dasm (-3.82% of base) -48 : 248565.dasm (-1.61% of base) -32 : 249053.dasm (-3.58% of base) -31 : 251012.dasm (-5.13% of base) -26 : 251011.dasm (-4.57% of base) -19 : 248561.dasm (-6.76% of base) -16 : 194667.dasm (-1.38% of base) -15 : 252241.dasm (-0.72% of base) -12 : 252242.dasm (-0.81% of base) -11 : 194669.dasm (-1.35% of base) -9 : 246308.dasm (-1.06% of base) -9 : 246307.dasm (-1.06% of base) -9 : 246245.dasm (-1.06% of base) -9 : 246246.dasm (-1.06% of base) -6 : 228622.dasm (-0.77% of base) -6 : 251010.dasm (-1.83% of base) -5 : 248557.dasm (-0.61% of base) -4 : 249054.dasm (-0.50% of base) -4 : 249052.dasm (-0.47% of base) 22 total files with Code Size differences (22 improved, 0 regressed), 1 unchanged. Top method improvements (bytes): -92 (-2.57% of base) : 194668.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int -68 (-3.82% of base) : 194589.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][]) -48 (-1.61% of base) : 248565.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][]) -32 (-3.58% of base) : 249053.dasm - SimpleArray_01.Test:BadMatrixMul2() -31 (-5.13% of base) : 251012.dasm - Benchstone.BenchI.Array2:Bench(int):bool -26 (-4.57% of base) : 251011.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool -19 (-6.76% of base) : 248561.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int) -16 (-1.38% of base) : 194667.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[]) -15 (-0.72% of base) : 252241.dasm - Complex_Array_Test:Main(System.String[]):int -12 (-0.81% of base) : 252242.dasm - Simple_Array_Test:Main(System.String[]):int -11 (-1.35% of base) : 194669.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[]) -9 (-1.06% of base) : 246308.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this -9 (-1.06% of base) : 246307.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this -9 (-1.06% of base) : 246245.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this -9 (-1.06% of base) : 246246.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this -6 (-0.77% of base) : 228622.dasm - SciMark2.LU:solve(System.Double[][],System.Int32[],System.Double[]) -6 (-1.83% of base) : 251010.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][]) -5 (-0.61% of base) : 248557.dasm - Benchstone.BenchF.InProd:Bench():bool -4 (-0.50% of base) : 249054.dasm - SimpleArray_01.Test:BadMatrixMul3() -4 (-0.47% of base) : 249052.dasm - SimpleArray_01.Test:BadMatrixMul1() Top method improvements (percentages): -19 (-6.76% of base) : 248561.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int) -31 (-5.13% of base) : 251012.dasm - Benchstone.BenchI.Array2:Bench(int):bool -26 (-4.57% of base) : 251011.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool -68 (-3.82% of base) : 194589.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][]) -32 (-3.58% of base) : 249053.dasm - SimpleArray_01.Test:BadMatrixMul2() -92 (-2.57% of base) : 194668.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int -6 (-1.83% of base) : 251010.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][]) -48 (-1.61% of base) : 248565.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][]) -16 (-1.38% of base) : 194667.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[]) -11 (-1.35% of base) : 194669.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[]) -3 (-1.11% of base) : 249057.dasm - SimpleArray_01.Test:Test2() -9 (-1.06% of base) : 246308.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this -9 (-1.06% of base) : 246307.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this -9 (-1.06% of base) : 246245.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this -9 (-1.06% of base) : 246246.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this -12 (-0.81% of base) : 252242.dasm - Simple_Array_Test:Main(System.String[]):int -6 (-0.77% of base) : 228622.dasm - SciMark2.LU:solve(System.Double[][],System.Int32[],System.Double[]) -15 (-0.72% of base) : 252241.dasm - Complex_Array_Test:Main(System.String[]):int -5 (-0.61% of base) : 248557.dasm - Benchstone.BenchF.InProd:Bench():bool -4 (-0.50% of base) : 249054.dasm - SimpleArray_01.Test:BadMatrixMul3() 22 total methods with Code Size differences (22 improved, 0 regressed), 1 unchanged. ``` </details> -------------------------------------------------------------------------------- ``` Summary of Perf Score diffs: (Lower is better) Total PerfScoreUnits of base: 161610.68999999997 Total PerfScoreUnits of diff: 160290.10999999996 Total PerfScoreUnits of delta: -1320.58 (-0.82% of base) Total relative delta: -0.20 diff is an improvement. relative diff is an improvement. ``` <details> <summary>Detail diffs</summary> ``` Top file improvements (PerfScoreUnits): -639.25 : 252241.dasm (-0.97% of base) -220.67 : 248565.dasm (-1.74% of base) -132.59 : 252242.dasm (-0.26% of base) -99.27 : 194668.dasm (-2.09% of base) -66.30 : 251012.dasm (-1.41% of base) -62.20 : 249053.dasm (-2.74% of base) -18.40 : 251011.dasm (-1.37% of base) -9.33 : 248557.dasm (-0.54% of base) -9.05 : 194667.dasm (-0.77% of base) -8.32 : 249054.dasm (-0.42% of base) -5.85 : 246308.dasm (-0.52% of base) -5.85 : 246307.dasm (-0.52% of base) -5.85 : 246245.dasm (-0.52% of base) -5.85 : 246246.dasm (-0.52% of base) -5.51 : 251010.dasm (-0.77% of base) -4.36 : 249052.dasm (-0.22% of base) -4.16 : 253363.dasm (-0.21% of base) -4.15 : 194589.dasm (-0.17% of base) -3.92 : 194666.dasm (-0.32% of base) -3.41 : 248561.dasm (-2.29% of base) 23 total files with Perf Score differences (23 improved, 0 regressed), 0 unchanged. Top method improvements (PerfScoreUnits): -639.25 (-0.97% of base) : 252241.dasm - Complex_Array_Test:Main(System.String[]):int -220.67 (-1.74% of base) : 248565.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][]) -132.59 (-0.26% of base) : 252242.dasm - Simple_Array_Test:Main(System.String[]):int -99.27 (-2.09% of base) : 194668.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int -66.30 (-1.41% of base) : 251012.dasm - Benchstone.BenchI.Array2:Bench(int):bool -62.20 (-2.74% of base) : 249053.dasm - SimpleArray_01.Test:BadMatrixMul2() -18.40 (-1.37% of base) : 251011.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool -9.33 (-0.54% of base) : 248557.dasm - Benchstone.BenchF.InProd:Bench():bool -9.05 (-0.77% of base) : 194667.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[]) -8.32 (-0.42% of base) : 249054.dasm - SimpleArray_01.Test:BadMatrixMul3() -5.85 (-0.52% of base) : 246308.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this -5.85 (-0.52% of base) : 246307.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this -5.85 (-0.52% of base) : 246245.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this -5.85 (-0.52% of base) : 246246.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this -5.51 (-0.77% of base) : 251010.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][]) -4.36 (-0.22% of base) : 249052.dasm - SimpleArray_01.Test:BadMatrixMul1() -4.16 (-0.21% of base) : 253363.dasm - MatrixMul.Test:MatrixMul() -4.15 (-0.17% of base) : 194589.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][]) -3.92 (-0.32% of base) : 194666.dasm - LUDecomp:DoLUIteration(System.Double[][],System.Double[],System.Double[][][],System.Double[][],int):long -3.41 (-2.29% of base) : 248561.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int) Top method improvements (percentages): -62.20 (-2.74% of base) : 249053.dasm - SimpleArray_01.Test:BadMatrixMul2() -3.41 (-2.29% of base) : 248561.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int) -99.27 (-2.09% of base) : 194668.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int -220.67 (-1.74% of base) : 248565.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][]) -2.70 (-1.71% of base) : 249057.dasm - SimpleArray_01.Test:Test2() -66.30 (-1.41% of base) : 251012.dasm - Benchstone.BenchI.Array2:Bench(int):bool -18.40 (-1.37% of base) : 251011.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool -639.25 (-0.97% of base) : 252241.dasm - Complex_Array_Test:Main(System.String[]):int -9.05 (-0.77% of base) : 194667.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[]) -5.51 (-0.77% of base) : 251010.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][]) -9.33 (-0.54% of base) : 248557.dasm - Benchstone.BenchF.InProd:Bench():bool -5.85 (-0.52% of base) : 246308.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this -5.85 (-0.52% of base) : 246307.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this -5.85 (-0.52% of base) : 246245.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this -5.85 (-0.52% of base) : 246246.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this -8.32 (-0.42% of base) : 249054.dasm - SimpleArray_01.Test:BadMatrixMul3() -3.92 (-0.32% of base) : 194666.dasm - LUDecomp:DoLUIteration(System.Double[][],System.Double[],System.Double[][][],System.Double[][],int):long -132.59 (-0.26% of base) : 252242.dasm - Simple_Array_Test:Main(System.String[]):int -1.89 (-0.22% of base) : 228622.dasm - SciMark2.LU:solve(System.Double[][],System.Int32[],System.Double[]) -4.36 (-0.22% of base) : 249052.dasm - SimpleArray_01.Test:BadMatrixMul1() 23 total methods with Perf Score differences (23 improved, 0 regressed), 0 unchanged. ``` </details> -------------------------------------------------------------------------------- ## libraries.crossgen2.windows.x64.checked.mch: ``` Summary of Code Size diffs: (Lower is better) Total bytes of base: 10828 Total bytes of diff: 10809 Total bytes of delta: -19 (-0.18% of base) Total relative delta: -0.00 diff is an improvement. relative diff is an improvement. ``` <details> <summary>Detail diffs</summary> ``` Top file improvements (bytes): -19 : 72504.dasm (-0.18% of base) 1 total files with Code Size differences (1 improved, 0 regressed), 0 unchanged. Top method improvements (bytes): -19 (-0.18% of base) : 72504.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this Top method improvements (percentages): -19 (-0.18% of base) : 72504.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this 1 total methods with Code Size differences (1 improved, 0 regressed), 0 unchanged. ``` </details> -------------------------------------------------------------------------------- ``` Summary of Perf Score diffs: (Lower is better) Total PerfScoreUnits of base: 6597.12 Total PerfScoreUnits of diff: 6586.31 Total PerfScoreUnits of delta: -10.81 (-0.16% of base) Total relative delta: -0.00 diff is an improvement. relative diff is an improvement. ``` <details> <summary>Detail diffs</summary> ``` Top file improvements (PerfScoreUnits): -10.81 : 72504.dasm (-0.16% of base) 1 total files with Perf Score differences (1 improved, 0 regressed), 0 unchanged. Top method improvements (PerfScoreUnits): -10.81 (-0.16% of base) : 72504.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this Top method improvements (percentages): -10.81 (-0.16% of base) : 72504.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this 1 total methods with Perf Score differences (1 improved, 0 regressed), 0 unchanged. ``` </details> -------------------------------------------------------------------------------- * Increase loop cloning max allowed condition blocks Allows inner loop of 3-nested loops (e.g., Array2 benchmark) to be cloned. * Clear GTF_INX_RNGCHK bit on loop cloning created index nodes to avoid unnecessary bounds checks. Revert max cloning condition blocks to 3; allowing more doesn't seem to improve performance (probably too many conditions before a not-sufficiently-executed loop, at least for the Array2 benchmark) * Remove outer index bounds checks * Convert loop cloning data structures to `vector` for better debugging * Improve CSE dump output 1. "#if 0" the guts of the CSE dataflow; that's not useful to most people. 2. Add readable CSE number output to the CSE dataflow set output 3. Add FMT_CSE to commonize CSE number output. 4. Add PHASE_OPTIMIZE_VALNUM_CSES to the pre-phase output "allow list" and stop doing its own blocks/trees output. 5. Remove unused optCSECandidateTotal 6. Add functions `getCSEAvailBit` and `getCSEAvailCrossCallBit` to avoid hand-coding these bit calculations in multiple places, for the CSE dataflow set bits. * Mark cloned array indexes as non-faulting When generating loop cloning conditions, mark array index expressions as non-faulting, as we have already null- and range-checked the array before generating an index expression. I also added similary code to mark array length expressions as non-faulting, for the same reason. However, that leads to CQ losses because of downstream CSE effects. * Don't count zero-sized align instructions in PerfScore * Add COMPlus_JitDasmWithAlignmentBoundaries This outputs the alignment boundaries without requiring outputting the actual addresses. It makes it easier to diff changes. * Improve bounds check output * Improve emitter label printing Create function for printing bound emitter labels. Also, add debug code to associate a BasicBlock with an insGroup, and output the block number and ID with the emitter label in JitDump, so it's easier to find where a group of generated instructions came from. * Formatting * Clear BBF_LOOP_PREHEADER bit when compacting empty pre-header block * Keep track of all basic blocks that contribute code to an insGroup * Update display of Intel jcc erratum branches in dump For instructions or instruction sequences which match the Intel jcc erratum criteria, note that in the alignment boundary dump. Also, a few fixes: 1. Move the alignment boundary dumping from `emitIssue1Instr` to `emitEndCodeGen` to avoid the possibility of reading the next instruction in a group when there is no next instruction. 2. Create `IsJccInstruction` and `IsJmpInstruction` functions for use by the jcc criteria detection, and fix that detection to fix a few omissions/errors. 3. Change the jcc criteria detection to be hard-coded to 32 byte boundaries instead of assuming `compJitAlignLoopBoundary` is 32. An example: ``` cmp r11d, dword ptr [rax+8] ; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (cmp: 0 ; jcc erratum) 32B boundary ............................... jae G_M42486_IG103 ``` In this case, the `cmp` doesn't cross the boundary, it is adjacent (the zero indicates the number of bytes of the instruction which cross the boundary), followed by the `jae` which starts after the boundary. Indicating the jcc erratum criteria can help point out potential performance issues due to unlucky alignment of these instructions in asm diffs. * Display full instruction name in alignment and other messages XArch sometimes prepends a "v" to the instructions names from the instruction table. Add a function `genInsDisplayName` to create the full instruction name that should be displayed, and use that in most places an instruction name will be displayed, such as in the alignment messages, and normal disassembly. Use this instead of the raw `genInsName`. This could be extended to handle arm32 appending an "s", but I didn't want to touch arm32 with this change. * Fix build * Code review feedback 1. Rename GTF_INX_NONFAULTING to GTF_INX_NOFAULT to increase clarity compared to existing GTF_IND_NONFAULTING. 2. Minor cleanup in getInsDisplayName. * Formatting
- Loading branch information