JIT: More CSE heuristics adjustments #98257

AndyAyersMS · 2024-02-10T02:46:45Z

Based on analysis of cases where the machine learning is struggling, add some more observations and tweak some of the existing ones:

where we use log for dynamic compresson, bias results to they are always non-negative
only consider integral vars for pressure estimate
note if a CSE has a call
note weighted tree costs
note weighted local occurrences (approx pressure relief)
note spread of occurrences (as fraction of BBs)
note if CSE is something that can be contained (guess)
note if CSE is cheap (cost 2 or 3) and is something that can be contained
note if CSE might be "live across" a call in LSRA block ordering

The block spread and LSRA live across are using the RPO artifacts that may no longer be up to date. Not clear it matters as LSRA does not use RPO for block ordering.

Contributes to #92915.

Based on analysis of cases where the machine learning is struggling, add some more observations and tweak some of the existing ones: * where we use `log` for dynamic compresson, bias results to they are always non-negative * only consider integral vars for pressure estimate * note if a CSE has a call * note weighted tree costs * note weighted local occurrences (approx pressure relief) * note spread of occurrences (as fraction of BBs) * note if CSE is something that can be contained (guess) * note if CSE is cheap (cost 2 or 3) and is something that can be contained * note if CSE might be "live across" a call in LSRA block ordering The block spread and LSRA live across are using the RPO artifacts that may no longer be up to date. Not clear it matters as LSRA does not use RPO for block ordering. Contributes to dotnet#92915.

ghost · 2024-02-10T02:46:56Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Based on analysis of cases where the machine learning is struggling, add some more observations and tweak some of the existing ones:

where we use log for dynamic compresson, bias results to they are always non-negative
only consider integral vars for pressure estimate
note if a CSE has a call
note weighted tree costs
note weighted local occurrences (approx pressure relief)
note spread of occurrences (as fraction of BBs)
note if CSE is something that can be contained (guess)
note if CSE is cheap (cost 2 or 3) and is something that can be contained
note if CSE might be "live across" a call in LSRA block ordering

The block spread and LSRA live across are using the RPO artifacts that may no longer be up to date. Not clear it matters as LSRA does not use RPO for block ordering.

Contributes to #92915.

Author:	AndyAyersMS
Assignees:	AndyAyersMS
Labels:	`area-CodeGen-coreclr`
Milestone:	-

AndyAyersMS · 2024-02-10T02:54:41Z

@EgorBo PTAL
cc @dotnet/jit-contrib

A slightly earlier version of this was able to get a geomean perf score improvement of 1.0035 on asp.net, which is best I've seen so far. My guestimate is the best possible improvement is around 1.01.

Indx       2375
Meth      27303     114753      81697     128368      63268     117275      47353      76178     114310      78590      65589      49072      29759     128341      37138      32964     128476      79343      62843      32737      63807      81266      57893     127845     125320      67972      48097     126028        898      50331      50605      68601      21984     124514     117877      58076      15850     125808      58549     129284     123834      58715      45169     125470      21885      49207      58341       9282      96046      96669
Base      54.58      30.88      93.18       9.75      71.25     130.58      34.75     152.26      20.88    6371.11     270.59      11.09    3641.02      59.33   34685.60      53.14      14.16     184.19     151.75      42.65     706.75      58.03      15.25     274.55     183.64      43.33      27.28      23.47     550.70      21.85     526.50      76.42    1321.75      47.36     129.69      34.96     187.43      18.25     147.64    9382.59      65.79     177.95      47.24     120.12      90.46     607.37      56.32      62.32      57.21      58.46
Best      51.41      30.88      91.95       9.75      69.50     130.58      33.25     151.87      20.75    6359.98     270.59      11.09    3641.02      59.33   34685.60      53.14      14.16     184.07     144.75      42.65     706.75      58.03      15.25     274.55     181.53      43.33      27.28      23.47     547.45      21.85     524.50      75.51    1282.15      45.86     129.69      34.96     186.43      18.25     147.64    9382.59      65.79     177.95      47.24     120.12      90.46     606.70      56.32      59.29      57.21      58.46
Grdy      54.58      30.88      91.95       9.75      69.50     130.58      35.00     152.26      20.75    6371.11     270.59      11.09    3641.02      59.33   34685.60      53.14      14.16     184.19     151.75      42.65     706.75      58.03      15.25     274.55     182.78      43.33      27.28      23.47     549.20      21.85     524.50      75.51    1311.85      47.36     129.69      34.96     186.55      18.25     147.64    9382.59      65.79     177.95      47.24     120.12      90.46     606.70      56.32      61.83      57.21      58.46
Best/base: 1.0071
vs Base    1.0016 Better 11 Same 38 Worse 1
vs Best    0.9946 Better 0 Same 38 Worse 12

Params     0.2464, 0.2704, 0.0649,-0.4737,-0.0927, 0.1187, 0.2871, 0.1694,-0.7787, 0.0000, 0.5030,-0.8862, 0.0000, 0.3484,-0.0014,-0.0781,-0.3194, 0.4156, 0.3742, 0.4030, 0.1251, 0.2982, 0.0000, 0.0000, 0.0000

Collecting greedy policy data via SPMI... done (26589 ms)
Greedy/Base: 35552 methods, 8573 better, 24393 same, 2585 worse,  1.0035 geomean
Best:  121352 @  1.2935
Worst: 123952 @  0.5479

The training here took about 2375 rounds * 50 methods/round * 25 runs/method * 2 spmi invocations per run. 937,000 single-method invocations of SPMI.

ryujit-bot · 2024-02-10T04:18:09Z

Diff results for #98257

Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)

Collection	PDIFF
libraries.pmi.linux.arm64.checked.mch	+0.01%

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)

Collection	PDIFF
libraries.pmi.windows.arm64.checked.mch	+0.01%

Details here

EgorBo · 2024-02-10T15:47:59Z

src/coreclr/jit/optcse.cpp

+    features[19] = deMinimusAdj + log(max(deMinimis, cse->numLocalOccurrences * cse->csdUseWtCnt));
+    features[20] = booleanScale * ((double)(blockSpread) / numBBs);
+
+    const bool isContainable = cse->csdTree->OperIs(GT_ADD, GT_NOT, GT_MUL, GT_LSH);


shouldn't such nodes be marked with GTF_ADDRMODE_NO_CSE by this point? Assuming you mean they could be contained as part of addressing modes

This stage of CSE will never see anything marked GTF_DONT_CSE.

This is a more general guess as to whether the operation can be subsumed by its parent, in particular I saw an AND(NOT ...) turn into andn so CSE of the NOT is not always an improvement.

EgorBo · 2024-02-10T15:48:56Z

src/coreclr/jit/optcse.cpp

+        for (BasicBlock *block                                                            = minPostorderBlock;
+             block != nullptr && block != maxPostorderBlock && count < blockSpread; block = block->Next(), count++)
+        {
+            if (block->HasFlag(BBF_HAS_CALL))


Isn't this flag not reliable in optimized tier? I understand that it's OK for it to be not precise, just wonder if it's too unreliable.

CSE was already relying on this flag for its live across call analysis. I haven't checked whether it is reliable (it is set in fgMorphCall, so it should be pretty good).

AndyAyersMS · 2024-02-10T19:06:24Z

Failure looks like #97049

ghost assigned AndyAyersMS Feb 10, 2024

dotnet-issue-labeler bot added the area-CodeGen-coreclr label Feb 10, 2024

build-analysis bot mentioned this pull request Feb 10, 2024

Tests crashing in CI with no dump: exit code 137 means SIGKILL Killed #97049

Closed

EgorBo reviewed Feb 10, 2024

View reviewed changes

EgorBo approved these changes Feb 10, 2024

View reviewed changes

AndyAyersMS mentioned this pull request Feb 10, 2024

Investigate improving JIT heuristics with machine learning #92915

Closed

8 tasks

AndyAyersMS merged commit 78bd7de into dotnet:main Feb 10, 2024
127 of 129 checks passed

github-actions bot locked and limited conversation to collaborators Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: More CSE heuristics adjustments #98257

JIT: More CSE heuristics adjustments #98257

AndyAyersMS commented Feb 10, 2024

ghost commented Feb 10, 2024

AndyAyersMS commented Feb 10, 2024

ryujit-bot commented Feb 10, 2024

Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

Throughput diffs for windows/arm64 ran on windows/x64

EgorBo Feb 10, 2024

AndyAyersMS Feb 10, 2024

EgorBo Feb 10, 2024

AndyAyersMS Feb 10, 2024

AndyAyersMS commented Feb 10, 2024

JIT: More CSE heuristics adjustments #98257

JIT: More CSE heuristics adjustments #98257

Conversation

AndyAyersMS commented Feb 10, 2024

ghost commented Feb 10, 2024

AndyAyersMS commented Feb 10, 2024

ryujit-bot commented Feb 10, 2024

Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

Throughput diffs for windows/arm64 ran on windows/x64

EgorBo Feb 10, 2024

Choose a reason for hiding this comment

AndyAyersMS Feb 10, 2024

Choose a reason for hiding this comment

EgorBo Feb 10, 2024

Choose a reason for hiding this comment

AndyAyersMS Feb 10, 2024

Choose a reason for hiding this comment

AndyAyersMS commented Feb 10, 2024