Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: More CSE heuristics adjustments #98257

Merged
merged 1 commit into from
Feb 10, 2024
Merged

Conversation

AndyAyersMS
Copy link
Member

Based on analysis of cases where the machine learning is struggling, add some more observations and tweak some of the existing ones:

  • where we use log for dynamic compresson, bias results to they are always non-negative
  • only consider integral vars for pressure estimate
  • note if a CSE has a call
  • note weighted tree costs
  • note weighted local occurrences (approx pressure relief)
  • note spread of occurrences (as fraction of BBs)
  • note if CSE is something that can be contained (guess)
  • note if CSE is cheap (cost 2 or 3) and is something that can be contained
  • note if CSE might be "live across" a call in LSRA block ordering

The block spread and LSRA live across are using the RPO artifacts that may no longer be up to date. Not clear it matters as LSRA does not use RPO for block ordering.

Contributes to #92915.

Based on analysis of cases where the machine learning is struggling, add some more observations
and tweak some of the existing ones:
* where we use `log` for dynamic compresson, bias results to they are always non-negative
* only consider integral vars for pressure estimate
* note if a CSE has a call
* note weighted tree costs
* note weighted local occurrences (approx pressure relief)
* note spread of occurrences (as fraction of BBs)
* note if CSE is something that can be contained (guess)
* note if CSE is cheap (cost 2 or 3) and is something that can be contained
* note if CSE might be "live across" a call in LSRA block ordering

The block spread and LSRA live across are using the RPO artifacts that may no longer be up to date.
Not clear it matters as LSRA does not use RPO for block ordering.

Contributes to dotnet#92915.
@ghost ghost assigned AndyAyersMS Feb 10, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 10, 2024
@ghost
Copy link

ghost commented Feb 10, 2024

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Based on analysis of cases where the machine learning is struggling, add some more observations and tweak some of the existing ones:

  • where we use log for dynamic compresson, bias results to they are always non-negative
  • only consider integral vars for pressure estimate
  • note if a CSE has a call
  • note weighted tree costs
  • note weighted local occurrences (approx pressure relief)
  • note spread of occurrences (as fraction of BBs)
  • note if CSE is something that can be contained (guess)
  • note if CSE is cheap (cost 2 or 3) and is something that can be contained
  • note if CSE might be "live across" a call in LSRA block ordering

The block spread and LSRA live across are using the RPO artifacts that may no longer be up to date. Not clear it matters as LSRA does not use RPO for block ordering.

Contributes to #92915.

Author: AndyAyersMS
Assignees: AndyAyersMS
Labels:

area-CodeGen-coreclr

Milestone: -

@AndyAyersMS
Copy link
Member Author

@EgorBo PTAL
cc @dotnet/jit-contrib

A slightly earlier version of this was able to get a geomean perf score improvement of 1.0035 on asp.net, which is best I've seen so far. My guestimate is the best possible improvement is around 1.01.

Indx       2375
Meth      27303     114753      81697     128368      63268     117275      47353      76178     114310      78590      65589      49072      29759     128341      37138      32964     128476      79343      62843      32737      63807      81266      57893     127845     125320      67972      48097     126028        898      50331      50605      68601      21984     124514     117877      58076      15850     125808      58549     129284     123834      58715      45169     125470      21885      49207      58341       9282      96046      96669
Base      54.58      30.88      93.18       9.75      71.25     130.58      34.75     152.26      20.88    6371.11     270.59      11.09    3641.02      59.33   34685.60      53.14      14.16     184.19     151.75      42.65     706.75      58.03      15.25     274.55     183.64      43.33      27.28      23.47     550.70      21.85     526.50      76.42    1321.75      47.36     129.69      34.96     187.43      18.25     147.64    9382.59      65.79     177.95      47.24     120.12      90.46     607.37      56.32      62.32      57.21      58.46
Best      51.41      30.88      91.95       9.75      69.50     130.58      33.25     151.87      20.75    6359.98     270.59      11.09    3641.02      59.33   34685.60      53.14      14.16     184.07     144.75      42.65     706.75      58.03      15.25     274.55     181.53      43.33      27.28      23.47     547.45      21.85     524.50      75.51    1282.15      45.86     129.69      34.96     186.43      18.25     147.64    9382.59      65.79     177.95      47.24     120.12      90.46     606.70      56.32      59.29      57.21      58.46
Grdy      54.58      30.88      91.95       9.75      69.50     130.58      35.00     152.26      20.75    6371.11     270.59      11.09    3641.02      59.33   34685.60      53.14      14.16     184.19     151.75      42.65     706.75      58.03      15.25     274.55     182.78      43.33      27.28      23.47     549.20      21.85     524.50      75.51    1311.85      47.36     129.69      34.96     186.55      18.25     147.64    9382.59      65.79     177.95      47.24     120.12      90.46     606.70      56.32      61.83      57.21      58.46
Best/base: 1.0071
vs Base    1.0016 Better 11 Same 38 Worse 1
vs Best    0.9946 Better 0 Same 38 Worse 12

Params     0.2464, 0.2704, 0.0649,-0.4737,-0.0927, 0.1187, 0.2871, 0.1694,-0.7787, 0.0000, 0.5030,-0.8862, 0.0000, 0.3484,-0.0014,-0.0781,-0.3194, 0.4156, 0.3742, 0.4030, 0.1251, 0.2982, 0.0000, 0.0000, 0.0000

Collecting greedy policy data via SPMI... done (26589 ms)
Greedy/Base: 35552 methods, 8573 better, 24393 same, 2585 worse,  1.0035 geomean
Best:  121352 @  1.2935
Worst: 123952 @  0.5479

The training here took about 2375 rounds * 50 methods/round * 25 runs/method * 2 spmi invocations per run. 937,000 single-method invocations of SPMI.

@ryujit-bot
Copy link

Diff results for #98257

Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)
Collection PDIFF
libraries.pmi.linux.arm64.checked.mch +0.01%

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)
Collection PDIFF
libraries.pmi.windows.arm64.checked.mch +0.01%

Details here


features[19] = deMinimusAdj + log(max(deMinimis, cse->numLocalOccurrences * cse->csdUseWtCnt));
features[20] = booleanScale * ((double)(blockSpread) / numBBs);

const bool isContainable = cse->csdTree->OperIs(GT_ADD, GT_NOT, GT_MUL, GT_LSH);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't such nodes be marked with GTF_ADDRMODE_NO_CSE by this point? Assuming you mean they could be contained as part of addressing modes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stage of CSE will never see anything marked GTF_DONT_CSE.

This is a more general guess as to whether the operation can be subsumed by its parent, in particular I saw an AND(NOT ...) turn into andn so CSE of the NOT is not always an improvement.

for (BasicBlock *block = minPostorderBlock;
block != nullptr && block != maxPostorderBlock && count < blockSpread; block = block->Next(), count++)
{
if (block->HasFlag(BBF_HAS_CALL))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this flag not reliable in optimized tier? I understand that it's OK for it to be not precise, just wonder if it's too unreliable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSE was already relying on this flag for its live across call analysis. I haven't checked whether it is reliable (it is set in fgMorphCall, so it should be pretty good).

@AndyAyersMS
Copy link
Member Author

Failure looks like #97049

@AndyAyersMS AndyAyersMS merged commit 78bd7de into dotnet:main Feb 10, 2024
127 of 129 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Mar 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants