-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: More CSE heuristics adjustments #98257
Conversation
Based on analysis of cases where the machine learning is struggling, add some more observations and tweak some of the existing ones: * where we use `log` for dynamic compresson, bias results to they are always non-negative * only consider integral vars for pressure estimate * note if a CSE has a call * note weighted tree costs * note weighted local occurrences (approx pressure relief) * note spread of occurrences (as fraction of BBs) * note if CSE is something that can be contained (guess) * note if CSE is cheap (cost 2 or 3) and is something that can be contained * note if CSE might be "live across" a call in LSRA block ordering The block spread and LSRA live across are using the RPO artifacts that may no longer be up to date. Not clear it matters as LSRA does not use RPO for block ordering. Contributes to dotnet#92915.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsBased on analysis of cases where the machine learning is struggling, add some more observations and tweak some of the existing ones:
The block spread and LSRA live across are using the RPO artifacts that may no longer be up to date. Not clear it matters as LSRA does not use RPO for block ordering. Contributes to #92915.
|
@EgorBo PTAL A slightly earlier version of this was able to get a geomean perf score improvement of 1.0035 on asp.net, which is best I've seen so far. My guestimate is the best possible improvement is around 1.01.
The training here took about 2375 rounds * 50 methods/round * 25 runs/method * 2 spmi invocations per run. 937,000 single-method invocations of SPMI. |
Diff results for #98257Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64MinOpts (-0.00% to +0.01%)
Throughput diffs for windows/arm64 ran on windows/x64MinOpts (-0.00% to +0.01%)
Details here |
features[19] = deMinimusAdj + log(max(deMinimis, cse->numLocalOccurrences * cse->csdUseWtCnt)); | ||
features[20] = booleanScale * ((double)(blockSpread) / numBBs); | ||
|
||
const bool isContainable = cse->csdTree->OperIs(GT_ADD, GT_NOT, GT_MUL, GT_LSH); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't such nodes be marked with GTF_ADDRMODE_NO_CSE
by this point? Assuming you mean they could be contained as part of addressing modes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stage of CSE will never see anything marked GTF_DONT_CSE
.
This is a more general guess as to whether the operation can be subsumed by its parent, in particular I saw an AND(NOT ...) turn into andn
so CSE of the NOT is not always an improvement.
for (BasicBlock *block = minPostorderBlock; | ||
block != nullptr && block != maxPostorderBlock && count < blockSpread; block = block->Next(), count++) | ||
{ | ||
if (block->HasFlag(BBF_HAS_CALL)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this flag not reliable in optimized tier? I understand that it's OK for it to be not precise, just wonder if it's too unreliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CSE was already relying on this flag for its live across call analysis. I haven't checked whether it is reliable (it is set in fgMorphCall
, so it should be pretty good).
Failure looks like #97049 |
Based on analysis of cases where the machine learning is struggling, add some more observations and tweak some of the existing ones:
log
for dynamic compresson, bias results to they are always non-negativeThe block spread and LSRA live across are using the RPO artifacts that may no longer be up to date. Not clear it matters as LSRA does not use RPO for block ordering.
Contributes to #92915.