Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Add a pass of liveness for the new locals inside physical promotion #86043

Merged
merged 18 commits into from
May 19, 2023

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented May 10, 2023

Used when decomposing stores to skip the fields that are dying and the remainder if it is dying. Also allows forward sub to kick in for the inserted replacement locals.

Not sure yet this is how I want to go about the liveness here... we might instead combine this with the existing early liveness pass, or use the early liveness information to figure out where to look when calculating more precise liveness. But this at least gives us an implementation to experiment with. The throughput impact seems very small, so doing something smarter seems unnecessary.

@ghost ghost assigned jakobbotsch May 10, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 10, 2023
@ghost
Copy link

ghost commented May 10, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Used when decomposing stores to skip the fields that are dying and the remainder if it is dying. Also allows forward sub to kick in for the inserted replacement locals.

Not sure yet this is how I want to go about the liveness here... we might instead combine this with the existing early liveness pass, or use the early liveness information to figure out where to look when calculating more precise liveness. But this at least gives us an implementation to experiment with.

Author: jakobbotsch
Assignees: jakobbotsch
Labels:

area-CodeGen-coreclr

Milestone: -

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, runtime-jit-experimental

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, runtime-jit-experimental

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, runtime-jit-experimental

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, runtime-jit-experimental

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@jakobbotsch
Copy link
Member Author

jitstress failures are a couple of timeouts in event source tests in scenario without physical promotion enabled, so almost certainly unrelated.

runtime-jit-experimental failures were recently fixed I believe, so will rerun that one, but will push a comment cleanup PR first.

No failures in libraries-jitstress.

@jakobbotsch
Copy link
Member Author

/azp run runtime-jit-experimental

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jakobbotsch
Copy link
Member Author

The failure looks like #86233.

cc @dotnet/jit-contrib PTAL @AndyAyersMS

Diffs.

Diffs without old promotion.

The throughput impact is very small so I don't think there will be a reason to try to be much smarter here. Well, unless it turns out that's because of a bug (but I have looked at a few examples and things look reasonable).

After this is merged I think most basics are in place and to the point where I can get started on readjusting the heuristics.

@jakobbotsch jakobbotsch marked this pull request as ready for review May 15, 2023 13:38
Comment on lines +86 to +87
// TODO: We need a scalability limit on these, we cannot always track
// the remainder and all fields.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will address this in a separate follow-up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to the "coarsening" discussion we had re alias analysis the other day, you should be able to have models where you group together fields and or fields + remainder and get approximate liveness for them.

Would be interesting to know if you see cases where there are large structs with "hot fields" so that it is worth tracking a subset but coarsening everything else, and if that diverges somehow from just promoting the hot fields and leaving the rest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be interesting to know if you see cases where there are large structs with "hot fields" so that it is worth tracking a subset but coarsening everything else, and if that diverges somehow from just promoting the hot fields and leaving the rest.

That's a good point... the heuristics should likely include something to avoid creating locals that are just going to end up untracked.

Comment on lines +516 to +519
#ifdef DEBUG
unsigned char lvTrackedWithoutIndex : 1; // Tracked but has no lvVarIndex (i.e. only valid GTF_VAR_DEATH flags, used
// by physical promotion)
#endif
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just used to print "(last use)" in the dump outside the new locals introduced by physical promotion.

Comment on lines +145 to +196
bool AggregateInfo::OverlappingReplacements(unsigned offset,
unsigned size,
Replacement** firstReplacement,
Replacement** endReplacement)
{
size_t firstIndex = Promotion::BinarySearch<Replacement, &Replacement::Offset>(Replacements, offset);
if ((ssize_t)firstIndex < 0)
{
firstIndex = ~firstIndex;
if (firstIndex > 0)
{
Replacement& lastRepBefore = Replacements[firstIndex - 1];
if ((lastRepBefore.Offset + genTypeSize(lastRepBefore.AccessType)) > offset)
{
// Overlap with last entry starting before offs.
firstIndex--;
}
else if (firstIndex >= Replacements.size())
{
// Starts after last replacement ends.
return false;
}
}

const Replacement& first = Replacements[firstIndex];
if (first.Offset >= (offset + size))
{
// First candidate starts after this ends.
return false;
}
}

assert((firstIndex < Replacements.size()) && Replacements[firstIndex].Overlaps(offset, size));
*firstReplacement = &Replacements[firstIndex];

if (endReplacement != nullptr)
{
size_t lastIndex = Promotion::BinarySearch<Replacement, &Replacement::Offset>(Replacements, offset + size);
if ((ssize_t)lastIndex < 0)
{
lastIndex = ~lastIndex;
}

// Since we verified above that there is an overlapping replacement
// we know that lastIndex exists and is the next one that does not
// overlap.
assert(lastIndex > 0);
*endReplacement = Replacements.data() + lastIndex;
}

return true;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced a new AggregateInfo and moved this method into it.

Comment on lines +700 to +713
bool StructSegments::Segment::IntersectsOrAdjacent(const Segment& other) const
{
if (End < other.Start)
{
return false;
}

if (other.End < Start)
{
return false;
}

return true;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These StructSegments method are now being used as part of replacement picking, so I've moved it into the main promotion.cpp.

@jakobbotsch
Copy link
Member Author

Ping @AndyAyersMS, but no rush if you're busy with PGO things.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Left some comments but those can be handled in follow-ups.

// For LCL_ADDR this is a retbuf and we expect it to be a def. We
// don't know the exact size here so we cannot mark anything as
// being fully defined, thus we can just return.
assert(isDef);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder about this part -- for managed callees it seems like we could access their return type and figure out the extent of the writes they do.

In principle the callee must either write that entire extent or write nothing (nothing only if it throws rather than returns; then if we're in a try the previous definition can reach handlers/filters). But maybe we skip writing to padding in some cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the problem is finding the call node that contains the retbuf, since the linked locals list does not contain it. In principle it shouldn't be hard to quickly find it utilizing GTF_CALL, so we could be a bit more precise about this.

Comment on lines +86 to +87
// TODO: We need a scalability limit on these, we cannot always track
// the remainder and all fields.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to the "coarsening" discussion we had re alias analysis the other day, you should be able to have models where you group together fields and or fields + remainder and get approximate liveness for them.

Would be interesting to know if you see cases where there are large structs with "hot fields" so that it is worth tracking a subset but coarsening everything else, and if that diverges somehow from just promoting the hot fields and leaving the rest.


for (BasicBlock* block = m_compiler->fgLastBB; block != nullptr; block = block->bbPrev)
{
m_hasPossibleBackEdge |= block->bbNext && (block->bbNext->bbNum <= block->bbNum);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know for sure we have proper bbNum ordering?

In cases with loops this will converge faster if you walk the blocks in post-order here, to maximize the propagation per iteration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the same goes for regular liveness so perhaps we can fix that too -- maybe do both as a follow-up?

Copy link
Member Author

@jakobbotsch jakobbotsch May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know for sure we have proper bbNum ordering?

Even if we don't it will just result in m_hasPossibleBackEdge being set to true which might result in one more iteration of the fixpoint computation than necessary.

When I did early liveness I considered computing strongly connected components and doing the localized fixpoint computation within each SCC, which should be the optimal thing. However, when I measured the actual throughput cost of the fixpoint computation I felt it wasn't worth more complexity, from #79346 (comment):

Base: 232865278205, Diff: 236405285056, +1.5202%

?PerBlockAnalysis@LiveVarAnalysis@@AEAA_NPEAUBasicBlock@@_N1@Z                                                                                                                                                                     : 616293246 : +30.93%  : 11.44% : +0.2647%
?fgMarkUseDef@Compiler@@AEAAXPEAUGenTreeLclVarCommon@@@Z                                                                                                                                                                           : 426414825 : +39.60%  : 7.91%  : +0.1831%
?fgInterBlockLocalVarLiveness@Compiler@@QEAAXXZ                                                                                                                                                                                    : 353468679 : +43.94%  : 6.56%  : +0.1518%
?fgPerBlockLocalVarLiveness@Compiler@@QEAAXXZ                                                                                                                                                                                      : 258048984 : +26.39%  : 4.79%  : +0.1108%
?lvaSortByRefCount@Compiler@@QEAAXXZ                                                                                                                                                                                               : 222829814 : +39.99%  : 4.14%  : +0.0957%
?GetSucc@BasicBlock@@QEAAPEAU1@IPEAVCompiler@@@Z                                                                                                                                                                                   : 142747231 : +11.06%  : 2.65%  : +0.0613%
?PostOrderVisit@ForwardSubVisitor@@QEAA?AW4fgWalkResult@Compiler@@PEAPEAUGenTree@@PEAU4@@Z                                                                                                                                         : 136636902 : NA       : 2.54%  : +0.0587%
?WalkTree@?$GenTreeVisitor@VLocalSequencer@@@@QEAA?AW4fgWalkResult@Compiler@@PEAPEAUGenTree@@PEAU4@@Z                                                                                                                              : 134894223 : NA       : 2.50%  : +0.0579%
??0AllSuccessorIterPosition@@QEAA@PEAVCompiler@@PEAUBasicBlock@@@Z                                                                                                                                                                 : 129181805 : +13.59%  : 2.40%  : +0.0555%
?fgComputeLifeTrackedLocalUse@Compiler@@QEAAXAEAPEA_KAEAVLclVarDsc@@PEAUGenTreeLclVarCommon@@@Z                                                                                                                                    : 127997140 : +37.14%  : 2.38%  : +0.0550%
?fgForwardSubStatement@Compiler@@AEAA_NPEAUStatement@@@Z                                                                                                                                                                           : 119968922 : +70.17%  : 2.23%  : +0.0515%
?fgMarkAddressExposedLocals@Compiler@@AEAA?AW4PhaseStatus@@XZ                                                                                                                                                                      : 119662088 : +72.72%  : 2.22%  : +0.0514%
?FindNextRegSuccTry@EHSuccessorIterPosition@@AEAAXPEAVCompiler@@PEAUBasicBlock@@@Z                                                                                                                                                 : 116051428 : +13.69%  : 2.15%  : +0.0498%
?WalkTree@?$GenTreeVisitor@V?$GenericTreeWalker@$00$0A@$0A@$00@@@@QEAA?AW4fgWalkResult@Compiler@@PEAPEAUGenTree@@PEAU4@@Z                                                                                                          : 111845378 : +56.69%  : 2.08%  : +0.0480%
?fgComputeLifeTrackedLocalDef@Compiler@@QEAA_NAEAPEA_KAEBQEA_KAEAVLclVarDsc@@PEAUGenTreeLclVarCommon@@@Z                                                                                                                           : 108064056 : +34.46%  : 2.01%  : +0.0464%
memset                                                                                                                                                                                                                             : 104530112 : +5.76%   : 1.94%  : +0.0449%

PerBlockLiveness is the method called for each basic block in each iteration of the fixpoint computation. So around 0.26% in TP cost for the fixpoint computation from adding early liveness.

It would still be an interesting experiment, and RPO traversal is also simpler than computing full SCCs.

@jakobbotsch jakobbotsch merged commit e62cb64 into dotnet:main May 19, 2023
@jakobbotsch jakobbotsch deleted the physical-promotion-liveness branch May 19, 2023 20:07
@ghost ghost locked as resolved and limited conversation to collaborators Jun 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants