Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize CheckSample #76520

Closed
EgorBo opened this issue Oct 3, 2022 · 12 comments · Fixed by #81932
Closed

Optimize CheckSample #76520

EgorBo opened this issue Oct 3, 2022 · 12 comments · Fixed by #81932
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@EgorBo
Copy link
Member

EgorBo commented Oct 3, 2022

CheckSample is invoked for each class probe in codegen (when TieredPGO is enabled) and seems to be quite hot, e.g.:
image

Even a simple __rdtsc here seems to speed up start time for an app I'm testing locally.

cc @AndyAyersMS

category:performance
theme:optimization
skill-level:beginner
cost:small
impact:small

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 3, 2022
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Oct 3, 2022
@ghost
Copy link

ghost commented Oct 3, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

CheckSample is invoked for each class probe in codegen (when TieredPGO is enabled) and seems to be quite hot, e.g.:
image

Even a simple __rdtsc here seems to speed up start time for an app I'm testing locally.

Author: EgorBo
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@EgorBo EgorBo added this to the 8.0.0 milestone Oct 3, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Oct 3, 2022
@AndyAyersMS
Copy link
Member

Seems odd that it would be that costly.

@EgorBo
Copy link
Member Author

EgorBo commented Oct 3, 2022

Yes, it might be just a side-effect from that other problem I shared in the Discord, the fact that MemberInfoCache.AddMethod is super slow when TieredPGO is enabled. And that method stays in Tier0 (is not promoted neither to Tier1 nor OSR):

image

Top - PGO is enabled, Bottom - disabled. Purple - AddMethod

The flamegraphs represent first 2 seconds of AvaloniaILSpy launch

@EgorBo
Copy link
Member Author

EgorBo commented Oct 3, 2022

That's extremely weird, MemberInfoCache1[System.__Canon]:AddMethod` is never promoted to tier1 despite being hot 🤷‍♂️ Doesn't look PGO related it's just that with PGO it's way slower due to class-probes inside the loops (~ 255 iterations so not enough for OSR?)

@EgorBo
Copy link
Member Author

EgorBo commented Oct 3, 2022

@AndyAyersMS Ok here is the problem:

Method MemberInfoCache:AddMethod is invoked 16 times, then there is some pause (App is doing different things) then it's invoked again ~10000 times this time and never promoted to tier1 during those 2 seconds.

If I change DOTNET_TC_CallCountThreshold from 30 to e.g. 10 -- it's promoted just fine.

@EgorBo
Copy link
Member Author

EgorBo commented Oct 3, 2022

Ugh... it seems like the number of methods pending promotion to Tier1 is constantly growing up to 6000 (in m_methodsPendingCountingForTier1 array) and AddMethod is somewhere in the end of that queue

@AndyAyersMS
Copy link
Member

(~ 255 iterations so not enough for OSR?)

OSR needs ~50,000 iterations. Note with PGO enabled the OSR version will still have class probes.

Ugh... it seems like the number of methods pending promotion to Tier1 is constantly growing up to 6000 (in m_methodsPendingCountingForTier1 array) and AddMethod is somewhere in the end of that queue

Right, this is what I was mentioning the other day, likely a good number of those 6000 are not all that important/interesting to rejit, but we have a straight FIFO rejit queue so no re-prioritizing can happen.

@EgorBo
Copy link
Member Author

EgorBo commented Oct 3, 2022

Right, this is what I was mentioning the other day, likely a good number of those 6000 are not all that important/interesting to rejit, but we have a straight FIFO rejit queue so no re-prioritizing can happen.

I wonder if at least changing it to FILO will help (whatever we requested just now we need right now). Also, I think we should check if we can set affinity for the tiered-compilation-thread to run on E-cores/low priority with lower delay (e.g. 100ms->20ms)

@EgorBo
Copy link
Member Author

EgorBo commented Oct 3, 2022

Another quick idea: having a separate "high priority" queue for methods with loops

@EgorBo
Copy link
Member Author

EgorBo commented Oct 3, 2022

So, I think it's an interesting case: A single method, not promoted to Tier1 on time, causes a serious start-time regression for AvaloniaILSpy app.

It's MemberInfoCache<_Canon>.AddMethod in this case (see flamegraph above)- this method has loops but usually it operates on small arrays so not enough for OSR. The method itself is hot, at some point during start this method is invoked 10000 times! and due to PGO instrumentation and small loops, that method bottle-necks itself in class-probes (hence, the initial reason I filed this issue for). The app has a lot of code to jit so AddMethod is lost somewhere in the "pending promotion to tier1" queue that is constantly big during start, if I measured it correctly it's

cc @davidwrighton @noahfalk

Potential ideas:

  1. Like Andy suggested: a thread-safe PriortyQueue where we never stop call-counting and use call-counting cells' values as priorities to promote hottest methods first
  2. Get rid of absolutely unnecessary tier1 compilations for small methods like auto-properties, see Tiered JIT: redundant compilations #76402 in case of this app it's ~5000 tier1 compilations
  3. Introduce a separate "high-priority" queue for methods with loops
  4. Investigate into lowering TC_CallCountingDelayMs value, run promotions on E-cores/low-priority threads via affinity so it won't affect main app

@AndyAyersMS
Copy link
Member

Also, did you look at where we add class probes and why? We must be putting one or more in each loop. If more than one, then perhaps some are redundant?

@noahfalk
Copy link
Member

noahfalk commented Oct 7, 2022

I like the notion of a priority queue or perhaps some rough priority buckets (ie call count 30-100, 100-1000, 1000+)? Another approach we heard about from the javascript team but never tested ourselves is bounding the promotion queue at a small size (10s or 100s) and pushing new items on the queue evicts the oldest item if needed to prevent overflow. The idea is that frequently called methods can be expected to re-enter the queue rapidly if they are evicted. This also should bias us towards promoting methods that were hot recently rather than methods that were hot in the past but no longer are.

In the past one reason we didn't do call counting for very long is that the overhead of counting was pretty high. I think @kouvel's work with call counting stubs resolved that issue and it would now be reasonable to leave call counters installed indefinitely, but we probably would want to test that out.

A last thought is that we could use stack sampling as an alternative prioritization measure. We tried this years ago using EESuspend as the sampling mechanism and the overhead was too high, but with some investment we should be able to do stack sampling more efficiently. Stack sampling also didn't identify nearly as many methods to promote. From a prioritization perspective that is useful, but if we only promoted the methods identified by sampling we didn't get same level of steady state perf wins vs. when we promoted based on call counting. We might need to combine multiple approaches to get the best results. I think a stack sampling approach takes more dev effort than many of the others, but it would also be independently useful for improving our diagnostic profiling story so we might nab two birds with one stone.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 10, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 10, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Mar 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants