Optimize CheckSample #76520

EgorBo · 2022-10-03T01:12:38Z

CheckSample is invoked for each class probe in codegen (when TieredPGO is enabled) and seems to be quite hot, e.g.:

Even a simple __rdtsc here seems to speed up start time for an app I'm testing locally.

cc @AndyAyersMS

category:performance
theme:optimization
skill-level:beginner
cost:small
impact:small

The text was updated successfully, but these errors were encountered:

ghost · 2022-10-03T01:12:44Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

CheckSample is invoked for each class probe in codegen (when TieredPGO is enabled) and seems to be quite hot, e.g.:

Even a simple __rdtsc here seems to speed up start time for an app I'm testing locally.

Author:	EgorBo
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

AndyAyersMS · 2022-10-03T01:19:49Z

Seems odd that it would be that costly.

EgorBo · 2022-10-03T01:23:16Z

Yes, it might be just a side-effect from that other problem I shared in the Discord, the fact that MemberInfoCache.AddMethod is super slow when TieredPGO is enabled. And that method stays in Tier0 (is not promoted neither to Tier1 nor OSR):

Top - PGO is enabled, Bottom - disabled. Purple - AddMethod

The flamegraphs represent first 2 seconds of AvaloniaILSpy launch

EgorBo · 2022-10-03T01:33:29Z

That's extremely weird, MemberInfoCache1[System.__Canon]:AddMethod` is never promoted to tier1 despite being hot 🤷‍♂️ Doesn't look PGO related it's just that with PGO it's way slower due to class-probes inside the loops (~ 255 iterations so not enough for OSR?)

EgorBo · 2022-10-03T02:10:02Z

@AndyAyersMS Ok here is the problem:

Method MemberInfoCache:AddMethod is invoked 16 times, then there is some pause (App is doing different things) then it's invoked again ~10000 times this time and never promoted to tier1 during those 2 seconds.

If I change DOTNET_TC_CallCountThreshold from 30 to e.g. 10 -- it's promoted just fine.

EgorBo · 2022-10-03T02:23:28Z

Ugh... it seems like the number of methods pending promotion to Tier1 is constantly growing up to 6000 (in m_methodsPendingCountingForTier1 array) and AddMethod is somewhere in the end of that queue

AndyAyersMS · 2022-10-03T04:11:32Z

(~ 255 iterations so not enough for OSR?)

OSR needs ~50,000 iterations. Note with PGO enabled the OSR version will still have class probes.

Ugh... it seems like the number of methods pending promotion to Tier1 is constantly growing up to 6000 (in m_methodsPendingCountingForTier1 array) and AddMethod is somewhere in the end of that queue

Right, this is what I was mentioning the other day, likely a good number of those 6000 are not all that important/interesting to rejit, but we have a straight FIFO rejit queue so no re-prioritizing can happen.

EgorBo · 2022-10-03T05:43:53Z

Right, this is what I was mentioning the other day, likely a good number of those 6000 are not all that important/interesting to rejit, but we have a straight FIFO rejit queue so no re-prioritizing can happen.

I wonder if at least changing it to FILO will help (whatever we requested just now we need right now). Also, I think we should check if we can set affinity for the tiered-compilation-thread to run on E-cores/low priority with lower delay (e.g. 100ms->20ms)

EgorBo · 2022-10-03T05:57:29Z

Another quick idea: having a separate "high priority" queue for methods with loops

EgorBo · 2022-10-03T08:18:15Z

So, I think it's an interesting case: A single method, not promoted to Tier1 on time, causes a serious start-time regression for AvaloniaILSpy app.

It's MemberInfoCache<_Canon>.AddMethod in this case (see flamegraph above)- this method has loops but usually it operates on small arrays so not enough for OSR. The method itself is hot, at some point during start this method is invoked 10000 times! and due to PGO instrumentation and small loops, that method bottle-necks itself in class-probes (hence, the initial reason I filed this issue for). The app has a lot of code to jit so AddMethod is lost somewhere in the "pending promotion to tier1" queue that is constantly big during start, if I measured it correctly it's

cc @davidwrighton @noahfalk

Potential ideas:

Like Andy suggested: a thread-safe PriortyQueue where we never stop call-counting and use call-counting cells' values as priorities to promote hottest methods first
Get rid of absolutely unnecessary tier1 compilations for small methods like auto-properties, see Tiered JIT: redundant compilations #76402 in case of this app it's ~5000 tier1 compilations
Introduce a separate "high-priority" queue for methods with loops
Investigate into lowering TC_CallCountingDelayMs value, run promotions on E-cores/low-priority threads via affinity so it won't affect main app

AndyAyersMS · 2022-10-03T16:06:14Z

Also, did you look at where we add class probes and why? We must be putting one or more in each loop. If more than one, then perhaps some are redundant?

noahfalk · 2022-10-07T09:51:58Z

I like the notion of a priority queue or perhaps some rough priority buckets (ie call count 30-100, 100-1000, 1000+)? Another approach we heard about from the javascript team but never tested ourselves is bounding the promotion queue at a small size (10s or 100s) and pushing new items on the queue evicts the oldest item if needed to prevent overflow. The idea is that frequently called methods can be expected to re-enter the queue rapidly if they are evicted. This also should bias us towards promoting methods that were hot recently rather than methods that were hot in the past but no longer are.

In the past one reason we didn't do call counting for very long is that the overhead of counting was pretty high. I think @kouvel's work with call counting stubs resolved that issue and it would now be reasonable to leave call counters installed indefinitely, but we probably would want to test that out.

A last thought is that we could use stack sampling as an alternative prioritization measure. We tried this years ago using EESuspend as the sampling mechanism and the overhead was too high, but with some investment we should be able to do stack sampling more efficiently. Stack sampling also didn't identify nearly as many methods to promote. From a prioritization perspective that is useful, but if we only promoted the methods identified by sampling we didn't get same level of steady state perf wins vs. when we promoted based on call counting. We might need to combine multiple approaches to get the best results. I think a stack sampling approach takes more dev effort than many of the others, but it would also be independently useful for improving our diagnostic profiling story so we might nab two birds with one stone.

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 3, 2022

ghost added the untriaged New issue has not been triaged by the area owner label Oct 3, 2022

EgorBo added this to the 8.0.0 milestone Oct 3, 2022

ghost removed the untriaged New issue has not been triaged by the area owner label Oct 3, 2022

EgorBo mentioned this issue Oct 12, 2022

Dynamic PGO startup improvements in NET 8 #76969

Closed

23 tasks

EgorBo mentioned this issue Feb 10, 2023

Remove CPU cache contention for instrumented code #81932

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 10, 2023

EgorBo closed this as completed in #81932 Feb 10, 2023

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 10, 2023

ghost locked as resolved and limited conversation to collaborators Mar 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize CheckSample #76520

Optimize CheckSample #76520

EgorBo commented Oct 3, 2022 •

edited by BruceForstall

Loading

ghost commented Oct 3, 2022

AndyAyersMS commented Oct 3, 2022

EgorBo commented Oct 3, 2022 •

edited

Loading

EgorBo commented Oct 3, 2022

EgorBo commented Oct 3, 2022 •

edited

Loading

EgorBo commented Oct 3, 2022

AndyAyersMS commented Oct 3, 2022

EgorBo commented Oct 3, 2022 •

edited

Loading

EgorBo commented Oct 3, 2022

EgorBo commented Oct 3, 2022 •

edited

Loading

AndyAyersMS commented Oct 3, 2022

noahfalk commented Oct 7, 2022

Optimize CheckSample #76520

Optimize CheckSample #76520

Comments

EgorBo commented Oct 3, 2022 • edited by BruceForstall Loading

ghost commented Oct 3, 2022

AndyAyersMS commented Oct 3, 2022

EgorBo commented Oct 3, 2022 • edited Loading

EgorBo commented Oct 3, 2022

EgorBo commented Oct 3, 2022 • edited Loading

EgorBo commented Oct 3, 2022

AndyAyersMS commented Oct 3, 2022

EgorBo commented Oct 3, 2022 • edited Loading

EgorBo commented Oct 3, 2022

EgorBo commented Oct 3, 2022 • edited Loading

AndyAyersMS commented Oct 3, 2022

noahfalk commented Oct 7, 2022

EgorBo commented Oct 3, 2022 •

edited by BruceForstall

Loading

EgorBo commented Oct 3, 2022 •

edited

Loading

EgorBo commented Oct 3, 2022 •

edited

Loading

EgorBo commented Oct 3, 2022 •

edited

Loading

EgorBo commented Oct 3, 2022 •

edited

Loading