-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize CheckSample #76520
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsCheckSample is invoked for each class probe in codegen (when TieredPGO is enabled) and seems to be quite hot, e.g.: Even a simple
|
Seems odd that it would be that costly. |
Yes, it might be just a side-effect from that other problem I shared in the Discord, the fact that Top - PGO is enabled, Bottom - disabled. Purple - The flamegraphs represent first 2 seconds of AvaloniaILSpy launch |
That's extremely weird, |
@AndyAyersMS Ok here is the problem: Method If I change |
Ugh... it seems like the number of methods pending promotion to Tier1 is constantly growing up to 6000 (in |
OSR needs ~50,000 iterations. Note with PGO enabled the OSR version will still have class probes.
Right, this is what I was mentioning the other day, likely a good number of those 6000 are not all that important/interesting to rejit, but we have a straight FIFO rejit queue so no re-prioritizing can happen. |
I wonder if at least changing it to FILO will help (whatever we requested just now we need right now). Also, I think we should check if we can set affinity for the tiered-compilation-thread to run on E-cores/low priority with lower delay (e.g. 100ms->20ms) |
Another quick idea: having a separate "high priority" queue for methods with loops |
So, I think it's an interesting case: A single method, not promoted to Tier1 on time, causes a serious start-time regression for AvaloniaILSpy app. It's MemberInfoCache<_Canon>.AddMethod in this case (see flamegraph above)- this method has loops but usually it operates on small arrays so not enough for OSR. The method itself is hot, at some point during start this method is invoked 10000 times! and due to PGO instrumentation and small loops, that method bottle-necks itself in class-probes (hence, the initial reason I filed this issue for). The app has a lot of code to jit so Potential ideas:
|
Also, did you look at where we add class probes and why? We must be putting one or more in each loop. If more than one, then perhaps some are redundant? |
I like the notion of a priority queue or perhaps some rough priority buckets (ie call count 30-100, 100-1000, 1000+)? Another approach we heard about from the javascript team but never tested ourselves is bounding the promotion queue at a small size (10s or 100s) and pushing new items on the queue evicts the oldest item if needed to prevent overflow. The idea is that frequently called methods can be expected to re-enter the queue rapidly if they are evicted. This also should bias us towards promoting methods that were hot recently rather than methods that were hot in the past but no longer are. In the past one reason we didn't do call counting for very long is that the overhead of counting was pretty high. I think @kouvel's work with call counting stubs resolved that issue and it would now be reasonable to leave call counters installed indefinitely, but we probably would want to test that out. A last thought is that we could use stack sampling as an alternative prioritization measure. We tried this years ago using EESuspend as the sampling mechanism and the overhead was too high, but with some investment we should be able to do stack sampling more efficiently. Stack sampling also didn't identify nearly as many methods to promote. From a prioritization perspective that is useful, but if we only promoted the methods identified by sampling we didn't get same level of steady state perf wins vs. when we promoted based on call counting. We might need to combine multiple approaches to get the best results. I think a stack sampling approach takes more dev effort than many of the others, but it would also be independently useful for improving our diagnostic profiling story so we might nab two birds with one stone. |
CheckSample is invoked for each class probe in codegen (when TieredPGO is enabled) and seems to be quite hot, e.g.:
Even a simple
__rdtsc
here seems to speed up start time for an app I'm testing locally.cc @AndyAyersMS
category:performance
theme:optimization
skill-level:beginner
cost:small
impact:small
The text was updated successfully, but these errors were encountered: