-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate microbenchmarks that regress with PGO enabled #84264
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue DetailsLook into cases where microbenchmarks are slower with PGO enabled. #74873 (comment) has an initial list to check.
|
@AndyAyersMS I think that runtime/src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs Lines 277 to 322 in e6226e6
|
PS: we can probably propose a new API |
Yes we are certain that these methods will be hot and want the best possible codegen right away. The cast helpers are also initialized in a special way - runtime/src/coreclr/vm/ecall.cpp Line 99 in 4cbe6f9
I do not remember all the details, but there were reasons for that. |
The ones you listed have AggressiveOptimization attribute, am I right? In that case we won't instrument them, but there are other (less important) members in CastHelpers which don't have AggressiveOptimization so are eligble for instrumentation |
All the cast helpers are initialized in that method. It would only be interesting if you somehow do not want |
The issue caused by #80481 is with benchmark methods like For such methods we never see any Tier0 + instr method return, so with sparse profiles enabled Tier1 profile reconstruction thinks the entry block is run rarely, and this messes up optimizations. eg in this case we fail to hoist/cse the cast out of the loop.
The fix I'm proposing is to run synthesis repair in such cases. It addresses the above case, looking now to see what else it does. |
In particular, run synthesis in repair mode for cases where there are profile counts within the method but zero counts in `fgFirstBB`. Recall that sparse profiling effectively probes return blocks to determine the method entry count. So the zero-entry but not zero-everywhere case can happen if we have a method with a very long running loop plus sparse profiling plus OSR -- we will only get profile counts from the instrumented Tier0 method, and it will never return (instead it will always escape to an OSR version which will eventually return, but that version won't be instrumented). I originally was a bit more ambitious and ran repair for a broader set of reconstruction issues, but lead to a large number of diffs, in part because repair doesn't cope well with irreducible loops. Leaving the entry count zero can have fairly disastrous impact on the quality of optimizations done in the method. Addresses quite a few of the worst-performing benchmarks in dotnet#84264.
Also noticed #84319 while reviewing the dumps for |
In particular, run synthesis in repair mode for cases where there are profile counts within the method but zero counts in `fgFirstBB`. Recall that sparse profiling effectively probes return blocks to determine the method entry count. So the zero-entry but not zero-everywhere case can happen if we have a method with a very long running loop plus sparse profiling plus OSR -- we will only get profile counts from the instrumented Tier0 method, and it will never return (instead it will always escape to an OSR version which will eventually return, but that version won't be instrumented). I originally was a bit more ambitious and ran repair for a broader set of reconstruction issues, but lead to a large number of diffs, in part because repair doesn't cope well with irreducible loops. Leaving the entry count zero can have fairly disastrous impact on the quality of optimizations done in the method. Addresses quite a few of the worst-performing benchmarks in #84264.
After #84312 Updated table of worst-case results (just 1 day of data, so possibly noisy, eg [now showing the "fast time"] Investigated cases in bold, notes linked -- skipping tests in italics that look similar to ones already analyzed.
|
System.Text.Json.Tests.Perf_Get.GetByteDefault [link]
PGO [link]
PGO is unable to inline
I also noticed
|
System.Collections.Sort.Array(Size: 512)Default [link]PGO [link]Both fluctuate, but PGO has worse behavior. By default this test doesn't iterate enough and is mainly measuring R2R codegen. Sadly both perfview and my tool are unable to resolve the R2R method. Not sure why.
If I bump iteration count up to 200 there is a more consistent small regression with PGO
And it looks like it is related to the codegen in Perf vs BDN iteration number for PGO, By default the lab does 15 iterations and IIRC we report median so you can see it might toggle back and forth between fast and slow results like we are seeing above, depending exactly on when the switchover happens. Seems like PGO behavior changed as a result of #80769. Issue seems to be that
The default case is using static PGO; from what I can tell all the counts are proper so this is just a divergence in profile data leading to a change in code layout. There doesn't seem to be an easy way to tweak this algorithm, so going to leave the analysis here and move on. |
System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToWriter(Mode: SourceGen)Default [link]PGO [link]Locally
Suspect this is another case where we run out of budget during inlining. The root level method that contains all the benchmark bits is and the respecive ETL profiles are
These are tricky to match up because of drastic inlining differences, but if we just look at top level call sites in the root method, we see
So with PGO we run out of budget for a couple of top-level methods that feature prominently in the profile. As before let me try and up the budget to see what that does... I am not suggesting this as a fix but it might tempt us to look at some kind of priority-first inlining process where we try and make sure we're maximizing the effectiveness of the budget. Looks like a budget value of 18 is sufficient to regain perf. That's quite a jump from where we are now. Non-pgo version is 2811 bytes, PGO version with budget=18 is 4664. |
System.Collections.ContainsKeyTrue<Int32, Int32>.ConcurrentDictionary(Size: 512)Default [link]PGO [link]The The first one, on Feb 9, was likely a result of #81557; not clear yet why this helped non-PGO but hurt PGO. The other jumps are harder to pin down; the one on March 26 is seemingly because of #83484 which we have seen implicated in other changes but it is hard to see a connection. Locally I see PGO as frequently faster or at par.
Profiling shows all time is in these two methods
Going to defer further investigation on this one. |
Benchstone.BenchI.MulMatrix.TestDefault [link]PGO [link]Default:
PGO:
All the time in this benchmark is in one method:
With PGO we actually jit this method 7 times (#84517 tracks fixing some of this). Only the last one matters for benchmark perf; all the others are no longer active when BDN is measuring (though likely some are active during the warmup/pilot stages, and might perturb BDN's choice of iteration strategy. However, measuring with
The dynamic profile only reflects the first loop or two of the method, because those loops are hot enough that we escape to OSR before they finish iterating. Thus at Tier1 we see an incomplete profile where many blocks are zero that we know must be executed. This is likely the root cause of the regression from #70941; before then we would instrument the OSR method and so have a more representative picture of profile data.
This looks like a case that should have been handled by #84312 -- because the instrumented methods never return, profile incorporation things the entry block weight is zero, and schedules repairs; these run but fail to populate the remainder of the method with interesting counts.
The problem seems to be that we still use the edge likelihoods laid down by the profile incorporation and so think the loops that never exit actually never exit, despite seeing that they have high Some ideas on how to resolve this:
If we go back and adjust we may also run into complications from cases like while (true)
{
if (p) throw new Exception();
...
} where the loop is postdominated by a throw; synthesis repair will want to make the throw have nonzero weight. Ultimately, we will have to permit throws to have nonzero weights. Testing some enhancements to profile repair that look promising on this case, and don't seem too disruptive overall. Update 4/16: official charts are slow to update but I can pull the data myself now. The PGO regression is largely mitigated: And now the "official chart" |
LinqBenchmarks.Count00ForXDefault [link]Hard to tell from this but default is at about 122 with error bars extending +/- 10. PGO [link] -- note the y-axis scale differs from above.PGO is pretty consistently at 163. At first blush I can't repro this locally. My machine is somewhat faster than the lab, so I assume it's the default case is running slowly for me locally.
However looking at the distribution of measurements
it looks like the default case is just getting unlucky and hitting its slow variant more often for me, and there's a set of runs that are consistently faster than PGO. So will keep digging. It looks like both modes end up measuring OSR code, but default switches to Tier1 and PGO never makes it there, in part because we redo tier0/osr over again for the main method (#84319, fix is close)
if you crank up the iteration count then PGO looks quite a bit better. Here we are with 100 iterations:
Looking at the benchmark sources this is a test where we manually set the iteration count and as a result each invocation of the benchmark method takes a long time and so BDN decides not to call it very often per iteration (just twice)
So to reach the 30 call threshold requires some number of iterations (look like we do 10 or so warming up). Once we fix #84517 I expect this benchmark to improve some, but we might also consider removing the internal iterations and letting BDN pick the counts for us. |
Benchstone.MDBenchI.MDLogicArray.TestDefault [link]PGO [link]Recent regression is likely from: #84312 This test also suffers from repeated tiering up (#84517)
However not clear that is the root cause of perf issues. Profiles show
(here So with PGO that final OSR compile evidently inlines Suspect this is the same issue as in
So let's try with the fix from #84817:
Things look better but still not at par. Will move on to another benchmark; may come back to this one. Lab charts are finally getting updated, here's the latest |
Benchstone.BenchI.CSieve.TestDefault [link]PGO [link]
Performance seems to be at parity now |
PerfLabTests.CastingPerf2.CastingPerf.ScalarValueTypeObjDefault [link]PGO [[link]]
PGO seems comparable |
System.Perf_Convert.FromBase64StringDefault [link]PGO [link]
Can repro locally:
Looks like with PGO Seems like this may be a case of the issue noted in #84319 where when we invert loops we don't get the profile updates done properly. Here we have a hot loop
Without PGO we end up compacting this (somehow), but with PGO we end up leaving one of the cold block inline ;; default
align [12 bytes for IG03]
G_M650_IG03: ;; offset=0020H
movzx r9, word ptr [rcx]
add rcx, 2
cmp r9d, 32
jbe SHORT G_M650_IG12
;; size=14 bbWeight=348.23 PerfScore 1218.79
G_M650_IG04: ;; offset=002EH
cmp r9d, 61
je SHORT G_M650_IG11
;; size=6 bbWeight=347.83 PerfScore 434.79
G_M650_IG05: ;; offset=0034H
cmp rcx, rax
jb SHORT G_M650_IG03
...
G_M650_IG11: ;; offset=0069H **COLD**
dec edx
inc r8d
jmp SHORT G_M650_IG05
;; pgo
align [8 bytes for IG06]
G_M36625_IG06: ;; offset=0060H
movzx rax, word ptr [rdx]
add rdx, 2
cmp eax, 32
jbe G_M36625_IG15
cmp eax, 61
jne SHORT G_M36625_IG08
;; size=21 bbWeight=35.07 PerfScore 166.59
G_M36625_IG07: ;; offset=0075H **COLD**
dec ecx
inc ebp
;; size=4 bbWeight=0.98 PerfScore 0.49
G_M36625_IG08: ;; offset=0079H
cmp rdx, rbx
jb SHORT G_M36625_IG06
When we invert the exit block weight ends up being too low:
Verified that the prospective fix from #84319 works:
Lab sees the fix now too: |
If we have a partial profile then the current count reconstruction will adjust the exit likelihood of some loop exit when it hits a capped loop. But for multiple exit loops we might wish to see some profile flow out of all exits, not just one. In `ludcmp` we choose to send all the profile weights down an early return path, leaving the bulk of the method with zero counts. Instead of trying increasingly elaborate repair schemes, we will now use blend mode for these sorts of problems; this gives a more balanced count redistribution. I also updated blend to use the same logic as repair if a block has zero weights, since presumably whatever likelihood was assigned there during reconstruction is not well supported. Fixes the `ludcmp` regression with PGO over no PGO, noted in dotnet#84264 (comment)
If we have a partial profile then the current count reconstruction will adjust the exit likelihood of some loop exit when it hits a capped loop. But for multiple exit loops we might wish to see some profile flow out of all exits, not just one. In `ludcmp` we choose to send all the profile weights down an early return path, leaving the bulk of the method with zero counts. Instead of trying increasingly elaborate repair schemes, we will now use blend mode for these sorts of problems; this gives a more balanced count redistribution. I also updated blend to use the same logic as repair if a block has zero weights, since presumably whatever likelihood was assigned there during reconstruction is not well supported. Fixes the `ludcmp` regression with PGO over no PGO, noted in #84264 (comment)
System.Memory.Span<Byte>.SequenceCompareToDifferent(Size: 4)Default [link]PGO [link]Feb 2002 regression seems to be #82412
Looks like slightly different PGO data for an
and the prologs ;; default
push rdi
push rsi
sub rsp, 72
;; pgo
push rdi
push rsi
push rbp
push rbx
sub rsp, 72 |
System.Text.Json.Document.Tests.Perf_DocumentParse.Parse(IsDataIndented: False, TestRandomAccess: True, TestCase: Json400KB)Default [link]PGO [link]May 2022 (small) regression may have been #68804 Locally I can repro, but annoyingly when I profile PGO is faster.
Note the Codegen for My only guess is that perhaps we now have some PGO data both on linux x64 and ampere arm64, for both of those PGO is comparable or a bit faster. |
System.Collections.IndexerSet<Int32>.Dictionary(Size: 512)Default [link]PGO [link]Jan 2023 regression looks like #80481 Locally I have PGO consistently faster:
Both default and PGO have since improved and are showing comparable results in the lab: Improvement was one or more of: PGO Improvement was likely from #85805 |
System.Tests.Perf_Char.Char_ToLowerInvariant(input: "Hello World!")Note in a windows batch script you may need to use
to get this properly passed to BenchmarkDotNet. Default [link]November 2022 improvement was #78593 PGO [link]November 2022 regression was #78593 So oddly enough these are reacting oppositely to some changes. I see PGO as faster or at least comparable locally.
|
The above covers the initial screen of the 50 worst ratios (as of 3 weeks ago). I am going to add in a few other cases based on a refreshed and slightly broadened query. Broadened: the results above excluded any test that had less than a 2ns reported time. A lot of HW intrinsic/vector tests come in under this and report slower with PGO. So let's look at a few. System.Numerics.Tests.Perf_Vector3.DotBenchmarkDefault [link]PGO [link]Seems comparable, PGO result is just a hair greater than zero, but default is zero so ratio is infinite. |
System.Numerics.Tests.Perf_Vector3.ReflectBenchmarkDefault [link]PGO [link]Regression appears to be from #83036 which seems odd. This one is still a mystery. Default and PGO seem to have identical code for the benchmark methods, but with release bits PGO reports consistently worse performance. If I drop in a checked jit into the release PGO build, it speeds up and matches the default perf. Suspecting something like the following -- the workload action unroll for release PGO ends up always calling the precode somehow, and this never gets updated/backpatched? Looks like the issue was that the jit was not fully initializing some of the SIMD12 vector constants used by this code, causing the high float to pick up random values, that might be denorms, which cause huge stalls. Fix is #85362 |
System.Buffers.Text.Tests.Base64EncodeDecodeInPlaceTests.Base64DecodeInPlace(NumberOfBytes: 200000000)Default [link]January 2023 improvement is #79346 PGO [link]October 2022 regression is #70941 This repros locally
But when I enable profiling
Seems like the "fast" run of this test executes R2R code and never tiers up: (also looks like something is broken with R2R rundown as neither PerfView nor my tool can resolve which R2R method this is). Think the issue is in BDN itself: dotnet/BenchmarkDotNet#2296 So, why are we looking at R2R code here? (Potentially if I can catch a slow run in the profiler, it may show hits at Tier0, OSR, or Tier1...will keep trying). Caught a slow run, and despite what I wrote above the "fast" runs look like they are indeed at Tier1.
The hot method is evidently R2R and so we go to Tier-1 instr with PGO but it doesn't get called enough to ever finish tiering up. If I rev up the iterations ....
as you can see we need to get to iteration 33 to switch to Tier1 with PGO, but by default we only run 15 iterations. And once we switch we get basically the same perf as without PGO. BDN could probably do a better job here of warning you that if there are less than say 30 invocations (perhaps 60 to be safe) of your benchmark method you may not see the results you're expecting, and either need to drop the per-invocation time or up the invocation or iteration counts. |
Benchmark.GetChildKeysTests.AddChainedConfigurationEmptyDefault [link]April 2023 improvement possibly #84582 PGO [link]October 2022 improvement was #70941 Comparable for me locally
|
System.Buffers.Text.Tests.Utf8FormatterTests.FormatterUInt64(value: 0)Default [link]Recent improvement was from #85277 PGO [link]Regression was from something in this range: 887c043...f92b9ef (74 commits) From Azure Data Explorer, regression was after 2023-04-25T00:29:35Z and before 2023-04-25T03:46:39Z Tue Apr 25 03:46:39 2023 f74138f Jeremy Koritzinsky Enable ComWrappers for NativeAOT on non-Windows (#85000) So looks like the change that lead to this was #85277.
Looks like all the time is spent in the benchmark method. Hopefully this will be easy to track down. Seems like what is happening is that with PGO we see a lot of cold blocks, and so we don't inline some methods that we inline without PGO. One of those takes a struct by address that is also used in hot code, and so we don't fully promote, we dependently promote. And then we make some poor choices in codegen that leads to what is likely a partial stall. Sampling shows a lot of time piling up in this little stretch of code: G_M000_IG05: ;; offset=0035H
mov word ptr [rsp+48H], 0
mov eax, dword ptr [rsp+48H]
or al, byte ptr [rsp+49H]
jne G_M000_IG14 For some reason we do a wide load after a narrow store; my recollection is that the HW does not like this at all. In the non PGO this code doesn't exist because we know all the values are zero. Looks like we widen this load in
PGO has now improved, thanks to #86491 fixing the store-foreward stall. Default "regressed" because we enabled PGO by default. then improved with #86491 |
Closing this; remaining regressions will be analyzed in #87194. |
Look into cases where microbenchmarks are slower with PGO enabled.
#74873 (comment) has an initial list to check.
Issues uncovered
-p ETW
is not enabling the CLR rundown provider BenchmarkDotNet#2296Fixes
The text was updated successfully, but these errors were encountered: