-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic PGO Microbenchmark Regressions #87194
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsThis issue tracks investigation into microbenchmarks that have reported regressions with Dynamic PGO enabled. It is a continuation of #84264 which tracked regressions from PGO before it was enabled. The report below is collated from the following autofiling reports.
The table is auto generated by a tool written by @EgorBo but may be edited by hand as regression analysis produces results. The "Score" is the geomean regression across all architectures; benchmarks that did not regress (or get reported) on some architectures are assumed to have produced the same results with and without PGO. "Recent Score" is the current performance (as of 2023-0606) versus the non-PGO result; "Orig Score" is based on the results of auto filing. They will differ if benchmark performance has improved or regressed since the auto filing ran (see for example row 4's results for Only the 54 entries with recent scores > 1.4 are included; this leaves off approximately 220 more rows with scores between 1.4 and 1.0. Our plan is to prioritize investigation of these benchmarks initially, as they have the largest aggregate regressions. If time permits we will regenerate this chart to pick up the impact of any fixes and see how much of the remainder we can tackle. Each arch/os result is a hyperlink to the performance data graph for that benchmark. Note we currently have no autofiling data for win-x64-intel. If/when that shows up we will regenerate the table. cc @dotnet/jit-contrib
|
System.Buffers.Text.Tests.Base64EncodeDecodeInPlaceTests.Base64EncodeInPlace(NumberOfBytes: 200000000)System.Buffers.Text.Tests.Base64EncodeDecodeInPlaceTests.Base64DecodeInPlace(NumberOfBytes: 200000000)----These benchmarks were analyzed before PGO was enabled: #84264 (comment) BDN's strategy doesn't run the benchmark enough, because each iteration is long running, and so (since the key benchmark methods are R2R'd) the test ends up measuring tier1-instr code. |
System.Text.Json.Tests.Perf_Get.GetInt16System.Text.Json.Tests.Perf_Get.GetSByteSystem.Text.Json.Tests.Perf_Get.GetByteLikely the analysis from #84264 (comment) is still relevant and explains the related tests regressions as well: we run out of inlining budget, in part because the benchmark method is small and there are quite a few large aggressive inline methods, and so we're unable to do some key inlines. Tracking issue for this is #85531. Some of these improved with #86551. |
System.Memory.Span<Int32>.EndsWith(Size: 4)System.Memory.Span.SequenceEqual(Size: 4)These two appear to be arm64 specific.
So issue appears to be in Seems like PGO and non-PGO codegen is the same. Method is R2R'd and with PGO we do a tier1 instr, but do not add any probes. Explanation: this method is an intrinsic and not on the whitelist. Fixing that (hack) gives:
and more broady
But does not explain why there is a PGO regression. And as you can see the Size=4 results are not very stable. Similar diffs for |
System.Memory.Span<Int32>.SequenceCompareToDifferent(Size: 512)Also seems to be arm64 specific. |
Benchstone.BenchF.InvMt.TestThis one is more substantially regressed on amd64 HW... This doesn't repro on my local Zen3 box BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22000.2176/21H2/SunValley) PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
Perf lab is running Ryzen 7 3700 PRO. |
System.Tests.Perf_Random.NextSingle
Profiling
Looks like with PGO we inline InternalSample, and this hurts perf. Why? Root cause is lack of if conversion in InternalSample, once PGO has inlined it into Random+Net5CompatSeedImpl.NextSingle. (with DOTNET_JitDoIfConversion=0)
@jakobbotsch here's an example where not if converting in loops is painful. PGO undoes the improvements that came in with #81267. ARM64 does not seem to be affected for some reason. #79101 may show a similar problem |
System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSingleSegment
Same set of inlines, similar optimizations. Main delta is code layout, in both First and in ChkCastClassSpecial. Not clear why this causes such a big perf diff as the profile data should be accurate. The issue is that by default we don't profile casts, and without profile data, we assume casts will fall back to the helper 25% of the time. This leads the jit to move all the cast calls to the end of the method, and so each call site must jump to the call and then jump back into the regular flow: ;; base
;; size=41 bbWeight=0.50 PerfScore 7.12
G_M11072_IG110: ;; offset=0C55H
mov rdx, r8
call [CORINFO_HELP_CHKCASTCLASS_SPECIAL]
;; size=9 bbWeight=0.12 PerfScore 0.41
G_M11072_IG111: ;; offset=0C5EH
mov rdx, gword ptr [rax+18H]
;; diff
jne SHORT G_M11072_IG52
;; size=45 bbWeight=0.50 PerfScore 7.12
G_M11072_IG49: ;; offset=0863H
mov rdx, gword ptr [rax+18H]
...
G_M11072_IG52: ;; offset=08B8H
mov rdx, r8
call [CORINFO_HELP_CHKCASTCLASS_SPECIAL]
jmp SHORT G_M11072_IG49 BBWeight of IG52 in comes from QMARK expansion, we assume 25% chance? With
@EgorBo where did we end up in the evaluation of cast profiling? |
Burgers.Test1This one is pretty stable except on linux x64: BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22621.1992/22H2/2022Update/SunValley2) PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
Can't repro this one locally. I think I have the same HW as the lab (i7-8700) but not sure. Also does not repro on my old Sandy Bridge BenchmarkDotNet=v0.13.2.2052-nightly, OS=ubuntu 20.04 PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
|
I think it should be good to enable it for complex types of casts, but not in .NET 8.0 (afair, we don't profile all types of casts + never checked actual codegen size overhead from it) |
System.MathBenchmarks.Single.MinSeems like it is bimodal and just happened to flip around the time we enabled Dynamic PGO. Similar for Max Recent regressions are #88482 |
MicroBenchmarks.Serializers.Json_FromStream<MyEventsListerViewModel>>.DataContractJsonSerializer_This is not a pure regression but rather an increase in variance (ala #87324), with lower lows and higher highs -- but note this is only the case for windows x64 intel (and arm64); Linux is relatively stable and faster with PGO, and windows on amd64 seems ok as well. |
System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer(TestCase: ArrayOfNumbers)Regression is linux-x64 only; win-x64 benefits from PGO, arm64 is more or less unchanged. I can repro the regression on WSL2, however there aren't great profiling tools available there.
Nominal profile (from winx64 is)
Using BDN's
Actually, looking at the profile data, it seems wrong. A big of digging turned out to be a bug in the 32 bit counter helper on linux, see #89340. Fixing that doesn't alter the codegen, it just makes the loop seem a bit less hot (but still very hot). |
System.Collections.Sort<IntStruct>.List(Size: 512)(see also #84264 (comment)) Did not regress on windows-x64, but did everywhere else. System.Collections.Sort<IntStruct>.Array(Size: 512)Ditto. These two are quite likely the same issue. Last I looked into sorting, it was layout related; let's look again. |
System.Collections.CtorFromCollection<Int32>.ConcurrentBag(Size: 512)Improved on x64. regressed on arm64. Ah, likely the same issue with barriers as noted below: #87194 (comment) |
System.Tests.Perf_HashCode.Combine_1amd64 only Does not repro on my local Zen3 machine: BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22621.1992/22H2/2022Update/SunValley2) PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
In fact the whole suite looks pretty good, save perhaps BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22621.1992/22H2/2022Update/SunValley2) PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
|
System.Memory.ReadOnlySequence.Slice_Repeat(Segment: Multiple)Win-x64 only This one repros BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22621.1992/22H2/2022Update/SunValley2) PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
Issue here seems to be that with PGO we mark a call site that takes V05 (struct local) as rare and don't inline it, and so V05 ends up getting address exposed and has more expensive copy semantics. @EgorBo example where not doing an inline in a cold block impacts codegen in a hot block.
|
System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer(TestCase: ArrayOfStrings)x64 linux only. Likely the same as #87194 (comment) |
Span.Sorting.QuickSortSpan(Size: 512)Windows intel x64 only.
At first look, it seems like BDN is not iterating this enough... we are measuring Tier0 code.
The per-iteration times tell a similar story:
Where for diff (pgo) we will eagerly instrument so the tier0 code will be slower. But even so, the optimized code is slower... @adamsitnik seems like we should up the iterations per invocation for these tests to something like 10_000 (at 1000, each benchmark interval is only 10ms). Think this might be caused by JCC errata. Main optimizations are identical, but code layout differs. ;;
cmp dword ptr [rbx+4*r8], edx
jge SHORT G_M24415_IG04
;; size=17 bbWeight=5.09 PerfScore 27.98
G_M24415_IG07: ;; offset=0041H and note how in diff that |
System.Memory.ReadOnlySequence.Slice_Start_And_Length(Segment: Multiple)also win-x64 only Same underlying issue as #87194 (comment) |
System.Collections.CtorFromCollection<String>.ConcurrentBag(Size: 512)arm64 only Repros on my volterra: BenchmarkDotNet v0.13.7-nightly.20230724.45, Windows 11 (10.0.22621.2070/22H2/2022Update/SunValley2) PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
So issue is evidently in In the base (no-pgo) jit we form a CSE and this lets us use
In base there are just two sampling hot spots, both tied to
In diff there are more hot spots, and all hottest samples are near the
@EgorBo example we were chatting about. |
I took a look at the benchmark source code and it's safe to do it (the benchmark ID won't change, as the |
While there are still a few benchmarks where the analysis is unclear, they are isolated to specific OS/ABI combinations. So I'm going to close this out. |
Please keep in mind that both bubble sort and |
Makes sense. I am just adding tests here that were not previously included, but regressed over the same commit range as this check-in. |
Also note that bubble sort runs for a very long time, and so likely BDN + lab customization is not reliably measuring the tier1 codegen , but instead some mixture of Tier0, Tier0 + instrumentation, OSR, and or R2R code. |
Feel free to add this kind of thing to #87324 |
I will do that going forward, didn't know about that issue :) |
This issue tracks investigation into microbenchmarks that have reported regressions with Dynamic PGO enabled. It is a continuation of #84264 which tracked regressions from PGO before it was enabled.
The report below is collated from the following autofiling reports.
The table is auto generated by a tool written by @EgorBo but may be edited by hand as regression analysis produces results. The "Score" is the geomean regression across all architectures; benchmarks that did not regress (or get reported) on some architectures are assumed to have produced the same results with and without PGO. "Recent Score" is the current performance (as of 2023-0606) versus the non-PGO result; "Orig Score" is based on the results of auto filing. They will differ if benchmark performance has improved or regressed since the auto filing ran (see for example the results for
System.Text.Json.Tests.Perf_Get.GetByte
, which has improved already).Only the 36 entries with recent scores >= 1.3 are included; this leaves off approximately 220 more rows with scores between 1.3 or lower. Our plan is to prioritize investigation of these benchmarks initially, as they have the largest aggregate regressions. If time permits, we will regenerate this chart to pick up the impact of any fixes and see how much of the remainder we can tackle.
Each arch/os result is a hyperlink to the performance data graph for that benchmark. ~Note we currently have no autofiling data for win-x64-intel. If/when that shows up we will regenerate the table.~~
[edit: had to regenerate the table once already, as the scoring logic was off]
[edit: have x64 win intel data now, new table. Not current results have shifted so table is somewhat different...]
cc @dotnet/jit-contrib
1.36
1.37
3.39
2.27
3.04
1.76
1.63
1.92
1.85
1.47
1.49
1.99
2.43
2.19
3.54
2.00
4.73
2.01
2.68
1.82
1.64
1.44
1.46
1.18
2.17
2.33
1.94
1.58
1.87
1.43
1.62
1.81
1.50
2.13
1.41
1.28
1.44
1.42
1.15
1.09
1.44
1.27
1.58
1.62
1.39
1.39
1.32
1.29
1.37
1.29
1.43
1.28
1.48
1.56
1.58
1.66
2.14
2.67
2.24
1.33
1.34
1.40
1.36
1.26
1.38
1.44
1.42
1.55
1.46
1.29
1.44
1.41
1.71
1.33
1.33
1.32
1.18
1.39
1.28
1.15
1.46
1.57
1.39
1.28
1.42
1.31
1.38
1.50
1.38
1.59
1.62
1.37
2.22
2.49
2.24
1.25
1.23
1.26
1.46
1.40
1.43
1.39
1.31
1.27
1.19
1.50
1.37
1.50
1.34
ldapr
1.30
1.30
The text was updated successfully, but these errors were encountered: