Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate microbenchmarks that regress with PGO enabled #84264

Closed
AndyAyersMS opened this issue Apr 3, 2023 · 49 comments
Closed

Investigate microbenchmarks that regress with PGO enabled #84264

AndyAyersMS opened this issue Apr 3, 2023 · 49 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI PGO tenet-performance-benchmarks Issue from performance benchmark
Milestone

Comments

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 3, 2023
@AndyAyersMS AndyAyersMS self-assigned this Apr 3, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Apr 3, 2023
@ghost
Copy link

ghost commented Apr 3, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Look into cases where microbenchmarks are slower with PGO enabled.

#74873 (comment) has an initial list to check.

Author: AndyAyersMS
Assignees: AndyAyersMS
Labels:

area-CodeGen-coreclr

Milestone: -

@AndyAyersMS AndyAyersMS added tenet-performance-benchmarks Issue from performance benchmark PGO and removed untriaged New issue has not been triaged by the area owner labels Apr 3, 2023
@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Apr 3, 2023

PerfLabTests.CastingPerf.CheckArrayIsInterfaceNo

Default [link]

newplot - 2023-04-03T121123 088

PGO [link]

newplot - 2023-04-03T121343 443

Improvements in both on 12/4 were from #79060.
Regression in PGO on 1/13 was from #80481.

Looks like quite a few tests were impacted by #80481. So I'll just focus on that for now.

@EgorBo
Copy link
Member

EgorBo commented Apr 3, 2023

@AndyAyersMS I think that CastHelpers should not be instrumented since they were already carefully optimized for a general case by @VSadov e.g.

private static object? IsInstanceOfClass(void* toTypeHnd, object? obj)
{
if (obj == null || RuntimeHelpers.GetMethodTable(obj) == toTypeHnd)
return obj;
MethodTable* mt = RuntimeHelpers.GetMethodTable(obj)->ParentMethodTable;
for (; ; )
{
if (mt == toTypeHnd)
goto done;
if (mt == null)
break;
mt = mt->ParentMethodTable;
if (mt == toTypeHnd)
goto done;
if (mt == null)
break;
mt = mt->ParentMethodTable;
if (mt == toTypeHnd)
goto done;
if (mt == null)
break;
mt = mt->ParentMethodTable;
if (mt == toTypeHnd)
goto done;
if (mt == null)
break;
mt = mt->ParentMethodTable;
}
// this helper is not supposed to be used with type-equivalent "to" type.
Debug.Assert(!((MethodTable*)toTypeHnd)->HasTypeEquivalence);
obj = null;
done:
return obj;
}

@EgorBo
Copy link
Member

EgorBo commented Apr 3, 2023

PS: we can probably propose a new API [MethodImpl(MethodImplOptions.NoInstrumenting)]

@VSadov
Copy link
Member

VSadov commented Apr 3, 2023

Yes we are certain that these methods will be hot and want the best possible codegen right away.

The cast helpers are also initialized in a special way -

void ECall::PopulateManagedCastHelpers()

I do not remember all the details, but there were reasons for that.
Maybe it is possible to disable instrumentation for things initialized that way.
(I do not have preferences, just want mention that there could be another way to detect and special case these)

@EgorBo
Copy link
Member

EgorBo commented Apr 3, 2023

Yes we are certain that these methods will be hot and want the best possible codegen right away.

The cast helpers are also initialized in a special way -

void ECall::PopulateManagedCastHelpers()

I do not remember all the details, but there were reasons for that. Maybe it is possible to disable instrumentation for things initialized that way. (I do not have preferences, just want mention that there could be another way to detect and special case these)

The ones you listed have AggressiveOptimization attribute, am I right? In that case we won't instrument them, but there are other (less important) members in CastHelpers which don't have AggressiveOptimization so are eligble for instrumentation

@VSadov
Copy link
Member

VSadov commented Apr 3, 2023

All the cast helpers are initialized in that method.
If you have other uses and reasons to introduce MethodImplOptions.NoInstrumenting, then this part is unimportant.

It would only be interesting if you somehow do not want MethodImplOptions.NoInstrumenting just for a few special methods and need another way to identify them.

@AndyAyersMS
Copy link
Member Author

The issue caused by #80481 is with benchmark methods like PerfLabTests.CastingPerf:CheckArrayIsInterfaceNo that trigger OSR by looping a lot internally.

For such methods we never see any Tier0 + instr method return, so with sparse profiles enabled Tier1 profile reconstruction thinks the entry block is run rarely, and this messes up optimizations. eg in this case we fail to hoist/cse the cast out of the loop.

Trees after Profile incorporation

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight       IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             0          0    [000..006)-> BB03 (always)                     rare IBC 
BB02 [0001]  1       BB03                 5630k 5630382    [006..018)                                     bwd bwd-target IBC 
BB03 [0002]  2       BB01,BB02            5630k 5630382    [018..020)-> BB02 ( cond )                     bwd bwd-src IBC 
BB04 [0003]  1       BB03                  0          0    [020..022)        (return)                     rare IBC 
-----------------------------------------------------------------------------------------------------------------------------------------

The fix I'm proposing is to run synthesis repair in such cases. It addresses the above case, looking now to see what else it does.

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Apr 4, 2023
In particular, run synthesis in repair mode for cases where there are profile
counts within the method but zero counts in `fgFirstBB`.

Recall that sparse profiling effectively probes return blocks to determine the
method entry count.

So the zero-entry but not zero-everywhere case can happen if we have a method
with a very long running loop plus sparse profiling plus OSR -- we will only
get profile counts from the instrumented Tier0 method, and it will never return
(instead it will always escape to an OSR version which will eventually return,
but that version won't be instrumented).

I originally was a bit more ambitious and ran repair for a broader set of
reconstruction issues, but lead to a large number of diffs, in part because
repair doesn't cope well with irreducible loops.

Leaving the entry count zero can have fairly disastrous impact on the quality
of optimizations done in the method.

Addresses quite a few of the worst-performing benchmarks in dotnet#84264.
@AndyAyersMS
Copy link
Member Author

Also noticed #84319 while reviewing the dumps for CheckArrayIsInterfaceNo.

AndyAyersMS added a commit that referenced this issue Apr 4, 2023
In particular, run synthesis in repair mode for cases where there are profile
counts within the method but zero counts in `fgFirstBB`.

Recall that sparse profiling effectively probes return blocks to determine the
method entry count.

So the zero-entry but not zero-everywhere case can happen if we have a method
with a very long running loop plus sparse profiling plus OSR -- we will only
get profile counts from the instrumented Tier0 method, and it will never return
(instead it will always escape to an OSR version which will eventually return,
but that version won't be instrumented).

I originally was a bit more ambitious and ran repair for a broader set of
reconstruction issues, but lead to a large number of diffs, in part because
repair doesn't cope well with irreducible loops.

Leaving the entry count zero can have fairly disastrous impact on the quality
of optimizations done in the method.

Addresses quite a few of the worst-performing benchmarks in #84264.
@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Apr 6, 2023

After #84312
newplot - 2023-04-06T130742 053

Updated table of worst-case results (just 1 day of data, so possibly noisy, eg PerfLabTests.CastingPerf.CheckArrayIsInterfaceNo is bimodal but hasn't shown fast results for PGO yet).

[now showing the "fast time"]

newplot (7)

Investigated cases in bold, notes linked -- skipping tests in italics that look similar to ones already analyzed.

TestName Pgo/NoPgo Note
System.Text.Json.Tests.Perf_Get.GetByte [notes] 0.378284924 budget #85531
System.Text.Json.Tests.Perf_Get.GetUInt16 0.39703163 dup
System.Text.Json.Tests.Perf_Get.GetInt16 0.415271098 dup
System.Text.Json.Tests.Perf_Get.GetSByte 0.432559076 dup
System.Text.Json.Tests.Perf_Get.GetUInt32 0.485751995 dup
System.Text.Json.Tests.Perf_Get.GetInt32 0.496368517 dup
PerfLabTests.CastingPerf.CheckArrayIsInterfaceYes 0.500523372 dup
PerfLabTests.CastingPerf.CheckArrayIsInterfaceNo [notes] improvements 0.500571243 fixed #84312
System.Collections.IterateFor<String>.ImmutableArray(Size: 512) [notes] 0.521820448 suspect noise
Exceptions.Handling.MultipleNestedTryCatch_FirstCatches(kind: Hardware) [notes] 0.587834042 noise
Exceptions.Handling.ThrowAndCatchWhenFinally(kind: Software) 0.629162936 dup
System.Collections.Sort<Int32>.Array(Size: 512) [notes] 0.691174178 layout
Exceptions.Handling.ThrowAndCatch(kind: ReflectionSoftware) 0.712891743 dup
System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToWriter(Mode: SourceGen) [notes] 0.714508317 budget #85531
System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToUtf8Bytes(Mode: SourceGen) 0.716100799 dup
System.Collections.ContainsKeyTrue<Int32, Int32>.ConcurrentDictionary(Size: 512) [notes] 0.716735601 deferred
System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToString(Mode: SourceGen) 0.72480144 dup
System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToStream(Mode: SourceGen) 0.728309251 dup
System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeObjectProperty(Mode: SourceGen) 0.732178172 dup
Exceptions.Handling.ThrowAndCatch_ManyCatchBlocks(kind: ReflectionSoftware) 0.734050863 dup
Benchstone.BenchI.MulMatrix.Test [notes] 0.746493091 repair #84741
LinqBenchmarks.Count00ForX [notes] 0.75390536 #84517
Benchstone.MDBenchI.MDLogicArray.Test [notes] 0.755000226 repair
Span.Sorting.QuickSortSpan(Size: 512) [notes] 0.755771373 layout
System.IO.Tests.Perf_Path.GetDirectoryName [notes] 0.762018237 unclear
System.Collections.CtorFromCollection<Int32>.SortedSet(Size: 512) [notes] 0.784138758 noise
MicroBenchmarks.Serializers.Xml_ToStream<MyEventsListerViewModel>.XmlSerializer_ [notes] 0.785627854 noise
System.Text.Json.Tests.Perf_Segment.ReadMultiSegmentSequence(segmentSize: 4096, TestCase: Json4KB) [notes] 0.788119346 budget #85531
System.Text.Json.Tests.Perf_Get.GetUInt64 0.791121821 dup
System.Memory.Span<Byte>.SequenceEqual(Size: 33) [notes] 0.791880016 intrinsic #85130
System.Globalization.Tests.StringEquality.Compare_Same_Upper(Count: 1024, Options: (en-US, IgnoreNonSpace)) [notes] 0.798755609 noise
Benchstone.MDBenchF.MDInvMt.Test [notes] 0.80080933 Fixed #84741
PerfLabTests.CastingPerf.ObjObjIsFoo [notes] 0.806031322 noise
System.Memory.Span<Byte>.StartsWith(Size: 512) [notes] 0.806253739 noise
Exceptions.Handling.TryAndFinallyDeep(kind: Software) 0.81026551 dup
System.Collections.ContainsFalse<String>.ImmutableArray(Size: 512) [notes] 0.812478704 noise
System.Memory.Span<Byte>.SequenceEqual(Size: 512) 0.821172947 dup
Benchstone.BenchF.InvMt.Test [notes] 0.822482883 fixed #84741
ByteMark.BenchLUDecomp [notes] 0.822880993 blend #85171
PerfLabTests.CastingPerf2.CastingPerf.ScalarValueTypeObj [notes] 0.82354731 fixed
System.Memory.Span<Byte>.SequenceCompareToDifferent(Size: 4) [notes] 0.834542277 LSRA block order
PerfLabTests.LowLevelPerf.GenericGenericMethod [notes] 0.836391889 noise
System.Text.Json.Document.Tests.Perf_DocumentParse.Parse(IsDataIndented: False, TestRandomAccess: True, TestCase: Json400KB) [notes] 0.836471477 unclear
System.Perf_Convert.FromBase64String [notes] 0.839094252 maintenance #85265
System.Text.RegularExpressions.Tests.Perf_Regex_Cache.IsMatch(total: 400000, unique: 1, cacheSize: 15) [notes] 0.841496573 noise
Benchstone.BenchI.CSieve.Test [link] 0.842229476 Fixed #84741
System.Collections.IndexerSet<Int32>.Dictionary(Size: 512) [notes] 0.842651326 noise ?
System.Text.Json.Serialization.Tests.WriteJson<ImmutableSortedDictionary<String, String>>.SerializeToWriter(Mode: SourceGen) 0.84525902 dup
System.Tests.Perf_Char.Char_ToLowerInvariant(input: "Hello World!") [notes] 0.846315396 noise
System.Text.Json.Tests.Perf_Get.GetInt64 0.847824295 dup

@AndyAyersMS
Copy link
Member Author

System.Text.Json.Tests.Perf_Get.GetByte

Default [link]

newplot - 2023-04-10T082229 376

51.97%   2.616E+07   Tier-1   [System.Private.CoreLib]Utf8Parser.TryParseByteD(value class System.ReadOnlySpan`1<unsigned int8>,unsigned int8&,int32&)
33.04%   1.663E+07   Tier-1   [MicroBenchmarks]Perf_Get.GetByte()
09.30%   4.68E+06    Tier-1   [System.Private.CoreLib]Utf8Parser.TryParse(value class System.ReadOnlySpan`1<unsigned int8>,unsigned int8&,int32&,wchar)
02.22%   1.12E+06    Tier-1   [System.Text.Json]Utf8JsonReader..ctor(value class System.ReadOnlySpan`1<unsigned int8>,bool,value class System.Text.Json.JsonReaderState)

PGO [link]

newplot - 2023-04-10T082207 409

45.71%   2.297E+07   Tier-1   [System.Text.Json]Utf8JsonReader.TryGetByte(unsigned int8&)
31.92%   1.604E+07   Tier-1   [MicroBenchmarks]Perf_Get.GetByte()
19.36%   9.73E+06    Tier-1   [System.Private.CoreLib]Utf8Parser.TryParseByteD(value class System.ReadOnlySpan`1<unsigned int8>,unsigned int8&,int32&)
00.84%   4.2E+05     Tier-1   [System.Text.Json]Utf8JsonReader..ctor(value class System.ReadOnlySpan`1<unsigned int8>,bool,value class System.Text.Json.JsonReaderState)
00.66%   3.3E+05     Tier-1   [System.Text.Json]Utf8JsonReader.TryGetNumber(value class System.ReadOnlySpan`1<unsigned int8>,int32&)

PGO is unable to inline TryGetByte into Perf_Get.GetByte because the inliner runs out of budget doing a lot of earlier inlines that are less profitable. If I increase the budget to 12x (currently it's 10x) then PGO and non-PGO performance matches.

    [INL00 IL=0003 TR=000894 060003D1] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonReader:TryGetByte(byref):bool:this

I also noticed Perf_Get.GetByte was jitted twice at instrumented tier0 (#84517)

 827: JIT compiled System.Text.Json.Tests.Perf_Get:GetByte():ubyte:this [Instrumented Tier0, IL size=62, code size=332, hash=0xb3d5b836]
 907: JIT compiled System.Text.Json.Tests.Perf_Get:GetByte():ubyte:this [Instrumented Tier0, IL size=62, code size=299, hash=0xb3d5b836]
 939: JIT compiled System.Text.Json.Tests.Perf_Get:GetByte():ubyte:this [Tier1 with Dynamic PGO, IL size=62, code size=1802, hash=0xb3d5b836]

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Apr 10, 2023

System.Collections.IterateFor.ImmutableArray(Size: 512)

Default [link]

newplot - 2023-04-10T101203 658

PGO [link]

newplot - 2023-04-10T101002 710


This one seems sporadically bimodal. Good results show PGO and default are at par.

My local runs have PGO faster than default. Will defer investigation for a while to see if PGO runs ever go back to the "fast" level.


Latest runs show PGO perf has indeed come back down
newplot (18)

@AndyAyersMS
Copy link
Member Author

Exceptions.Handling.MultipleNestedTryCatch_FirstCatches(kind: Hardware)

Default [link]

newplot

PGO [link]

newplot (1)


Also seems bimodal and theres no real difference between non-PGO and PGO

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Apr 11, 2023

System.Collections.Sort.Array(Size: 512)

Default [link]

newplot (2)

PGO [link]

newplot (3)


Both fluctuate, but PGO has worse behavior.

By default this test doesn't iterate enough and is mainly measuring R2R codegen. Sadly both perfview and my tool are unable to resolve the R2R method. Not sure why.

51.77%   3.07E+06    native   System.Private.CoreLib.dll
27.66%   1.64E+06    Tier-1   [System.Private.CoreLib]System.Collections.Generic.GenericArraySortHelper`1[System.Int32].PickPivotAndPartition(value class System.Span`1<!0>)
14.84%   8.8E+05     Tier-1   [System.Private.CoreLib]System.Collections.Generic.GenericArraySortHelper`1[System.Int32].InsertionSort(value class System.Span`1<!0>)
02.53%   1.5E+05     Tier-1   [System.Private.CoreLib]System.Collections.Generic.GenericArraySortHelper`1[System.Int32].IntroSort(value class System.Span`1<!0>,int32)

If I bump iteration count up to 200 there is a more consistent small regression with PGO

Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
Array Job-YGWDQU Default 512 4.594 us 0.1309 us 0.5542 us 4.455 us 4.120 us 8.563 us 1.00 0.00 - NA
Array Job-AEEVAG PGO 512 5.128 us 0.1241 us 0.5255 us 5.201 us 2.996 us 7.711 us 1.13 0.16 - NA

And it looks like it is related to the codegen in PickPivotAndPartition.

Perf vs BDN iteration number for PGO, By default the lab does 15 iterations and IIRC we report median so you can see it might toggle back and forth between fast and slow results like we are seeing above, depending exactly on when the switchover happens.

image

Seems like PGO behavior changed as a result of #80769.

Issue seems to be that fgReorderBlocks sees slightly different profile weights and leaves a colder block BB10 intermixed with hotter ones:

;; default (before)

BB10 [0092]  1       BB08                  0.31   33    [037..038)-> BB11 (always)                     i IBC 
BB07 [0056]  1       BB05                  0.31   33    [030..031)-> BB08 (always)                     i IBC 
BB04 [0020]  1       BB02                  0.31   33    [029..02A)-> BB05 (always)                     i IBC 
BB20 [0014]  2       BB11,BB18             1     108    [113..11D)-> BB22 ( cond )                     i IBC 
BB21 [0015]  1       BB20                  1     108    [11D..125)                                     i IBC 
BB22 [0016]  2       BB20,BB21             1     108    [125..137)        (return)                     i IBC 

;; default (after)

BB20 [0014]  2       BB11,BB18             1     108    [113..11D)-> BB22 ( cond )                     i IBC 
BB21 [0015]  1       BB20                  1     108    [11D..125)                                     i IBC 
BB22 [0016]  2       BB20,BB21             1     108    [125..137)        (return)                     i IBC 
BB10 [0092]  1       BB08                  0.31   33    [037..038)-> BB11 (always)                     i IBC 
BB07 [0056]  1       BB05                  0.31   33    [030..031)-> BB08 (always)                     i IBC 
BB04 [0020]  1       BB02                  0.31   33    [029..02A)-> BB05 (always)                     i IBC 

;; pgo (before)

BB10 [0092]  1       BB08                  0.33   19513    [037..038)-> BB11 (always)                     i IBC 
BB07 [0056]  1       BB05                  0.33   19513    [030..031)-> BB08 (always)                     i IBC 
BB04 [0020]  1       BB02                  0.33   19513    [029..02A)-> BB05 (always)                     i IBC 
BB20 [0014]  2       BB11,BB18             1      59808    [113..11D)-> BB22 ( cond )                     i IBC 
BB21 [0015]  1       BB20                  0.98   58720    [11D..125)                                     i IBC 
BB22 [0016]  2       BB20,BB21             1      59808    [125..137)        (return)                     i IBC 

;; pgo (after)

BB20 [0014]  2       BB11,BB18             1      59808    [113..11D)-> BB22 ( cond )                     i IBC 
BB23 [0193]  1       BB20                  0.98            [???..???)-> BB21 (always)                     internal 
BB10 [0092]  1       BB08                  0.33   19513    [037..038)-> BB11 (always)                     i IBC 
BB21 [0015]  1       BB23                  0.98   58720    [11D..125)                                     i IBC 
BB22 [0016]  2       BB20,BB21             1      59808    [125..137)        (return)                     i IBC 
BB07 [0056]  1       BB05                  0.33   19513    [030..031)-> BB08 (always)                     i IBC 
BB04 [0020]  1       BB02                  0.33   19513    [029..02A)-> BB05 (always)                     i IBC 

The default case is using static PGO; from what I can tell all the counts are proper so this is just a divergence in profile data leading to a change in code layout.

There doesn't seem to be an easy way to tweak this algorithm, so going to leave the analysis here and move on.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Apr 12, 2023

System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToWriter(Mode: SourceGen)

Default [link]

newplot (5)

PGO [link]

newplot (6)

Locally

Method Job Toolchain Mode Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
SerializeToWriter Job-GYSUOG Default SourceGen 9.702 us 0.4368 us 0.5030 us 9.678 us 9.033 us 11.18 us 1.00 0.00 - NA
SerializeToWriter Job-HDVYVV PGO SourceGen 13.188 us 0.4572 us 0.5265 us 13.011 us 12.641 us 14.62 us 1.36 0.07 - NA

Suspect this is another case where we run out of budget during inlining. The root level method that contains all the benchmark bits is MicroBenchmarks.Serializers.SystemTextJsonSourceGeneratedContext:ImmutableDictionaryStringStringSerializeHandler(System.Text.Json.Utf8JsonWriter,System.Collections.Immutable.ImmutableDictionary`2[System.String,System.String])

and the respecive ETL profiles are

;; default

02.56%   1.31E+06    ?        Unknown
20.78%   1.065E+07   Tier-1   [MicroBenchmarks]SystemTextJsonSourceGeneratedContext.ImmutableDictionaryStringStringSerializeHandler(class System.Text.Json.Utf8JsonWriter,class System.Collections.Immutable.ImmutableDictionary`2<class System.String,class System.String>)
12.69%   6.5E+06     Tier-1   [System.Text.Json]Utf8JsonWriter.WriteStringMinimized(value class System.ReadOnlySpan`1<wchar>,value class System.ReadOnlySpan`1<wchar>)
12.57%   6.44E+06    Tier-1   [System.Text.Encodings.Web]OptimizedInboxTextEncoder.GetIndexOfFirstCharToEncodeSsse3(wchar*,unsigned int)
08.39%   4.3E+06     Tier-1   [System.Private.CoreLib]Utf8.FromUtf16(value class System.ReadOnlySpan`1<wchar>,value class System.Span`1<unsigned int8>,int32&,int32&,bool,bool)
07.59%   3.89E+06    Tier-1   [System.Private.CoreLib]Ascii.NarrowUtf16ToAscii(wchar*,unsigned int8*,unsigned int)
05.35%   2.74E+06    Tier-1   [System.Private.CoreLib]Utf8Utility.TranscodeToUtf8(wchar*,int32,unsigned int8*,int32,wchar*&,unsigned int8*&)
05.21%   2.67E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SortedInt32KeyNode`1+Enumerator[System.Collections.Immutable.ImmutableDictionary`2+HashBucket[System.__Canon,System.__Canon]].PushLeft(class System.Collections.Immutable.SortedInt32KeyNode`1<!0>)
04.37%   2.24E+06    Tier-1   [System.Text.Encodings.Web]OptimizedInboxTextEncoder.GetIndexOfFirstCharToEncode(value class System.ReadOnlySpan`1<wchar>)
04.25%   2.18E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.ImmutableDictionary`2+HashBucket+Enumerator[System.__Canon,System.__Canon].MoveNext()
03.40%   1.74E+06    Tier-1   [System.Text.Encodings.Web]DefaultJavaScriptEncoder.FindFirstCharacterToEncode(wchar*,int32)
03.28%   1.68E+06    Tier-1   [System.Private.CoreLib]Ascii.NarrowUtf16ToAscii_Intrinsified(wchar*,unsigned int8*,unsigned int)
03.10%   1.59E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SecurePooledObject`1[System.__Canon].Use(!!0&)
02.05%   1.05E+06    Tier-1   [System.Collections]System.Collections.Generic.Stack`1[System.Collections.Immutable.RefAsValueType`1[System.__Canon]].Pop()
02.03%   1.04E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SecurePooledObject`1[System.__Canon].IsOwned(!!0&)

;; pgo

02.87%   2.07E+06    ?        Unknown
16.01%   1.154E+07   Tier-1   [System.Text.Json]Utf8JsonWriter.WriteString(class System.String,class System.String)
15.82%   1.14E+07    Tier-1   [System.Text.Encodings.Web]OptimizedInboxTextEncoder.GetIndexOfFirstCharToEncode(value class System.ReadOnlySpan`1<wchar>)
08.81%   6.35E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SortedInt32KeyNode`1+Enumerator[System.Collections.Immutable.ImmutableDictionary`2+HashBucket[System.__Canon,System.__Canon]].MoveNext()
07.15%   5.15E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon].MoveNext()
05.70%   4.11E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SortedInt32KeyNode`1[System.Collections.Immutable.ImmutableDictionary`2+HashBucket[System.__Canon,System.__Canon]].get_Value()
05.61%   4.04E+06    Tier-1   [System.Private.CoreLib]Ascii.NarrowUtf16ToAscii(wchar*,unsigned int8*,unsigned int)
05.54%   3.99E+06    Tier-1   [System.Private.CoreLib]Utf8.FromUtf16(value class System.ReadOnlySpan`1<wchar>,value class System.Span`1<unsigned int8>,int32&,int32&,bool,bool)
05.07%   3.65E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SecurePooledObject`1[System.__Canon].IsOwned(!!0&)
04.88%   3.52E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.ImmutableDictionary`2+HashBucket+Enumerator[System.__Canon,System.__Canon].MoveNext()
04.44%   3.2E+06     native   coreclr.dll
03.84%   2.77E+06    Tier-1   [System.Private.CoreLib]Utf8Utility.TranscodeToUtf8(wchar*,int32,unsigned int8*,int32,wchar*&,unsigned int8*&)
03.64%   2.62E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SecurePooledObject`1[System.__Canon].Use(!!0&)
03.03%   2.18E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SortedInt32KeyNode`1+Enumerator[System.Collections.Immutable.ImmutableDictionary`2+HashBucket[System.__Canon,System.__Canon]].get_Current()
02.44%   1.76E+06    Tier-1   [System.Private.CoreLib]Ascii.NarrowUtf16ToAscii_Intrinsified(wchar*,unsigned int8*,unsigned int)
02.01%   1.45E+06    Tier-1   [MicroBenchmarks]SystemTextJsonSourceGeneratedContext.ImmutableDictionaryStringStringSerializeHandler(class System.Text.Json.Utf8JsonWriter,class System.Collections.Immutable.ImmutableDictionary`2<class System.String,class System.String>)
00.97%   7E+05       Tier-1   [System.Collections]System.Collections.Generic.Stack`1[System.Collections.Immutable.RefAsValueType`1[System.__Canon]].Pop()
00.93%   6.7E+05     Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SortedInt32KeyNode`1+Enumerator[System.Collections.Immutable.ImmutableDictionary`2+HashBucket[System.__Canon,System.__Canon]].System.Collections.Immutable.ISecurePooledObjectUser.get_PoolUserId()

These are tricky to match up because of drastic inlining differences, but if we just look at top level call sites in the root method, we see

;; default

Inlines into 06001DF8 [via ExtendedDefaultPolicy] MicroBenchmarks.Serializers.SystemTextJsonSourceGeneratedContext:ImmutableDictionaryStringStringSerializeHandler(System.Text.Json.Utf8JsonWriter,System.Collections.Immutable.ImmutableDictionary`2[System.String,System.String]):
  [INL00 IL=0004 TR=000045 0600069F] [FAILED: call site: unprofitable inline] System.Text.Json.Utf8JsonWriter:WriteNullValue():this
  [INL01 IL=0011 TR=000005 06000564] [INLINED: callee: below ALWAYS_INLINE size] System.Text.Json.Utf8JsonWriter:WriteStartObject():this
  [INL02 IL=0017 TR=000007 0600027A] [INLINED: callee: below ALWAYS_INLINE size] System.Collections.Immutable.ImmutableDictionary`2[System.__Canon,System.__Canon]:GetEnumerator():System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:this
  [INL23 IL=0027 TR=000019 060002F7] [INLINED: call site: profitable inline] System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:get_Current():System.Collections.Generic.KeyValuePair`2[System.__Canon,System.__Canon]:this
  [INL33 IL=0036 TR=000026 06007DD0] [INLINED: callee: below ALWAYS_INLINE size] System.Collections.Generic.KeyValuePair`2[System.__Canon,System.__Canon]:get_Key():System.__Canon:this
  [INL34 IL=0043 TR=000030 06007DD1] [INLINED: callee: below ALWAYS_INLINE size] System.Collections.Generic.KeyValuePair`2[System.__Canon,System.__Canon]:get_Value():System.__Canon:this
  [INL35 IL=0048 TR=000036 06000635] [INLINED: call site: profitable inline] System.Text.Json.Utf8JsonWriter:WriteString(System.String,System.String):this
  [INL81 IL=0055 TR=000012 060002F9] [INLINED: call site: profitable inline] System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:MoveNext():bool:this
  [INL101 IL=0072 TR=000041 060002FB] [INLINED: call site: profitable inline] System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:Dispose():this
  [INL109 IL=0079 TR=000038 0600057A] [INLINED: callee: below ALWAYS_INLINE size] System.Text.Json.Utf8JsonWriter:WriteEndObject():this

Budget: initialTime=315, finalTime=3147, initialBudget=3150, currentBudget=3150
Budget: discretionary inline caused a force inline
Budget: initialSize=2069, finalSize=18734

;; pgo

Inlines into 06001DF8 [via ExtendedDefaultPolicy] MicroBenchmarks.Serializers.SystemTextJsonSourceGeneratedContext:ImmutableDictionaryStringStringSerializeHandler(System.Text.Json.Utf8JsonWriter,System.Collections.Immutable.ImmutableDictionary`2[System.String,System.String]):
  [INL00 IL=0004 TR=000045 0600069F] [FAILED: call site: unprofitable inline] System.Text.Json.Utf8JsonWriter:WriteNullValue():this
  [INL01 IL=0011 TR=000005 06000564] [INLINED: callee: below ALWAYS_INLINE size] System.Text.Json.Utf8JsonWriter:WriteStartObject():this
  [INL56 IL=0017 TR=000007 0600027A] [INLINED: callee: below ALWAYS_INLINE size] System.Collections.Immutable.ImmutableDictionary`2[System.__Canon,System.__Canon]:GetEnumerator():System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:this
  [INL82 IL=0027 TR=000019 060002F7] [INLINED: call site: profitable inline] System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:get_Current():System.Collections.Generic.KeyValuePair`2[System.__Canon,System.__Canon]:this
  [INL87 IL=0036 TR=000026 06007DD0] [INLINED: callee: below ALWAYS_INLINE size] System.Collections.Generic.KeyValuePair`2[System.__Canon,System.__Canon]:get_Key():System.__Canon:this
  [INL88 IL=0043 TR=000030 06007DD1] [INLINED: callee: below ALWAYS_INLINE size] System.Collections.Generic.KeyValuePair`2[System.__Canon,System.__Canon]:get_Value():System.__Canon:this
  [INL00 IL=0048 TR=000036 06000635] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonWriter:WriteString(System.String,System.String):this
  [INL00 IL=0055 TR=000012 060002F9] [FAILED: call site: inline exceeds budget] System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:MoveNext():bool:this
  [INL00 IL=0072 TR=000041 060002FB] [FAILED: call site: inline exceeds budget] System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:Dispose():this
  [INL89 IL=0079 TR=000038 0600057A] [INLINED: callee: below ALWAYS_INLINE size] System.Text.Json.Utf8JsonWriter:WriteEndObject():this

Budget: initialTime=315, finalTime=3137, initialBudget=3150, currentBudget=3150
Budget: discretionary inline caused a force inline
Budget: initialSize=2069, finalSize=16857

So with PGO we run out of budget for a couple of top-level methods that feature prominently in the profile.

As before let me try and up the budget to see what that does... I am not suggesting this as a fix but it might tempt us to look at some kind of priority-first inlining process where we try and make sure we're maximizing the effectiveness of the budget.

Looks like a budget value of 18 is sufficient to regain perf. That's quite a jump from where we are now. Non-pgo version is 2811 bytes, PGO version with budget=18 is 4664.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Apr 12, 2023

System.Collections.ContainsKeyTrue<Int32, Int32>.ConcurrentDictionary(Size: 512)

Default [link]

newplot (8)

PGO [link]

newplot (9)


The ContainsKey tests are notoriously noisy, but it looks like there were some regressions in PGO that now make it run slower on average.

The first one, on Feb 9, was likely a result of #81557; not clear yet why this helped non-PGO but hurt PGO.

The other jumps are harder to pin down; the one on March 26 is seemingly because of #83484 which we have seen implicated in other changes but it is hard to see a connection.

Locally I see PGO as frequently faster or at par.

Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
ConcurrentDictionary Job-WSQNSB \base-rel\corerun.exe 512 2.478 us 0.1069 us 0.1231 us 2.451 us 2.312 us 2.703 us 1.00 0.00 - NA
ConcurrentDictionary Job-KIYTHA \diff-rel\corerun.exe 512 2.223 us 0.0630 us 0.0725 us 2.218 us 2.123 us 2.383 us 0.90 0.04 - NA

Profiling shows all time is in these two methods

87.65%   5.464E+07   Tier-1   [System.Collections.Concurrent]System.Collections.Concurrent.ConcurrentDictionary`2[System.Int32,System.Int32].TryGetValue(!0,!1&)
11.41%   7.11E+06    Tier-1   [MicroBenchmarks]System.Collections.ContainsKeyTrue`2[System.Int32,System.Int32].ConcurrentDictionary()

Going to defer further investigation on this one.


Latest runs show PGO behaving better:
newplot - 2023-04-19T120759 506

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Apr 12, 2023

Benchstone.BenchI.MulMatrix.Test

Default [link]

newplot (10)

PGO [link]

newplot (11)


Default:

PGO:


All the time in this benchmark is in one method:

;; default

99.16%   4.365E+07   Tier-1   [MicroBenchmarks]MulMatrix.Inner(int32[][],int32[][],int32[][])

;; pgo

99.41%   7.545E+07   Tier-1   [MicroBenchmarks]MulMatrix.Inner(int32[][],int32[][],int32[][])

With PGO we actually jit this method 7 times (#84517 tracks fixing some of this). Only the last one matters for benchmark perf; all the others are no longer active when BDN is measuring (though likely some are active during the warmup/pilot stages, and might perturb BDN's choice of iteration strategy. However, measuring with --apples --iterationCount 20 still shows a regression, so any BDN strategy variance is a second-order effect here).

 832: JIT compiled Benchstone.BenchI.MulMatrix:Inner(int[][],int[][],int[][]) [Instrumented Tier0, IL size=435, code size=2943, hash=0x482bb336]
 833: JIT compiled Benchstone.BenchI.MulMatrix:Inner(int[][],int[][],int[][]) [Tier1-OSR @0x60 with Dynamic PGO, IL size=435, code size=1323, hash=0x482bb336]
 834: JIT compiled Benchstone.BenchI.MulMatrix:Inner(int[][],int[][],int[][]) [Tier1-OSR @0x2a with Dynamic PGO, IL size=435, code size=1529, hash=0x482bb336]
 835: JIT compiled Benchstone.BenchI.MulMatrix:Inner(int[][],int[][],int[][]) [Instrumented Tier0, IL size=435, code size=2943, hash=0x482bb336]
 836: JIT compiled Benchstone.BenchI.MulMatrix:Inner(int[][],int[][],int[][]) [Tier1-OSR @0x60 with Dynamic PGO, IL size=435, code size=1323, hash=0x482bb336]
 837: JIT compiled Benchstone.BenchI.MulMatrix:Inner(int[][],int[][],int[][]) [Tier1-OSR @0x2a with Dynamic PGO, IL size=435, code size=1547, hash=0x482bb336]
 838: JIT compiled Benchstone.BenchI.MulMatrix:Inner(int[][],int[][],int[][]) [Tier1 with Dynamic PGO, IL size=435, code size=1194, hash=0x482bb336]

The dynamic profile only reflects the first loop or two of the method, because those loops are hot enough that we escape to OSR before they finish iterating. Thus at Tier1 we see an incomplete profile where many blocks are zero that we know must be executed. This is likely the root cause of the regression from #70941; before then we would instrument the OSR method and so have a more representative picture of profile data.

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight       IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                           100        100    [000..004)-> BB06 (always)                     IBC 
BB02 [0001]  1       BB06                  100k  100000    [004..008)-> BB04 (always)                     bwd bwd-target IBC 
BB03 [0002]  1       BB04                 8035k 8034927    [008..02A)                                     bwd bwd-target IBC 
BB04 [0003]  2       BB02,BB03            8135k 8134927    [02A..02F)-> BB03 ( cond )                     bwd bwd-src IBC 
BB05 [0004]  1       BB04                  100k  100000    [02F..033)                                     bwd IBC 
BB06 [0005]  2       BB01,BB05             100k  100000    [033..038)-> BB02 ( cond )                     bwd bwd-src IBC 
BB07 [0006]  1       BB06                  0          0    [038..03C)-> BB15 (always)                     rare IBC 
BB08 [0007]  1       BB15                  0          0    [03C..040)-> BB13 (always)                     rare bwd bwd-target IBC 
BB09 [0008]  1       BB13                  0          0    [040..044)-> BB11 (always)                     rare bwd bwd-target IBC 
BB10 [0009]  1       BB11                  0          0    [044..060)                                     rare bwd bwd-target IBC 

This looks like a case that should have been handled by #84312 -- because the instrumented methods never return, profile incorporation things the entry block weight is zero, and schedules repairs; these run but fail to populate the remainder of the method with interesting counts.

Solver: entry block weight is zero
BB01 has zero weight , setting outgoing likelihoods heuristically
...
Incorporated count data had inconsistencies; repairing profile...
Enter blocks: BB01 
...
ccp backedge BB32 (0.5) -> BB33 likelihood 1
For loop at BB33 cyclic weight is 0.5 cyclic probability is 2
...
ccp backedge BB05 (1) -> BB06 likelihood 1
For loop at BB06 cyclic weight is 0.999 cyclic probability is 1000 [capped]
...
Computing block weights
cbw: BB01 :: 100
cbw (loop head): BB06 :: 100000
cbw: BB02 :: 100000
cbw (loop head): BB04 :: 8134927
cbw: BB03 :: 8034927
cbw: BB05 :: 100000
cbw: BB07 :: 0
cbw (loop head): BB15 :: 0
cbw: BB08 :: 0
cbw (loop head): BB13 :: 0
cbw: BB09 :: 0
...

The problem seems to be that we still use the edge likelihoods laid down by the profile incorporation and so think the loops that never exit actually never exit, despite seeing that they have high $C_p$ values (which we cap). The capping decision does not get fed back into an edge likelihood adjustment.

Some ideas on how to resolve this:

  • for blocks with zero counts, allow repair to reset edge likelihoods (or have profile incorporation run after synthesis so it can just use the synthesized likelihoods when it has zero counts, and use its own likelihoods otherwise). In the example above all the downstream loops (say the one at BB33 have zero weigt. But likelihood analysis thinks if they are reached they will all iterate twice since profile incorporation just sets zero block outgoing likelihoods to 50/50.
  • for capped loops, figure out how to adjust loop exit block likelihoods to permit some flow to exit. I recall hitting this problem in the past but not what (if anything) I did about it—likely nothing as we did not have odd profiles like the ones we're seeing here. It is not clear in general how to adjust individual block likelihoods so that one obtains a desired $C_p$ value. The simplest method is to hope that the loop has just one exit and make the fix there; but if the loop has multiple exits things are more complicated. Presumably when computing $C_p$ we also have enough data to record the relative likelihood that each exit block is reached per iteration; if one of those is always reached then we can presumably adjust there, but if all exits are conditionally reached (or even messier, if the exit is from a nested loop) then the computation is not straightforward. And if these various exits all target different blocks then the choice of distribution of exit likelihoods across them can have widespread impact.

If we go back and adjust we may also run into complications from cases like

while (true)
{
      if (p) throw new Exception();
      ...
}

where the loop is postdominated by a throw; synthesis repair will want to make the throw have nonzero weight. Ultimately, we will have to permit throws to have nonzero weights.


Testing some enhancements to profile repair that look promising on this case, and don't seem too disruptive overall.


Update 4/16: official charts are slow to update but I can pull the data myself now. The PGO regression is largely mitigated:

image

And now the "official chart"

newplot (27)

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Apr 13, 2023

LinqBenchmarks.Count00ForX

Default [link]

Hard to tell from this but default is at about 122 with error bars extending +/- 10.
newplot (12)

PGO [link] -- note the y-axis scale differs from above.

PGO is pretty consistently at 163.

newplot (13)

At first blush I can't repro this locally. My machine is somewhat faster than the lab, so I assume it's the default case is running slowly for me locally.

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
Count00ForX Job-SGUCJG Default 131.5 ms 26.31 ms 30.30 ms 153.0 ms 94.10 ms 162.7 ms 1.00 0.00 200 B 1.00
Count00ForX Job-KTJAFK PGO 122.8 ms 1.85 ms 1.54 ms 122.8 ms 119.94 ms 126.3 ms 0.83 0.13 200 B 1.00

However looking at the distribution of measurements

;; default
-------------------- Histogram --------------------
[ 79.447 ms ; 111.213 ms) | @@@@@@@@
[111.213 ms ; 142.932 ms) | @
[142.932 ms ; 177.358 ms) | @@@@@@@@@@@
---------------------------------------------------

;; pgo
-------------------- Histogram --------------------
[119.082 ms ; 123.442 ms) | @@@@@@@@@
[123.442 ms ; 127.124 ms) | @@@@
---------------------------------------------------

it looks like the default case is just getting unlucky and hitting its slow variant more often for me, and there's a set of runs that are consistently faster than PGO. So will keep digging.


It looks like both modes end up measuring OSR code, but default switches to Tier1 and PGO never makes it there, in part because we redo tier0/osr over again for the main method (#84319, fix is close)

;; default

65.49%   3.249E+07   OSR      [MicroBenchmarks]LinqBenchmarks.Count00ForX()
33.72%   1.673E+07   Tier-1   [MicroBenchmarks]LinqBenchmarks.Count00ForX()

009 4980.124 -- 5274.105 : 293.981
010 5275.485 -- 5576.441 : 300.955
011 5577.827 -- 5767.517 : 189.690
012 5768.957 -- 5953.575 : 184.618

;; pgo

78.87%   2.952E+07   OSR      [MicroBenchmarks]LinqBenchmarks.Count00ForX()
19.77%   7.4E+06     OSR      [MicroBenchmarks]LinqBenchmarks.Count00ForX()

(all iterations same time)

if you crank up the iteration count then PGO looks quite a bit better. Here we are with 100 iterations:

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
Count00ForX Job-AVMULQ Default 102.08 ms 7.204 ms 21.24 ms 93.39 ms 88.49 ms 161.3 ms 1.00 0.00 200 B 1.00
Count00ForX Job-DJMRVT PGO 71.47 ms 11.541 ms 34.03 ms 50.62 ms 45.31 ms 130.5 ms 0.69 0.29 200 B 1.00

Looking at the benchmark sources

https://github.com/dotnet/performance/blob/d7dac8a7ca12a28d099192f8a901cf8e30361384/src/benchmarks/micro/runtime/Linq/Linq.cs#L291-L308

this is a test where we manually set the iteration count and as a result each invocation of the benchmark method takes a long time and so BDN decides not to call it very often per iteration (just twice)

// AfterActualRun
WorkloadResult   1: 2 op, 246812800.00 ns, 123.4064 ms/op
WorkloadResult   2: 2 op, 245786100.00 ns, 122.8931 ms/op

So to reach the 30 call threshold requires some number of iterations (look like we do 10 or so warming up).

Once we fix #84517 I expect this benchmark to improve some, but we might also consider removing the internal iterations and letting BDN pick the counts for us.

@JulieLeeMSFT JulieLeeMSFT added this to the 8.0.0 milestone Apr 13, 2023
@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Apr 14, 2023

Benchstone.MDBenchI.MDLogicArray.Test

Default [link]

newplot (14)

PGO [link]

newplot (15)

Recent regression is likely from: #84312


This test also suffers from repeated tiering up (#84517)

;; pgo
 830: JIT compiled Benchstone.MDBenchI.MDLogicArray:Test():bool:this [Instrumented Tier0, IL size=63, code size=258, hash=0x5774c8bd]
 831: JIT compiled Benchstone.MDBenchI.MDLogicArray:Inner(byref):bool [Instrumented Tier0, IL size=139, code size=557, hash=0x420d3983]
 834: JIT compiled Benchstone.MDBenchI.MDLogicArray:Inner(byref):bool [Tier1-OSR @0x39 with Dynamic PGO, IL size=139, code size=234, hash=0x420d3983]
 835: JIT compiled Benchstone.MDBenchI.MDLogicArray:Inner(byref):bool [Tier1-OSR @0x21 with Dynamic PGO, IL size=139, code size=326, hash=0x420d3983]
 836: JIT compiled Benchstone.MDBenchI.MDLogicArray:Inner(byref):bool [Instrumented Tier0, IL size=139, code size=557, hash=0x420d3983]
 837: JIT compiled Benchstone.MDBenchI.MDLogicArray:Inner(byref):bool [Tier1-OSR @0x39 with Dynamic PGO, IL size=139, code size=234, hash=0x420d3983]
 838: JIT compiled Benchstone.MDBenchI.MDLogicArray:Inner(byref):bool [Tier1-OSR @0x21 with Dynamic PGO, IL size=139, code size=326, hash=0x420d3983]
 839: JIT compiled Benchstone.MDBenchI.MDLogicArray:Inner(byref):bool [Tier1 with Dynamic PGO, IL size=139, code size=236, hash=0x420d3983]
 906: JIT compiled Benchstone.MDBenchI.MDLogicArray:Test():bool:this [Tier1-OSR @0x35 with Dynamic PGO, IL size=63, code size=353, hash=0x5774c8bd]

However not clear that is the root cause of perf issues. Profiles show

;; default

99.08%   4.95E+07    Tier-1   [MicroBenchmarks]MDLogicArray.Inner(value class Workarea&)

;; pgo

61.53%   3.985E+07   OSR      [MicroBenchmarks]MDLogicArray.Test()
37.67%   2.44E+07    Tier-1   [MicroBenchmarks]MDLogicArray.Inner(value class Workarea&)

(here Test calls Inner in a loop)

So with PGO that final OSR compile evidently inlines Inner into Test and the CQ of the computations in Inner suffers.

Suspect this is the same issue as in MulMatrix [above] where profile data is lacking because it only comes from partially executing Tier0 methods:

Inline @[000031] Finishing PHASE Profile incorporation
Trees after Profile incorporation

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight         IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0007]  1                           10046.     10046    [000..00B)-> BB06 (always)                     IBC 
BB02 [0008]  1       BB06                10046k  10046000    [00B..00F)-> BB04 (always)                     bwd bwd-target IBC 
BB03 [0009]  1       BB04                504375k 504374837    [00F..021)                                     bwd bwd-target IBC 
BB04 [0010]  2       BB02,BB03           514421k 514420837    [021..026)-> BB03 ( cond )                     bwd bwd-src IBC 
BB05 [0011]  1       BB04                10046k  10046000    [026..02A)                                     bwd IBC 
BB06 [0012]  2       BB01,BB05           10046k  10046000    [02A..02F)-> BB02 ( cond )                     bwd bwd-src IBC 
BB07 [0013]  1       BB06                  0            0    [02F..033)-> BB14 (always)                     rare IBC 
BB08 [0014]  1       BB14                  0            0    [033..037)-> BB12 (always)                     rare bwd bwd-target IBC 
BB09 [0015]  1       BB12                  0            0    [037..039)                                     rare bwd bwd-target IBC 
BB10 [0016]  2       BB09,BB10             0            0    [039..06C)-> BB10 ( cond )                     rare bwd bwd-target bwd-src IBC 
BB11 [0017]  1       BB10                  0            0    [06C..070)                                     rare bwd IBC 

So let's try with the fix from #84817:

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDLogicArray Job-INDUVT Default 320.6 ms 7.76 ms 8.93 ms 320.9 ms 308.0 ms 337.1 ms 1.00 0.00 10.59 KB 1.00
MDLogicArray Job-BHAQKC PGO 406.5 ms 4.25 ms 3.98 ms 404.8 ms 400.7 ms 413.9 ms 1.29 0.03 10.59 KB 1.00
MDLogicArray Job-DVPVHR With #84817 366.4 ms 5.87 ms 6.03 ms 365.1 ms 358.7 ms 379.0 ms 1.14 0.03 10.59 KB 1.00

Things look better but still not at par.

Will move on to another benchmark; may come back to this one.


Lab charts are finally getting updated, here's the latest

newplot - 2023-04-19T120337 911

@AndyAyersMS
Copy link
Member Author

Benchstone.BenchI.CSieve.Test

Default [link]

newplot (40)

PGO [link]

newplot (41)


Performance seems to be at parity now

@AndyAyersMS
Copy link
Member Author

System.Memory.Span.StartsWith(Size: 512)

Default [link]

newplot (42)

PGO [link]

newplot (43)


PGO looks to be at parity now or perhaps a bit faster.

@AndyAyersMS
Copy link
Member Author

System.Collections.ContainsFalse.ImmutableArray(Size: 512)

Default [link]

newplot (44)

PGO [link]

newplot (45)


PGO performance seems comparable?

@AndyAyersMS
Copy link
Member Author

PerfLabTests.CastingPerf2.CastingPerf.ScalarValueTypeObj

Default [link]

newplot (46)

PGO [[link]]

newplot (48)


PGO seems comparable

@AndyAyersMS
Copy link
Member Author

PerfLabTests.LowLevelPerf.GenericGenericMethod

Default [link]

newplot (49)

PGO [link]

newplot (50)


Looks to be tri-modal with PGO and Default showing similar levels.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Apr 21, 2023

System.Perf_Convert.FromBase64String

Default [link]

newplot (51)

PGO [link]

newplot (52)


Can repro locally:

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Gen0 Allocated Alloc Ratio
FromBase64String Job-ZPWSJG Default 64.07 ns 1.903 ns 2.191 ns 63.74 ns 61.38 ns 69.18 ns 1.00 0.00 0.0087 56 B 1.00
FromBase64String Job-YVHGZZ PGO 68.80 ns 0.851 ns 0.796 ns 68.93 ns 67.31 ns 70.35 ns 1.08 0.03 0.0088 56 B 1.00
;; default

02.88%   1.2E+06     ?        Unknown
37.87%   1.579E+07   Tier-1   [System.Private.CoreLib]Convert.TryDecodeFromUtf16(value class System.ReadOnlySpan`1<wchar>,value class System.Span`1<unsigned int8>,int32&,int32&)
23.84%   9.94E+06    Tier-1   [System.Private.CoreLib]Convert.FromBase64_ComputeResultLength(wchar*,int32)
18.04%   7.52E+06    native   coreclr.dll
07.75%   3.23E+06    Tier-1   [System.Private.CoreLib]Convert.TryFromBase64Chars(value class System.ReadOnlySpan`1<wchar>,value class System.Span`1<unsigned int8>,int32&)
04.41%   1.84E+06    Tier-1   [System.Private.CoreLib]Convert.FromBase64CharPtr(wchar*,int32)
02.30%   9.6E+05     Tier-1   [System.Private.CoreLib]Convert.FromBase64String(class System.String)

;; pgo

02.36%   1.08E+06    ?        Unknown
38.08%   1.742E+07   Tier-1   [System.Private.CoreLib]Convert.FromBase64CharPtr(wchar*,int32)
32.29%   1.477E+07   Tier-1   [System.Private.CoreLib]Convert.TryDecodeFromUtf16(value class System.ReadOnlySpan`1<wchar>,value class System.Span`1<unsigned int8>,int32&,int32&)
15.96%   7.3E+06     native   coreclr.dll
06.62%   3.03E+06    Tier-1   [System.Private.CoreLib]Convert.TryFromBase64Chars(value class System.ReadOnlySpan`1<wchar>,value class System.Span`1<unsigned int8>,int32&)
02.45%   1.12E+06    Tier-1   [System.Private.CoreLib]Convert.FromBase64String(class System.String)

Looks like with PGO FromBase64CharPtr has now absorbed FromBase64_ComputeResultLength


Seems like this may be a case of the issue noted in #84319 where when we invert loops we don't get the profile updates done properly. Here we have a hot loop BB02 - BB06 with some cold blocks:

BB01 [0010]  1                           48888.   48888    [000..00D)-> BB06 (always)                     IBC 
BB02 [0011]  1       BB06                 1734k 1734078    [00D..01A)-> BB04 ( cond )                     bwd bwd-target IBC 
BB03 [0012]  1       BB02                  0          0    [01A..020)-> BB06 (always)                     rare bwd IBC 
BB04 [0013]  1       BB02                 1734k 1734078    [020..025)-> BB06 ( cond )                     bwd IBC 
BB05 [0014]  1       BB04                48545.   48545    [025..02D)                                     bwd IBC 
BB06 [0015]  4       BB01,BB03,BB04,BB05  1783k 1782966    [02D..031)-> BB02 ( cond )                     bwd bwd-src IBC 

Without PGO we end up compacting this (somehow), but with PGO we end up leaving one of the cold block inline

;; default 

       align    [12 bytes for IG03]

G_M650_IG03:  ;; offset=0020H
       movzx    r9, word  ptr [rcx]
       add      rcx, 2
       cmp      r9d, 32
       jbe      SHORT G_M650_IG12
						;; size=14 bbWeight=348.23 PerfScore 1218.79
G_M650_IG04:  ;; offset=002EH
       cmp      r9d, 61
       je       SHORT G_M650_IG11
						;; size=6 bbWeight=347.83 PerfScore 434.79
G_M650_IG05:  ;; offset=0034H
       cmp      rcx, rax
       jb       SHORT G_M650_IG03

...

G_M650_IG11:  ;; offset=0069H  **COLD**
       dec      edx
       inc      r8d
       jmp      SHORT G_M650_IG05


;; pgo

       align    [8 bytes for IG06]

G_M36625_IG06:  ;; offset=0060H
       movzx    rax, word  ptr [rdx]
       add      rdx, 2
       cmp      eax, 32
       jbe      G_M36625_IG15
       cmp      eax, 61
       jne      SHORT G_M36625_IG08
						;; size=21 bbWeight=35.07 PerfScore 166.59
G_M36625_IG07:  ;; offset=0075H **COLD**
       dec      ecx
       inc      ebp
						;; size=4 bbWeight=0.98 PerfScore 0.49
G_M36625_IG08:  ;; offset=0079H
       cmp      rdx, rbx
       jb       SHORT G_M36625_IG06

When we invert the exit block weight ends up being too low:

BB09 [0011]  2       BB13,BB29            35.47 1734078    [029..02A)-> BB11 ( cond )                     i bwd bwd-target IBC 
BB10 [0012]  1       BB09                  0          0    [029..02A)-> BB13 (always)                     i rare bwd IBC 
BB11 [0013]  1       BB09                 35.47 1734078    [029..02A)-> BB13 ( cond )                     i bwd IBC 
BB12 [0014]  1       BB11                  0.99   48545    [029..02A)                                     i bwd IBC 
BB13 [0015]  3       BB10,BB11,BB12        1      48888    [029..02A)-> BB09 ( cond )                     i bwd bwd-src IBC 

Verified that the prospective fix from #84319 works:

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Gen0 Allocated Alloc Ratio
FromBase64String Job-XFXANI FixLoopInversion 63.05 ns 0.716 ns 0.598 ns 62.83 ns 62.35 ns 64.43 ns 1.00 0.00 0.0088 56 B 1.00
FromBase64String Job-ITZXRC PGO 70.10 ns 1.172 ns 1.096 ns 70.16 ns 68.79 ns 72.12 ns 1.11 0.02 0.0088 56 B 1.00

Lab sees the fix now too:

newplot - 2023-05-18T180257 652

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Apr 21, 2023
If we have a partial profile then the current count reconstruction will
adjust the exit likelihood of some loop exit when it hits a capped loop.
But for multiple exit loops we might wish to see some profile flow out
of all exits, not just one.

In `ludcmp` we choose to send all the profile weights down an early return
path, leaving the bulk of the method with zero counts.

Instead of trying increasingly elaborate repair schemes, we will now use
blend mode for these sorts of problems; this gives a more balanced count
redistribution.

I also updated blend to use the same logic as repair if a block has zero
weights, since presumably whatever likelihood was assigned there during
reconstruction is not well supported.

Fixes the `ludcmp` regression with PGO over no PGO, noted in
dotnet#84264 (comment)
AndyAyersMS added a commit that referenced this issue Apr 22, 2023
If we have a partial profile then the current count reconstruction will
adjust the exit likelihood of some loop exit when it hits a capped loop.
But for multiple exit loops we might wish to see some profile flow out
of all exits, not just one.

In `ludcmp` we choose to send all the profile weights down an early return
path, leaving the bulk of the method with zero counts.

Instead of trying increasingly elaborate repair schemes, we will now use
blend mode for these sorts of problems; this gives a more balanced count
redistribution.

I also updated blend to use the same logic as repair if a block has zero
weights, since presumably whatever likelihood was assigned there during
reconstruction is not well supported.

Fixes the `ludcmp` regression with PGO over no PGO, noted in
#84264 (comment)
@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Apr 24, 2023

System.Memory.Span<Byte>.SequenceCompareToDifferent(Size: 4)

Default [link]

newplot (53)

PGO [link]

newplot (54)

Feb 2002 regression seems to be #82412


Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
SequenceCompareToDifferent Job-HHXDGF \base-rel\corerun.exe 4 4.590 ns 0.1019 ns 0.1001 ns 4.543 ns 4.478 ns 4.791 ns 1.00 0.00 - NA
SequenceCompareToDifferent Job-UIGKEB \diff-rel\corerun.exe 4 4.817 ns 0.1166 ns 0.1033 ns 4.785 ns 4.700 ns 5.083 ns 1.05 0.03 - NA
;; default

03.21%   1.2E+06     ?        Unknown
36.40%   1.36E+07    Tier-1   [MicroBenchmarks]System.Memory.Span`1[System.Byte].SequenceCompareToDifferent()
35.95%   1.343E+07   Tier-1   [System.Private.CoreLib]SpanHelpers.SequenceCompareTo(unsigned int8&,int32,unsigned int8&,int32)
17.91%   6.69E+06    Tier-1   [System.Private.CoreLib]MemoryExtensions.SequenceCompareTo(value class System.Span`1<!!0>,value class System.ReadOnlySpan`1<!!0>)

;; pgo

38.51%   1.483E+07   Tier-1   [MicroBenchmarks]System.Memory.Span`1[System.Byte].SequenceCompareToDifferent()
35.63%   1.372E+07   Tier-1   [System.Private.CoreLib]SpanHelpers.SequenceCompareTo(unsigned int8&,int32,unsigned int8&,int32)
16.26%   6.26E+06    Tier-1   [System.Private.CoreLib]MemoryExtensions.SequenceCompareTo(value class System.Span`1<!!0>,value class System.ReadOnlySpan`1<!!0>)

Looks like slightly different PGO data for an System.ReadOnlySpan1[ubyte]:.ctor(ubyte[]):this` (static vs dynamic) leads to an LSRA block order difference which leads LSRA to use two callee saves in the PGO case.

;; default (static pgo)

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0008]  1                           100    100    [000..003)-> BB03 ( cond )                     IBC 
BB02 [0009]  1       BB01                  0.58   1    [003..00B)        (return)                     IBC 
BB03 [0010]  1       BB01                 99.42  99    [00B..021)        (return)                     IBC 
-----------------------------------------------------------------------------------------------------------------------------------------

Start LSRA Block Sequence: 
Current block: BB01
	Succ block: BB03, Criteria: weight, Worklist: [BB03 ]
	Succ block: BB02, Criteria: weight, Worklist: [BB03 BB02 ]
Current block: BB03
	Succ block: BB04, Criteria: weight, Worklist: [BB04 BB02 ]
Current block: BB04
	Succ block: BB06, Criteria: bbNum, Worklist: [BB06 BB02 ]
	Succ block: BB05, Criteria: bbNum, Worklist: [BB05 BB06 BB02 ]
Current block: BB05
	Succ block: BB07, Criteria: bbNum, Worklist: [BB06 BB07 BB02 ]
Current block: BB06
Current block: BB07
Current block: BB02
Final LSRA Block Sequence:
BB01 (  1   )
BB03 (  0.50)
BB04 (  1   )
BB05 (  0.00)
BB06 (  0.50)
BB07 (  1   )
BB02 (  0   )

;; pgo

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0008]  1                           100    100    [000..003)-> BB03 ( cond )                     IBC 
BB02 [0009]  1       BB01                  0      0    [003..00B)        (return)                     rare IBC 
BB03 [0010]  1       BB01                100    100    [00B..021)        (return)                     IBC 
-----------------------------------------------------------------------------------------------------------------------------------------

Start LSRA Block Sequence: 
Current block: BB01
	Succ block: BB03, Criteria: weight, Worklist: [BB03 ]
	Succ block: BB02, Criteria: weight, Worklist: [BB03 BB02 ]
Current block: BB03
	Succ block: BB04, Criteria: weight, Worklist: [BB04 BB02 ]
Current block: BB04
	Succ block: BB06, Criteria: bbNum, Worklist: [BB06 BB02 ]
	Succ block: BB05, Criteria: weight, Worklist: [BB06 BB02 BB05 ]
Current block: BB06
	Succ block: BB07, Criteria: bbNum, Worklist: [BB07 BB02 BB05 ]
Current block: BB07
Current block: BB02
Current block: BB05
Final LSRA Block Sequence:
BB01 (  1   )
BB03 (  0.50)
BB04 (  1   )
BB06 (  0.50)
BB07 (  1   )
BB02 (  0   )
BB05 (  0   )

and the prologs

;; default
       push     rdi
       push     rsi
       sub      rsp, 72

;; pgo
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 72

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Apr 24, 2023

System.Text.Json.Document.Tests.Perf_DocumentParse.Parse(IsDataIndented: False, TestRandomAccess: True, TestCase: Json400KB)

Default [link]

newplot (55)

PGO [link]

newplot (56)

May 2022 (small) regression may have been #68804
February 2023 regression may be #77357 or #81936
April 2023 regression possibly #84427


Locally I can repro, but annoyingly when I profile PGO is faster.

Method Job Toolchain IsDataIndented TestRandomAccess TestCase Mean Error StdDev Median Min Max Ratio RatioSD Gen0 Allocated Alloc Ratio
Parse Job-FHNYPZ Default False True Json400KB 4.592 ms 0.0531 ms 0.0444 ms 4.599 ms 4.515 ms 4.679 ms 1.00 0.00 92.5926 597.51 KB 1.00
Parse Job-LHXDIX PGO False True Json400KB 5.507 ms 0.9467 ms 1.0522 ms 4.760 ms 4.490 ms 6.996 ms 1.29 0.22 - 597.59 KB 1.00
;; default

65.72%   2.552E+07   Tier-1   [System.Text.Json]JsonDocument.GetArrayIndexElement(int32,int32)
07.37%   2.86E+06    FullOpt  [System.Text.Json]JsonDocument.TryGetNamedPropertyValue(int32,int32,value class System.ReadOnlySpan`1<unsigned int8>,value class System.Text.Json.JsonElement&)
03.37%   1.31E+06    native   coreclr.dll
01.85%   7.2E+05     Tier-1   [System.Text.Json]JsonDocument.Parse(value class System.ReadOnlySpan`1<unsigned int8>,value class System.Text.Json.JsonReaderOptions,value class MetadataDb&,value class StackRowStack&)
01.75%   6.8E+05     Tier-1   [System.Text.Json]Utf8JsonReader.ReadSingleSegment()
01.75%   6.8E+05     Tier-1   [System.Private.CoreLib]IndexOfAnyAsciiSearcher.IndexOfAnyVectorized(unsigned int8&,int32,value class System.Runtime.Intrinsics.Vector128`1<unsigned int8>)
01.39%   5.4E+05     Tier-1   [System.Text.Json]Utf8JsonReader.ConsumeString()

;; pgo

66.14%   2.531E+07   Tier-1   [System.Text.Json]JsonDocument.GetArrayIndexElement(int32,int32)
07.66%   2.93E+06    FullOpt  [System.Text.Json]JsonDocument.TryGetNamedPropertyValue(int32,int32,value class System.ReadOnlySpan`1<unsigned int8>,value class System.Text.Json.JsonElement&)
03.29%   1.26E+06    native   coreclr.dll
02.59%   9.9E+05     Tier-1   [System.Text.Json]JsonDocument.Parse(value class System.ReadOnlySpan`1<unsigned int8>,value class System.Text.Json.JsonReaderOptions,value class MetadataDb&,value class StackRowStack&)
01.86%   7.1E+05     Tier-1   [System.Text.Json]Utf8JsonReader.ConsumeNextToken(unsigned int8)
01.44%   5.5E+05     Tier-1   [System.Private.CoreLib]Ascii.WidenAsciiToUtf16(unsigned int8*,wchar*,unsigned int)
01.31%   5E+05       Tier-1   [System.Text.Json]Utf8JsonReader.ReadSingleSegment()
01.23%   4.7E+05     Tier-1   [System.Private.CoreLib]IndexOfAnyAsciiSearcher.IndexOfAnyVectorized(unsigned int8&,int32,value class System.Runtime.Intrinsics.Vector128`1<unsigned int8>)
01.12%   4.3E+05     FullOpt  [System.Text.Json]JsonDocument.TryGetNamedPropertyValue(int32,value class System.ReadOnlySpan`1<wchar>,value class System.Text.Json.JsonElement&)
01.07%   4.1E+05     Tier-1   [System.Text.Json]Utf8JsonReader.ConsumeValue(unsigned int8)
01.07%   4.1E+05     native   clrjit.dll

Note the TryGetNamedPropertyValue methods bypass tiering. Looks like they have both stackalloc and a loop, so aren't eligible for OSR.


Codegen for GetArrayIndexElement is identical w/wo PGO, and next hottest method bypasses tiering, and I consistently get faster results with PGO running BDN -p ETW as admin, so am going to move on.

My only guess is that perhaps GetArrayIndexElement is running into JCC errata. I don't have PGO perf history for other machines so can't easily cross-check (say against arm64).


we now have some PGO data both on linux x64 and ampere arm64, for both of those PGO is comparable or a bit faster.

image

@AndyAyersMS
Copy link
Member Author

System.Text.RegularExpressions.Tests.Perf_Regex_Cache.IsMatch(total: 400000, unique: 1, cacheSize: 15)

Default [link]

newplot (57)

PGO [link]

newplot (58)


PGO seems consistently faster, so likely this was just a noisy comparison.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Apr 24, 2023

System.Collections.IndexerSet<Int32>.Dictionary(Size: 512)

Default [link]

newplot (59)

PGO [link]

newplot (60)

Jan 2023 regression looks like #80481


Locally I have PGO consistently faster:

Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
Dictionary Job-OHYCZK Default 512 3.506 us 0.1136 us 0.1216 us 3.491 us 3.328 us 3.792 us 1.00 0.00 - NA
Dictionary Job-NZIXNJ PGO 512 3.219 us 0.0467 us 0.0414 us 3.214 us 3.122 us 3.287 us 0.92 0.03 - NA

Both default and PGO have since improved and are showing comparable results in the lab:

Default
newplot - 2023-05-18T181432 642

Improvement was one or more of:

PGO

newplot - 2023-05-18T181617 568

Improvement was likely from #85805

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Apr 24, 2023

System.Tests.Perf_Char.Char_ToLowerInvariant(input: "Hello World!")

Note in a windows batch script you may need to use

--filter "System.Tests.Perf_Char.Char_ToLowerInvariant(input: ""Hello World!"")"

to get this properly passed to BenchmarkDotNet.

Default [link]

newplot (61)

November 2022 improvement was #78593
April 2023 regression is possibly #83624

PGO [link]

newplot (62)

November 2022 regression was #78593
April 2023 regression is possibly #83624


So oddly enough these are reacting oppositely to some changes.

I see PGO as faster or at least comparable locally.

Method Job Toolchain input Mean Error StdDev Median Min Max Ratio Allocated Alloc Ratio
Char_ToLowerInvariant Job-ZSHNYR Default Hello World! 13.34 ns 0.143 ns 0.127 ns 13.36 ns 13.15 ns 13.52 ns 1.00 - NA
Char_ToLowerInvariant Job-QOYHZR PGO Hello World! 11.80 ns 0.137 ns 0.128 ns 11.76 ns 11.61 ns 11.99 ns 0.88 - NA

@AndyAyersMS
Copy link
Member Author

The above covers the initial screen of the 50 worst ratios (as of 3 weeks ago). I am going to add in a few other cases based on a refreshed and slightly broadened query.

Broadened: the results above excluded any test that had less than a 2ns reported time. A lot of HW intrinsic/vector tests come in under this and report slower with PGO. So let's look at a few.

System.Numerics.Tests.Perf_Vector3.DotBenchmark

Default [link]

newplot (63)

PGO [link]

newplot (64)

Seems comparable, PGO result is just a hair greater than zero, but default is zero so ratio is infinite.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Apr 24, 2023

System.Numerics.Tests.Perf_Vector3.ReflectBenchmark

Default [link]

newplot (65)

PGO [link]

newplot (66)

Regression appears to be from #83036 which seems odd.


This one is still a mystery. Default and PGO seem to have identical code for the benchmark methods, but with release bits PGO reports consistently worse performance. If I drop in a checked jit into the release PGO build, it speeds up and matches the default perf.

Suspecting something like the following -- the workload action unroll for release PGO ends up always calling the precode somehow, and this never gets updated/backpatched?


Looks like the issue was that the jit was not fully initializing some of the SIMD12 vector constants used by this code, causing the high float to pick up random values, that might be denorms, which cause huge stalls.


Fix is #85362

newplot - 2023-04-27T014436 260

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Apr 25, 2023

System.Buffers.Text.Tests.Base64EncodeDecodeInPlaceTests.Base64DecodeInPlace(NumberOfBytes: 200000000)

Default [link]

newplot - 2023-04-25T081437 916

January 2023 improvement is #79346
February 2023 improvement is possibly #79381

PGO [link]

newplot - 2023-04-25T081327 907

October 2022 regression is #70941
January 2023 improvement looks like #80481
February 2023 improvement possibly #79381
April 2023 regression is from #84427


This repros locally

Method Job Toolchain NumberOfBytes Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
Base64DecodeInPlace Job-BGTHBT Default 200000000 123.6 ms 1.15 ms 0.89 ms 123.5 ms 122.3 ms 125.8 ms 1.00 0.00 400 B 1.00
Base64DecodeInPlace Job-DGNZZW PGO 200000000 261.5 ms 47.18 ms 50.48 ms 278.5 ms 123.0 ms 287.5 ms 2.04 0.49 400 B 1.00

But when I enable profiling -p ETW it vanishes

Method Job Toolchain NumberOfBytes Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
Base64DecodeInPlace Job-ODJYCK Default 200000000 125.4 ms 6.63 ms 6.20 ms 123.7 ms 121.3 ms 141.2 ms 1.00 0.00 400 B 1.00
Base64DecodeInPlace Job-NQEVUA PGO 200000000 124.7 ms 4.63 ms 4.33 ms 123.4 ms 122.0 ms 139.8 ms 1.00 0.06 400 B 1.00

Seems like the "fast" run of this test executes R2R code and never tiers up:

image

(also looks like something is broken with R2R rundown as neither PerfView nor my tool can resolve which R2R method this is).

Think the issue is in BDN itself: dotnet/BenchmarkDotNet#2296

So, why are we looking at R2R code here? (Potentially if I can catch a slow run in the profiler, it may show hits at Tier0, OSR, or Tier1...will keep trying).


Caught a slow run, and despite what I wrote above the "fast" runs look like they are indeed at Tier1.

;; default

86.93%   1.63E+07    Tier-1   [System.Private.CoreLib]Base64.DecodeFromUtf8InPlace(value class System.Span`1<unsigned int8>,int32&)
13.07%   2.45E+06    native   System.Private.CoreLib.dll

;; pgo

48.42%   2.669E+07   native   coreclr.dll
47.01%   2.591E+07   Tier-1 instr  [System.Private.CoreLib]Base64.DecodeFromUtf8InPlace(value class System.Span`1<unsigned int8>,int32&)
04.44%   2.45E+06    native   System.Private.CoreLib.dll
00.09%   5E+04       native   ntoskrnl.exe

The hot method is evidently R2R and so we go to Tier-1 instr with PGO but it doesn't get called enough to ever finish tiering up. If I rev up the iterations ....

WorkloadResult  29: 1 op, 283866900.00 ns, 283.8669 ms/op
WorkloadResult  30: 1 op, 291856400.00 ns, 291.8564 ms/op
WorkloadResult  31: 1 op, 286046600.00 ns, 286.0466 ms/op
WorkloadResult  32: 1 op, 287347400.00 ns, 287.3474 ms/op
WorkloadResult  33: 1 op, 121137000.00 ns, 121.1370 ms/op
WorkloadResult  34: 1 op, 121175600.00 ns, 121.1756 ms/op
WorkloadResult  35: 1 op, 122286200.00 ns, 122.2862 ms/op
WorkloadResult  36: 1 op, 123082000.00 ns, 123.0820 ms/op
WorkloadResult  37: 1 op, 122119700.00 ns, 122.1197 ms/op

as you can see we need to get to iteration 33 to switch to Tier1 with PGO, but by default we only run 15 iterations. And once we switch we get basically the same perf as without PGO.

BDN could probably do a better job here of warning you that if there are less than say 30 invocations (perhaps 60 to be safe) of your benchmark method you may not see the results you're expecting, and either need to drop the per-invocation time or up the invocation or iteration counts.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Apr 25, 2023

Benchmark.GetChildKeysTests.AddChainedConfigurationEmpty

Default [link]

newplot - 2023-04-25T082718 128

April 2023 improvement possibly #84582

PGO [link]

newplot - 2023-04-25T082745 458

October 2022 improvement was #70941
February 2023 improvement was #81267
April 2023 regression was possibly #84582


Comparable for me locally

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
AddChainedConfigurationEmpty Job-SNZWKD \base-rel\corerun.exe 13.41 ms 0.263 ms 0.270 ms 13.30 ms 13.18 ms 14.01 ms 1.00 0.00 143.24 KB 1.00
AddChainedConfigurationEmpty Job-AMBTRB \diff-rel\corerun.exe 13.87 ms 1.199 ms 1.380 ms 13.10 ms 12.58 ms 16.12 ms 1.05 0.10 143.32 KB 1.00

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented May 5, 2023

System.Buffers.Text.Tests.Utf8FormatterTests.FormatterUInt64(value: 0)

Default [link]

newplot - 2023-05-05T075647 419

Recent improvement was from #85277

PGO [link]

newplot - 2023-05-05T075552 000

Regression was from something in this range: 887c043...f92b9ef (74 commits)


From Azure Data Explorer, regression was after 2023-04-25T00:29:35Z and before 2023-04-25T03:46:39Z

image

Tue Apr 25 03:46:39 2023 f74138f Jeremy Koritzinsky Enable ComWrappers for NativeAOT on non-Windows (#85000)
Tue Apr 25 03:33:48 2023 ad59672 Jeremy Koritzinsky Build Mono LLVM legs on the CBL-Mariner images, but run the AOTing steps on CentOS Stream 8 with binutils (#85273)
Tue Apr 25 02:51:18 2023 516aa8c Stephen Toub Add Parallel.ForAsync (#84804)
Tue Apr 25 02:50:59 2023 759fabe Stephen Toub Fix perf regressions in Utf8Formatter for integers (#85277)
Tue Apr 25 01:14:10 2023 01e2455 Michał Petryka Add CollectionsMarshal.SetCount(list, count) (#82146)
Tue Apr 25 00:49:02 2023 41f6e79 Aleksey Kliger (λgeek) [native-library] If a dllimport is specified with an absolute path, look for it first (#85255)
Tue Apr 25 00:29:35 2023 620b0db Stephen Toub Add parsing error information to composite format string parsing exceptions (#85106)

So looks like the change that lead to this was #85277.


Method Job Toolchain value Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
FormatterUInt64 Job-WWTDFM PGO 0 6.214 ns 0.0098 ns 0.0082 ns 6.213 ns 6.203 ns 6.231 ns 2.91 0.03 - NA
FormatterUInt64 Job-QWMANF Default 0 2.137 ns 0.0243 ns 0.0227 ns 2.148 ns 2.082 ns 2.162 ns 1.00 0.00 - NA

;; default

05.74%   2.18E+06    ?        Unknown
81.72%   3.103E+07   Tier-1   [MicroBenchmarks]Utf8FormatterTests.FormatterUInt64(unsigned int64)
11.72%   4.45E+06    Tier-1   [9c1446ac-187e-4ddf-b6bd-964603a8e2dd]Runnable_0.WorkloadActionUnroll(int64)

;; pgo

02.50%   1.97E+06    ?        Unknown
93.66%   7.369E+07   Tier-1   [MicroBenchmarks]Utf8FormatterTests.FormatterUInt64(unsigned int64)
03.23%   2.54E+06    Tier-1   [ea5814fd-a81b-4c82-bc90-c88fd94286d8]Runnable_0.WorkloadActionUnroll(int64)

Looks like all the time is spent in the benchmark method. Hopefully this will be easy to track down.


Seems like what is happening is that with PGO we see a lot of cold blocks, and so we don't inline some methods that we inline without PGO. One of those takes a struct by address that is also used in hot code, and so we don't fully promote, we dependently promote. And then we make some poor choices in codegen that leads to what is likely a partial stall. Sampling shows a lot of time piling up in this little stretch of code:

G_M000_IG05:                ;; offset=0035H
       mov      word  ptr [rsp+48H], 0
       mov      eax, dword ptr [rsp+48H]
       or       al, byte  ptr [rsp+49H]
       jne      G_M000_IG14

For some reason we do a wide load after a narrow store; my recollection is that the HW does not like this at all. In the non PGO this code doesn't exist because we know all the values are zero.


Looks like we widen this load in fgMorphCastedBitwiseOp. If I fix that to not widen normalize on load locals, then we recover most (but not all) of the perf. Suspect the remainder is in the higher prolog/epilog costs we see with PGO.

Method Job Toolchain value Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
FormatterUInt64 Job-FINTBR PGO/Fixed 0 2.516 ns 0.0596 ns 0.0558 ns 2.506 ns 2.455 ns 2.631 ns 1.16 0.04 - NA
FormatterUInt64 Job-EZPSRN Default 0 2.159 ns 0.0497 ns 0.0441 ns 2.149 ns 2.097 ns 2.254 ns 1.00 0.00 - NA

PGO has now improved, thanks to #86491 fixing the store-foreward stall.

newplot - 2023-05-21T075034 050

Default "regressed" because we enabled PGO by default. then improved with #86491

newplot - 2023-05-21T075348 361

@AndyAyersMS
Copy link
Member Author

Closing this; remaining regressions will be analyzed in #87194.

@ghost ghost locked as resolved and limited conversation to collaborators Jul 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI PGO tenet-performance-benchmarks Issue from performance benchmark
Projects
None yet
Development

No branches or pull requests

4 participants