Investigate microbenchmarks that regress with PGO enabled #84264

AndyAyersMS · 2023-04-03T19:06:49Z

Look into cases where microbenchmarks are slower with PGO enabled.

#74873 (comment) has an initial list to check.

Issues uncovered

Fixes

The text was updated successfully, but these errors were encountered:

ghost · 2023-04-03T19:06:59Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Look into cases where microbenchmarks are slower with PGO enabled.

#74873 (comment) has an initial list to check.

Author:	AndyAyersMS
Assignees:	AndyAyersMS
Labels:	`area-CodeGen-coreclr`
Milestone:	-

AndyAyersMS · 2023-04-03T19:18:41Z

PerfLabTests.CastingPerf.CheckArrayIsInterfaceNo

Default [link]

PGO [link]

Improvements in both on 12/4 were from #79060.
Regression in PGO on 1/13 was from #80481.

Looks like quite a few tests were impacted by #80481. So I'll just focus on that for now.

EgorBo · 2023-04-03T21:50:26Z

@AndyAyersMS I think that CastHelpers should not be instrumented since they were already carefully optimized for a general case by @VSadov e.g.

runtime/src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs

Lines 277 to 322 in e6226e6

    
           private static object? IsInstanceOfClass(void* toTypeHnd, object? obj) 
        
           { 
        
               if (obj == null || RuntimeHelpers.GetMethodTable(obj) == toTypeHnd) 
        
                   return obj; 
        
               MethodTable* mt = RuntimeHelpers.GetMethodTable(obj)->ParentMethodTable; 
        
               for (; ; ) 
        
               { 
        
                   if (mt == toTypeHnd) 
        
                       goto done; 
        
                   if (mt == null) 
        
                       break; 
        
                   mt = mt->ParentMethodTable; 
        
                   if (mt == toTypeHnd) 
        
                       goto done; 
        
                   if (mt == null) 
        
                       break; 
        
                   mt = mt->ParentMethodTable; 
        
                   if (mt == toTypeHnd) 
        
                       goto done; 
        
                   if (mt == null) 
        
                       break; 
        
                   mt = mt->ParentMethodTable; 
        
                   if (mt == toTypeHnd) 
        
                       goto done; 
        
                   if (mt == null) 
        
                       break; 
        
                   mt = mt->ParentMethodTable; 
        
               } 
        
               // this helper is not supposed to be used with type-equivalent "to" type. 
        
               Debug.Assert(!((MethodTable*)toTypeHnd)->HasTypeEquivalence); 
        
               obj = null; 
        
           done: 
        
               return obj; 
        
           }

EgorBo · 2023-04-03T22:08:18Z

PS: we can probably propose a new API [MethodImpl(MethodImplOptions.NoInstrumenting)]

VSadov · 2023-04-03T23:07:30Z

Yes we are certain that these methods will be hot and want the best possible codegen right away.

The cast helpers are also initialized in a special way -

runtime/src/coreclr/vm/ecall.cpp

Line 99 in 4cbe6f9

void ECall::PopulateManagedCastHelpers()

I do not remember all the details, but there were reasons for that.
Maybe it is possible to disable instrumentation for things initialized that way.
(I do not have preferences, just want mention that there could be another way to detect and special case these)

EgorBo · 2023-04-03T23:18:19Z

Yes we are certain that these methods will be hot and want the best possible codegen right away.

The cast helpers are also initialized in a special way -

runtime/src/coreclr/vm/ecall.cpp

Line 99 in 4cbe6f9

void ECall::PopulateManagedCastHelpers()

I do not remember all the details, but there were reasons for that. Maybe it is possible to disable instrumentation for things initialized that way. (I do not have preferences, just want mention that there could be another way to detect and special case these)

The ones you listed have AggressiveOptimization attribute, am I right? In that case we won't instrument them, but there are other (less important) members in CastHelpers which don't have AggressiveOptimization so are eligble for instrumentation

VSadov · 2023-04-03T23:25:44Z

All the cast helpers are initialized in that method.
If you have other uses and reasons to introduce MethodImplOptions.NoInstrumenting, then this part is unimportant.

It would only be interesting if you somehow do not want MethodImplOptions.NoInstrumenting just for a few special methods and need another way to identify them.

AndyAyersMS · 2023-04-04T01:16:05Z

The issue caused by #80481 is with benchmark methods like PerfLabTests.CastingPerf:CheckArrayIsInterfaceNo that trigger OSR by looping a lot internally.

For such methods we never see any Tier0 + instr method return, so with sparse profiles enabled Tier1 profile reconstruction thinks the entry block is run rarely, and this messes up optimizations. eg in this case we fail to hoist/cse the cast out of the loop.

Trees after Profile incorporation

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight       IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             0          0    [000..006)-> BB03 (always)                     rare IBC 
BB02 [0001]  1       BB03                 5630k 5630382    [006..018)                                     bwd bwd-target IBC 
BB03 [0002]  2       BB01,BB02            5630k 5630382    [018..020)-> BB02 ( cond )                     bwd bwd-src IBC 
BB04 [0003]  1       BB03                  0          0    [020..022)        (return)                     rare IBC 
-----------------------------------------------------------------------------------------------------------------------------------------

The fix I'm proposing is to run synthesis repair in such cases. It addresses the above case, looking now to see what else it does.

In particular, run synthesis in repair mode for cases where there are profile counts within the method but zero counts in `fgFirstBB`. Recall that sparse profiling effectively probes return blocks to determine the method entry count. So the zero-entry but not zero-everywhere case can happen if we have a method with a very long running loop plus sparse profiling plus OSR -- we will only get profile counts from the instrumented Tier0 method, and it will never return (instead it will always escape to an OSR version which will eventually return, but that version won't be instrumented). I originally was a bit more ambitious and ran repair for a broader set of reconstruction issues, but lead to a large number of diffs, in part because repair doesn't cope well with irreducible loops. Leaving the entry count zero can have fairly disastrous impact on the quality of optimizations done in the method. Addresses quite a few of the worst-performing benchmarks in dotnet#84264.

AndyAyersMS · 2023-04-04T18:01:16Z

Also noticed #84319 while reviewing the dumps for CheckArrayIsInterfaceNo.

In particular, run synthesis in repair mode for cases where there are profile counts within the method but zero counts in `fgFirstBB`. Recall that sparse profiling effectively probes return blocks to determine the method entry count. So the zero-entry but not zero-everywhere case can happen if we have a method with a very long running loop plus sparse profiling plus OSR -- we will only get profile counts from the instrumented Tier0 method, and it will never return (instead it will always escape to an OSR version which will eventually return, but that version won't be instrumented). I originally was a bit more ambitious and ran repair for a broader set of reconstruction issues, but lead to a large number of diffs, in part because repair doesn't cope well with irreducible loops. Leaving the entry count zero can have fairly disastrous impact on the quality of optimizations done in the method. Addresses quite a few of the worst-performing benchmarks in #84264.

AndyAyersMS · 2023-04-06T20:30:13Z

After #84312

Updated table of worst-case results (just 1 day of data, so possibly noisy, eg PerfLabTests.CastingPerf.CheckArrayIsInterfaceNo is bimodal ~~but hasn't shown fast results for PGO yet~~).

[now showing the "fast time"]

Investigated cases in bold, notes linked -- skipping tests in italics that look similar to ones already analyzed.

TestName	Pgo/NoPgo	Note
System.Text.Json.Tests.Perf_Get.GetByte [notes]	0.378284924	budget #85531
System.Text.Json.Tests.Perf_Get.GetUInt16	0.39703163	dup
System.Text.Json.Tests.Perf_Get.GetInt16	0.415271098	dup
System.Text.Json.Tests.Perf_Get.GetSByte	0.432559076	dup
System.Text.Json.Tests.Perf_Get.GetUInt32	0.485751995	dup
System.Text.Json.Tests.Perf_Get.GetInt32	0.496368517	dup
PerfLabTests.CastingPerf.CheckArrayIsInterfaceYes	0.500523372	dup
PerfLabTests.CastingPerf.CheckArrayIsInterfaceNo [notes] improvements	0.500571243	fixed #84312
System.Collections.IterateFor<String>.ImmutableArray(Size: 512) [notes]	0.521820448	suspect noise
Exceptions.Handling.MultipleNestedTryCatch_FirstCatches(kind: Hardware) [notes]	0.587834042	noise
Exceptions.Handling.ThrowAndCatchWhenFinally(kind: Software)	0.629162936	dup
System.Collections.Sort<Int32>.Array(Size: 512) [notes]	0.691174178	layout
Exceptions.Handling.ThrowAndCatch(kind: ReflectionSoftware)	0.712891743	dup
System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToWriter(Mode: SourceGen) [notes]	0.714508317	budget #85531
System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToUtf8Bytes(Mode: SourceGen)	0.716100799	dup
System.Collections.ContainsKeyTrue<Int32, Int32>.ConcurrentDictionary(Size: 512) [notes]	0.716735601	deferred
System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToString(Mode: SourceGen)	0.72480144	dup
System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToStream(Mode: SourceGen)	0.728309251	dup
System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeObjectProperty(Mode: SourceGen)	0.732178172	dup
Exceptions.Handling.ThrowAndCatch_ManyCatchBlocks(kind: ReflectionSoftware)	0.734050863	dup
Benchstone.BenchI.MulMatrix.Test [notes]	0.746493091	repair #84741
LinqBenchmarks.Count00ForX [notes]	0.75390536	#84517
Benchstone.MDBenchI.MDLogicArray.Test [notes]	0.755000226	repair
Span.Sorting.QuickSortSpan(Size: 512) [notes]	0.755771373	layout
System.IO.Tests.Perf_Path.GetDirectoryName [notes]	0.762018237	unclear
System.Collections.CtorFromCollection<Int32>.SortedSet(Size: 512) [notes]	0.784138758	noise
MicroBenchmarks.Serializers.Xml_ToStream<MyEventsListerViewModel>.XmlSerializer_ [notes]	0.785627854	noise
System.Text.Json.Tests.Perf_Segment.ReadMultiSegmentSequence(segmentSize: 4096, TestCase: Json4KB) [notes]	0.788119346	budget #85531
System.Text.Json.Tests.Perf_Get.GetUInt64	0.791121821	dup
System.Memory.Span<Byte>.SequenceEqual(Size: 33) [notes]	0.791880016	intrinsic #85130
System.Globalization.Tests.StringEquality.Compare_Same_Upper(Count: 1024, Options: (en-US, IgnoreNonSpace)) [notes]	0.798755609	noise
Benchstone.MDBenchF.MDInvMt.Test [notes]	0.80080933	Fixed #84741
PerfLabTests.CastingPerf.ObjObjIsFoo [notes]	0.806031322	noise
System.Memory.Span<Byte>.StartsWith(Size: 512) [notes]	0.806253739	noise
Exceptions.Handling.TryAndFinallyDeep(kind: Software)	0.81026551	dup
System.Collections.ContainsFalse<String>.ImmutableArray(Size: 512) [notes]	0.812478704	noise
System.Memory.Span<Byte>.SequenceEqual(Size: 512)	0.821172947	dup
Benchstone.BenchF.InvMt.Test [notes]	0.822482883	fixed #84741
ByteMark.BenchLUDecomp [notes]	0.822880993	blend #85171
PerfLabTests.CastingPerf2.CastingPerf.ScalarValueTypeObj [notes]	0.82354731	fixed
System.Memory.Span<Byte>.SequenceCompareToDifferent(Size: 4) [notes]	0.834542277	LSRA block order
PerfLabTests.LowLevelPerf.GenericGenericMethod [notes]	0.836391889	noise
System.Text.Json.Document.Tests.Perf_DocumentParse.Parse(IsDataIndented: False, TestRandomAccess: True, TestCase: Json400KB) [notes]	0.836471477	unclear
System.Perf_Convert.FromBase64String [notes]	0.839094252	maintenance #85265
System.Text.RegularExpressions.Tests.Perf_Regex_Cache.IsMatch(total: 400000, unique: 1, cacheSize: 15) [notes]	0.841496573	noise
Benchstone.BenchI.CSieve.Test [link]	0.842229476	Fixed #84741
System.Collections.IndexerSet<Int32>.Dictionary(Size: 512) [notes]	0.842651326	noise ?
System.Text.Json.Serialization.Tests.WriteJson<ImmutableSortedDictionary<String, String>>.SerializeToWriter(Mode: SourceGen)	0.84525902	dup
System.Tests.Perf_Char.Char_ToLowerInvariant(input: "Hello World!") [notes]	0.846315396	noise
System.Text.Json.Tests.Perf_Get.GetInt64	0.847824295	dup

AndyAyersMS · 2023-04-10T15:56:38Z

System.Text.Json.Tests.Perf_Get.GetByte

Default [link]

51.97%   2.616E+07   Tier-1   [System.Private.CoreLib]Utf8Parser.TryParseByteD(value class System.ReadOnlySpan`1<unsigned int8>,unsigned int8&,int32&)
33.04%   1.663E+07   Tier-1   [MicroBenchmarks]Perf_Get.GetByte()
09.30%   4.68E+06    Tier-1   [System.Private.CoreLib]Utf8Parser.TryParse(value class System.ReadOnlySpan`1<unsigned int8>,unsigned int8&,int32&,wchar)
02.22%   1.12E+06    Tier-1   [System.Text.Json]Utf8JsonReader..ctor(value class System.ReadOnlySpan`1<unsigned int8>,bool,value class System.Text.Json.JsonReaderState)

PGO [link]

45.71%   2.297E+07   Tier-1   [System.Text.Json]Utf8JsonReader.TryGetByte(unsigned int8&)
31.92%   1.604E+07   Tier-1   [MicroBenchmarks]Perf_Get.GetByte()
19.36%   9.73E+06    Tier-1   [System.Private.CoreLib]Utf8Parser.TryParseByteD(value class System.ReadOnlySpan`1<unsigned int8>,unsigned int8&,int32&)
00.84%   4.2E+05     Tier-1   [System.Text.Json]Utf8JsonReader..ctor(value class System.ReadOnlySpan`1<unsigned int8>,bool,value class System.Text.Json.JsonReaderState)
00.66%   3.3E+05     Tier-1   [System.Text.Json]Utf8JsonReader.TryGetNumber(value class System.ReadOnlySpan`1<unsigned int8>,int32&)

PGO is unable to inline TryGetByte into Perf_Get.GetByte because the inliner runs out of budget doing a lot of earlier inlines that are less profitable. If I increase the budget to 12x (currently it's 10x) then PGO and non-PGO performance matches.

    [INL00 IL=0003 TR=000894 060003D1] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonReader:TryGetByte(byref):bool:this

I also noticed Perf_Get.GetByte was jitted twice at instrumented tier0 (#84517)

 827: JIT compiled System.Text.Json.Tests.Perf_Get:GetByte():ubyte:this [Instrumented Tier0, IL size=62, code size=332, hash=0xb3d5b836]
 907: JIT compiled System.Text.Json.Tests.Perf_Get:GetByte():ubyte:this [Instrumented Tier0, IL size=62, code size=299, hash=0xb3d5b836]
 939: JIT compiled System.Text.Json.Tests.Perf_Get:GetByte():ubyte:this [Tier1 with Dynamic PGO, IL size=62, code size=1802, hash=0xb3d5b836]

AndyAyersMS · 2023-04-10T17:16:02Z

System.Collections.IterateFor.ImmutableArray(Size: 512)

Default [link]

PGO [link]

This one seems sporadically bimodal. Good results show PGO and default are at par.

My local runs have PGO faster than default. Will defer investigation for a while to see if PGO runs ever go back to the "fast" level.

Latest runs show PGO perf has indeed come back down

AndyAyersMS · 2023-04-11T02:37:23Z

Exceptions.Handling.MultipleNestedTryCatch_FirstCatches(kind: Hardware)

Default [link]

PGO [link]

Also seems bimodal and theres no real difference between non-PGO and PGO

AndyAyersMS · 2023-04-11T03:05:22Z

System.Collections.Sort.Array(Size: 512)

Default [link]

PGO [link]

Both fluctuate, but PGO has worse behavior.

By default this test doesn't iterate enough and is mainly measuring R2R codegen. Sadly both perfview and my tool are unable to resolve the R2R method. Not sure why.

51.77%   3.07E+06    native   System.Private.CoreLib.dll
27.66%   1.64E+06    Tier-1   [System.Private.CoreLib]System.Collections.Generic.GenericArraySortHelper`1[System.Int32].PickPivotAndPartition(value class System.Span`1<!0>)
14.84%   8.8E+05     Tier-1   [System.Private.CoreLib]System.Collections.Generic.GenericArraySortHelper`1[System.Int32].InsertionSort(value class System.Span`1<!0>)
02.53%   1.5E+05     Tier-1   [System.Private.CoreLib]System.Collections.Generic.GenericArraySortHelper`1[System.Int32].IntroSort(value class System.Span`1<!0>,int32)

If I bump iteration count up to 200 there is a more consistent small regression with PGO

Method	Job	Toolchain	Size	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
Array	Job-YGWDQU	Default	512	4.594 us	0.1309 us	0.5542 us	4.455 us	4.120 us	8.563 us	1.00	0.00	-	NA
Array	Job-AEEVAG	PGO	512	5.128 us	0.1241 us	0.5255 us	5.201 us	2.996 us	7.711 us	1.13	0.16	-	NA

And it looks like it is related to the codegen in PickPivotAndPartition.

Perf vs BDN iteration number for PGO, By default the lab does 15 iterations and IIRC we report median so you can see it might toggle back and forth between fast and slow results like we are seeing above, depending exactly on when the switchover happens.

Seems like PGO behavior changed as a result of #80769.

Issue seems to be that fgReorderBlocks sees slightly different profile weights and leaves a colder block BB10 intermixed with hotter ones:

;; default (before)

BB10 [0092]  1       BB08                  0.31   33    [037..038)-> BB11 (always)                     i IBC 
BB07 [0056]  1       BB05                  0.31   33    [030..031)-> BB08 (always)                     i IBC 
BB04 [0020]  1       BB02                  0.31   33    [029..02A)-> BB05 (always)                     i IBC 
BB20 [0014]  2       BB11,BB18             1     108    [113..11D)-> BB22 ( cond )                     i IBC 
BB21 [0015]  1       BB20                  1     108    [11D..125)                                     i IBC 
BB22 [0016]  2       BB20,BB21             1     108    [125..137)        (return)                     i IBC 

;; default (after)

BB20 [0014]  2       BB11,BB18             1     108    [113..11D)-> BB22 ( cond )                     i IBC 
BB21 [0015]  1       BB20                  1     108    [11D..125)                                     i IBC 
BB22 [0016]  2       BB20,BB21             1     108    [125..137)        (return)                     i IBC 
BB10 [0092]  1       BB08                  0.31   33    [037..038)-> BB11 (always)                     i IBC 
BB07 [0056]  1       BB05                  0.31   33    [030..031)-> BB08 (always)                     i IBC 
BB04 [0020]  1       BB02                  0.31   33    [029..02A)-> BB05 (always)                     i IBC 

;; pgo (before)

BB10 [0092]  1       BB08                  0.33   19513    [037..038)-> BB11 (always)                     i IBC 
BB07 [0056]  1       BB05                  0.33   19513    [030..031)-> BB08 (always)                     i IBC 
BB04 [0020]  1       BB02                  0.33   19513    [029..02A)-> BB05 (always)                     i IBC 
BB20 [0014]  2       BB11,BB18             1      59808    [113..11D)-> BB22 ( cond )                     i IBC 
BB21 [0015]  1       BB20                  0.98   58720    [11D..125)                                     i IBC 
BB22 [0016]  2       BB20,BB21             1      59808    [125..137)        (return)                     i IBC 

;; pgo (after)

BB20 [0014]  2       BB11,BB18             1      59808    [113..11D)-> BB22 ( cond )                     i IBC 
BB23 [0193]  1       BB20                  0.98            [???..???)-> BB21 (always)                     internal 
BB10 [0092]  1       BB08                  0.33   19513    [037..038)-> BB11 (always)                     i IBC 
BB21 [0015]  1       BB23                  0.98   58720    [11D..125)                                     i IBC 
BB22 [0016]  2       BB20,BB21             1      59808    [125..137)        (return)                     i IBC 
BB07 [0056]  1       BB05                  0.33   19513    [030..031)-> BB08 (always)                     i IBC 
BB04 [0020]  1       BB02                  0.33   19513    [029..02A)-> BB05 (always)                     i IBC

The default case is using static PGO; from what I can tell all the counts are proper so this is just a divergence in profile data leading to a change in code layout.

There doesn't seem to be an easy way to tweak this algorithm, so going to leave the analysis here and move on.

AndyAyersMS · 2023-04-12T02:35:31Z

System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToWriter(Mode: SourceGen)

Default [link]

PGO [link]

Locally

Method	Job	Toolchain	Mode	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
SerializeToWriter	Job-GYSUOG	Default	SourceGen	9.702 us	0.4368 us	0.5030 us	9.678 us	9.033 us	11.18 us	1.00	0.00	-	NA
SerializeToWriter	Job-HDVYVV	PGO	SourceGen	13.188 us	0.4572 us	0.5265 us	13.011 us	12.641 us	14.62 us	1.36	0.07	-	NA

Suspect this is another case where we run out of budget during inlining. The root level method that contains all the benchmark bits is MicroBenchmarks.Serializers.SystemTextJsonSourceGeneratedContext:ImmutableDictionaryStringStringSerializeHandler(System.Text.Json.Utf8JsonWriter,System.Collections.Immutable.ImmutableDictionary`2[System.String,System.String])

and the respecive ETL profiles are

;; default

02.56%   1.31E+06    ?        Unknown
20.78%   1.065E+07   Tier-1   [MicroBenchmarks]SystemTextJsonSourceGeneratedContext.ImmutableDictionaryStringStringSerializeHandler(class System.Text.Json.Utf8JsonWriter,class System.Collections.Immutable.ImmutableDictionary`2<class System.String,class System.String>)
12.69%   6.5E+06     Tier-1   [System.Text.Json]Utf8JsonWriter.WriteStringMinimized(value class System.ReadOnlySpan`1<wchar>,value class System.ReadOnlySpan`1<wchar>)
12.57%   6.44E+06    Tier-1   [System.Text.Encodings.Web]OptimizedInboxTextEncoder.GetIndexOfFirstCharToEncodeSsse3(wchar*,unsigned int)
08.39%   4.3E+06     Tier-1   [System.Private.CoreLib]Utf8.FromUtf16(value class System.ReadOnlySpan`1<wchar>,value class System.Span`1<unsigned int8>,int32&,int32&,bool,bool)
07.59%   3.89E+06    Tier-1   [System.Private.CoreLib]Ascii.NarrowUtf16ToAscii(wchar*,unsigned int8*,unsigned int)
05.35%   2.74E+06    Tier-1   [System.Private.CoreLib]Utf8Utility.TranscodeToUtf8(wchar*,int32,unsigned int8*,int32,wchar*&,unsigned int8*&)
05.21%   2.67E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SortedInt32KeyNode`1+Enumerator[System.Collections.Immutable.ImmutableDictionary`2+HashBucket[System.__Canon,System.__Canon]].PushLeft(class System.Collections.Immutable.SortedInt32KeyNode`1<!0>)
04.37%   2.24E+06    Tier-1   [System.Text.Encodings.Web]OptimizedInboxTextEncoder.GetIndexOfFirstCharToEncode(value class System.ReadOnlySpan`1<wchar>)
04.25%   2.18E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.ImmutableDictionary`2+HashBucket+Enumerator[System.__Canon,System.__Canon].MoveNext()
03.40%   1.74E+06    Tier-1   [System.Text.Encodings.Web]DefaultJavaScriptEncoder.FindFirstCharacterToEncode(wchar*,int32)
03.28%   1.68E+06    Tier-1   [System.Private.CoreLib]Ascii.NarrowUtf16ToAscii_Intrinsified(wchar*,unsigned int8*,unsigned int)
03.10%   1.59E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SecurePooledObject`1[System.__Canon].Use(!!0&)
02.05%   1.05E+06    Tier-1   [System.Collections]System.Collections.Generic.Stack`1[System.Collections.Immutable.RefAsValueType`1[System.__Canon]].Pop()
02.03%   1.04E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SecurePooledObject`1[System.__Canon].IsOwned(!!0&)

;; pgo

02.87%   2.07E+06    ?        Unknown
16.01%   1.154E+07   Tier-1   [System.Text.Json]Utf8JsonWriter.WriteString(class System.String,class System.String)
15.82%   1.14E+07    Tier-1   [System.Text.Encodings.Web]OptimizedInboxTextEncoder.GetIndexOfFirstCharToEncode(value class System.ReadOnlySpan`1<wchar>)
08.81%   6.35E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SortedInt32KeyNode`1+Enumerator[System.Collections.Immutable.ImmutableDictionary`2+HashBucket[System.__Canon,System.__Canon]].MoveNext()
07.15%   5.15E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon].MoveNext()
05.70%   4.11E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SortedInt32KeyNode`1[System.Collections.Immutable.ImmutableDictionary`2+HashBucket[System.__Canon,System.__Canon]].get_Value()
05.61%   4.04E+06    Tier-1   [System.Private.CoreLib]Ascii.NarrowUtf16ToAscii(wchar*,unsigned int8*,unsigned int)
05.54%   3.99E+06    Tier-1   [System.Private.CoreLib]Utf8.FromUtf16(value class System.ReadOnlySpan`1<wchar>,value class System.Span`1<unsigned int8>,int32&,int32&,bool,bool)
05.07%   3.65E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SecurePooledObject`1[System.__Canon].IsOwned(!!0&)
04.88%   3.52E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.ImmutableDictionary`2+HashBucket+Enumerator[System.__Canon,System.__Canon].MoveNext()
04.44%   3.2E+06     native   coreclr.dll
03.84%   2.77E+06    Tier-1   [System.Private.CoreLib]Utf8Utility.TranscodeToUtf8(wchar*,int32,unsigned int8*,int32,wchar*&,unsigned int8*&)
03.64%   2.62E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SecurePooledObject`1[System.__Canon].Use(!!0&)
03.03%   2.18E+06    Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SortedInt32KeyNode`1+Enumerator[System.Collections.Immutable.ImmutableDictionary`2+HashBucket[System.__Canon,System.__Canon]].get_Current()
02.44%   1.76E+06    Tier-1   [System.Private.CoreLib]Ascii.NarrowUtf16ToAscii_Intrinsified(wchar*,unsigned int8*,unsigned int)
02.01%   1.45E+06    Tier-1   [MicroBenchmarks]SystemTextJsonSourceGeneratedContext.ImmutableDictionaryStringStringSerializeHandler(class System.Text.Json.Utf8JsonWriter,class System.Collections.Immutable.ImmutableDictionary`2<class System.String,class System.String>)
00.97%   7E+05       Tier-1   [System.Collections]System.Collections.Generic.Stack`1[System.Collections.Immutable.RefAsValueType`1[System.__Canon]].Pop()
00.93%   6.7E+05     Tier-1   [System.Collections.Immutable]System.Collections.Immutable.SortedInt32KeyNode`1+Enumerator[System.Collections.Immutable.ImmutableDictionary`2+HashBucket[System.__Canon,System.__Canon]].System.Collections.Immutable.ISecurePooledObjectUser.get_PoolUserId()

These are tricky to match up because of drastic inlining differences, but if we just look at top level call sites in the root method, we see

;; default

Inlines into 06001DF8 [via ExtendedDefaultPolicy] MicroBenchmarks.Serializers.SystemTextJsonSourceGeneratedContext:ImmutableDictionaryStringStringSerializeHandler(System.Text.Json.Utf8JsonWriter,System.Collections.Immutable.ImmutableDictionary`2[System.String,System.String]):
  [INL00 IL=0004 TR=000045 0600069F] [FAILED: call site: unprofitable inline] System.Text.Json.Utf8JsonWriter:WriteNullValue():this
  [INL01 IL=0011 TR=000005 06000564] [INLINED: callee: below ALWAYS_INLINE size] System.Text.Json.Utf8JsonWriter:WriteStartObject():this
  [INL02 IL=0017 TR=000007 0600027A] [INLINED: callee: below ALWAYS_INLINE size] System.Collections.Immutable.ImmutableDictionary`2[System.__Canon,System.__Canon]:GetEnumerator():System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:this
  [INL23 IL=0027 TR=000019 060002F7] [INLINED: call site: profitable inline] System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:get_Current():System.Collections.Generic.KeyValuePair`2[System.__Canon,System.__Canon]:this
  [INL33 IL=0036 TR=000026 06007DD0] [INLINED: callee: below ALWAYS_INLINE size] System.Collections.Generic.KeyValuePair`2[System.__Canon,System.__Canon]:get_Key():System.__Canon:this
  [INL34 IL=0043 TR=000030 06007DD1] [INLINED: callee: below ALWAYS_INLINE size] System.Collections.Generic.KeyValuePair`2[System.__Canon,System.__Canon]:get_Value():System.__Canon:this
  [INL35 IL=0048 TR=000036 06000635] [INLINED: call site: profitable inline] System.Text.Json.Utf8JsonWriter:WriteString(System.String,System.String):this
  [INL81 IL=0055 TR=000012 060002F9] [INLINED: call site: profitable inline] System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:MoveNext():bool:this
  [INL101 IL=0072 TR=000041 060002FB] [INLINED: call site: profitable inline] System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:Dispose():this
  [INL109 IL=0079 TR=000038 0600057A] [INLINED: callee: below ALWAYS_INLINE size] System.Text.Json.Utf8JsonWriter:WriteEndObject():this

Budget: initialTime=315, finalTime=3147, initialBudget=3150, currentBudget=3150
Budget: discretionary inline caused a force inline
Budget: initialSize=2069, finalSize=18734

;; pgo

Inlines into 06001DF8 [via ExtendedDefaultPolicy] MicroBenchmarks.Serializers.SystemTextJsonSourceGeneratedContext:ImmutableDictionaryStringStringSerializeHandler(System.Text.Json.Utf8JsonWriter,System.Collections.Immutable.ImmutableDictionary`2[System.String,System.String]):
  [INL00 IL=0004 TR=000045 0600069F] [FAILED: call site: unprofitable inline] System.Text.Json.Utf8JsonWriter:WriteNullValue():this
  [INL01 IL=0011 TR=000005 06000564] [INLINED: callee: below ALWAYS_INLINE size] System.Text.Json.Utf8JsonWriter:WriteStartObject():this
  [INL56 IL=0017 TR=000007 0600027A] [INLINED: callee: below ALWAYS_INLINE size] System.Collections.Immutable.ImmutableDictionary`2[System.__Canon,System.__Canon]:GetEnumerator():System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:this
  [INL82 IL=0027 TR=000019 060002F7] [INLINED: call site: profitable inline] System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:get_Current():System.Collections.Generic.KeyValuePair`2[System.__Canon,System.__Canon]:this
  [INL87 IL=0036 TR=000026 06007DD0] [INLINED: callee: below ALWAYS_INLINE size] System.Collections.Generic.KeyValuePair`2[System.__Canon,System.__Canon]:get_Key():System.__Canon:this
  [INL88 IL=0043 TR=000030 06007DD1] [INLINED: callee: below ALWAYS_INLINE size] System.Collections.Generic.KeyValuePair`2[System.__Canon,System.__Canon]:get_Value():System.__Canon:this
  [INL00 IL=0048 TR=000036 06000635] [FAILED: call site: inline exceeds budget] System.Text.Json.Utf8JsonWriter:WriteString(System.String,System.String):this
  [INL00 IL=0055 TR=000012 060002F9] [FAILED: call site: inline exceeds budget] System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:MoveNext():bool:this
  [INL00 IL=0072 TR=000041 060002FB] [FAILED: call site: inline exceeds budget] System.Collections.Immutable.ImmutableDictionary`2+Enumerator[System.__Canon,System.__Canon]:Dispose():this
  [INL89 IL=0079 TR=000038 0600057A] [INLINED: callee: below ALWAYS_INLINE size] System.Text.Json.Utf8JsonWriter:WriteEndObject():this

Budget: initialTime=315, finalTime=3137, initialBudget=3150, currentBudget=3150
Budget: discretionary inline caused a force inline
Budget: initialSize=2069, finalSize=16857

So with PGO we run out of budget for a couple of top-level methods that feature prominently in the profile.

As before let me try and up the budget to see what that does... I am not suggesting this as a fix but it might tempt us to look at some kind of priority-first inlining process where we try and make sure we're maximizing the effectiveness of the budget.

Looks like a budget value of 18 is sufficient to regain perf. That's quite a jump from where we are now. Non-pgo version is 2811 bytes, PGO version with budget=18 is 4664.

AndyAyersMS · 2023-04-12T15:22:34Z

System.Collections.ContainsKeyTrue<Int32, Int32>.ConcurrentDictionary(Size: 512)

Default [link]

PGO [link]

The ContainsKey tests are notoriously noisy, but it looks like there were some regressions in PGO that now make it run slower on average.

The first one, on Feb 9, was likely a result of #81557; not clear yet why this helped non-PGO but hurt PGO.

The other jumps are harder to pin down; the one on March 26 is seemingly because of #83484 which we have seen implicated in other changes but it is hard to see a connection.

Locally I see PGO as frequently faster or at par.

Method	Job	Toolchain	Size	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
ConcurrentDictionary	Job-WSQNSB	\base-rel\corerun.exe	512	2.478 us	0.1069 us	0.1231 us	2.451 us	2.312 us	2.703 us	1.00	0.00	-	NA
ConcurrentDictionary	Job-KIYTHA	\diff-rel\corerun.exe	512	2.223 us	0.0630 us	0.0725 us	2.218 us	2.123 us	2.383 us	0.90	0.04	-	NA

Profiling shows all time is in these two methods

87.65%   5.464E+07   Tier-1   [System.Collections.Concurrent]System.Collections.Concurrent.ConcurrentDictionary`2[System.Int32,System.Int32].TryGetValue(!0,!1&)
11.41%   7.11E+06    Tier-1   [MicroBenchmarks]System.Collections.ContainsKeyTrue`2[System.Int32,System.Int32].ConcurrentDictionary()

Going to defer further investigation on this one.

Latest runs show PGO behaving better:

AndyAyersMS · 2023-04-12T15:57:56Z

Benchstone.BenchI.MulMatrix.Test

Default [link]

PGO [link]

Default:

big regression last June Hoist the invariants out of multi-level nested loops #68061
recent improvement this April: Always create loop pre headers #83956

PGO:

big regression in November: PGO: Add new tiers #70941
recent regression in April: likely JIT: use synthesis to repair some reconstruction issues #84312

All the time in this benchmark is in one method:

;; default

99.16%   4.365E+07   Tier-1   [MicroBenchmarks]MulMatrix.Inner(int32[][],int32[][],int32[][])

;; pgo

99.41%   7.545E+07   Tier-1   [MicroBenchmarks]MulMatrix.Inner(int32[][],int32[][],int32[][])

With PGO we actually jit this method 7 times (#84517 tracks fixing some of this). Only the last one matters for benchmark perf; all the others are no longer active when BDN is measuring (though likely some are active during the warmup/pilot stages, and might perturb BDN's choice of iteration strategy. However, measuring with --apples --iterationCount 20 still shows a regression, so any BDN strategy variance is a second-order effect here).

 832: JIT compiled Benchstone.BenchI.MulMatrix:Inner(int[][],int[][],int[][]) [Instrumented Tier0, IL size=435, code size=2943, hash=0x482bb336]
 833: JIT compiled Benchstone.BenchI.MulMatrix:Inner(int[][],int[][],int[][]) [Tier1-OSR @0x60 with Dynamic PGO, IL size=435, code size=1323, hash=0x482bb336]
 834: JIT compiled Benchstone.BenchI.MulMatrix:Inner(int[][],int[][],int[][]) [Tier1-OSR @0x2a with Dynamic PGO, IL size=435, code size=1529, hash=0x482bb336]
 835: JIT compiled Benchstone.BenchI.MulMatrix:Inner(int[][],int[][],int[][]) [Instrumented Tier0, IL size=435, code size=2943, hash=0x482bb336]
 836: JIT compiled Benchstone.BenchI.MulMatrix:Inner(int[][],int[][],int[][]) [Tier1-OSR @0x60 with Dynamic PGO, IL size=435, code size=1323, hash=0x482bb336]
 837: JIT compiled Benchstone.BenchI.MulMatrix:Inner(int[][],int[][],int[][]) [Tier1-OSR @0x2a with Dynamic PGO, IL size=435, code size=1547, hash=0x482bb336]
 838: JIT compiled Benchstone.BenchI.MulMatrix:Inner(int[][],int[][],int[][]) [Tier1 with Dynamic PGO, IL size=435, code size=1194, hash=0x482bb336]

The dynamic profile only reflects the first loop or two of the method, because those loops are hot enough that we escape to OSR before they finish iterating. Thus at Tier1 we see an incomplete profile where many blocks are zero that we know must be executed. This is likely the root cause of the regression from #70941; before then we would instrument the OSR method and so have a more representative picture of profile data.

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight       IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                           100        100    [000..004)-> BB06 (always)                     IBC 
BB02 [0001]  1       BB06                  100k  100000    [004..008)-> BB04 (always)                     bwd bwd-target IBC 
BB03 [0002]  1       BB04                 8035k 8034927    [008..02A)                                     bwd bwd-target IBC 
BB04 [0003]  2       BB02,BB03            8135k 8134927    [02A..02F)-> BB03 ( cond )                     bwd bwd-src IBC 
BB05 [0004]  1       BB04                  100k  100000    [02F..033)                                     bwd IBC 
BB06 [0005]  2       BB01,BB05             100k  100000    [033..038)-> BB02 ( cond )                     bwd bwd-src IBC 
BB07 [0006]  1       BB06                  0          0    [038..03C)-> BB15 (always)                     rare IBC 
BB08 [0007]  1       BB15                  0          0    [03C..040)-> BB13 (always)                     rare bwd bwd-target IBC 
BB09 [0008]  1       BB13                  0          0    [040..044)-> BB11 (always)                     rare bwd bwd-target IBC 
BB10 [0009]  1       BB11                  0          0    [044..060)                                     rare bwd bwd-target IBC

This looks like a case that should have been handled by #84312 -- because the instrumented methods never return, profile incorporation things the entry block weight is zero, and schedules repairs; these run but fail to populate the remainder of the method with interesting counts.

Solver: entry block weight is zero
BB01 has zero weight , setting outgoing likelihoods heuristically
...
Incorporated count data had inconsistencies; repairing profile...
Enter blocks: BB01 
...
ccp backedge BB32 (0.5) -> BB33 likelihood 1
For loop at BB33 cyclic weight is 0.5 cyclic probability is 2
...
ccp backedge BB05 (1) -> BB06 likelihood 1
For loop at BB06 cyclic weight is 0.999 cyclic probability is 1000 [capped]
...
Computing block weights
cbw: BB01 :: 100
cbw (loop head): BB06 :: 100000
cbw: BB02 :: 100000
cbw (loop head): BB04 :: 8134927
cbw: BB03 :: 8034927
cbw: BB05 :: 100000
cbw: BB07 :: 0
cbw (loop head): BB15 :: 0
cbw: BB08 :: 0
cbw (loop head): BB13 :: 0
cbw: BB09 :: 0
...

The problem seems to be that we still use the edge likelihoods laid down by the profile incorporation and so think the loops that never exit actually never exit, despite seeing that they have high $C_p$ values (which we cap). The capping decision does not get fed back into an edge likelihood adjustment.

Some ideas on how to resolve this:

for blocks with zero counts, allow repair to reset edge likelihoods (or have profile incorporation run after synthesis so it can just use the synthesized likelihoods when it has zero counts, and use its own likelihoods otherwise). In the example above all the downstream loops (say the one at BB33 have zero weigt. But likelihood analysis thinks if they are reached they will all iterate twice since profile incorporation just sets zero block outgoing likelihoods to 50/50.
for capped loops, figure out how to adjust loop exit block likelihoods to permit some flow to exit. I recall hitting this problem in the past but not what (if anything) I did about it—likely nothing as we did not have odd profiles like the ones we're seeing here. It is not clear in general how to adjust individual block likelihoods so that one obtains a desired $C_p$ value. The simplest method is to hope that the loop has just one exit and make the fix there; but if the loop has multiple exits things are more complicated. Presumably when computing $C_p$ we also have enough data to record the relative likelihood that each exit block is reached per iteration; if one of those is always reached then we can presumably adjust there, but if all exits are conditionally reached (or even messier, if the exit is from a nested loop) then the computation is not straightforward. And if these various exits all target different blocks then the choice of distribution of exit likelihoods across them can have widespread impact.

If we go back and adjust we may also run into complications from cases like

while (true)
{
      if (p) throw new Exception();
      ...
}

where the loop is postdominated by a throw; synthesis repair will want to make the throw have nonzero weight. Ultimately, we will have to permit throws to have nonzero weights.

Testing some enhancements to profile repair that look promising on this case, and don't seem too disruptive overall.

Update 4/16: official charts are slow to update but I can pull the data myself now. The PGO regression is largely mitigated:

And now the "official chart"

AndyAyersMS · 2023-04-13T01:24:07Z

LinqBenchmarks.Count00ForX

Default [link]

Hard to tell from this but default is at about 122 with error bars extending +/- 10.

PGO [link] -- note the y-axis scale differs from above.

PGO is pretty consistently at 163.

At first blush I can't repro this locally. My machine is somewhat faster than the lab, so I assume it's the default case is running slowly for me locally.

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
Count00ForX	Job-SGUCJG	Default	131.5 ms	26.31 ms	30.30 ms	153.0 ms	94.10 ms	162.7 ms	1.00	0.00	200 B	1.00
Count00ForX	Job-KTJAFK	PGO	122.8 ms	1.85 ms	1.54 ms	122.8 ms	119.94 ms	126.3 ms	0.83	0.13	200 B	1.00

However looking at the distribution of measurements

;; default
-------------------- Histogram --------------------
[ 79.447 ms ; 111.213 ms) | @@@@@@@@
[111.213 ms ; 142.932 ms) | @
[142.932 ms ; 177.358 ms) | @@@@@@@@@@@
---------------------------------------------------

;; pgo
-------------------- Histogram --------------------
[119.082 ms ; 123.442 ms) | @@@@@@@@@
[123.442 ms ; 127.124 ms) | @@@@
---------------------------------------------------

it looks like the default case is just getting unlucky and hitting its slow variant more often for me, and there's a set of runs that are consistently faster than PGO. So will keep digging.

It looks like both modes end up measuring OSR code, but default switches to Tier1 and PGO never makes it there, in part because we redo tier0/osr over again for the main method (#84319, fix is close)

;; default

65.49%   3.249E+07   OSR      [MicroBenchmarks]LinqBenchmarks.Count00ForX()
33.72%   1.673E+07   Tier-1   [MicroBenchmarks]LinqBenchmarks.Count00ForX()

009 4980.124 -- 5274.105 : 293.981
010 5275.485 -- 5576.441 : 300.955
011 5577.827 -- 5767.517 : 189.690
012 5768.957 -- 5953.575 : 184.618

;; pgo

78.87%   2.952E+07   OSR      [MicroBenchmarks]LinqBenchmarks.Count00ForX()
19.77%   7.4E+06     OSR      [MicroBenchmarks]LinqBenchmarks.Count00ForX()

(all iterations same time)

if you crank up the iteration count then PGO looks quite a bit better. Here we are with 100 iterations:

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
Count00ForX	Job-AVMULQ	Default	102.08 ms	7.204 ms	21.24 ms	93.39 ms	88.49 ms	161.3 ms	1.00	0.00	200 B	1.00
Count00ForX	Job-DJMRVT	PGO	71.47 ms	11.541 ms	34.03 ms	50.62 ms	45.31 ms	130.5 ms	0.69	0.29	200 B	1.00

Looking at the benchmark sources

https://github.com/dotnet/performance/blob/d7dac8a7ca12a28d099192f8a901cf8e30361384/src/benchmarks/micro/runtime/Linq/Linq.cs#L291-L308

this is a test where we manually set the iteration count and as a result each invocation of the benchmark method takes a long time and so BDN decides not to call it very often per iteration (just twice)

// AfterActualRun
WorkloadResult   1: 2 op, 246812800.00 ns, 123.4064 ms/op
WorkloadResult   2: 2 op, 245786100.00 ns, 122.8931 ms/op

So to reach the 30 call threshold requires some number of iterations (look like we do 10 or so warming up).

Once we fix #84517 I expect this benchmark to improve some, but we might also consider removing the internal iterations and letting BDN pick the counts for us.

AndyAyersMS · 2023-04-14T17:03:48Z

Benchstone.MDBenchI.MDLogicArray.Test

Default [link]

PGO [link]

Recent regression is likely from: #84312

This test also suffers from repeated tiering up (#84517)

;; pgo
 830: JIT compiled Benchstone.MDBenchI.MDLogicArray:Test():bool:this [Instrumented Tier0, IL size=63, code size=258, hash=0x5774c8bd]
 831: JIT compiled Benchstone.MDBenchI.MDLogicArray:Inner(byref):bool [Instrumented Tier0, IL size=139, code size=557, hash=0x420d3983]
 834: JIT compiled Benchstone.MDBenchI.MDLogicArray:Inner(byref):bool [Tier1-OSR @0x39 with Dynamic PGO, IL size=139, code size=234, hash=0x420d3983]
 835: JIT compiled Benchstone.MDBenchI.MDLogicArray:Inner(byref):bool [Tier1-OSR @0x21 with Dynamic PGO, IL size=139, code size=326, hash=0x420d3983]
 836: JIT compiled Benchstone.MDBenchI.MDLogicArray:Inner(byref):bool [Instrumented Tier0, IL size=139, code size=557, hash=0x420d3983]
 837: JIT compiled Benchstone.MDBenchI.MDLogicArray:Inner(byref):bool [Tier1-OSR @0x39 with Dynamic PGO, IL size=139, code size=234, hash=0x420d3983]
 838: JIT compiled Benchstone.MDBenchI.MDLogicArray:Inner(byref):bool [Tier1-OSR @0x21 with Dynamic PGO, IL size=139, code size=326, hash=0x420d3983]
 839: JIT compiled Benchstone.MDBenchI.MDLogicArray:Inner(byref):bool [Tier1 with Dynamic PGO, IL size=139, code size=236, hash=0x420d3983]
 906: JIT compiled Benchstone.MDBenchI.MDLogicArray:Test():bool:this [Tier1-OSR @0x35 with Dynamic PGO, IL size=63, code size=353, hash=0x5774c8bd]

However not clear that is the root cause of perf issues. Profiles show

;; default

99.08%   4.95E+07    Tier-1   [MicroBenchmarks]MDLogicArray.Inner(value class Workarea&)

;; pgo

61.53%   3.985E+07   OSR      [MicroBenchmarks]MDLogicArray.Test()
37.67%   2.44E+07    Tier-1   [MicroBenchmarks]MDLogicArray.Inner(value class Workarea&)

(here Test calls Inner in a loop)

So with PGO that final OSR compile evidently inlines Inner into Test and the CQ of the computations in Inner suffers.

Suspect this is the same issue as in MulMatrix [above] where profile data is lacking because it only comes from partially executing Tier0 methods:

Inline @[000031] Finishing PHASE Profile incorporation
Trees after Profile incorporation

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight         IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0007]  1                           10046.     10046    [000..00B)-> BB06 (always)                     IBC 
BB02 [0008]  1       BB06                10046k  10046000    [00B..00F)-> BB04 (always)                     bwd bwd-target IBC 
BB03 [0009]  1       BB04                504375k 504374837    [00F..021)                                     bwd bwd-target IBC 
BB04 [0010]  2       BB02,BB03           514421k 514420837    [021..026)-> BB03 ( cond )                     bwd bwd-src IBC 
BB05 [0011]  1       BB04                10046k  10046000    [026..02A)                                     bwd IBC 
BB06 [0012]  2       BB01,BB05           10046k  10046000    [02A..02F)-> BB02 ( cond )                     bwd bwd-src IBC 
BB07 [0013]  1       BB06                  0            0    [02F..033)-> BB14 (always)                     rare IBC 
BB08 [0014]  1       BB14                  0            0    [033..037)-> BB12 (always)                     rare bwd bwd-target IBC 
BB09 [0015]  1       BB12                  0            0    [037..039)                                     rare bwd bwd-target IBC 
BB10 [0016]  2       BB09,BB10             0            0    [039..06C)-> BB10 ( cond )                     rare bwd bwd-target bwd-src IBC 
BB11 [0017]  1       BB10                  0            0    [06C..070)                                     rare bwd IBC

So let's try with the fix from #84817:

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDLogicArray	Job-INDUVT	Default	320.6 ms	7.76 ms	8.93 ms	320.9 ms	308.0 ms	337.1 ms	1.00	0.00	10.59 KB	1.00
MDLogicArray	Job-BHAQKC	PGO	406.5 ms	4.25 ms	3.98 ms	404.8 ms	400.7 ms	413.9 ms	1.29	0.03	10.59 KB	1.00
MDLogicArray	Job-DVPVHR	With #84817	366.4 ms	5.87 ms	6.03 ms	365.1 ms	358.7 ms	379.0 ms	1.14	0.03	10.59 KB	1.00

Things look better but still not at par.

Will move on to another benchmark; may come back to this one.

Lab charts are finally getting updated, here's the latest

AndyAyersMS · 2023-04-20T23:18:17Z

Benchstone.BenchI.CSieve.Test

Default [link]

PGO [link]

October 2022 regression looks like PGO: Add new tiers #70941
April 2023 regression was JIT: use synthesis to repair some reconstruction issues #84312
April 2023 improvement was JIT: extend initial profile repair to more cases #84741

Performance seems to be at parity now

AndyAyersMS · 2023-04-20T23:33:09Z

System.Memory.Span.StartsWith(Size: 512)

Default [link]

PGO [link]

PGO looks to be at parity now or perhaps a bit faster.

AndyAyersMS · 2023-04-20T23:37:24Z

System.Collections.ContainsFalse.ImmutableArray(Size: 512)

Default [link]

PGO [link]

PGO performance seems comparable?

AndyAyersMS · 2023-04-20T23:43:41Z

PerfLabTests.CastingPerf2.CastingPerf.ScalarValueTypeObj

Default [link]

PGO [[link]]

Jan 2023 regression was JIT: Enable edge based profiles for all scenarios #80481
April 2023 improvement was JIT: use synthesis to repair some reconstruction issues #84312

PGO seems comparable

AndyAyersMS · 2023-04-21T14:52:34Z

PerfLabTests.LowLevelPerf.GenericGenericMethod

Default [link]

PGO [link]

Looks to be tri-modal with PGO and Default showing similar levels.

AndyAyersMS · 2023-04-21T14:59:51Z

System.Perf_Convert.FromBase64String

Default [link]

PGO [link]

Regression in Sept 2022 was from JIT: optimize redundant branches by looking through phis #76283
Undone a few days later by JIT: disable phi based redundant branch opts #76695
Regression in Oct 2022 was from PGO: Add new tiers #70941

Can repro locally:

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen0	Allocated	Alloc Ratio
FromBase64String	Job-ZPWSJG	Default	64.07 ns	1.903 ns	2.191 ns	63.74 ns	61.38 ns	69.18 ns	1.00	0.00	0.0087	56 B	1.00
FromBase64String	Job-YVHGZZ	PGO	68.80 ns	0.851 ns	0.796 ns	68.93 ns	67.31 ns	70.35 ns	1.08	0.03	0.0088	56 B	1.00

;; default

02.88%   1.2E+06     ?        Unknown
37.87%   1.579E+07   Tier-1   [System.Private.CoreLib]Convert.TryDecodeFromUtf16(value class System.ReadOnlySpan`1<wchar>,value class System.Span`1<unsigned int8>,int32&,int32&)
23.84%   9.94E+06    Tier-1   [System.Private.CoreLib]Convert.FromBase64_ComputeResultLength(wchar*,int32)
18.04%   7.52E+06    native   coreclr.dll
07.75%   3.23E+06    Tier-1   [System.Private.CoreLib]Convert.TryFromBase64Chars(value class System.ReadOnlySpan`1<wchar>,value class System.Span`1<unsigned int8>,int32&)
04.41%   1.84E+06    Tier-1   [System.Private.CoreLib]Convert.FromBase64CharPtr(wchar*,int32)
02.30%   9.6E+05     Tier-1   [System.Private.CoreLib]Convert.FromBase64String(class System.String)

;; pgo

02.36%   1.08E+06    ?        Unknown
38.08%   1.742E+07   Tier-1   [System.Private.CoreLib]Convert.FromBase64CharPtr(wchar*,int32)
32.29%   1.477E+07   Tier-1   [System.Private.CoreLib]Convert.TryDecodeFromUtf16(value class System.ReadOnlySpan`1<wchar>,value class System.Span`1<unsigned int8>,int32&,int32&)
15.96%   7.3E+06     native   coreclr.dll
06.62%   3.03E+06    Tier-1   [System.Private.CoreLib]Convert.TryFromBase64Chars(value class System.ReadOnlySpan`1<wchar>,value class System.Span`1<unsigned int8>,int32&)
02.45%   1.12E+06    Tier-1   [System.Private.CoreLib]Convert.FromBase64String(class System.String)

Looks like with PGO FromBase64CharPtr has now absorbed FromBase64_ComputeResultLength

Seems like this may be a case of the issue noted in #84319 where when we invert loops we don't get the profile updates done properly. Here we have a hot loop BB02 - BB06 with some cold blocks:

BB01 [0010]  1                           48888.   48888    [000..00D)-> BB06 (always)                     IBC 
BB02 [0011]  1       BB06                 1734k 1734078    [00D..01A)-> BB04 ( cond )                     bwd bwd-target IBC 
BB03 [0012]  1       BB02                  0          0    [01A..020)-> BB06 (always)                     rare bwd IBC 
BB04 [0013]  1       BB02                 1734k 1734078    [020..025)-> BB06 ( cond )                     bwd IBC 
BB05 [0014]  1       BB04                48545.   48545    [025..02D)                                     bwd IBC 
BB06 [0015]  4       BB01,BB03,BB04,BB05  1783k 1782966    [02D..031)-> BB02 ( cond )                     bwd bwd-src IBC

Without PGO we end up compacting this (somehow), but with PGO we end up leaving one of the cold block inline

;; default 

       align    [12 bytes for IG03]

G_M650_IG03:  ;; offset=0020H
       movzx    r9, word  ptr [rcx]
       add      rcx, 2
       cmp      r9d, 32
       jbe      SHORT G_M650_IG12
						;; size=14 bbWeight=348.23 PerfScore 1218.79
G_M650_IG04:  ;; offset=002EH
       cmp      r9d, 61
       je       SHORT G_M650_IG11
						;; size=6 bbWeight=347.83 PerfScore 434.79
G_M650_IG05:  ;; offset=0034H
       cmp      rcx, rax
       jb       SHORT G_M650_IG03

...

G_M650_IG11:  ;; offset=0069H  **COLD**
       dec      edx
       inc      r8d
       jmp      SHORT G_M650_IG05


;; pgo

       align    [8 bytes for IG06]

G_M36625_IG06:  ;; offset=0060H
       movzx    rax, word  ptr [rdx]
       add      rdx, 2
       cmp      eax, 32
       jbe      G_M36625_IG15
       cmp      eax, 61
       jne      SHORT G_M36625_IG08
						;; size=21 bbWeight=35.07 PerfScore 166.59
G_M36625_IG07:  ;; offset=0075H **COLD**
       dec      ecx
       inc      ebp
						;; size=4 bbWeight=0.98 PerfScore 0.49
G_M36625_IG08:  ;; offset=0079H
       cmp      rdx, rbx
       jb       SHORT G_M36625_IG06

When we invert the exit block weight ends up being too low:

BB09 [0011]  2       BB13,BB29            35.47 1734078    [029..02A)-> BB11 ( cond )                     i bwd bwd-target IBC 
BB10 [0012]  1       BB09                  0          0    [029..02A)-> BB13 (always)                     i rare bwd IBC 
BB11 [0013]  1       BB09                 35.47 1734078    [029..02A)-> BB13 ( cond )                     i bwd IBC 
BB12 [0014]  1       BB11                  0.99   48545    [029..02A)                                     i bwd IBC 
BB13 [0015]  3       BB10,BB11,BB12        1      48888    [029..02A)-> BB09 ( cond )                     i bwd bwd-src IBC

Verified that the prospective fix from #84319 works:

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen0	Allocated	Alloc Ratio
FromBase64String	Job-XFXANI	FixLoopInversion	63.05 ns	0.716 ns	0.598 ns	62.83 ns	62.35 ns	64.43 ns	1.00	0.00	0.0088	56 B	1.00
FromBase64String	Job-ITZXRC	PGO	70.10 ns	1.172 ns	1.096 ns	70.16 ns	68.79 ns	72.12 ns	1.11	0.02	0.0088	56 B	1.00

Lab sees the fix now too:

If we have a partial profile then the current count reconstruction will adjust the exit likelihood of some loop exit when it hits a capped loop. But for multiple exit loops we might wish to see some profile flow out of all exits, not just one. In `ludcmp` we choose to send all the profile weights down an early return path, leaving the bulk of the method with zero counts. Instead of trying increasingly elaborate repair schemes, we will now use blend mode for these sorts of problems; this gives a more balanced count redistribution. I also updated blend to use the same logic as repair if a block has zero weights, since presumably whatever likelihood was assigned there during reconstruction is not well supported. Fixes the `ludcmp` regression with PGO over no PGO, noted in dotnet#84264 (comment)

If we have a partial profile then the current count reconstruction will adjust the exit likelihood of some loop exit when it hits a capped loop. But for multiple exit loops we might wish to see some profile flow out of all exits, not just one. In `ludcmp` we choose to send all the profile weights down an early return path, leaving the bulk of the method with zero counts. Instead of trying increasingly elaborate repair schemes, we will now use blend mode for these sorts of problems; this gives a more balanced count redistribution. I also updated blend to use the same logic as repair if a block has zero weights, since presumably whatever likelihood was assigned there during reconstruction is not well supported. Fixes the `ludcmp` regression with PGO over no PGO, noted in #84264 (comment)

AndyAyersMS · 2023-04-24T18:41:55Z

System.Memory.Span<Byte>.SequenceCompareToDifferent(Size: 4)

Default [link]

PGO [link]

Feb 2002 regression seems to be #82412

Method	Job	Toolchain	Size	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
SequenceCompareToDifferent	Job-HHXDGF	\base-rel\corerun.exe	4	4.590 ns	0.1019 ns	0.1001 ns	4.543 ns	4.478 ns	4.791 ns	1.00	0.00	-	NA
SequenceCompareToDifferent	Job-UIGKEB	\diff-rel\corerun.exe	4	4.817 ns	0.1166 ns	0.1033 ns	4.785 ns	4.700 ns	5.083 ns	1.05	0.03	-	NA

;; default

03.21%   1.2E+06     ?        Unknown
36.40%   1.36E+07    Tier-1   [MicroBenchmarks]System.Memory.Span`1[System.Byte].SequenceCompareToDifferent()
35.95%   1.343E+07   Tier-1   [System.Private.CoreLib]SpanHelpers.SequenceCompareTo(unsigned int8&,int32,unsigned int8&,int32)
17.91%   6.69E+06    Tier-1   [System.Private.CoreLib]MemoryExtensions.SequenceCompareTo(value class System.Span`1<!!0>,value class System.ReadOnlySpan`1<!!0>)

;; pgo

38.51%   1.483E+07   Tier-1   [MicroBenchmarks]System.Memory.Span`1[System.Byte].SequenceCompareToDifferent()
35.63%   1.372E+07   Tier-1   [System.Private.CoreLib]SpanHelpers.SequenceCompareTo(unsigned int8&,int32,unsigned int8&,int32)
16.26%   6.26E+06    Tier-1   [System.Private.CoreLib]MemoryExtensions.SequenceCompareTo(value class System.Span`1<!!0>,value class System.ReadOnlySpan`1<!!0>)

Looks like slightly different PGO data for an System.ReadOnlySpan1[ubyte]:.ctor(ubyte[]):this` (static vs dynamic) leads to an LSRA block order difference which leads LSRA to use two callee saves in the PGO case.

;; default (static pgo)

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0008]  1                           100    100    [000..003)-> BB03 ( cond )                     IBC 
BB02 [0009]  1       BB01                  0.58   1    [003..00B)        (return)                     IBC 
BB03 [0010]  1       BB01                 99.42  99    [00B..021)        (return)                     IBC 
-----------------------------------------------------------------------------------------------------------------------------------------

Start LSRA Block Sequence: 
Current block: BB01
	Succ block: BB03, Criteria: weight, Worklist: [BB03 ]
	Succ block: BB02, Criteria: weight, Worklist: [BB03 BB02 ]
Current block: BB03
	Succ block: BB04, Criteria: weight, Worklist: [BB04 BB02 ]
Current block: BB04
	Succ block: BB06, Criteria: bbNum, Worklist: [BB06 BB02 ]
	Succ block: BB05, Criteria: bbNum, Worklist: [BB05 BB06 BB02 ]
Current block: BB05
	Succ block: BB07, Criteria: bbNum, Worklist: [BB06 BB07 BB02 ]
Current block: BB06
Current block: BB07
Current block: BB02
Final LSRA Block Sequence:
BB01 (  1   )
BB03 (  0.50)
BB04 (  1   )
BB05 (  0.00)
BB06 (  0.50)
BB07 (  1   )
BB02 (  0   )

;; pgo

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0008]  1                           100    100    [000..003)-> BB03 ( cond )                     IBC 
BB02 [0009]  1       BB01                  0      0    [003..00B)        (return)                     rare IBC 
BB03 [0010]  1       BB01                100    100    [00B..021)        (return)                     IBC 
-----------------------------------------------------------------------------------------------------------------------------------------

Start LSRA Block Sequence: 
Current block: BB01
	Succ block: BB03, Criteria: weight, Worklist: [BB03 ]
	Succ block: BB02, Criteria: weight, Worklist: [BB03 BB02 ]
Current block: BB03
	Succ block: BB04, Criteria: weight, Worklist: [BB04 BB02 ]
Current block: BB04
	Succ block: BB06, Criteria: bbNum, Worklist: [BB06 BB02 ]
	Succ block: BB05, Criteria: weight, Worklist: [BB06 BB02 BB05 ]
Current block: BB06
	Succ block: BB07, Criteria: bbNum, Worklist: [BB07 BB02 BB05 ]
Current block: BB07
Current block: BB02
Current block: BB05
Final LSRA Block Sequence:
BB01 (  1   )
BB03 (  0.50)
BB04 (  1   )
BB06 (  0.50)
BB07 (  1   )
BB02 (  0   )
BB05 (  0   )

and the prologs

;; default
       push     rdi
       push     rsi
       sub      rsp, 72

;; pgo
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 72

AndyAyersMS · 2023-04-24T19:30:36Z

System.Text.Json.Document.Tests.Perf_DocumentParse.Parse(IsDataIndented: False, TestRandomAccess: True, TestCase: Json400KB)

Default [link]

PGO [link]

May 2022 (small) regression may have been #68804
February 2023 regression may be #77357 or #81936
April 2023 regression possibly #84427

Locally I can repro, but annoyingly when I profile PGO is faster.

Method	Job	Toolchain	IsDataIndented	TestRandomAccess	TestCase	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen0	Allocated	Alloc Ratio
Parse	Job-FHNYPZ	Default	False	True	Json400KB	4.592 ms	0.0531 ms	0.0444 ms	4.599 ms	4.515 ms	4.679 ms	1.00	0.00	92.5926	597.51 KB	1.00
Parse	Job-LHXDIX	PGO	False	True	Json400KB	5.507 ms	0.9467 ms	1.0522 ms	4.760 ms	4.490 ms	6.996 ms	1.29	0.22	-	597.59 KB	1.00

;; default

65.72%   2.552E+07   Tier-1   [System.Text.Json]JsonDocument.GetArrayIndexElement(int32,int32)
07.37%   2.86E+06    FullOpt  [System.Text.Json]JsonDocument.TryGetNamedPropertyValue(int32,int32,value class System.ReadOnlySpan`1<unsigned int8>,value class System.Text.Json.JsonElement&)
03.37%   1.31E+06    native   coreclr.dll
01.85%   7.2E+05     Tier-1   [System.Text.Json]JsonDocument.Parse(value class System.ReadOnlySpan`1<unsigned int8>,value class System.Text.Json.JsonReaderOptions,value class MetadataDb&,value class StackRowStack&)
01.75%   6.8E+05     Tier-1   [System.Text.Json]Utf8JsonReader.ReadSingleSegment()
01.75%   6.8E+05     Tier-1   [System.Private.CoreLib]IndexOfAnyAsciiSearcher.IndexOfAnyVectorized(unsigned int8&,int32,value class System.Runtime.Intrinsics.Vector128`1<unsigned int8>)
01.39%   5.4E+05     Tier-1   [System.Text.Json]Utf8JsonReader.ConsumeString()

;; pgo

66.14%   2.531E+07   Tier-1   [System.Text.Json]JsonDocument.GetArrayIndexElement(int32,int32)
07.66%   2.93E+06    FullOpt  [System.Text.Json]JsonDocument.TryGetNamedPropertyValue(int32,int32,value class System.ReadOnlySpan`1<unsigned int8>,value class System.Text.Json.JsonElement&)
03.29%   1.26E+06    native   coreclr.dll
02.59%   9.9E+05     Tier-1   [System.Text.Json]JsonDocument.Parse(value class System.ReadOnlySpan`1<unsigned int8>,value class System.Text.Json.JsonReaderOptions,value class MetadataDb&,value class StackRowStack&)
01.86%   7.1E+05     Tier-1   [System.Text.Json]Utf8JsonReader.ConsumeNextToken(unsigned int8)
01.44%   5.5E+05     Tier-1   [System.Private.CoreLib]Ascii.WidenAsciiToUtf16(unsigned int8*,wchar*,unsigned int)
01.31%   5E+05       Tier-1   [System.Text.Json]Utf8JsonReader.ReadSingleSegment()
01.23%   4.7E+05     Tier-1   [System.Private.CoreLib]IndexOfAnyAsciiSearcher.IndexOfAnyVectorized(unsigned int8&,int32,value class System.Runtime.Intrinsics.Vector128`1<unsigned int8>)
01.12%   4.3E+05     FullOpt  [System.Text.Json]JsonDocument.TryGetNamedPropertyValue(int32,value class System.ReadOnlySpan`1<wchar>,value class System.Text.Json.JsonElement&)
01.07%   4.1E+05     Tier-1   [System.Text.Json]Utf8JsonReader.ConsumeValue(unsigned int8)
01.07%   4.1E+05     native   clrjit.dll

Note the TryGetNamedPropertyValue methods bypass tiering. Looks like they have both stackalloc and a loop, so aren't eligible for OSR.

Codegen for GetArrayIndexElement is identical w/wo PGO, and next hottest method bypasses tiering, and I consistently get faster results with PGO running BDN -p ETW as admin, so am going to move on.

My only guess is that perhaps GetArrayIndexElement is running into JCC errata. I don't have PGO perf history for other machines so can't easily cross-check (say against arm64).

we now have some PGO data both on linux x64 and ampere arm64, for both of those PGO is comparable or a bit faster.

AndyAyersMS · 2023-04-24T21:49:44Z

System.Text.RegularExpressions.Tests.Perf_Regex_Cache.IsMatch(total: 400000, unique: 1, cacheSize: 15)

Default [link]

PGO [link]

PGO seems consistently faster, so likely this was just a noisy comparison.

AndyAyersMS · 2023-04-24T21:56:07Z

System.Collections.IndexerSet<Int32>.Dictionary(Size: 512)

Default [link]

PGO [link]

Jan 2023 regression looks like #80481

Locally I have PGO consistently faster:

Method	Job	Toolchain	Size	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
Dictionary	Job-OHYCZK	Default	512	3.506 us	0.1136 us	0.1216 us	3.491 us	3.328 us	3.792 us	1.00	0.00	-	NA
Dictionary	Job-NZIXNJ	PGO	512	3.219 us	0.0467 us	0.0414 us	3.214 us	3.122 us	3.287 us	0.92	0.03	-	NA

Both default and PGO have since improved and are showing comparable results in the lab:

Default

Improvement was one or more of:

PGO

Improvement was likely from #85805

AndyAyersMS · 2023-04-24T22:22:45Z

System.Tests.Perf_Char.Char_ToLowerInvariant(input: "Hello World!")

Note in a windows batch script you may need to use

--filter "System.Tests.Perf_Char.Char_ToLowerInvariant(input: ""Hello World!"")"

to get this properly passed to BenchmarkDotNet.

Default [link]

November 2022 improvement was #78593
April 2023 regression is possibly #83624

PGO [link]

November 2022 regression was #78593
April 2023 regression is possibly #83624

So oddly enough these are reacting oppositely to some changes.

I see PGO as faster or at least comparable locally.

Method	Job	Toolchain	input	Mean	Error	StdDev	Median	Min	Max	Ratio	Allocated	Alloc Ratio
Char_ToLowerInvariant	Job-ZSHNYR	Default	Hello World!	13.34 ns	0.143 ns	0.127 ns	13.36 ns	13.15 ns	13.52 ns	1.00	-	NA
Char_ToLowerInvariant	Job-QOYHZR	PGO	Hello World!	11.80 ns	0.137 ns	0.128 ns	11.76 ns	11.61 ns	11.99 ns	0.88	-	NA

AndyAyersMS · 2023-04-24T23:20:03Z

The above covers the initial screen of the 50 worst ratios (as of 3 weeks ago). I am going to add in a few other cases based on a refreshed and slightly broadened query.

Broadened: the results above excluded any test that had less than a 2ns reported time. A lot of HW intrinsic/vector tests come in under this and report slower with PGO. So let's look at a few.

System.Numerics.Tests.Perf_Vector3.DotBenchmark

Default [link]

PGO [link]

Seems comparable, PGO result is just a hair greater than zero, but default is zero so ratio is infinite.

AndyAyersMS · 2023-04-24T23:23:32Z

System.Numerics.Tests.Perf_Vector3.ReflectBenchmark

Default [link]

PGO [link]

Regression appears to be from #83036 which seems odd.

This one is still a mystery. Default and PGO seem to have identical code for the benchmark methods, but with release bits PGO reports consistently worse performance. If I drop in a checked jit into the release PGO build, it speeds up and matches the default perf.

Suspecting something like the following -- the workload action unroll for release PGO ends up always calling the precode somehow, and this never gets updated/backpatched?

Looks like the issue was that the jit was not fully initializing some of the SIMD12 vector constants used by this code, causing the high float to pick up random values, that might be denorms, which cause huge stalls.

Fix is #85362

AndyAyersMS · 2023-04-25T15:23:56Z

System.Buffers.Text.Tests.Base64EncodeDecodeInPlaceTests.Base64DecodeInPlace(NumberOfBytes: 200000000)

Default [link]

January 2023 improvement is #79346
February 2023 improvement is possibly #79381

PGO [link]

October 2022 regression is #70941
January 2023 improvement looks like #80481
February 2023 improvement possibly #79381
April 2023 regression is from #84427

This repros locally

Method	Job	Toolchain	NumberOfBytes	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
Base64DecodeInPlace	Job-BGTHBT	Default	200000000	123.6 ms	1.15 ms	0.89 ms	123.5 ms	122.3 ms	125.8 ms	1.00	0.00	400 B	1.00
Base64DecodeInPlace	Job-DGNZZW	PGO	200000000	261.5 ms	47.18 ms	50.48 ms	278.5 ms	123.0 ms	287.5 ms	2.04	0.49	400 B	1.00

But when I enable profiling -p ETW it vanishes

Method	Job	Toolchain	NumberOfBytes	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
Base64DecodeInPlace	Job-ODJYCK	Default	200000000	125.4 ms	6.63 ms	6.20 ms	123.7 ms	121.3 ms	141.2 ms	1.00	0.00	400 B	1.00
Base64DecodeInPlace	Job-NQEVUA	PGO	200000000	124.7 ms	4.63 ms	4.33 ms	123.4 ms	122.0 ms	139.8 ms	1.00	0.06	400 B	1.00

Seems like the "fast" run of this test executes R2R code and never tiers up:

(also looks like something is broken with R2R rundown as neither PerfView nor my tool can resolve which R2R method this is).

Think the issue is in BDN itself: dotnet/BenchmarkDotNet#2296

So, why are we looking at R2R code here? (Potentially if I can catch a slow run in the profiler, it may show hits at Tier0, OSR, or Tier1...will keep trying).

Caught a slow run, and despite what I wrote above the "fast" runs look like they are indeed at Tier1.

;; default

86.93%   1.63E+07    Tier-1   [System.Private.CoreLib]Base64.DecodeFromUtf8InPlace(value class System.Span`1<unsigned int8>,int32&)
13.07%   2.45E+06    native   System.Private.CoreLib.dll

;; pgo

48.42%   2.669E+07   native   coreclr.dll
47.01%   2.591E+07   Tier-1 instr  [System.Private.CoreLib]Base64.DecodeFromUtf8InPlace(value class System.Span`1<unsigned int8>,int32&)
04.44%   2.45E+06    native   System.Private.CoreLib.dll
00.09%   5E+04       native   ntoskrnl.exe

The hot method is evidently R2R and so we go to Tier-1 instr with PGO but it doesn't get called enough to ever finish tiering up. If I rev up the iterations ....

WorkloadResult  29: 1 op, 283866900.00 ns, 283.8669 ms/op
WorkloadResult  30: 1 op, 291856400.00 ns, 291.8564 ms/op
WorkloadResult  31: 1 op, 286046600.00 ns, 286.0466 ms/op
WorkloadResult  32: 1 op, 287347400.00 ns, 287.3474 ms/op
WorkloadResult  33: 1 op, 121137000.00 ns, 121.1370 ms/op
WorkloadResult  34: 1 op, 121175600.00 ns, 121.1756 ms/op
WorkloadResult  35: 1 op, 122286200.00 ns, 122.2862 ms/op
WorkloadResult  36: 1 op, 123082000.00 ns, 123.0820 ms/op
WorkloadResult  37: 1 op, 122119700.00 ns, 122.1197 ms/op

as you can see we need to get to iteration 33 to switch to Tier1 with PGO, but by default we only run 15 iterations. And once we switch we get basically the same perf as without PGO.

BDN could probably do a better job here of warning you that if there are less than say 30 invocations (perhaps 60 to be safe) of your benchmark method you may not see the results you're expecting, and either need to drop the per-invocation time or up the invocation or iteration counts.

AndyAyersMS · 2023-04-25T15:32:24Z

Benchmark.GetChildKeysTests.AddChainedConfigurationEmpty

Default [link]

April 2023 improvement possibly #84582

PGO [link]

October 2022 improvement was #70941
February 2023 improvement was #81267
April 2023 regression was possibly #84582

Comparable for me locally

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
AddChainedConfigurationEmpty	Job-SNZWKD	\base-rel\corerun.exe	13.41 ms	0.263 ms	0.270 ms	13.30 ms	13.18 ms	14.01 ms	1.00	0.00	143.24 KB	1.00
AddChainedConfigurationEmpty	Job-AMBTRB	\diff-rel\corerun.exe	13.87 ms	1.199 ms	1.380 ms	13.10 ms	12.58 ms	16.12 ms	1.05	0.10	143.32 KB	1.00

AndyAyersMS · 2023-05-05T14:59:36Z

System.Buffers.Text.Tests.Utf8FormatterTests.FormatterUInt64(value: 0)

Default [link]

Recent improvement was from #85277

PGO [link]

Regression was from something in this range: 887c043...f92b9ef (74 commits)

From Azure Data Explorer, regression was after 2023-04-25T00:29:35Z and before 2023-04-25T03:46:39Z

Tue Apr 25 03:46:39 2023 f74138f Jeremy Koritzinsky Enable ComWrappers for NativeAOT on non-Windows (#85000)
Tue Apr 25 03:33:48 2023 ad59672 Jeremy Koritzinsky Build Mono LLVM legs on the CBL-Mariner images, but run the AOTing steps on CentOS Stream 8 with binutils (#85273)
Tue Apr 25 02:51:18 2023 516aa8c Stephen Toub Add Parallel.ForAsync (#84804)
Tue Apr 25 02:50:59 2023 759fabe Stephen Toub Fix perf regressions in Utf8Formatter for integers (#85277)
Tue Apr 25 01:14:10 2023 01e2455 Michał Petryka Add CollectionsMarshal.SetCount(list, count) (#82146)
Tue Apr 25 00:49:02 2023 41f6e79 Aleksey Kliger (λgeek) [native-library] If a dllimport is specified with an absolute path, look for it first (#85255)
Tue Apr 25 00:29:35 2023 620b0db Stephen Toub Add parsing error information to composite format string parsing exceptions (#85106)

So looks like the change that lead to this was #85277.

Method	Job	Toolchain	value	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
FormatterUInt64	Job-WWTDFM	PGO	0	6.214 ns	0.0098 ns	0.0082 ns	6.213 ns	6.203 ns	6.231 ns	2.91	0.03	-	NA
FormatterUInt64	Job-QWMANF	Default	0	2.137 ns	0.0243 ns	0.0227 ns	2.148 ns	2.082 ns	2.162 ns	1.00	0.00	-	NA


;; default

05.74%   2.18E+06    ?        Unknown
81.72%   3.103E+07   Tier-1   [MicroBenchmarks]Utf8FormatterTests.FormatterUInt64(unsigned int64)
11.72%   4.45E+06    Tier-1   [9c1446ac-187e-4ddf-b6bd-964603a8e2dd]Runnable_0.WorkloadActionUnroll(int64)

;; pgo

02.50%   1.97E+06    ?        Unknown
93.66%   7.369E+07   Tier-1   [MicroBenchmarks]Utf8FormatterTests.FormatterUInt64(unsigned int64)
03.23%   2.54E+06    Tier-1   [ea5814fd-a81b-4c82-bc90-c88fd94286d8]Runnable_0.WorkloadActionUnroll(int64)

Looks like all the time is spent in the benchmark method. Hopefully this will be easy to track down.

Seems like what is happening is that with PGO we see a lot of cold blocks, and so we don't inline some methods that we inline without PGO. One of those takes a struct by address that is also used in hot code, and so we don't fully promote, we dependently promote. And then we make some poor choices in codegen that leads to what is likely a partial stall. Sampling shows a lot of time piling up in this little stretch of code:

G_M000_IG05:                ;; offset=0035H
       mov      word  ptr [rsp+48H], 0
       mov      eax, dword ptr [rsp+48H]
       or       al, byte  ptr [rsp+49H]
       jne      G_M000_IG14

For some reason we do a wide load after a narrow store; my recollection is that the HW does not like this at all. In the non PGO this code doesn't exist because we know all the values are zero.

Looks like we widen this load in fgMorphCastedBitwiseOp. If I fix that to not widen normalize on load locals, then we recover most (but not all) of the perf. Suspect the remainder is in the higher prolog/epilog costs we see with PGO.

Method	Job	Toolchain	value	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
FormatterUInt64	Job-FINTBR	PGO/Fixed	0	2.516 ns	0.0596 ns	0.0558 ns	2.506 ns	2.455 ns	2.631 ns	1.16	0.04	-	NA
FormatterUInt64	Job-EZPSRN	Default	0	2.159 ns	0.0497 ns	0.0441 ns	2.149 ns	2.097 ns	2.254 ns	1.00	0.00	-	NA

PGO has now improved, thanks to #86491 fixing the store-foreward stall.

Default "regressed" because we enabled PGO by default. then improved with #86491

AndyAyersMS · 2023-06-06T20:33:50Z

Closing this; remaining regressions will be analyzed in #87194.

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 3, 2023

AndyAyersMS self-assigned this Apr 3, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Apr 3, 2023

AndyAyersMS added tenet-performance-benchmarks Issue from performance benchmark PGO and removed untriaged New issue has not been triaged by the area owner labels Apr 3, 2023

AndyAyersMS mentioned this issue Apr 4, 2023

JIT: use synthesis to repair some reconstruction issues #84312

Merged

AndyAyersMS mentioned this issue Apr 12, 2023

JIT: extend initial profile repair to more cases #84741

Merged

JulieLeeMSFT added this to the 8.0.0 milestone Apr 13, 2023

AndyAyersMS mentioned this issue Apr 21, 2023

JIT: use blend rather then repair for profile inconsistencies #85171

Merged

AndyAyersMS mentioned this issue Apr 24, 2023

JIT: improve profile update for loop inversion #85265

Merged

AndyAyersMS mentioned this issue Apr 24, 2023

On Stack Replacement Next Steps #33658

Open

72 tasks

This was referenced Apr 26, 2023

Suspect -p ETW is not enabling the CLR rundown provider dotnet/BenchmarkDotNet#2296

Open

Microbenchmarks where inliner runs out of budget with TieredPGO #85531

Open

Methods with both stackalloc and loops bypass tiering #85548

Open

AndyAyersMS mentioned this issue May 9, 2023

JIT: widening stack memory loads can cause store-forward stalls #85957

Open

AndyAyersMS mentioned this issue Jun 6, 2023

Dynamic PGO Microbenchmark Regressions #87194

Closed

AndyAyersMS closed this as completed Jun 6, 2023

ghost locked as resolved and limited conversation to collaborators Jul 7, 2023

Investigate microbenchmarks that regress with PGO enabled #84264

Investigate microbenchmarks that regress with PGO enabled #84264

Comments

AndyAyersMS commented Apr 3, 2023 • edited Loading

ghost commented Apr 3, 2023

AndyAyersMS commented Apr 3, 2023 • edited Loading

PerfLabTests.CastingPerf.CheckArrayIsInterfaceNo

Default [link]

PGO [link]

EgorBo commented Apr 3, 2023 • edited Loading

EgorBo commented Apr 3, 2023

VSadov commented Apr 3, 2023

EgorBo commented Apr 3, 2023

VSadov commented Apr 3, 2023

AndyAyersMS commented Apr 4, 2023

AndyAyersMS commented Apr 4, 2023

AndyAyersMS commented Apr 6, 2023 • edited Loading

AndyAyersMS commented Apr 10, 2023

System.Text.Json.Tests.Perf_Get.GetByte

Default [link]

PGO [link]

AndyAyersMS commented Apr 10, 2023 • edited Loading

System.Collections.IterateFor.ImmutableArray(Size: 512)

Default [link]

PGO [link]

AndyAyersMS commented Apr 11, 2023

Exceptions.Handling.MultipleNestedTryCatch_FirstCatches(kind: Hardware)

Default [link]

PGO [link]

AndyAyersMS commented Apr 11, 2023 • edited Loading

System.Collections.Sort.Array(Size: 512)

Default [link]

PGO [link]

AndyAyersMS commented Apr 12, 2023 • edited Loading

System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToWriter(Mode: SourceGen)

Default [link]

PGO [link]

AndyAyersMS commented Apr 12, 2023 • edited Loading

System.Collections.ContainsKeyTrue<Int32, Int32>.ConcurrentDictionary(Size: 512)

Default [link]

PGO [link]

AndyAyersMS commented Apr 12, 2023 • edited Loading

Benchstone.BenchI.MulMatrix.Test

Default [link]

PGO [link]

AndyAyersMS commented Apr 13, 2023 • edited Loading

LinqBenchmarks.Count00ForX

Default [link]

PGO [link] -- note the y-axis scale differs from above.

AndyAyersMS commented Apr 14, 2023 • edited Loading

Benchstone.MDBenchI.MDLogicArray.Test

Default [link]

PGO [link]

AndyAyersMS commented Apr 20, 2023

Benchstone.BenchI.CSieve.Test

Default [link]

PGO [link]

AndyAyersMS commented Apr 20, 2023

System.Memory.Span.StartsWith(Size: 512)

Default [link]

PGO [link]

AndyAyersMS commented Apr 20, 2023

System.Collections.ContainsFalse.ImmutableArray(Size: 512)

Default [link]

PGO [link]

AndyAyersMS commented Apr 20, 2023

PerfLabTests.CastingPerf2.CastingPerf.ScalarValueTypeObj

Default [link]

PGO [[link]]

AndyAyersMS commented Apr 21, 2023

PerfLabTests.LowLevelPerf.GenericGenericMethod

Default [link]

PGO [link]

AndyAyersMS commented Apr 21, 2023 • edited Loading

System.Perf_Convert.FromBase64String

Default [link]

PGO [link]

AndyAyersMS commented Apr 24, 2023 • edited Loading

System.Memory.Span<Byte>.SequenceCompareToDifferent(Size: 4)

Default [link]

AndyAyersMS commented Apr 3, 2023 •

edited

Loading

AndyAyersMS commented Apr 3, 2023 •

edited

Loading

EgorBo commented Apr 3, 2023 •

edited

Loading

AndyAyersMS commented Apr 6, 2023 •

edited

Loading

AndyAyersMS commented Apr 10, 2023 •

edited

Loading

AndyAyersMS commented Apr 11, 2023 •

edited

Loading

AndyAyersMS commented Apr 12, 2023 •

edited

Loading

AndyAyersMS commented Apr 12, 2023 •

edited

Loading

AndyAyersMS commented Apr 12, 2023 •

edited

Loading

AndyAyersMS commented Apr 13, 2023 •

edited

Loading

AndyAyersMS commented Apr 14, 2023 •

edited

Loading

AndyAyersMS commented Apr 21, 2023 •

edited

Loading

AndyAyersMS commented Apr 24, 2023 •

edited

Loading

AndyAyersMS commented Apr 24, 2023 •

edited

Loading

AndyAyersMS commented Apr 24, 2023 •

edited

Loading

AndyAyersMS commented Apr 24, 2023 •

edited

Loading

AndyAyersMS commented Apr 24, 2023 •

edited

Loading

AndyAyersMS commented Apr 25, 2023 •

edited

Loading

AndyAyersMS commented Apr 25, 2023 •

edited

Loading

AndyAyersMS commented May 5, 2023 •

edited

Loading