JIT: widening stack memory loads can cause store-forward stalls #85957

AndyAyersMS · 2023-05-09T01:55:40Z

The JIT will widen stack loads of small-typed locals. When one of those wide loads appears closely after a small store to the same local, it can cause a lengthy store-forward stall.

One such example is seen in the System.Buffers.Text.Tests.Utf8FormatterTests.FormatterUInt64(value: 0) benchmark when PGO is enabled (see #84264 (comment))

G_M000_IG05:                ;; offset=0035H
       mov      word  ptr [rsp+48H], 0    // narrow store (struct init)
       mov      eax, dword ptr [rsp+48H]  // wide load
       or       al, byte  ptr [rsp+49H]
       jne      G_M000_IG14

This transformation comes about in morph, as part of fgMorphCastedBitwiseOp:

fgMorphTree BB06, STMT00014 (before)
               [000065] --C--------                         *  JTRUE     void  
               [000064] --C--------                         \--*  EQ        int   
               [000161] -----------                            +--*  EQ        int   
               [000159] -----------                            |  +--*  OR        int   
               [000155] -----------                            |  |  +--*  LCL_VAR   ubyte (AX) V58 tmp54        
               [000158] -----------                            |  |  \--*  LCL_VAR   ubyte (AX) V59 tmp55        
               [000160] -----------                            |  \--*  CNS_INT   int    0
               [000063] -----------                            \--*  CNS_INT   int    0

fgMorphTree BB06, STMT00014 (after)
               [000065] ----G+-----                         *  JTRUE     void  
               [000161] J---G+-N---                         \--*  NE        int   
               [000624] ----G+-----                            +--*  CAST      int <- ubyte <- int
               [000159] ----G------                            |  \--*  OR        int   
               [000155] ----G+-----                            |     +--*  LCL_VAR   int   (AX) V58 tmp54        
               [000158] ----G+-----                            |     \--*  LCL_VAR   int   (AX) V59 tmp55        
               [000160] -----+-----                            \--*  CNS_INT   int    0

In the case cited above this causes a roughly 3x slowdown in the test, and a point fix that disables this for normalize on load locals recovers the missing perf.

This particular pattern only arises with PGO, as most of this method is cold and the struct backing V58/V59 is left exposed by some calls in cold blocks that don't get inlined. But the same thing could happen even without PGO.

G_M000_IG05:                ;; offset=0035H
       mov      word  ptr [rsp+48H], 0
       movzx    rax, byte  ptr [rsp+48H]   // narrow load, extended
       movzx    rcx, byte  ptr [rsp+49H]
       or       eax, ecx
       jne      G_M000_IG14

I played around with a broader fix, modifying xarch's genCodeForLclVar to do narrower loads for normalize on load locals, and that lead to quite a few diffs. At least in my limited checking I only saw diffs in Tier0 code. So that would suggest that in optimized code we don't run into this all that often, but it can happen.

cc @dotnet/jit-contrib

The text was updated successfully, but these errors were encountered:

ghost · 2023-05-09T01:55:48Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

The JIT will widen stack loads of small-typed locals. When one of those wide loads appears closely after a small store to the same local, it can cause a lengthy store-forward stall.

One such example is seen in the System.Buffers.Text.Tests.Utf8FormatterTests.FormatterUInt64(value: 0) benchmark when PGO is enabled (see #84264 (comment))

G_M000_IG05:                ;; offset=0035H
       mov      word  ptr [rsp+48H], 0    // narrow store (struct init)
       mov      eax, dword ptr [rsp+48H]  // wide load
       or       al, byte  ptr [rsp+49H]
       jne      G_M000_IG14

This transformation comes about in morph, as part of fgMorphCastedBitwiseOp:

fgMorphTree BB06, STMT00014 (before)
               [000065] --C--------                         *  JTRUE     void  
               [000064] --C--------                         \--*  EQ        int   
               [000161] -----------                            +--*  EQ        int   
               [000159] -----------                            |  +--*  OR        int   
               [000155] -----------                            |  |  +--*  LCL_VAR   ubyte (AX) V58 tmp54        
               [000158] -----------                            |  |  \--*  LCL_VAR   ubyte (AX) V59 tmp55        
               [000160] -----------                            |  \--*  CNS_INT   int    0
               [000063] -----------                            \--*  CNS_INT   int    0

fgMorphTree BB06, STMT00014 (after)
               [000065] ----G+-----                         *  JTRUE     void  
               [000161] J---G+-N---                         \--*  NE        int   
               [000624] ----G+-----                            +--*  CAST      int <- ubyte <- int
               [000159] ----G------                            |  \--*  OR        int   
               [000155] ----G+-----                            |     +--*  LCL_VAR   int   (AX) V58 tmp54        
               [000158] ----G+-----                            |     \--*  LCL_VAR   int   (AX) V59 tmp55        
               [000160] -----+-----                            \--*  CNS_INT   int    0

In the case cited above this causes a roughly 3x slowdown in the test, and a point fix that disables this for normalize on load locals recovers the missing perf.

This particular pattern only arises with PGO, as most of this method is cold and the struct backing V58/V59 is left exposed by some calls in cold blocks that don't get inlined. But the same thing could happen even without PGO.

G_M000_IG05:                ;; offset=0035H
       mov      word  ptr [rsp+48H], 0
       movzx    rax, byte  ptr [rsp+48H]   // narrow load, extended
       movzx    rcx, byte  ptr [rsp+49H]
       or       eax, ecx
       jne      G_M000_IG14

I played around with a broader fix, modifying xarch's genCodeForLclVar to do narrower loads for normalize on load locals, and that lead to quite a few diffs. At least in my limited checking I only saw diffs in Tier0 code. So that would suggest that in optimized code we don't run into this all that often, but it can happen.

cc @dotnet/jit-contrib

Author:	AndyAyersMS
Assignees:	-
Labels:	`area-CodeGen-coreclr`, `untriaged`
Milestone:	-

AndyAyersMS · 2023-05-15T15:39:47Z

@dotnet/jit-contrib would like to see this get fixed for 8.0, as it causes a big regression when PGO is enabled.

AndyAyersMS · 2023-05-18T22:25:59Z

I'm probably not the best person to fix this, so I will probably reassign.

EgorBo · 2023-05-18T22:43:16Z

Presumably, it's not the only problem with that benchmark, I also see exceeds budget problem

Inlines into Program:FormatterUInt64(ulong):bool:this:
  [INLINED: below ALWAYS_INLINE size] System.Span`1[ubyte]:op_Implicit(ubyte[]):System.Span`1[ubyte]
    [INLINED: aggressive inline attribute] System.Span`1[ubyte]:.ctor(ubyte[]):this
  [INLINED: aggressive inline attribute] System.Buffers.Text.Utf8Formatter:TryFormat(ulong,System.Span`1[ubyte],byref,System.Buffers.StandardFormat):bool
    [INLINED: profitable inline] System.Buffers.StandardFormat:get_IsDefault():bool:this
    [FAILED: too many il bytes] System.Number:TryUInt64ToDecStr[ubyte](ulong,System.Span`1[ubyte],byref):bool
    [INLINED: below ALWAYS_INLINE size] System.Buffers.StandardFormat:get_Symbol():ushort:this
    [INLINED: profitable inline] System.Buffers.StandardFormat:get_PrecisionOrZero():ubyte:this
    [INLINED: profitable inline] System.Number:TryUInt64ToDecStr[ubyte](ulong,int,System.Span`1[ubyte],byref):bool
      [INLINED: aggressive inline attribute] System.Buffers.Text.FormattingHelpers:CountDigits(ulong):int
        [INLINED: aggressive inline attribute] System.ReadOnlySpan`1[ubyte]:.ctor(ulong,int):this
          [INLINED: below ALWAYS_INLINE size] System.Runtime.CompilerServices.RuntimeHelpers:IsReferenceOrContainsReferences[ubyte]():bool
          [FAILED: unprofitable inline] System.ThrowHelper:ThrowInvalidTypeWithPointersNotSupported(System.Type)
        [INLINED: profitable inline] System.Diagnostics.Debug:Assert(bool)
          [INLINED: below ALWAYS_INLINE size] System.Diagnostics.Debug:Assert(bool,System.String,System.String)
            [FAILED: noinline per IL/cached result] System.Diagnostics.Debug:Fail(System.String,System.String)
        [INLINED: below ALWAYS_INLINE size] System.Runtime.InteropServices.MemoryMarshal:GetReference[ubyte](System.ReadOnlySpan`1[ubyte]):byref
        [INLINED: aggressive inline attribute] System.Numerics.BitOperations:Log2(ulong):int
        [INLINED: profitable inline] System.Diagnostics.Debug:Assert(bool)
          [INLINED: below ALWAYS_INLINE size] System.Diagnostics.Debug:Assert(bool,System.String,System.String)
            [FAILED: noinline per IL/cached result] System.Diagnostics.Debug:Fail(System.String,System.String)
        [INLINED: below ALWAYS_INLINE size] System.Runtime.InteropServices.MemoryMarshal:GetReference[ulong](System.ReadOnlySpan`1[ulong]):byref
      [INLINED: below ALWAYS_INLINE size] System.Math:Max(int,int):int
      [INLINED: below ALWAYS_INLINE size] System.Runtime.InteropServices.MemoryMarshal:GetReference[ubyte](System.Span`1[ubyte]):byref
      [INLINED: aggressive inline attribute] System.Number:UInt64ToDecChars[ubyte](ulong,ulong):ulong
        [INLINED: profitable inline] System.Diagnostics.Debug:Assert(bool)
          [INLINED: below ALWAYS_INLINE size] System.Diagnostics.Debug:Assert(bool,System.String,System.String)
            [FAILED: noinline per IL/cached result] System.Diagnostics.Debug:Fail(System.String,System.String)
        [INLINED: aggressive inline attribute] System.Math:DivRem(ulong,ulong):System.ValueTuple`2[ulong,ulong]
          [INLINED: below ALWAYS_INLINE size] System.ValueTuple`2[ulong,ulong]:.ctor(ulong,ulong):this
        [INLINED: aggressive inline attribute] System.Number:WriteTwoDigits[ubyte](uint,ulong)
          [INLINED: profitable inline] System.Diagnostics.Debug:Assert(bool)
            [INLINED: below ALWAYS_INLINE size] System.Diagnostics.Debug:Assert(bool,System.String,System.String)
              [FAILED: noinline per IL/cached result] System.Diagnostics.Debug:Fail(System.String,System.String)
          [INLINED: profitable inline] System.Diagnostics.Debug:Assert(bool)
            [INLINED: below ALWAYS_INLINE size] System.Diagnostics.Debug:Assert(bool,System.String,System.String)
              [FAILED: noinline per IL/cached result] System.Diagnostics.Debug:Fail(System.String,System.String)
          [INLINED: aggressive inline attribute] System.Runtime.CompilerServices.Unsafe:CopyBlockUnaligned(byref,byref,uint)
        [INLINED: aggressive inline attribute] System.Number:WriteTwoDigits[ubyte](uint,ulong)
          [FAILED: inline exceeds budget] System.Diagnostics.Debug:Assert(bool)
          [FAILED: inline exceeds budget] System.Diagnostics.Debug:Assert(bool)
          [FAILED: inline exceeds budget] System.Runtime.CompilerServices.Unsafe:CopyBlockUnaligned(byref,byref,uint)
        [FAILED: inline exceeds budget] System.Byte:System.IUtfChar<System.Byte>.CastFrom(ulong):ubyte
      [FAILED: inline exceeds budget] System.Number:UInt64ToDecChars[ubyte](ulong,ulong,int):ulong
      [FAILED: inline exceeds budget] System.Diagnostics.Debug:Assert(bool)
    [FAILED: inline exceeds budget] System.Buffers.StandardFormat:get_Symbol():ushort:this
    [FAILED: inline exceeds budget] System.Number:GetHexBase(ushort):ushort
    [FAILED: inline exceeds budget] System.Buffers.StandardFormat:get_PrecisionOrZero():ubyte:this
    [FAILED: too many il bytes] System.Number:TryInt64ToHexStr[ubyte](long,ushort,int,System.Span`1[ubyte],byref):bool
    [FAILED: inline exceeds budget] System.Buffers.Text.FormattingHelpers:TryFormat[ulong](ulong,System.Span`1[ubyte],byref,System.Buffers.StandardFormat):bool
    [FAILED: inline exceeds budget] System.Buffers.StandardFormat:get_HasPrecision():bool:this
    [FAILED: inline exceeds budget] System.Buffers.Text.Utf8Formatter:ThrowGWithPrecisionNotSupported()
    [FAILED: inline exceeds budget] System.ThrowHelper:ThrowFormatException_BadFormatSpecifier()

(but it's possible because I use Checked + Asserts)

Ah, I only see this when Dynamic PGO is disabled 🤔

markples · 2023-05-18T23:10:32Z

@TIHan How does this relate to the widen-on-load, casting, etc., work that you've done? I'm thinking both specifically the codegen for this example but also in general does the general strategy lend itself to stalls like this?

The following is from code inspection so should be verified that the source hasn't changed, etc.:

This seems a bit suspicious. It appears (again, verify this, I'm assuming that the two byte reads come from this code but maybe there are different byte reads in the source) that the various TryFormats call the TryFormat with the hexMask, which starts with if (format.IsDefault), for which the relevant code is:

public readonly struct StandardFormat
        private readonly byte _format;
        private readonly byte _precision;
        public bool IsDefault => (_format | _precision) == 0;

From the snippet (with just a 2 byte initialization), it doesn't seem valid to be reading 4 bytes. However, maybe surrounding code is zeroing the other bytes and technically with a comparison against zero various changes could be made.

(In fact, for a comparison with zero, there is no need to or the values together - they could be read together and only that compared against zero. Perhaps this transformation is half of that (reading the whole thing but not discarding the second read)? But also the widening causes the stall so it would be more sensitive to the context, or widen the initialization, or ...)

AndyAyersMS · 2023-05-18T23:27:53Z

Presumably, it's not the only problem with that benchmark,

a point fix that disables this (widening) for normalize on load locals recovers the missing perf.

@EgorBo in this case I happen to know this narrow store/wide load is the culprit. What PGO does here is block independent promotion of this two-byte struct (because we now don't inline some cold calls), so the struct remains on the stack.

AndyAyersMS · 2023-05-21T14:55:10Z

#86491 fixed the cases I know of, but there might be more...

AndyAyersMS · 2023-06-09T01:40:00Z

Since I don't know of other cases where this currently causes problems, will reset to future.

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 9, 2023

ghost added the untriaged New issue has not been triaged by the area owner label May 9, 2023

AndyAyersMS mentioned this issue May 9, 2023

Investigate microbenchmarks that regress with PGO enabled #84264

Closed

AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label May 15, 2023

AndyAyersMS added this to the 8.0.0 milestone May 15, 2023

JulieLeeMSFT assigned AndyAyersMS May 18, 2023

EgorBo mentioned this issue May 19, 2023

Remove fgMorphCastedBitwiseOp #86491

Merged

AndyAyersMS modified the milestones: 8.0.0, Future Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: widening stack memory loads can cause store-forward stalls #85957

JIT: widening stack memory loads can cause store-forward stalls #85957

AndyAyersMS commented May 9, 2023

ghost commented May 9, 2023

AndyAyersMS commented May 15, 2023

AndyAyersMS commented May 18, 2023

EgorBo commented May 18, 2023 •

edited

Loading

markples commented May 18, 2023

AndyAyersMS commented May 18, 2023

AndyAyersMS commented May 21, 2023

AndyAyersMS commented Jun 9, 2023

JIT: widening stack memory loads can cause store-forward stalls #85957

JIT: widening stack memory loads can cause store-forward stalls #85957

Comments

AndyAyersMS commented May 9, 2023

ghost commented May 9, 2023

AndyAyersMS commented May 15, 2023

AndyAyersMS commented May 18, 2023

EgorBo commented May 18, 2023 • edited Loading

markples commented May 18, 2023

AndyAyersMS commented May 18, 2023

AndyAyersMS commented May 21, 2023

AndyAyersMS commented Jun 9, 2023

EgorBo commented May 18, 2023 •

edited

Loading