Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: widening stack memory loads can cause store-forward stalls #85957

Open
AndyAyersMS opened this issue May 9, 2023 · 8 comments
Open

JIT: widening stack memory loads can cause store-forward stalls #85957

AndyAyersMS opened this issue May 9, 2023 · 8 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@AndyAyersMS
Copy link
Member

The JIT will widen stack loads of small-typed locals. When one of those wide loads appears closely after a small store to the same local, it can cause a lengthy store-forward stall.

One such example is seen in the System.Buffers.Text.Tests.Utf8FormatterTests.FormatterUInt64(value: 0) benchmark when PGO is enabled (see #84264 (comment))

G_M000_IG05:                ;; offset=0035H
       mov      word  ptr [rsp+48H], 0    // narrow store (struct init)
       mov      eax, dword ptr [rsp+48H]  // wide load
       or       al, byte  ptr [rsp+49H]
       jne      G_M000_IG14

This transformation comes about in morph, as part of fgMorphCastedBitwiseOp:

fgMorphTree BB06, STMT00014 (before)
               [000065] --C--------                         *  JTRUE     void  
               [000064] --C--------                         \--*  EQ        int   
               [000161] -----------                            +--*  EQ        int   
               [000159] -----------                            |  +--*  OR        int   
               [000155] -----------                            |  |  +--*  LCL_VAR   ubyte (AX) V58 tmp54        
               [000158] -----------                            |  |  \--*  LCL_VAR   ubyte (AX) V59 tmp55        
               [000160] -----------                            |  \--*  CNS_INT   int    0
               [000063] -----------                            \--*  CNS_INT   int    0

fgMorphTree BB06, STMT00014 (after)
               [000065] ----G+-----                         *  JTRUE     void  
               [000161] J---G+-N---                         \--*  NE        int   
               [000624] ----G+-----                            +--*  CAST      int <- ubyte <- int
               [000159] ----G------                            |  \--*  OR        int   
               [000155] ----G+-----                            |     +--*  LCL_VAR   int   (AX) V58 tmp54        
               [000158] ----G+-----                            |     \--*  LCL_VAR   int   (AX) V59 tmp55        
               [000160] -----+-----                            \--*  CNS_INT   int    0

In the case cited above this causes a roughly 3x slowdown in the test, and a point fix that disables this for normalize on load locals recovers the missing perf.

This particular pattern only arises with PGO, as most of this method is cold and the struct backing V58/V59 is left exposed by some calls in cold blocks that don't get inlined. But the same thing could happen even without PGO.

G_M000_IG05:                ;; offset=0035H
       mov      word  ptr [rsp+48H], 0
       movzx    rax, byte  ptr [rsp+48H]   // narrow load, extended
       movzx    rcx, byte  ptr [rsp+49H]
       or       eax, ecx
       jne      G_M000_IG14

I played around with a broader fix, modifying xarch's genCodeForLclVar to do narrower loads for normalize on load locals, and that lead to quite a few diffs. At least in my limited checking I only saw diffs in Tier0 code. So that would suggest that in optimized code we don't run into this all that often, but it can happen.

cc @dotnet/jit-contrib

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 9, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label May 9, 2023
@ghost
Copy link

ghost commented May 9, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

The JIT will widen stack loads of small-typed locals. When one of those wide loads appears closely after a small store to the same local, it can cause a lengthy store-forward stall.

One such example is seen in the System.Buffers.Text.Tests.Utf8FormatterTests.FormatterUInt64(value: 0) benchmark when PGO is enabled (see #84264 (comment))

G_M000_IG05:                ;; offset=0035H
       mov      word  ptr [rsp+48H], 0    // narrow store (struct init)
       mov      eax, dword ptr [rsp+48H]  // wide load
       or       al, byte  ptr [rsp+49H]
       jne      G_M000_IG14

This transformation comes about in morph, as part of fgMorphCastedBitwiseOp:

fgMorphTree BB06, STMT00014 (before)
               [000065] --C--------                         *  JTRUE     void  
               [000064] --C--------                         \--*  EQ        int   
               [000161] -----------                            +--*  EQ        int   
               [000159] -----------                            |  +--*  OR        int   
               [000155] -----------                            |  |  +--*  LCL_VAR   ubyte (AX) V58 tmp54        
               [000158] -----------                            |  |  \--*  LCL_VAR   ubyte (AX) V59 tmp55        
               [000160] -----------                            |  \--*  CNS_INT   int    0
               [000063] -----------                            \--*  CNS_INT   int    0

fgMorphTree BB06, STMT00014 (after)
               [000065] ----G+-----                         *  JTRUE     void  
               [000161] J---G+-N---                         \--*  NE        int   
               [000624] ----G+-----                            +--*  CAST      int <- ubyte <- int
               [000159] ----G------                            |  \--*  OR        int   
               [000155] ----G+-----                            |     +--*  LCL_VAR   int   (AX) V58 tmp54        
               [000158] ----G+-----                            |     \--*  LCL_VAR   int   (AX) V59 tmp55        
               [000160] -----+-----                            \--*  CNS_INT   int    0

In the case cited above this causes a roughly 3x slowdown in the test, and a point fix that disables this for normalize on load locals recovers the missing perf.

This particular pattern only arises with PGO, as most of this method is cold and the struct backing V58/V59 is left exposed by some calls in cold blocks that don't get inlined. But the same thing could happen even without PGO.

G_M000_IG05:                ;; offset=0035H
       mov      word  ptr [rsp+48H], 0
       movzx    rax, byte  ptr [rsp+48H]   // narrow load, extended
       movzx    rcx, byte  ptr [rsp+49H]
       or       eax, ecx
       jne      G_M000_IG14

I played around with a broader fix, modifying xarch's genCodeForLclVar to do narrower loads for normalize on load locals, and that lead to quite a few diffs. At least in my limited checking I only saw diffs in Tier0 code. So that would suggest that in optimized code we don't run into this all that often, but it can happen.

cc @dotnet/jit-contrib

Author: AndyAyersMS
Assignees: -
Labels:

area-CodeGen-coreclr, untriaged

Milestone: -

@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label May 15, 2023
@AndyAyersMS
Copy link
Member Author

@dotnet/jit-contrib would like to see this get fixed for 8.0, as it causes a big regression when PGO is enabled.

@AndyAyersMS AndyAyersMS added this to the 8.0.0 milestone May 15, 2023
@AndyAyersMS
Copy link
Member Author

I'm probably not the best person to fix this, so I will probably reassign.

@EgorBo
Copy link
Member

EgorBo commented May 18, 2023

Presumably, it's not the only problem with that benchmark, I also see exceeds budget problem

Inlines into Program:FormatterUInt64(ulong):bool:this:
  [INLINED: below ALWAYS_INLINE size] System.Span`1[ubyte]:op_Implicit(ubyte[]):System.Span`1[ubyte]
    [INLINED: aggressive inline attribute] System.Span`1[ubyte]:.ctor(ubyte[]):this
  [INLINED: aggressive inline attribute] System.Buffers.Text.Utf8Formatter:TryFormat(ulong,System.Span`1[ubyte],byref,System.Buffers.StandardFormat):bool
    [INLINED: profitable inline] System.Buffers.StandardFormat:get_IsDefault():bool:this
    [FAILED: too many il bytes] System.Number:TryUInt64ToDecStr[ubyte](ulong,System.Span`1[ubyte],byref):bool
    [INLINED: below ALWAYS_INLINE size] System.Buffers.StandardFormat:get_Symbol():ushort:this
    [INLINED: profitable inline] System.Buffers.StandardFormat:get_PrecisionOrZero():ubyte:this
    [INLINED: profitable inline] System.Number:TryUInt64ToDecStr[ubyte](ulong,int,System.Span`1[ubyte],byref):bool
      [INLINED: aggressive inline attribute] System.Buffers.Text.FormattingHelpers:CountDigits(ulong):int
        [INLINED: aggressive inline attribute] System.ReadOnlySpan`1[ubyte]:.ctor(ulong,int):this
          [INLINED: below ALWAYS_INLINE size] System.Runtime.CompilerServices.RuntimeHelpers:IsReferenceOrContainsReferences[ubyte]():bool
          [FAILED: unprofitable inline] System.ThrowHelper:ThrowInvalidTypeWithPointersNotSupported(System.Type)
        [INLINED: profitable inline] System.Diagnostics.Debug:Assert(bool)
          [INLINED: below ALWAYS_INLINE size] System.Diagnostics.Debug:Assert(bool,System.String,System.String)
            [FAILED: noinline per IL/cached result] System.Diagnostics.Debug:Fail(System.String,System.String)
        [INLINED: below ALWAYS_INLINE size] System.Runtime.InteropServices.MemoryMarshal:GetReference[ubyte](System.ReadOnlySpan`1[ubyte]):byref
        [INLINED: aggressive inline attribute] System.Numerics.BitOperations:Log2(ulong):int
        [INLINED: profitable inline] System.Diagnostics.Debug:Assert(bool)
          [INLINED: below ALWAYS_INLINE size] System.Diagnostics.Debug:Assert(bool,System.String,System.String)
            [FAILED: noinline per IL/cached result] System.Diagnostics.Debug:Fail(System.String,System.String)
        [INLINED: below ALWAYS_INLINE size] System.Runtime.InteropServices.MemoryMarshal:GetReference[ulong](System.ReadOnlySpan`1[ulong]):byref
      [INLINED: below ALWAYS_INLINE size] System.Math:Max(int,int):int
      [INLINED: below ALWAYS_INLINE size] System.Runtime.InteropServices.MemoryMarshal:GetReference[ubyte](System.Span`1[ubyte]):byref
      [INLINED: aggressive inline attribute] System.Number:UInt64ToDecChars[ubyte](ulong,ulong):ulong
        [INLINED: profitable inline] System.Diagnostics.Debug:Assert(bool)
          [INLINED: below ALWAYS_INLINE size] System.Diagnostics.Debug:Assert(bool,System.String,System.String)
            [FAILED: noinline per IL/cached result] System.Diagnostics.Debug:Fail(System.String,System.String)
        [INLINED: aggressive inline attribute] System.Math:DivRem(ulong,ulong):System.ValueTuple`2[ulong,ulong]
          [INLINED: below ALWAYS_INLINE size] System.ValueTuple`2[ulong,ulong]:.ctor(ulong,ulong):this
        [INLINED: aggressive inline attribute] System.Number:WriteTwoDigits[ubyte](uint,ulong)
          [INLINED: profitable inline] System.Diagnostics.Debug:Assert(bool)
            [INLINED: below ALWAYS_INLINE size] System.Diagnostics.Debug:Assert(bool,System.String,System.String)
              [FAILED: noinline per IL/cached result] System.Diagnostics.Debug:Fail(System.String,System.String)
          [INLINED: profitable inline] System.Diagnostics.Debug:Assert(bool)
            [INLINED: below ALWAYS_INLINE size] System.Diagnostics.Debug:Assert(bool,System.String,System.String)
              [FAILED: noinline per IL/cached result] System.Diagnostics.Debug:Fail(System.String,System.String)
          [INLINED: aggressive inline attribute] System.Runtime.CompilerServices.Unsafe:CopyBlockUnaligned(byref,byref,uint)
        [INLINED: aggressive inline attribute] System.Number:WriteTwoDigits[ubyte](uint,ulong)
          [FAILED: inline exceeds budget] System.Diagnostics.Debug:Assert(bool)
          [FAILED: inline exceeds budget] System.Diagnostics.Debug:Assert(bool)
          [FAILED: inline exceeds budget] System.Runtime.CompilerServices.Unsafe:CopyBlockUnaligned(byref,byref,uint)
        [FAILED: inline exceeds budget] System.Byte:System.IUtfChar<System.Byte>.CastFrom(ulong):ubyte
      [FAILED: inline exceeds budget] System.Number:UInt64ToDecChars[ubyte](ulong,ulong,int):ulong
      [FAILED: inline exceeds budget] System.Diagnostics.Debug:Assert(bool)
    [FAILED: inline exceeds budget] System.Buffers.StandardFormat:get_Symbol():ushort:this
    [FAILED: inline exceeds budget] System.Number:GetHexBase(ushort):ushort
    [FAILED: inline exceeds budget] System.Buffers.StandardFormat:get_PrecisionOrZero():ubyte:this
    [FAILED: too many il bytes] System.Number:TryInt64ToHexStr[ubyte](long,ushort,int,System.Span`1[ubyte],byref):bool
    [FAILED: inline exceeds budget] System.Buffers.Text.FormattingHelpers:TryFormat[ulong](ulong,System.Span`1[ubyte],byref,System.Buffers.StandardFormat):bool
    [FAILED: inline exceeds budget] System.Buffers.StandardFormat:get_HasPrecision():bool:this
    [FAILED: inline exceeds budget] System.Buffers.Text.Utf8Formatter:ThrowGWithPrecisionNotSupported()
    [FAILED: inline exceeds budget] System.ThrowHelper:ThrowFormatException_BadFormatSpecifier()

(but it's possible because I use Checked + Asserts)

Ah, I only see this when Dynamic PGO is disabled 🤔

@markples
Copy link
Member

@TIHan How does this relate to the widen-on-load, casting, etc., work that you've done? I'm thinking both specifically the codegen for this example but also in general does the general strategy lend itself to stalls like this?

The following is from code inspection so should be verified that the source hasn't changed, etc.:

This seems a bit suspicious. It appears (again, verify this, I'm assuming that the two byte reads come from this code but maybe there are different byte reads in the source) that the various TryFormats call the TryFormat with the hexMask, which starts with if (format.IsDefault), for which the relevant code is:

public readonly struct StandardFormat
        private readonly byte _format;
        private readonly byte _precision;
        public bool IsDefault => (_format | _precision) == 0;

From the snippet (with just a 2 byte initialization), it doesn't seem valid to be reading 4 bytes. However, maybe surrounding code is zeroing the other bytes and technically with a comparison against zero various changes could be made.

(In fact, for a comparison with zero, there is no need to or the values together - they could be read together and only that compared against zero. Perhaps this transformation is half of that (reading the whole thing but not discarding the second read)? But also the widening causes the stall so it would be more sensitive to the context, or widen the initialization, or ...)

@AndyAyersMS
Copy link
Member Author

Presumably, it's not the only problem with that benchmark,

a point fix that disables this (widening) for normalize on load locals recovers the missing perf.

@EgorBo in this case I happen to know this narrow store/wide load is the culprit. What PGO does here is block independent promotion of this two-byte struct (because we now don't inline some cold calls), so the struct remains on the stack.

@AndyAyersMS
Copy link
Member Author

#86491 fixed the cases I know of, but there might be more...

newplot - 2023-05-21T075034 050

@AndyAyersMS
Copy link
Member Author

Since I don't know of other cases where this currently causes problems, will reset to future.

@AndyAyersMS AndyAyersMS modified the milestones: 8.0.0, Future Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

3 participants