-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: widening stack memory loads can cause store-forward stalls #85957
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsThe JIT will widen stack loads of small-typed locals. When one of those wide loads appears closely after a small store to the same local, it can cause a lengthy store-forward stall. One such example is seen in the G_M000_IG05: ;; offset=0035H
mov word ptr [rsp+48H], 0 // narrow store (struct init)
mov eax, dword ptr [rsp+48H] // wide load
or al, byte ptr [rsp+49H]
jne G_M000_IG14 This transformation comes about in morph, as part of
In the case cited above this causes a roughly 3x slowdown in the test, and a point fix that disables this for normalize on load locals recovers the missing perf. This particular pattern only arises with PGO, as most of this method is cold and the struct backing V58/V59 is left exposed by some calls in cold blocks that don't get inlined. But the same thing could happen even without PGO. G_M000_IG05: ;; offset=0035H
mov word ptr [rsp+48H], 0
movzx rax, byte ptr [rsp+48H] // narrow load, extended
movzx rcx, byte ptr [rsp+49H]
or eax, ecx
jne G_M000_IG14 I played around with a broader fix, modifying xarch's cc @dotnet/jit-contrib
|
@dotnet/jit-contrib would like to see this get fixed for 8.0, as it causes a big regression when PGO is enabled. |
I'm probably not the best person to fix this, so I will probably reassign. |
Presumably, it's not the only problem with that benchmark, I also see
(but it's possible because I use Checked + Asserts) Ah, I only see this when Dynamic PGO is disabled 🤔 |
@TIHan How does this relate to the widen-on-load, casting, etc., work that you've done? I'm thinking both specifically the codegen for this example but also in general does the general strategy lend itself to stalls like this? The following is from code inspection so should be verified that the source hasn't changed, etc.: This seems a bit suspicious. It appears (again, verify this, I'm assuming that the two byte reads come from this code but maybe there are different byte reads in the source) that the various
From the snippet (with just a 2 byte initialization), it doesn't seem valid to be reading 4 bytes. However, maybe surrounding code is zeroing the other bytes and technically with a comparison against zero various changes could be made. (In fact, for a comparison with zero, there is no need to |
@EgorBo in this case I happen to know this narrow store/wide load is the culprit. What PGO does here is block independent promotion of this two-byte struct (because we now don't inline some cold calls), so the struct remains on the stack. |
#86491 fixed the cases I know of, but there might be more... |
Since I don't know of other cases where this currently causes problems, will reset to future. |
The JIT will widen stack loads of small-typed locals. When one of those wide loads appears closely after a small store to the same local, it can cause a lengthy store-forward stall.
One such example is seen in the
System.Buffers.Text.Tests.Utf8FormatterTests.FormatterUInt64(value: 0)
benchmark when PGO is enabled (see #84264 (comment))This transformation comes about in morph, as part of
fgMorphCastedBitwiseOp
:In the case cited above this causes a roughly 3x slowdown in the test, and a point fix that disables this for normalize on load locals recovers the missing perf.
This particular pattern only arises with PGO, as most of this method is cold and the struct backing V58/V59 is left exposed by some calls in cold blocks that don't get inlined. But the same thing could happen even without PGO.
I played around with a broader fix, modifying xarch's
genCodeForLclVar
to do narrower loads for normalize on load locals, and that lead to quite a few diffs. At least in my limited checking I only saw diffs in Tier0 code. So that would suggest that in optimized code we don't run into this all that often, but it can happen.cc @dotnet/jit-contrib
The text was updated successfully, but these errors were encountered: