Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update where and when vzeroupper is emitted #98261

Merged
merged 8 commits into from
Feb 13, 2024

Conversation

tannergooding
Copy link
Member

@tannergooding tannergooding commented Feb 10, 2024

This resolves #82132 and resolves #11496 and resolves #96211 and resolves #95954

The transition diagrams are as seen below. The The Intel optimization manual guidance in 3.11.5.3 Fixing Instruction Slowdowns states:

Insert a VZEROUPPER to tell the hardware that the state of the higher registers is clean
between the VEX and the legacy SSE instructions. Often the best way to do this is to insert a
VZEROUPPER before returning from any function that uses VEX (that does not produce a VEX
register) and before any call to an unknown function.

Given the diagrams and this statement, we can come to two conclusions:

  1. We were emitting vzeroupper in cases it wasn't needed, such as prologues of methods
  2. We weren't emitting vzeroupper in cases it was needed, such as before p/invoke transitions

Essentially, for any method compiled by the JIT during the lifetime of the program, we know it is VEX-aware and thus regardless of the UpperState=Dirty or UpperState=Clean, managed to managed calls for such methods are safe and do not need vzeroupper and incur no transition penalty.

Likewise, if we are going from unmanaged to managed we are also safe because we are going from UpperState=Clean or UpperState=Dirty to UpperState=Dirty (assuming we aren't on a pre-Skylake microarchitecture where native itself placed us in UpperState=PreservedNonInit) and thus no transition penalty exists.

The only case we really care about is managed to unmanaged (such as for a P/Invoke), as for such a scenario we cannot assume to know whether or not the unmanaged code is VEX aware. Thus, we need to emit vzeroupper before such calls (as the optimization manual guidance states) to ensure we aren't executing legacy encoded instructions where UpperState=Dirty or UpperState=PreservedNonInit.

This consideration largely only applies to P/Invokes to user functions and does not apply to most JIT helpers. It additionally applies to calls from a managed method that was jitted during the execution of the program to a managed method that was compiled for R2R, which may target the legacy encoding.

Older micro-architectures:
image

Skylake and newer micro-architectures:
image

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 10, 2024
@ghost ghost assigned tannergooding Feb 10, 2024
@ghost
Copy link

ghost commented Feb 10, 2024

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This resolves #82132 and resolves #11496

The transition diagrams are as seen below. The The Intel optimization manual guidance in 3.11.5.3 Fixing Instruction Slowdowns states:

Insert a VZEROUPPER to tell the hardware that the state of the higher registers is clean
between the VEX and the legacy SSE instructions. Often the best way to do this is to insert a
VZEROUPPER before returning from any function that uses VEX (that does not produce a VEX
register) and before any call to an unknown function.

Given the diagrams and this statement, we can come to two conclusions:

  1. We were emitting vzeroupper in cases it wasn't needed, such as prologues of methods
  2. We weren't emitting vzeroupper in cases it was needed, such as before p/invoke transitions

Essentially, for any method compiled by the JIT during the lifetime of the program, we know it is VEX-aware and thus regardless of the UpperState=Dirty or UpperState=Clean, managed to managed calls for such methods are safe and do not need vzeroupper and incur no transition penalty.

Likewise, if we are going from unmanaged to managed we are also safe because we are going from UpperState=Clean or UpperState=Dirty to UpperState=Dirty (assuming we aren't on a pre-Skylake microarchitecture where native itself placed us in UpperState=PreservedNonInit) and thus no transition penalty exists.

The only case we really care about is managed to unmanaged (such as for a P/Invoke), as for such a scenario we cannot assume to know whether or not the unmanaged code is VEX aware. Thus, we need to emit vzeroupper before such calls (as the optimization manual guidance states) to ensure we aren't executing legacy encoded instructions where UpperState=Dirty or UpperState=PreservedNonInit.

This consideration largely only applies to P/Invokes to user functions and does not apply to most JIT helpers. It additionally applies to calls from a managed method that was jitted during the execution of the program to a managed method that was compiled for R2R, which may target the legacy encoding.

Older micro-architectures:
image

Skylake and newer micro-architectures:
image

Author: tannergooding
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@ryujit-bot
Copy link

Diff results for #98261

Assembly diffs

Assembly diffs for linux/x64 ran on windows/x64

Diffs are based on 1,620,764 contexts (360,162 MinOpts, 1,260,602 FullOpts).

MISSED contexts: 3,086 (0.19%)

Overall (-790,628 bytes)
Collection Base size (bytes) Diff size (bytes)
benchmarks.run.linux.x64.checked.mch 11,931,107 -19,269
benchmarks.run_pgo.linux.x64.checked.mch 57,210,208 -50,798
benchmarks.run_tiered.linux.x64.checked.mch 18,554,064 -38,326
coreclr_tests.run.linux.x64.checked.mch 247,128,973 -392,798
libraries.pmi.linux.x64.checked.mch 60,382,766 -116,383
libraries_tests.run.linux.x64.Release.mch 31,730,047 -30,736
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch 130,006,281 -130,040
realworld.run.linux.x64.checked.mch 13,217,922 -11,925
smoke_tests.nativeaot.linux.x64.checked.mch 4,173,941 -353
MinOpts (-306,427 bytes)
Collection Base size (bytes) Diff size (bytes)
benchmarks.run.linux.x64.checked.mch 169,702 -975
benchmarks.run_pgo.linux.x64.checked.mch 17,746,512 -34,802
benchmarks.run_tiered.linux.x64.checked.mch 15,055,746 -34,283
coreclr_tests.run.linux.x64.checked.mch 140,366,881 -204,420
libraries.pmi.linux.x64.checked.mch 112,857 -42
libraries_tests.run.linux.x64.Release.mch 15,927,817 -20,049
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch 10,583,855 -11,853
realworld.run.linux.x64.checked.mch 388,536 -3
FullOpts (-484,201 bytes)
Collection Base size (bytes) Diff size (bytes)
benchmarks.run.linux.x64.checked.mch 11,761,405 -18,294
benchmarks.run_pgo.linux.x64.checked.mch 39,463,696 -15,996
benchmarks.run_tiered.linux.x64.checked.mch 3,498,318 -4,043
coreclr_tests.run.linux.x64.checked.mch 106,762,092 -188,378
libraries.pmi.linux.x64.checked.mch 60,269,909 -116,341
libraries_tests.run.linux.x64.Release.mch 15,802,230 -10,687
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch 119,422,426 -118,187
realworld.run.linux.x64.checked.mch 12,829,386 -11,922
smoke_tests.nativeaot.linux.x64.checked.mch 4,172,992 -353

Assembly diffs for windows/x64 ran on windows/x64

Diffs are based on 1,999,231 contexts (587,594 MinOpts, 1,411,637 FullOpts).

MISSED contexts: 3,657 (0.18%)

Overall (-1,094,187 bytes)
Collection Base size (bytes) Diff size (bytes)
aspnet.run.windows.x64.checked.mch 46,755,443 -66,423
benchmarks.run.windows.x64.checked.mch 11,726,687 -12,375
benchmarks.run_pgo.windows.x64.checked.mch 34,354,002 -66,859
benchmarks.run_tiered.windows.x64.checked.mch 19,448,991 -38,617
coreclr_tests.run.windows.x64.checked.mch 296,147,801 -497,185
libraries.pmi.windows.x64.checked.mch 67,659,390 -161,764
libraries_tests.run.windows.x64.Release.mch 42,430,197 -62,987
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 142,635,439 -167,708
realworld.run.windows.x64.checked.mch 14,768,267 -19,303
smoke_tests.nativeaot.windows.x64.checked.mch 5,049,682 -966
MinOpts (-483,299 bytes)
Collection Base size (bytes) Diff size (bytes)
aspnet.run.windows.x64.checked.mch 18,488,740 -27,292
benchmarks.run.windows.x64.checked.mch 595 -3
benchmarks.run_pgo.windows.x64.checked.mch 18,836,696 -39,778
benchmarks.run_tiered.windows.x64.checked.mch 15,367,889 -34,929
coreclr_tests.run.windows.x64.checked.mch 185,774,390 -300,707
libraries.pmi.windows.x64.checked.mch 113,521 -42
libraries_tests.run.windows.x64.Release.mch 31,641,880 -54,124
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 10,782,870 -26,421
realworld.run.windows.x64.checked.mch 386,609 -3
FullOpts (-610,888 bytes)
Collection Base size (bytes) Diff size (bytes)
aspnet.run.windows.x64.checked.mch 28,266,703 -39,131
benchmarks.run.windows.x64.checked.mch 11,726,092 -12,372
benchmarks.run_pgo.windows.x64.checked.mch 15,517,306 -27,081
benchmarks.run_tiered.windows.x64.checked.mch 4,081,102 -3,688
coreclr_tests.run.windows.x64.checked.mch 110,373,411 -196,478
libraries.pmi.windows.x64.checked.mch 67,545,869 -161,722
libraries_tests.run.windows.x64.Release.mch 10,788,317 -8,863
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 131,852,569 -141,287
realworld.run.windows.x64.checked.mch 14,381,658 -19,300
smoke_tests.nativeaot.windows.x64.checked.mch 5,048,735 -966

Details here


Assembly diffs for windows/x86 ran on windows/x86

Diffs are based on 1,618,717 contexts (327,626 MinOpts, 1,291,091 FullOpts).

MISSED contexts: 11,022 (0.68%)

Overall (-504,237 bytes)
Collection Base size (bytes) Diff size (bytes)
benchmarks.run.windows.x86.checked.mch 11,115,500 -3,943
benchmarks.run_pgo.windows.x86.checked.mch 31,815,296 +27,317
benchmarks.run_tiered.windows.x86.checked.mch 13,989,178 -6,113
coreclr_tests.run.windows.x86.checked.mch 215,108,646 -367,256
libraries.pmi.windows.x86.checked.mch 50,246,165 -92,043
libraries_tests.run.windows.x86.Release.mch 14,793,337 -5,742
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch 107,842,128 -47,463
realworld.run.windows.x86.checked.mch 11,479,674 -8,994
MinOpts (-200,802 bytes)
Collection Base size (bytes) Diff size (bytes)
benchmarks.run_pgo.windows.x86.checked.mch 6,121,948 -3,295
benchmarks.run_tiered.windows.x86.checked.mch 6,854,637 -4,273
coreclr_tests.run.windows.x86.checked.mch 122,261,024 -189,664
libraries.pmi.windows.x86.checked.mch 95,233 -3
libraries_tests.run.windows.x86.Release.mch 5,490,195 -3,351
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch 8,952,773 -213
realworld.run.windows.x86.checked.mch 295,714 -3
FullOpts (-303,435 bytes)
Collection Base size (bytes) Diff size (bytes)
benchmarks.run.windows.x86.checked.mch 11,115,022 -3,943
benchmarks.run_pgo.windows.x86.checked.mch 25,693,348 +30,612
benchmarks.run_tiered.windows.x86.checked.mch 7,134,541 -1,840
coreclr_tests.run.windows.x86.checked.mch 92,847,622 -177,592
libraries.pmi.windows.x86.checked.mch 50,150,932 -92,040
libraries_tests.run.windows.x86.Release.mch 9,303,142 -2,391
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch 98,889,355 -47,250
realworld.run.windows.x86.checked.mch 11,183,960 -8,991

Details here


Throughput diffs

Throughput diffs for linux/x64 ran on windows/x64

Overall (+0.00% to +0.03%)
Collection PDIFF
benchmarks.run.linux.x64.checked.mch +0.01%
benchmarks.run_pgo.linux.x64.checked.mch +0.02%
benchmarks.run_tiered.linux.x64.checked.mch +0.03%
coreclr_tests.run.linux.x64.checked.mch +0.02%
libraries.pmi.linux.x64.checked.mch +0.01%
libraries_tests.run.linux.x64.Release.mch +0.03%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch +0.02%
realworld.run.linux.x64.checked.mch +0.01%
MinOpts (-0.00% to +0.08%)
Collection PDIFF
benchmarks.run.linux.x64.checked.mch +0.01%
benchmarks.run_pgo.linux.x64.checked.mch +0.05%
benchmarks.run_tiered.linux.x64.checked.mch +0.05%
coreclr_tests.run.linux.x64.checked.mch +0.03%
libraries.pmi.linux.x64.checked.mch +0.05%
libraries_tests.run.linux.x64.Release.mch +0.06%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch +0.04%
realworld.run.linux.x64.checked.mch +0.08%
FullOpts (+0.00% to +0.02%)
Collection PDIFF
benchmarks.run.linux.x64.checked.mch +0.01%
benchmarks.run_pgo.linux.x64.checked.mch +0.01%
benchmarks.run_tiered.linux.x64.checked.mch +0.01%
coreclr_tests.run.linux.x64.checked.mch +0.02%
libraries.pmi.linux.x64.checked.mch +0.01%
libraries_tests.run.linux.x64.Release.mch +0.02%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch +0.02%
realworld.run.linux.x64.checked.mch +0.01%

Throughput diffs for windows/x64 ran on windows/x64

Overall (-0.00% to +0.03%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.01%
benchmarks.run.windows.x64.checked.mch +0.01%
benchmarks.run_pgo.windows.x64.checked.mch +0.01%
benchmarks.run_tiered.windows.x64.checked.mch +0.02%
coreclr_tests.run.windows.x64.checked.mch +0.02%
libraries.pmi.windows.x64.checked.mch +0.01%
libraries_tests.run.windows.x64.Release.mch +0.03%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +0.01%
realworld.run.windows.x64.checked.mch +0.01%
MinOpts (-0.01% to +0.07%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.05%
benchmarks.run.windows.x64.checked.mch +0.01%
benchmarks.run_pgo.windows.x64.checked.mch +0.04%
benchmarks.run_tiered.windows.x64.checked.mch +0.03%
coreclr_tests.run.windows.x64.checked.mch +0.03%
libraries.crossgen2.windows.x64.checked.mch -0.01%
libraries.pmi.windows.x64.checked.mch +0.05%
libraries_tests.run.windows.x64.Release.mch +0.05%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +0.03%
realworld.run.windows.x64.checked.mch +0.07%
smoke_tests.nativeaot.windows.x64.checked.mch -0.01%
FullOpts (-0.00% to +0.02%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.01%
benchmarks.run.windows.x64.checked.mch +0.01%
benchmarks.run_pgo.windows.x64.checked.mch +0.01%
benchmarks.run_tiered.windows.x64.checked.mch +0.01%
coreclr_tests.run.windows.x64.checked.mch +0.02%
libraries.pmi.windows.x64.checked.mch +0.01%
libraries_tests.run.windows.x64.Release.mch +0.02%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +0.01%
realworld.run.windows.x64.checked.mch +0.01%

Details here


Throughput diffs for windows/x86 ran on windows/x86

Overall (+0.00% to +0.04%)
Collection PDIFF
benchmarks.run.windows.x86.checked.mch +0.02%
benchmarks.run_pgo.windows.x86.checked.mch +0.02%
benchmarks.run_tiered.windows.x86.checked.mch +0.03%
coreclr_tests.run.windows.x86.checked.mch +0.03%
libraries.pmi.windows.x86.checked.mch +0.01%
libraries_tests.run.windows.x86.Release.mch +0.04%
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch +0.03%
realworld.run.windows.x86.checked.mch +0.02%
MinOpts (+0.00% to +0.16%)
Collection PDIFF
benchmarks.run.windows.x86.checked.mch +0.09%
benchmarks.run_pgo.windows.x86.checked.mch +0.07%
benchmarks.run_tiered.windows.x86.checked.mch +0.07%
coreclr_tests.run.windows.x86.checked.mch +0.06%
libraries.pmi.windows.x86.checked.mch +0.11%
libraries_tests.run.windows.x86.Release.mch +0.11%
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch +0.10%
realworld.run.windows.x86.checked.mch +0.16%
FullOpts (+0.00% to +0.03%)
Collection PDIFF
benchmarks.run.windows.x86.checked.mch +0.02%
benchmarks.run_pgo.windows.x86.checked.mch +0.02%
benchmarks.run_tiered.windows.x86.checked.mch +0.02%
coreclr_tests.run.windows.x86.checked.mch +0.02%
libraries.pmi.windows.x86.checked.mch +0.01%
libraries_tests.run.windows.x86.Release.mch +0.03%
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch +0.03%
realworld.run.windows.x86.checked.mch +0.02%

Details here


Throughput diffs for linux/x64 ran on linux/x64

Overall (-0.01% to +0.00%)
Collection PDIFF
benchmarks.run_tiered.linux.x64.checked.mch -0.01%
MinOpts (-0.05% to +0.02%)
Collection PDIFF
benchmarks.run_tiered.linux.x64.checked.mch -0.01%
coreclr_tests.run.linux.x64.checked.mch -0.01%
libraries.crossgen2.linux.x64.checked.mch -0.02%
libraries.pmi.linux.x64.checked.mch +0.01%
benchmarks.run.linux.x64.checked.mch -0.05%
realworld.run.linux.x64.checked.mch +0.02%
smoke_tests.nativeaot.linux.x64.checked.mch -0.01%
benchmarks.run_pgo.linux.x64.checked.mch -0.01%

Details here


@tannergooding
Copy link
Member Author

tannergooding commented Feb 12, 2024

Diffs look better now. Still overwhelmingly an improvement, but now with less examples of regressions.

The regressions are places where we called a P/Invoke but didn't use any 256-bit or higher AVX in the method itself, this is what fixes the perf issues called out in the original post. In such scenarios, we "hoist" the vzeroupper to be emitted in the prologue to avoid needing to do it before every P/Invoke call.

The improvements are primarily places where we used floating-point/simd in the method. Previously we would always emit a vzeroupper in the prologue for these methods. However, this was unnecessary since the JIT always emits VEX aware instructions and thus there is no penalty regardless of whether the UpperState=Clean or UpperState=Dirty.

We continue emitting vzeroupper in the epilogue of methods that use any 256-bit or higher AVX in the method itself, as is best practice according to the architecture manual. While this isn't strictly necessary for the JIT, since any managed caller will likely be VEX aware itself, it does ensure that if we return to a R2R method or a native method, that they won't incur any penalty. It likewise ensures that if we return to a method that needs to call a P/Invoke where the vzeroupper was hoisted, that the "right stuff" happens.

We likewise continue emitting vzeroupper before any P/Invokes for a method that uses 256-bit or higher AVX in the method itself. We could do more flow analysis to hoist some of these as well, but it's likely not worth the complexity.

@tannergooding
Copy link
Member Author

Reduced the TP impact a bit and limited it to only x64. This should be ready for review, @dotnet/jit-contrib

@ryujit-bot
Copy link

Diff results for #98261

Assembly diffs

Assembly diffs for linux/x64 ran on windows/x64

Diffs are based on 1,730,987 contexts (430,855 MinOpts, 1,300,132 FullOpts).

Overall (-823,764 bytes)
Collection Base size (bytes) Diff size (bytes)
benchmarks.run.linux.x64.checked.mch 12,567,522 -20,337
benchmarks.run_pgo.linux.x64.checked.mch 69,885,519 -66,168
benchmarks.run_tiered.linux.x64.checked.mch 23,156,301 -41,331
coreclr_tests.run.linux.x64.checked.mch 246,265,337 -408,042
libraries.pmi.linux.x64.checked.mch 60,776,347 -110,913
libraries_tests.run.linux.x64.Release.mch 32,207,624 -28,116
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch 140,919,691 -136,986
realworld.run.linux.x64.checked.mch 13,946,490 -11,076
smoke_tests.nativeaot.linux.x64.checked.mch 4,232,799 -795
MinOpts (-338,370 bytes)
Collection Base size (bytes) Diff size (bytes)
benchmarks.run.linux.x64.checked.mch 199,298 -903
benchmarks.run_pgo.linux.x64.checked.mch 27,322,199 -48,459
benchmarks.run_tiered.linux.x64.checked.mch 18,767,019 -37,281
coreclr_tests.run.linux.x64.checked.mch 139,079,884 -217,170
libraries.pmi.linux.x64.checked.mch 112,857 -27
libraries_tests.run.linux.x64.Release.mch 20,750,846 -22,671
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch 10,584,167 -11,856
realworld.run.linux.x64.checked.mch 388,157 -3
FullOpts (-485,394 bytes)
Collection Base size (bytes) Diff size (bytes)
benchmarks.run.linux.x64.checked.mch 12,368,224 -19,434
benchmarks.run_pgo.linux.x64.checked.mch 42,563,320 -17,709
benchmarks.run_tiered.linux.x64.checked.mch 4,389,282 -4,050
coreclr_tests.run.linux.x64.checked.mch 107,185,453 -190,872
libraries.pmi.linux.x64.checked.mch 60,663,490 -110,886
libraries_tests.run.linux.x64.Release.mch 11,456,778 -5,445
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch 130,335,524 -125,130
realworld.run.linux.x64.checked.mch 13,558,333 -11,073
smoke_tests.nativeaot.linux.x64.checked.mch 4,231,850 -795

Assembly diffs for windows/x64 ran on windows/x64

Diffs are based on 1,837,795 contexts (509,217 MinOpts, 1,328,578 FullOpts).

MISSED contexts: 133 (0.01%)

Overall (-964,068 bytes)
Collection Base size (bytes) Diff size (bytes)
aspnet.run.windows.x64.checked.mch 46,760,847 -64,902
benchmarks.run.windows.x64.checked.mch 8,752,020 -9,990
benchmarks.run_pgo.windows.x64.checked.mch 26,046,814 -43,281
benchmarks.run_tiered.windows.x64.checked.mch 12,793,606 -24,594
coreclr_tests.run.windows.x64.checked.mch 286,363,008 -479,850
libraries.pmi.windows.x64.checked.mch 62,025,027 -121,785
libraries_tests.run.windows.x64.Release.mch 35,353,949 -42,561
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 136,924,584 -157,551
realworld.run.windows.x64.checked.mch 14,214,707 -18,276
smoke_tests.nativeaot.windows.x64.checked.mch 5,089,751 -1,278
MinOpts (-422,088 bytes)
Collection Base size (bytes) Diff size (bytes)
aspnet.run.windows.x64.checked.mch 18,490,815 -26,304
benchmarks.run.windows.x64.checked.mch 363 -3
benchmarks.run_pgo.windows.x64.checked.mch 11,756,366 -24,291
benchmarks.run_tiered.windows.x64.checked.mch 9,132,019 -20,877
coreclr_tests.run.windows.x64.checked.mch 179,104,349 -289,155
libraries.pmi.windows.x64.checked.mch 113,521 -27
libraries_tests.run.windows.x64.Release.mch 26,016,097 -35,004
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 10,511,309 -26,424
realworld.run.windows.x64.checked.mch 386,612 -3
FullOpts (-541,980 bytes)
Collection Base size (bytes) Diff size (bytes)
aspnet.run.windows.x64.checked.mch 28,270,032 -38,598
benchmarks.run.windows.x64.checked.mch 8,751,657 -9,987
benchmarks.run_pgo.windows.x64.checked.mch 14,290,448 -18,990
benchmarks.run_tiered.windows.x64.checked.mch 3,661,587 -3,717
coreclr_tests.run.windows.x64.checked.mch 107,258,659 -190,695
libraries.pmi.windows.x64.checked.mch 61,911,506 -121,758
libraries_tests.run.windows.x64.Release.mch 9,337,852 -7,557
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 126,413,275 -131,127
realworld.run.windows.x64.checked.mch 13,828,095 -18,273
smoke_tests.nativeaot.windows.x64.checked.mch 5,088,804 -1,278

Details here


Assembly diffs for windows/x86 ran on windows/x86

Diffs are based on 1,485,481 contexts (265,979 MinOpts, 1,219,502 FullOpts).

Overall (-493,551 bytes)
Collection Base size (bytes) Diff size (bytes)
benchmarks.run.windows.x86.checked.mch 7,144,882 -4,548
benchmarks.run_pgo.windows.x86.checked.mch 31,085,893 +28,212
benchmarks.run_tiered.windows.x86.checked.mch 9,486,951 -5,766
coreclr_tests.run.windows.x86.checked.mch 207,102,883 -364,980
libraries.pmi.windows.x86.checked.mch 49,622,010 -82,632
libraries_tests.run.windows.x86.Release.mch 8,693,631 -1,512
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch 104,061,821 -52,320
realworld.run.windows.x86.checked.mch 11,356,453 -10,005
MinOpts (-193,488 bytes)
Collection Base size (bytes) Diff size (bytes)
benchmarks.run_pgo.windows.x86.checked.mch 3,953,221 -3,201
benchmarks.run_tiered.windows.x86.checked.mch 4,279,358 -3,624
coreclr_tests.run.windows.x86.checked.mch 117,693,910 -186,045
libraries.pmi.windows.x86.checked.mch 95,233 -3
libraries_tests.run.windows.x86.Release.mch 1,591,385 -450
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch 8,675,049 -162
realworld.run.windows.x86.checked.mch 295,717 -3
FullOpts (-300,063 bytes)
Collection Base size (bytes) Diff size (bytes)
benchmarks.run.windows.x86.checked.mch 7,144,601 -4,548
benchmarks.run_pgo.windows.x86.checked.mch 27,132,672 +31,413
benchmarks.run_tiered.windows.x86.checked.mch 5,207,593 -2,142
coreclr_tests.run.windows.x86.checked.mch 89,408,973 -178,935
libraries.pmi.windows.x86.checked.mch 49,526,777 -82,629
libraries_tests.run.windows.x86.Release.mch 7,102,246 -1,062
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch 95,386,772 -52,158
realworld.run.windows.x86.checked.mch 11,060,736 -10,002

Details here


Throughput diffs

Throughput diffs for linux/x64 ran on linux/x64

Overall (-0.00% to +0.02%)
Collection PDIFF
smoke_tests.nativeaot.linux.x64.checked.mch +0.01%
benchmarks.run.linux.x64.checked.mch +0.01%
libraries.crossgen2.linux.x64.checked.mch +0.02%
benchmarks.run_pgo.linux.x64.checked.mch +0.01%
libraries.pmi.linux.x64.checked.mch +0.01%
libraries_tests.run.linux.x64.Release.mch +0.02%
benchmarks.run_tiered.linux.x64.checked.mch +0.01%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch +0.01%
MinOpts (-0.01% to +0.07%)
Collection PDIFF
smoke_tests.nativeaot.linux.x64.checked.mch +0.07%
libraries.crossgen2.linux.x64.checked.mch +0.03%
realworld.run.linux.x64.checked.mch +0.05%
benchmarks.run_pgo.linux.x64.checked.mch +0.02%
libraries.pmi.linux.x64.checked.mch +0.07%
libraries_tests.run.linux.x64.Release.mch +0.04%
coreclr_tests.run.linux.x64.checked.mch -0.01%
benchmarks.run_tiered.linux.x64.checked.mch +0.02%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch +0.02%
FullOpts (+0.00% to +0.02%)
Collection PDIFF
smoke_tests.nativeaot.linux.x64.checked.mch +0.01%
benchmarks.run.linux.x64.checked.mch +0.01%
libraries.crossgen2.linux.x64.checked.mch +0.02%
benchmarks.run_pgo.linux.x64.checked.mch +0.01%
libraries.pmi.linux.x64.checked.mch +0.01%
libraries_tests.run.linux.x64.Release.mch +0.01%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch +0.01%

Details here


Throughput diffs for linux/x64 ran on windows/x64

Overall (+0.01% to +0.04%)
Collection PDIFF
benchmarks.run.linux.x64.checked.mch +0.02%
benchmarks.run_pgo.linux.x64.checked.mch +0.02%
benchmarks.run_tiered.linux.x64.checked.mch +0.03%
coreclr_tests.run.linux.x64.checked.mch +0.01%
libraries.crossgen2.linux.x64.checked.mch +0.04%
libraries.pmi.linux.x64.checked.mch +0.03%
libraries_tests.run.linux.x64.Release.mch +0.03%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch +0.03%
realworld.run.linux.x64.checked.mch +0.02%
smoke_tests.nativeaot.linux.x64.checked.mch +0.03%
MinOpts (-0.00% to +0.10%)
Collection PDIFF
benchmarks.run.linux.x64.checked.mch +0.03%
benchmarks.run_pgo.linux.x64.checked.mch +0.04%
benchmarks.run_tiered.linux.x64.checked.mch +0.03%
libraries.crossgen2.linux.x64.checked.mch +0.06%
libraries.pmi.linux.x64.checked.mch +0.08%
libraries_tests.run.linux.x64.Release.mch +0.05%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch +0.03%
realworld.run.linux.x64.checked.mch +0.07%
smoke_tests.nativeaot.linux.x64.checked.mch +0.10%
FullOpts (+0.02% to +0.04%)
Collection PDIFF
benchmarks.run.linux.x64.checked.mch +0.02%
benchmarks.run_pgo.linux.x64.checked.mch +0.02%
benchmarks.run_tiered.linux.x64.checked.mch +0.02%
coreclr_tests.run.linux.x64.checked.mch +0.02%
libraries.crossgen2.linux.x64.checked.mch +0.04%
libraries.pmi.linux.x64.checked.mch +0.03%
libraries_tests.run.linux.x64.Release.mch +0.02%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch +0.03%
realworld.run.linux.x64.checked.mch +0.02%
smoke_tests.nativeaot.linux.x64.checked.mch +0.03%

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)
Collection PDIFF
libraries.pmi.windows.arm64.checked.mch +0.01%

Throughput diffs for windows/x64 ran on windows/x64

Overall (+0.01% to +0.03%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.02%
benchmarks.run.windows.x64.checked.mch +0.02%
benchmarks.run_pgo.windows.x64.checked.mch +0.01%
benchmarks.run_tiered.windows.x64.checked.mch +0.02%
coreclr_tests.run.windows.x64.checked.mch +0.01%
libraries.crossgen2.windows.x64.checked.mch +0.03%
libraries.pmi.windows.x64.checked.mch +0.02%
libraries_tests.run.windows.x64.Release.mch +0.03%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +0.03%
realworld.run.windows.x64.checked.mch +0.02%
smoke_tests.nativeaot.windows.x64.checked.mch +0.03%
MinOpts (-0.03% to +0.09%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.04%
benchmarks.run.windows.x64.checked.mch -0.03%
benchmarks.run_pgo.windows.x64.checked.mch +0.02%
benchmarks.run_tiered.windows.x64.checked.mch +0.02%
libraries.crossgen2.windows.x64.checked.mch +0.05%
libraries.pmi.windows.x64.checked.mch +0.08%
libraries_tests.run.windows.x64.Release.mch +0.04%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +0.01%
realworld.run.windows.x64.checked.mch +0.06%
smoke_tests.nativeaot.windows.x64.checked.mch +0.09%
FullOpts (+0.01% to +0.03%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.02%
benchmarks.run.windows.x64.checked.mch +0.02%
benchmarks.run_pgo.windows.x64.checked.mch +0.01%
benchmarks.run_tiered.windows.x64.checked.mch +0.02%
coreclr_tests.run.windows.x64.checked.mch +0.02%
libraries.crossgen2.windows.x64.checked.mch +0.03%
libraries.pmi.windows.x64.checked.mch +0.02%
libraries_tests.run.windows.x64.Release.mch +0.02%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +0.03%
realworld.run.windows.x64.checked.mch +0.02%
smoke_tests.nativeaot.windows.x64.checked.mch +0.03%

Details here


Throughput diffs for windows/x86 ran on windows/x86

Overall (-0.02% to +0.02%)
Collection PDIFF
benchmarks.run_pgo.windows.x86.checked.mch +0.01%
coreclr_tests.run.windows.x86.checked.mch -0.02%
libraries.crossgen2.windows.x86.checked.mch +0.02%
libraries.pmi.windows.x86.checked.mch -0.01%
libraries_tests.run.windows.x86.Release.mch +0.02%
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch +0.01%
realworld.run.windows.x86.checked.mch -0.01%
MinOpts (-0.03% to +0.07%)
Collection PDIFF
benchmarks.run.windows.x86.checked.mch +0.05%
coreclr_tests.run.windows.x86.checked.mch -0.03%
libraries.crossgen2.windows.x86.checked.mch +0.04%
libraries.pmi.windows.x86.checked.mch +0.05%
libraries_tests.run.windows.x86.Release.mch +0.07%
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch +0.05%
realworld.run.windows.x86.checked.mch +0.07%
FullOpts (-0.02% to +0.02%)
Collection PDIFF
benchmarks.run_pgo.windows.x86.checked.mch +0.01%
coreclr_tests.run.windows.x86.checked.mch -0.02%
libraries.crossgen2.windows.x86.checked.mch +0.02%
libraries.pmi.windows.x86.checked.mch -0.01%
libraries_tests.run.windows.x86.Release.mch +0.01%
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch +0.01%
realworld.run.windows.x86.checked.mch -0.01%

Details here


@kunalspathak
Copy link
Member

Not sure why windows-arm64 TP is affected:

image

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tannergooding
Copy link
Member Author

Not sure why windows-arm64 TP is affected:

The tool is only measuring how many instructions were executed. This naturally fluctuates based on several factors and so it's always possible (although rare) that the tool reports additional TP changes for an architecture that wasn't touched.

In this case, the changes have all been made in xarch specific files or definitions (either *xarch.h, *xarch.cpp, or under #ifdef TARGET_XARCH), so there's nothing that could have actually changed for Arm64.

@tannergooding tannergooding merged commit 6d877c5 into dotnet:main Feb 13, 2024
137 of 139 checks passed
@tannergooding tannergooding deleted the vzeroupper branch February 13, 2024 16:09
@jakobbotsch
Copy link
Member

I've also noticed that lately the arm64 variance in the TP jobs has been higher than previously. I should take a look at where that variance is coming from. The variance used to be significantly less than 0.01%.

@jnyrup
Copy link
Contributor

jnyrup commented Feb 14, 2024

#82132 (comment)

This is a longstanding perf issue, but not a regression nor a correctness issue. Moving to .NET 9

With the two reported regressions for .NET 8 fixed by this PR is there a hope of meeting the bar for having this PR backported to .NET 8?

@tannergooding
Copy link
Member Author

@jnyrup, my expectation is "no", but it would ultimately be up to @JulieLeeMSFT on whether or not we take it for a servicing bar check.

This is a general issue going back to .NET Framework, so it's not technically a regression. There were two new customer reported scenarios that it shows up in .NET 8, but they are just variations on the same general issue and are showing up primarily due to the context of broader code (user code + library code + user optimizations happen to trigger it for this scenario).

The fix here is relatively straightforward, but its also not isolated and impacts a lot of code across the BCL. Because of this it's possible that there are scenarios not covered or a particular microarchitecture this doesn't fix, so it's not easy to label it as "low risk". Given a couple months time, it might be easier to label this as "low risk", certainly after we get the first set of benchmark numbers in our weekly perf triage next Tuesday.

And then finally, there are some "workarounds" devs can do to "fix" this by utilizing knowledge of when the JIT emits vzeroupper. Most notably you can "force" the JIT to emit a vzeroupper before a P/Invoke by simply ensuring some V256 usage exists before the P/Invoke call. One example of this is the following, where you'd simply use _ = GetZero(); before the P/Invoke. This will force a call which emits vzeroupper and then never mutates the upper bits, ensuring you're in a "clean" state so that the penalty doesn't exist.

[MethodImpl(MethodImplOptions.NoInlining)]
public static Vector128<float> GetZero() => Vector128<float>.Zero;

@github-actions github-actions bot locked and limited conversation to collaborators Mar 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
7 participants