Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf] Linux/arm64: 7 Regressions on 3/18/2024 9:28:21 PM #100090

Closed
performanceautofiler bot opened this issue Mar 21, 2024 · 6 comments
Closed

[Perf] Linux/arm64: 7 Regressions on 3/18/2024 9:28:21 PM #100090

performanceautofiler bot opened this issue Mar 21, 2024 · 6 comments
Assignees
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-linux Linux OS (any supported distro) Priority:2 Work that is important, but not critical for the release runtime-coreclr specific to the CoreCLR runtime
Milestone

Comments

@performanceautofiler
Copy link

performanceautofiler bot commented Mar 21, 2024

Run Information

Name Value
Architecture arm64
OS ubuntu 22.04
Queue AmpereUbuntu
Baseline 21f23f91a0c29d91f69013d561e82bd0960c95e1
Compare d8141743fecaeff4711112ca08a3226ac5d68011
Diff Diff
Configs CompilationMode:tiered, RunKind:micro

Regressions in System.Collections.TryGetValueFalse<Int32, Int32>

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio
28.06 μs 31.32 μs 1.12 0.21 False

graph
Test Report

Repro

General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Collections.TryGetValueFalse&lt;Int32, Int32&gt;*'

System.Collections.TryGetValueFalse<Int32, Int32>.SortedDictionary(Size: 512)

ETL Files

Histogram

JIT Disasms

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository


Run Information

Name Value
Architecture arm64
OS ubuntu 22.04
Queue AmpereUbuntu
Baseline 21f23f91a0c29d91f69013d561e82bd0960c95e1
Compare d8141743fecaeff4711112ca08a3226ac5d68011
Diff Diff
Configs CompilationMode:tiered, RunKind:micro

Regressions in System.Numerics.Tests.Perf_Matrix3x2

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio
6.29 ns 8.20 ns 1.30 0.04 False

graph
Test Report

Repro

General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Numerics.Tests.Perf_Matrix3x2*'

System.Numerics.Tests.Perf_Matrix3x2.MultiplyByMatrixOperatorBenchmark

ETL Files

Histogram

JIT Disasms

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository


Run Information

Name Value
Architecture arm64
OS ubuntu 22.04
Queue AmpereUbuntu
Baseline 21f23f91a0c29d91f69013d561e82bd0960c95e1
Compare d8141743fecaeff4711112ca08a3226ac5d68011
Diff Diff
Configs CompilationMode:tiered, RunKind:micro

Regressions in System.Collections.ContainsKeyFalse<Int32, Int32>

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio
28.34 μs 31.06 μs 1.10 0.25 False
4.04 μs 5.08 μs 1.26 0.05 False

graph
graph
Test Report

Repro

General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Collections.ContainsKeyFalse&lt;Int32, Int32&gt;*'

System.Collections.ContainsKeyFalse<Int32, Int32>.SortedDictionary(Size: 512)

ETL Files

Histogram

JIT Disasms

System.Collections.ContainsKeyFalse<Int32, Int32>.IDictionary(Size: 512)

ETL Files

Histogram

JIT Disasms

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository


Run Information

Name Value
Architecture arm64
OS ubuntu 22.04
Queue AmpereUbuntu
Baseline 21f23f91a0c29d91f69013d561e82bd0960c95e1
Compare d8141743fecaeff4711112ca08a3226ac5d68011
Diff Diff
Configs CompilationMode:tiered, RunKind:micro

Regressions in System.Collections.IterateForEach<Int32>

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio
261.07 ns 347.33 ns 1.33 0.01 True
817.75 ns 869.22 ns 1.06 0.01 True

graph
graph
Test Report

Repro

General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Collections.IterateForEach&lt;Int32&gt;*'

System.Collections.IterateForEach<Int32>.FrozenSet(Size: 512)

ETL Files

Histogram

JIT Disasms

System.Collections.IterateForEach<Int32>.ImmutableStack(Size: 512)

ETL Files

Histogram

JIT Disasms

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository


Run Information

Name Value
Architecture arm64
OS ubuntu 22.04
Queue AmpereUbuntu
Baseline 21f23f91a0c29d91f69013d561e82bd0960c95e1
Compare d8141743fecaeff4711112ca08a3226ac5d68011
Diff Diff
Configs CompilationMode:tiered, RunKind:micro

Regressions in System.Numerics.Tests.Perf_Vector3

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio
0.00 ns 1.34 ns 0.85 False

graph
Test Report

Repro

General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Numerics.Tests.Perf_Vector3*'

System.Numerics.Tests.Perf_Vector3.LengthBenchmark

ETL Files

Histogram

JIT Disasms

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

@performanceautofiler performanceautofiler bot added arch-arm64 os-linux Linux OS (any supported distro) runtime-coreclr specific to the CoreCLR runtime untriaged New issue has not been triaged by the area owner labels Mar 21, 2024
@LoopedBard3 LoopedBard3 removed the untriaged New issue has not been triaged by the area owner label Mar 21, 2024
@LoopedBard3 LoopedBard3 transferred this issue from dotnet/perf-autofiling-issues Mar 21, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Mar 21, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Mar 21, 2024
@LoopedBard3
Copy link
Member

Likely: #99783

@jeffschwMSFT jeffschwMSFT added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 22, 2024
@JulieLeeMSFT JulieLeeMSFT added this to the 9.0.0 milestone Mar 22, 2024
@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label Mar 22, 2024
@vcsjones vcsjones removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Mar 25, 2024
@LoopedBard3
Copy link
Member

LoopedBard3 commented Mar 26, 2024

@amanasifkhalid amanasifkhalid added the Priority:2 Work that is important, but not critical for the release label May 3, 2024
@amanasifkhalid
Copy link
Member

Notes Recent Score Orig Score Ubuntu 2022.04 arm64 Ubuntu 2022.04 x64 Windows 2010.0.18362 x64 Windows 2010.0.22621 amd64 Benchmark
1.87 1.87 1.87
1.87
System.Collections.TryGetValueTrue(Int32, Int32).ConcurrentDictionary(Size: 512)
1.39 1.67 1.39
1.67
System.Numerics.Tests.Perf_BigInteger.Equals(arguments: 67 bytes, DiffMiddleByte)
1.36 1.47 1.36
1.47
Benchstone.MDBenchI.MDArray2.Test
1.19 1.19 1.19
1.19
System.Collections.CreateAddAndClear(String).Span(Size: 512)
1.12 1.29 1.12
1.29
Benchstone.BenchI.Array2.Test
1.12 1.15 1.12
1.15
System.Numerics.Tests.Perf_BigInteger.ToStringX(numberString: 123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678
1.11 1.30 1.11
1.30
System.Numerics.Tests.Perf_Matrix3x2.MultiplyByMatrixOperatorBenchmark
1.11 1.07 1.11
1.07
System.Collections.ContainsKeyTrue(String, String).IDictionary(Size: 512)
1.09 1.08 1.09
1.08
System.Collections.IterateFor(String).IList(Size: 512)
1.07 1.07 1.07
1.07
System.Security.Cryptography.Primitives.Tests.Performance.Perf_FixedTimeEquals.FixedTimeEquals_256Bit_CascadingErrors
1.07 1.07 1.07
1.07
System.Security.Cryptography.Primitives.Tests.Performance.Perf_FixedTimeEquals.FixedTimeEquals_256Bit_AllBitsDifferent
1.07 1.07 1.07
1.07
System.Security.Cryptography.Primitives.Tests.Performance.Perf_FixedTimeEquals.FixedTimeEquals_256Bit_Equal
1.07 1.07 1.07
1.07
System.Security.Cryptography.Primitives.Tests.Performance.Perf_FixedTimeEquals.FixedTimeEquals_256Bit_VersusZero
1.07 1.07 1.07
1.07
System.Security.Cryptography.Primitives.Tests.Performance.Perf_FixedTimeEquals.FixedTimeEquals_256Bit_FirstBitDifferent
1.07 1.07 1.07
1.07
System.Collections.TryGetValueTrue(Int32, Int32).ImmutableDictionary(Size: 512)
1.07 1.07 1.07
1.07
System.Collections.ContainsKeyFalse(String, String).ImmutableDictionary(Size: 512)
1.03 1.15 1.03
1.15
System.Collections.ContainsKeyFalse(String, String).Dictionary(Size: 512)
1.01 1.26 1.01
1.26
System.Collections.ContainsKeyFalse(Int32, Int32).IDictionary(Size: 512)
1.01 1.21 1.01
1.21
System.Numerics.Tests.Perf_BigInteger.Equals(arguments: 259 bytes, DiffLastByte)
1.01 1.52 1.01
1.52
System.Memory.Span(Byte).LastIndexOfAnyValues(Size: 33)
1.00 1.27 1.00
1.33
1.01
1.22
System.Collections.IterateForEach(Int32).FrozenSet(Size: 512)
1.00 1.06 1.00
1.06
System.Collections.IterateForEach(Int32).ImmutableStack(Size: 512)
0.99 1.16 0.99
1.16
System.Collections.TryGetValueFalse(String, String).IDictionary(Size: 512)
0.99 1.12 0.99
1.12
System.Collections.TryGetValueFalse(String, String).Dictionary(Size: 512)
0.91 1.25 0.91
1.25
System.Memory.Span(Int32).IndexOfValue(Size: 512)
0.90 1.21 0.90
1.21
System.Numerics.Tests.Perf_BigInteger.Equals(arguments: 259 bytes, Same)
0.89 1.78 0.89
1.78
Struct.FilteredSpanEnumerator.Sum
0.82 1.12 0.82
1.12
System.Collections.TryGetValueFalse(Int32, Int32).SortedDictionary(Size: 512)
0.81 1.25 0.81
1.25
System.Numerics.Tests.Perf_BigInteger.Equals(arguments: 259 bytes, DiffMiddleByte)
0.79 1.10 0.79
1.10
System.Collections.ContainsKeyFalse(Int32, Int32).SortedDictionary(Size: 512)
0.67 1.27 0.67
1.27
Struct.GSeq.FilterSkipMapSum
0.66 1.39 0.66
1.39
System.Collections.IterateForEach(Int32).Dictionary(Size: 512)
0.55 1.16 0.55
1.16
System.Collections.IterateForEach(Int32).HashSet(Size: 512)

Almost all of the Linux arm64 regressions resolved themselves. I'll look at newer data for the remaining ones.

@amanasifkhalid
Copy link
Member

I took a closer look at each one with a recent score >=1.1:

System.Collections.TryGetValueTrue(Int32, Int32).ConcurrentDictionary(Size: 512):
image
Purple is Tiger Windows 10, Blue is Tiger Windows 11.

System.Numerics.Tests.Perf_BigInteger.Equals(arguments: 67 bytes, DiffMiddleByte):
image
(The regressions table seemed to get confused by the different BigInteger.Equals benchmarks. This one has improved upon its baseline.)

image
This one hasn't moved much; probably worth looking further into.

image
This looks like it was fixed by the RPO-based block layout (perf issue), and then subsequent iterations regressed it (perf issue).

image
This one was fixed pretty quickly, and then churned by all the layout work.

image
This was similarly churned by block layout.

image
This regression was short-lived, and the new block layout seemed to improve upon it. That regression in mid-June looks like it could be #103972, though I don't see it listed there...

image
This one is quite noisy, though it looks like it improved around the time of the new block layout, and it's recently gone back up again.

@amanasifkhalid
Copy link
Member

amanasifkhalid commented Jul 25, 2024

For Benchstone.MDBenchI.MDArray2.Test, after tiering up, the old layout is the following:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight        IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             0.91       91 [000..019)-> BB03(1)                 (always)                     i IBC hascall
BB03 [0014]  2       BB01,BB10             9.09      909 [018..019)-> BB05(1)                 (always)                     i IBC loophead mdidxlen bwd bwd-target mdarr
BB05 [0015]  2       BB03,BB08            90.91     9091 [018..019)-> BB07(1)                 (always)                     i IBC loophead mdidxlen bwd bwd-target mdarr
BB07 [0016]  2       BB05,BB07            1000.   100000 [018..019)-> BB07(0.909),BB08(0.0909)  ( cond )                     i IBC loophead mdidxlen bwd bwd-target mdarr
BB08 [0062]  1       BB07                100.00    10000 [018..019)-> BB05(0.909),BB10(0.0909)  ( cond )                     i IBC bwd
BB10 [0063]  1       BB08                 10.00     1000 [018..019)-> BB03(0.909),BB12(0.0909)  ( cond )                     i IBC bwd
BB12 [0064]  1       BB10                  1.00      100 [018..06D)-> BB28(0.00595),BB15(0.994)   ( cond )                     i IBC loophead bwd
BB15 [0001]  2       BB12,BB26           153.66    15366 [020..024)-> BB17(1)                 (always)                     i IBC bwd bwd-target
BB17 [0002]  2       BB15,BB24            1741.   174141 [024..029)-> BB19(1)                 (always)                     i IBC loophead bwd bwd-target
BB19 [0003]  2       BB17,BB22           17492.  1749182 [029..02E)-> BB21(1)                 (always)                     i IBC loophead mdidxlen bwd bwd-target mdarr
BB21 [0004]  2       BB19,BB21             194k 19378019 [02E..050)-> BB21(0.91),BB22(0.0903) ( cond )                     i IBC loophead mdidxlen bwd bwd-target mdarr
BB22 [0059]  1       BB21                19227.  1922741 [050..05C)-> BB19(0.909),BB24(0.0906)  ( cond )                     i IBC bwd
BB24 [0060]  1       BB22                 1915.   191483 [05C..065)-> BB17(0.92),BB26(0.0802) ( cond )                     i IBC bwd
BB26 [0061]  1       BB24                167.07    16707 [065..06A)-> BB15(0.00595),BB28(0.994)   ( cond )                     i IBC bwd
BB28 [0012]  2       BB12,BB26             0.91       91 [06D..06E)-> BB30(1)                 (always)                     i IBC
BB30 [0025]  2       BB28,BB38             9.09      909 [06D..06E)-> BB32(1)                 (always)                     i IBC loophead bwd bwd-target
BB32 [0026]  2       BB30,BB36            90.91     9091 [06D..06E)-> BB34(1)                 (always)                     i IBC loophead mdidxlen bwd bwd-target mdarr
BB34 [0027]  2       BB32,BB35            1000.   100000 [06D..06E)-> BB42(0),BB35(1)         ( cond )                     i IBC loophead mdidxlen bwd bwd-target mdarr
BB35 [0029]  1       BB34                 1000.   100000 [06D..06E)-> BB34(0.909),BB36(0.0909)  ( cond )                     i IBC bwd
BB36 [0056]  1       BB35                100.00    10000 [06D..06E)-> BB32(0.909),BB38(0.0909)  ( cond )                     i IBC bwd
BB38 [0057]  1       BB36                 10.00     1000 [06D..06E)-> BB30(0.909),BB40(0.0909)  ( cond )                     i IBC bwd
BB40 [0058]  1       BB38                  1.00      100 [06D..06E)-> BB43(1)                 (always)                     i IBC
BB43 [0036]  2       BB40,BB42             1.00      100 [06D..077)                           (return)                     i IBC
BB44 [0065]  0                             0             [???..???)                           (throw )                     i rare keep internal
BB42 [0028]  1       BB34                  0           0 [06D..06E)-> BB43(1)                 (always)                     i IBC rare
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

And the new layout is this:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight        IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             0.91       91 [000..019)-> BB03(1)                 (always)                     i IBC hascall
BB03 [0014]  2       BB01,BB10             9.09      909 [018..019)-> BB05(1)                 (always)                     i IBC loophead mdidxlen bwd bwd-target mdarr
BB05 [0015]  2       BB03,BB08            90.91     9091 [018..019)-> BB07(1)                 (always)                     i IBC loophead mdidxlen bwd bwd-target mdarr
BB07 [0016]  2       BB05,BB07            1000.   100000 [018..019)-> BB07(0.909),BB08(0.0909)  ( cond )                     i IBC loophead mdidxlen bwd bwd-target mdarr
BB08 [0062]  1       BB07                100.00    10000 [018..019)-> BB05(0.909),BB10(0.0909)  ( cond )                     i IBC bwd
BB10 [0063]  1       BB08                 10.00     1000 [018..019)-> BB03(0.909),BB12(0.0909)  ( cond )                     i IBC bwd
BB26 [0061]  1       BB24                167.07    16707 [065..06A)-> BB12(1)                 (always)                     i IBC bwd
BB12 [0064]  2       BB10,BB26           168.07    16807 [018..06D)-> BB15(0.994),BB28(0.00595)   ( cond )                     i IBC loophead bwd
BB15 [0001]  1       BB12                153.66    15366 [020..024)-> BB17(1)                 (always)                     i IBC bwd bwd-target
BB17 [0002]  2       BB15,BB24            1741.   174141 [024..029)-> BB19(1)                 (always)                     i IBC loophead bwd bwd-target
BB19 [0003]  2       BB17,BB22           17492.  1749182 [029..02E)-> BB21(1)                 (always)                     i IBC loophead mdidxlen bwd bwd-target mdarr
BB21 [0004]  2       BB19,BB21             194k 19378019 [02E..050)-> BB21(0.91),BB22(0.0903) ( cond )                     i IBC loophead mdidxlen bwd bwd-target mdarr
BB22 [0059]  1       BB21                19227.  1922741 [050..05C)-> BB19(0.909),BB24(0.0906)  ( cond )                     i IBC bwd
BB24 [0060]  1       BB22                 1915.   191483 [05C..065)-> BB17(0.92),BB26(0.0802) ( cond )                     i IBC bwd
BB28 [0012]  1       BB12                  0.91       91 [06D..06E)-> BB30(1)                 (always)                     i IBC
BB30 [0025]  2       BB28,BB38             9.09      909 [06D..06E)-> BB32(1)                 (always)                     i IBC loophead bwd bwd-target
BB32 [0026]  2       BB30,BB36            90.91     9091 [06D..06E)-> BB34(1)                 (always)                     i IBC loophead mdidxlen bwd bwd-target mdarr
BB34 [0027]  2       BB32,BB35            1000.   100000 [06D..06E)-> BB35(1),BB42(0)         ( cond )                     i IBC loophead mdidxlen bwd bwd-target mdarr
BB35 [0029]  1       BB34                 1000.   100000 [06D..06E)-> BB34(0.909),BB36(0.0909)  ( cond )                     i IBC bwd
BB36 [0056]  1       BB35                100.00    10000 [06D..06E)-> BB32(0.909),BB38(0.0909)  ( cond )                     i IBC bwd
BB38 [0057]  1       BB36                 10.00     1000 [06D..06E)-> BB30(0.909),BB40(0.0909)  ( cond )                     i IBC bwd
BB40 [0058]  1       BB38                  1.00      100 [06D..06E)-> BB43(1)                 (always)                     i IBC
BB43 [0036]  2       BB40,BB42             1.00      100 [06D..077)                           (return)                     i IBC
BB42 [0028]  1       BB34                  0           0 [06D..06E)-> BB43(1)                 (always)                     i IBC rare
BB44 [0065]  0                             0             [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

The key difference is BB26 is no longer converted from an unconditional jump to a conditional jump. In the old layout implementation, if we decided not to move a BBJ_ALWAYS block, we'd instead run Compiler::fgOptimizeBranch, which would try to clone and concatenate the condition of the block's BBJ_COND successor; this yielded a better layout in this case. With the new layout, we never run Compiler::fgOptimizeBranch -- it might be a good idea to add a check to fgUpdateFlowGraph, where if we don't compact a BBJ_ALWAYS into its BBJ_COND successor, we instead try cloning the condition. It's probably not worth introducing churn into .NET 9 at this point, but we can definitely pursue this in .NET 10.

@amanasifkhalid
Copy link
Member

Since the largest regressions have either resolved themselves, or improved and then regressed due to block layout churn, I don't think there's anything actionable here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-linux Linux OS (any supported distro) Priority:2 Work that is important, but not critical for the release runtime-coreclr specific to the CoreCLR runtime
Projects
None yet
Development

No branches or pull requests

5 participants