Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fast path for linq count with predicate #102884

Merged
merged 2 commits into from
Jul 22, 2024

Conversation

neon-sunset
Copy link
Contributor

@neon-sunset neon-sunset commented May 30, 2024

Fixes #102696

A simple change to take fast path when the source is span-able. Also move non-span loop to a local function to make it easier for the JIT to inline the method alongside the lambda, when available, per callsite, avoiding a single lambda only devirtualization due to the loop not being inlined.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label May 30, 2024
@neon-sunset
Copy link
Contributor Author

@EgorBot -arm64 -intel

using BenchmarkDotNet.Attributes;
using System.Buffers;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<Perf_Count>(args: args);

[MemoryDiagnoser]
[DisassemblyDiagnoser(maxDepth: 2)]
public class Perf_Count
{
    [Params(1, 5, 10, 100, 1000)]
    public int Length;

    int[]? _array;
    IEnumerable<int>? _select;

    [GlobalSetup]
    public void Setup()
    {
        _array = Enumerable.Range(0, Length).ToArray();
        _select = _array.Select(i => i);
    }

    [Benchmark]
    public int Array() => _array!.Count(i => i % 2 == 0);

    [Benchmark]
    public int Select() => _select!.Count(i => i % 2 == 0);
}

@EgorBot
Copy link

EgorBot commented May 30, 2024

Benchmark results on Intel
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores
  Job-STKMUK : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-JINEQQ : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Toolchain Length Mean Error Ratio Gen0 Code Size Allocated Alloc Ratio
Array Main 1 12.149 ns 0.1380 ns 1.00 0.0013 664 B 32 B 1.00
Array PR 1 1.948 ns 0.0140 ns 0.16 - 431 B - 0.00
Select Main 1 16.124 ns 0.3533 ns 1.00 0.0019 887 B 48 B 1.00
Select PR 1 18.350 ns 0.3369 ns 1.16 0.0019 1,040 B 48 B 1.00
Array Main 5 20.080 ns 0.2632 ns 1.00 0.0013 657 B 32 B 1.00
Array PR 5 4.556 ns 0.0256 ns 0.23 - 431 B - 0.00
Select Main 5 21.463 ns 0.1104 ns 1.00 0.0019 861 B 48 B 1.00
Select PR 5 22.344 ns 0.2173 ns 1.04 0.0019 1,023 B 48 B 1.00
Array Main 10 29.461 ns 0.0246 ns 1.00 0.0013 663 B 32 B 1.00
Array PR 10 8.410 ns 0.0038 ns 0.29 - 431 B - 0.00
Select Main 10 30.997 ns 0.6440 ns 1.00 0.0019 861 B 48 B 1.00
Select PR 10 30.950 ns 0.3587 ns 0.99 0.0019 1,023 B 48 B 1.00
Array Main 100 198.109 ns 0.0530 ns 1.00 0.0012 663 B 32 B 1.00
Array PR 100 72.787 ns 0.0360 ns 0.37 - 431 B - 0.00
Select Main 100 202.276 ns 0.3104 ns 1.00 0.0019 863 B 48 B 1.00
Select PR 100 203.941 ns 0.1311 ns 1.01 0.0019 1,025 B 48 B 1.00
Array Main 1000 1,839.498 ns 7.0074 ns 1.00 - 694 B 32 B 1.00
Array PR 1000 724.855 ns 0.3306 ns 0.39 - 425 B - 0.00
Select Main 1000 2,070.519 ns 1.1994 ns 1.00 - 885 B 48 B 1.00
Select PR 1000 2,066.841 ns 24.4825 ns 1.00 - 1,043 B 48 B 1.00

BDN_Artifacts.zip

@EgorBot
Copy link

EgorBot commented May 30, 2024

Benchmark results on Arm64
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Unknown processor
  Job-WOALJO : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-UBOTCO : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Toolchain Length Mean Error Ratio Gen0 Code Size Allocated Alloc Ratio
Array Main 1 17.292 ns 0.2280 ns 1.00 0.0019 168 B 32 B 1.00
Array PR 1 3.298 ns 0.0029 ns 0.19 - 600 B - 0.00
Select Main 1 23.278 ns 0.3798 ns 1.00 0.0029 1,100 B 48 B 1.00
Select PR 1 22.260 ns 0.3077 ns 0.96 0.0029 1,324 B 48 B 1.00
Array Main 5 25.772 ns 0.1569 ns 1.00 0.0019 896 B 32 B 1.00
Array PR 5 8.686 ns 0.0009 ns 0.34 - 600 B - 0.00
Select Main 5 35.232 ns 0.0942 ns 1.00 0.0029 1,084 B 48 B 1.00
Select PR 5 29.867 ns 0.2031 ns 0.85 0.0029 1,308 B 48 B 1.00
Array Main 10 35.749 ns 0.1809 ns 1.00 0.0019 908 B 32 B 1.00
Array PR 10 17.797 ns 0.0018 ns 0.50 - 600 B - 0.00
Select Main 10 37.491 ns 0.5069 ns 1.00 0.0029 1,096 B 48 B 1.00
Select PR 10 40.025 ns 0.2962 ns 1.07 0.0029 1,320 B 48 B 1.00
Array Main 100 210.000 ns 0.5227 ns 1.00 0.0019 896 B 32 B 1.00
Array PR 100 170.124 ns 0.0339 ns 0.81 - 600 B - 0.00
Select Main 100 225.710 ns 0.5734 ns 1.00 0.0029 1,092 B 48 B 1.00
Select PR 100 230.128 ns 0.5316 ns 1.02 0.0029 1,320 B 48 B 1.00
Array Main 1000 1,892.696 ns 1.6808 ns 1.00 0.0019 932 B 32 B 1.00
Array PR 1000 1,674.242 ns 0.9036 ns 0.88 - 600 B - 0.00
Select Main 1000 2,030.906 ns 0.4242 ns 1.00 - 1,116 B 48 B 1.00
Select PR 1000 2,025.169 ns 0.4080 ns 1.00 - 1,320 B 48 B 1.00

BDN_Artifacts.zip

@neon-sunset
Copy link
Contributor Author

@EgorBot -arm64 -intel

using BenchmarkDotNet.Attributes;
using System.Buffers;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<Perf_Count>(args: args);

[MemoryDiagnoser]
[DisassemblyDiagnoser(maxDepth: 4)]
public class Perf_Count
{
    [Params(1, 5, 10, 100, 10_000)]
    public int Length;

    int[]? _array;
    IEnumerable<int>? _select;

    [GlobalSetup]
    public void Setup()
    {
        _array = Enumerable.Range(0, Length).ToArray();
        _select = _array.Select(i => i);
    }

    [Benchmark]
    public int Array() => _array!.Count(i => i % 2 == 0);

    [Benchmark]
    public int Select() => _select!.Count(i => i % 2 == 0);
}

@jkotas
Copy link
Member

jkotas commented May 30, 2024

Also move non-span loop to a local function to make it easier for the JIT to inline the method alongside the lambda, when available, per callsite, avoiding a single lambda only devirtualization due to the loop not being inlined.

This split adds code bloat. We do not have an artificial split like this in other places that use TryGetSpan. Is it really worth it here?

@EgorBot
Copy link

EgorBot commented May 30, 2024

Benchmark results on Intel
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores
  Job-JWMELF : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-MGQLWD : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Toolchain Length Mean Error Ratio Gen0 Code Size Allocated Alloc Ratio
Array Main 1 11.677 ns 0.1474 ns 1.00 0.0013 664 B 32 B 1.00
Array PR 1 1.920 ns 0.0052 ns 0.16 - 431 B - 0.00
Select Main 1 14.475 ns 0.0686 ns 1.00 0.0019 887 B 48 B 1.00
Select PR 1 15.424 ns 0.2414 ns 1.07 0.0019 983 B 48 B 1.00
Array Main 5 21.138 ns 0.1114 ns 1.00 0.0013 657 B 32 B 1.00
Array PR 5 4.525 ns 0.0321 ns 0.21 - 431 B - 0.00
Select Main 5 20.799 ns 0.2591 ns 1.00 0.0019 861 B 48 B 1.00
Select PR 5 20.616 ns 0.1313 ns 0.99 0.0019 969 B 48 B 1.00
Array Main 10 29.085 ns 0.2374 ns 1.00 0.0013 663 B 32 B 1.00
Array PR 10 8.423 ns 0.0025 ns 0.29 - 431 B - 0.00
Select Main 10 30.464 ns 0.3519 ns 1.00 0.0019 861 B 48 B 1.00
Select PR 10 30.241 ns 0.3273 ns 0.99 0.0019 969 B 48 B 1.00
Array Main 100 198.322 ns 0.0202 ns 1.00 0.0012 663 B 32 B 1.00
Array PR 100 72.722 ns 0.0198 ns 0.37 - 431 B - 0.00
Select Main 100 205.284 ns 0.1638 ns 1.00 0.0019 863 B 48 B 1.00
Select PR 100 203.963 ns 0.0657 ns 0.99 0.0019 971 B 48 B 1.00
Array Main 10000 18,279.597 ns 39.9907 ns 1.00 - 664 B 32 B 1.00
Array PR 10000 7,184.813 ns 6.0751 ns 0.39 - 431 B - 0.00
Select Main 10000 20,183.147 ns 8.2054 ns 1.00 - 838 B 48 B 1.00
Select PR 10000 20,240.985 ns 8.9905 ns 1.00 - 938 B 48 B 1.00

BDN_Artifacts.zip

@EgorBot
Copy link

EgorBot commented May 30, 2024

Benchmark results on Arm64
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Unknown processor
  Job-VCHWCP : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-CMALQX : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Toolchain Length Mean Error Ratio Gen0 Code Size Allocated Alloc Ratio
Array Main 1 17.151 ns 0.3226 ns 1.00 0.0019 844 B 32 B 1.00
Array PR 1 3.298 ns 0.0025 ns 0.19 - 600 B - 0.00
Select Main 1 23.034 ns 0.3441 ns 1.00 0.0029 1,100 B 48 B 1.00
Select PR 1 21.308 ns 0.2683 ns 0.93 0.0029 1,240 B 48 B 1.00
Array Main 5 26.064 ns 0.1618 ns 1.00 0.0019 896 B 32 B 1.00
Array PR 5 8.783 ns 0.0006 ns 0.34 - 600 B - 0.00
Select Main 5 36.077 ns 0.4738 ns 1.00 0.0029 1,084 B 48 B 1.00
Select PR 5 29.687 ns 0.1459 ns 0.82 0.0029 1,244 B 48 B 1.00
Array Main 10 35.418 ns 0.0351 ns 1.00 0.0019 908 B 32 B 1.00
Array PR 10 17.792 ns 0.0015 ns 0.50 - 600 B - 0.00
Select Main 10 37.256 ns 0.1162 ns 1.00 0.0029 1,096 B 48 B 1.00
Select PR 10 46.241 ns 0.0927 ns 1.24 0.0029 1,252 B 48 B 1.00
Array Main 100 209.662 ns 0.1772 ns 1.00 0.0019 908 B 32 B 1.00
Array PR 100 169.890 ns 0.0096 ns 0.81 - 600 B - 0.00
Select Main 100 225.857 ns 0.2314 ns 1.00 0.0029 1,092 B 48 B 1.00
Select PR 100 226.185 ns 0.4154 ns 1.00 0.0029 1,252 B 48 B 1.00
Array Main 10000 18,709.212 ns 1.7315 ns 1.00 - 888 B 32 B 1.00
Array PR 10000 16,697.292 ns 0.9395 ns 0.89 - 600 B - 0.00
Select Main 10000 20,035.024 ns 1.7563 ns 1.00 - 1,060 B 48 B 1.00
Select PR 10000 20,036.847 ns 3.4040 ns 1.00 - 1,220 B 48 B 1.00

BDN_Artifacts.zip

@neon-sunset
Copy link
Contributor Author

neon-sunset commented May 30, 2024

This split adds code bloat. We do not have an artificial split like this in other places that use TryGetSpan. Is it really worth it here?

The goal behind this is to allow the JIT to inline the .Count call (which it does as of this PR), the loop with a specific delegate instance (which it also does) and then also ideally optimize away type handle comparisons in .TryGetSpan (which it currently doesn't, at least for the instance fields in the benchmark). This is not possible without hoisting enumerable count to a local function as enumerable foreach brings EH which blocks the inlining. When inlining is blocked, only a single delegate will be devirtualized in the span path given current DPGO restrictions, also impacting negatively other delegate types, if I understand it correctly.

This also allows the JIT to optimize away the precondition checks when possible (almost noise given enumerator allocation but hey).

Second variant which moves the span loop to a local function shaves off ~100B of codegen impact size on enumerable path. For the span path, this PR is codegen size win (specific instance but yeah, not total size).

@neon-sunset
Copy link
Contributor Author

@EgorBot -arm64 -intel

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<Perf_Count>(args: args);

[MemoryDiagnoser]
[DisassemblyDiagnoser(maxDepth: 4)]
public class Perf_Count
{
    [Params(1, 5, 10, 100, 10_000)]
    public int Length;

    int[]? _array;
    IEnumerable<int>? _select;

    [GlobalSetup]
    public void Setup()
    {
        _array = Enumerable.Range(0, Length).ToArray();
        _select = _array.Select(i => i);
    }

    [Benchmark]
    public int Array() => _array!.Count(i => i % 2 == 0);

    [Benchmark]
    public int ArrayTwoDelegates()
    {
        var array = _array!;
        return array.Count(i => i % 2 == 0) +
               array.Count(i => i % 8 == 0);
    }

    [Benchmark]
    public int Select() => _select!.Count(i => i % 2 == 0);

    [Benchmark]
    public int SelectTwoDelegates()
    {
        var select = _select!;
        return select.Count(i => i % 2 == 0) +
               select.Count(i => i % 8 == 0);
    }
}

@EgorBot
Copy link

EgorBot commented May 30, 2024

❌ Benchmark failed on Intel
Benchmark run failed: // Validating benchmarks:
// ***** BenchmarkRunner: Start   *****
// ***** Found 40 benchmark(s) in total *****
// ***** Building 2 exe(s) in Parallel: Start   *****
// ***** Done, took 00:00:51 (51.14 sec)   *****
// Found 40 benchmarks:
//   Perf_Count.Array: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=1]
//   Perf_Count.ArrayTwoDelegates: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=1]
//   Perf_Count.Select: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=1]
//   Perf_Count.SelectTwoDelegates: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=1]
//   Perf_Count.Array: Job-RMPIRF(Toolchain=/core_root_diff/corerun) [Length=1]
//   Perf_Count.ArrayTwoDelegates: Job-RMPIRF(Toolchain=/core_root_diff/corerun) [Length=1]
//   Perf_Count.Select: Job-RMPIRF(Toolchain=/core_root_diff/corerun) [Length=1]
//   Perf_Count.SelectTwoDelegates: Job-RMPIRF(Toolchain=/core_root_diff/corerun) [Length=1]
//   Perf_Count.Array: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=5]
//   Perf_Count.ArrayTwoDelegates: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=5]
//   Perf_Count.Select: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=5]
//   Perf_Count.SelectTwoDelegates: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=5]
//   Perf_Count.Array: Job-RMPIRF(Toolchain=/core_root_diff/corerun) [Length=5]
//   Perf_Count.ArrayTwoDelegates: Job-RMPIRF(Toolchain=/core_root_diff/corerun) [Length=5]
//   Perf_Count.Select: Job-RMPIRF(Toolchain=/core_root_diff/corerun) [Length=5]
//   Perf_Count.SelectTwoDelegates: Job-RMPIRF(Toolchain=/core_root_diff/corerun) [Length=5]
//   Perf_Count.Array: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=10]
//   Perf_Count.ArrayTwoDelegates: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=10]
//   Perf_Count.Select: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=10]
//   Perf_Count.SelectTwoDelegates: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=10]
//   Perf_Count.Array: Job-RMPIRF(Toolchain=/core_root_diff/corerun) [Length=10]
//   Perf_Count.ArrayTwoDelegates: Job-RMPIRF(Toolchain=/core_root_diff/corerun) [Length=10]
//   Perf_Count.Select: Job-RMPIRF(Toolchain=/core_root_diff/corerun) [Length=10]
//   Perf_Count.SelectTwoDelegates: Job-RMPIRF(Toolchain=/core_root_diff/corerun) [Length=10]
//   Perf_Count.Array: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=100]
//   Perf_Count.ArrayTwoDelegates: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=100]
//   Perf_Count.Select: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=100]
//   Perf_Count.SelectTwoDelegates: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=100]
//   Perf_Count.Array: Job-RMPIRF(Toolchain=/core_root_diff/corerun) [Length=100]
//   Perf_Count.ArrayTwoDelegates: Job-RMPIRF(Toolchain=/core_root_diff/corerun) [Length=100]
//   Perf_Count.Select: Job-RMPIRF(Toolchain=/core_root_diff/corerun) [Length=100]
//   Perf_Count.SelectTwoDelegates: Job-RMPIRF(Toolchain=/core_root_diff/corerun) [Length=100]
//   Perf_Count.Array: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=10000]
//   Perf_Count.ArrayTwoDelegates: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=10000]
//   Perf_Count.Select: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=10000]
//   Perf_Count.SelectTwoDelegates: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=10000]
//   Perf_Count.Array: Job-RMPIRF(Toolchain=/core_root_diff/corerun) [Length=10000]
//   Perf_Count.ArrayTwoDelegates: Job-RMPIRF(Toolchain=/core_root_diff/corerun) [Length=10000]
//   Perf_Count.Select: Job-RMPIRF(Toolchain=/core_root_diff/corerun) [Length=10000]
//   Perf_Count.SelectTwoDelegates: Job-RMPIRF(Toolchain=/core_root_diff/corerun) [Length=10000]

// **************************
// Benchmark: Perf_Count.Array: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=1]
// *** Execute ***
// Launch: 1 / 1
// Execute: /home/egorbot/492d316d-8c7c-432d-b28b-9e91dbf4b1c2/corerun ade80f89-6895-4c73-aecf-e48027fe5e22.dll --anonymousPipes 110 117 --benchmarkName "Perf_Count.Array(Length: 1)" --job Toolchain=/core_root_base/corerun --benchmarkId 0 in /home/egorbot/benchapp/bin/Release/net9.0/ade80f89-6895-4c73-aecf-e48027fe5e22/bin/Release/net9.0/publish
// BeforeAnythingElse

// Benchmark Process Environment Information:
// BenchmarkDotNet v0.13.12
// Runtime=.NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
// GC=Concurrent Workstation
// HardwareIntrinsics=AVX-512F+CD+BW+DQ+VL+VBMI,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT VectorSize=256
// Job: DefaultJob

OverheadJitting  1: 1 op, 303360.00 ns, 303.3600 us/op
WorkloadJitting  1: 1 op, 690445.00 ns, 690.4450 us/op

OverheadJitting  2: 16 op, 447428.00 ns, 27.9642 us/op
WorkloadJitting  2: 16 op, 480721.00 ns, 30.0451 us/op

WorkloadPilot    1: 16 op, 4385.00 ns, 274.0625 ns/op
WorkloadPilot    2: 32 op, 6585.00 ns, 205.7812 ns/op
WorkloadPilot    3: 64 op, 9151.00 ns, 142.9844 ns/op
WorkloadPilot    4: 128 op, 21857.00 ns, 170.7578 ns/op
WorkloadPilot    5: 256 op, 35513.00 ns, 138.7227 ns/op
WorkloadPilot    6: 512 op, 62539.00 ns, 122.1465 ns/op
WorkloadPilot    7: 1024 op, 127393.00 ns, 124.4072 ns/op
WorkloadPilot    8: 2048 op, 253987.00 ns, 124.0171 ns/op
WorkloadPilot    9: 4096 op, 507708.00 ns, 123.9521 ns/op
WorkloadPilot   10: 8192 op, 1254911.00 ns, 153.1874 ns/op
WorkloadPilot   11: 16384 op, 2228735.00 ns, 136.0312 ns/op
WorkloadPilot   12: 32768 op, 4237795.00 ns, 129.3272 ns/op
WorkloadPilot   13: 65536 op, 8240470.00 ns, 125.7396 ns/op
WorkloadPilot   14: 131072 op, 16289908.00 ns, 124.2821 ns/op
WorkloadPilot   15: 262144 op, 32083751.00 ns, 122.3898 ns/op
WorkloadPilot   16: 524288 op, 63842403.00 ns, 121.7697 ns/op
WorkloadPilot   17: 1048576 op, 28772244.00 ns, 27.4394 ns/op
WorkloadPilot   18: 2097152 op, 32144652.00 ns, 15.3278 ns/op
WorkloadPilot   19: 4194304 op, 57163014.00 ns, 13.6287 ns/op
WorkloadPilot   20: 8388608 op, 112229246.00 ns, 13.3788 ns/op
WorkloadPilot   21: 16777216 op, 226622354.00 ns, 13.5077 ns/op
WorkloadPilot   22: 33554432 op, 446807181.00 ns, 13.3159 ns/op
WorkloadPilot   23: 67108864 op, 896611231.00 ns, 13.3605 ns/op

OverheadWarmup   1: 67108864 op, 144597856.00 ns, 2.1547 ns/op
OverheadWarmup   2: 67108864 op, 120418403.00 ns, 1.7944 ns/op
OverheadWarmup   3: 67108864 op, 105901831.00 ns, 1.5781 ns/op
OverheadWarmup   4: 67108864 op, 105898505.00 ns, 1.5780 ns/op
OverheadWarmup   5: 67108864 op, 105912335.00 ns, 1.5782 ns/op
OverheadWarmup   6: 67108864 op, 105927667.00 ns, 1.5784 ns/op
OverheadWarmup   7: 67108864 op, 105929642.00 ns, 1.5785 ns/op
OverheadWarmup   8: 67108864 op, 105888564.00 ns, 1.5779 ns/op
OverheadWarmup   9: 67108864 op, 105950413.00 ns, 1.5788 ns/op
OverheadWarmup  10: 67108864 op, 105881811.00 ns, 1.5778 ns/op

OverheadActual   1: 67108864 op, 105895027.00 ns, 1.5780 ns/op
OverheadActual   2: 67108864 op, 105882914.00 ns, 1.5778 ns/op
OverheadActual   3: 67108864 op, 105877194.00 ns, 1.5777 ns/op
OverheadActual   4: 67108864 op, 105872376.00 ns, 1.5776 ns/op
OverheadActual   5: 67108864 op, 105912665.00 ns, 1.5782 ns/op
OverheadActual   6: 67108864 op, 105904263.00 ns, 1.5781 ns/op
OverheadActual   7: 67108864 op, 105913346.00 ns, 1.5782 ns/op
OverheadActual   8: 67108864 op, 105947695.00 ns, 1.5787 ns/op
OverheadActual   9: 67108864 op, 105874303.00 ns, 1.5777 ns/op
OverheadActual  10: 67108864 op, 105896900.00 ns, 1.5780 ns/op
OverheadActual  11: 67108864 op, 105915053.00 ns, 1.5783 ns/op
OverheadActual  12: 67108864 op, 105898329.00 ns, 1.5780 ns/op
OverheadActual  13: 67108864 op, 105882876.00 ns, 1.5778 ns/op
OverheadActual  14: 67108864 op, 105915107.00 ns, 1.5783 ns/op
OverheadActual  15: 67108864 op, 105889470.00 ns, 1.5779 ns/op

WorkloadWarmup   1: 67108864 op, 901300499.00 ns, 13.4304 ns/op
WorkloadWarmup   2: 67108864 op, 970947203.00 ns, 14.4682 ns/op
WorkloadWarmup   3: 67108864 op, 964747674.00 ns, 14.3759 ns/op
WorkloadWarmup   4: 67108864 op, 979687146.00 ns, 14.5985 ns/op
WorkloadWarmup   5: 67108864 op, 982691513.00 ns, 14.6432 ns/op
WorkloadWarmup   6: 67108864 op, 969455967.00 ns, 14.4460 ns/op

// BeforeActualRun
WorkloadActual   1: 67108864 op, 900163918.00 ns, 13.4135 ns/op
WorkloadActual   2: 67108864 op, 910697597.00 ns, 13.5705 ns/op
WorkloadActual   3: 67108864 op, 921953773.00 ns, 13.7382 ns/op
WorkloadActual   4: 67108864 op, 928259380.00 ns, 13.8321 ns/op
WorkloadActual   5: 67108864 op, 921332665.00 ns, 13.7289 ns/op
WorkloadActual   6: 67108864 op, 918330542.00 ns, 13.6842 ns/op
WorkloadActual   7: 67108864 op, 929512644.00 ns, 13.8508 ns/op
WorkloadActual   8: 67108864 op, 910954376.00 ns, 13.5743 ns/op
WorkloadActual   9: 67108864 op, 918479702.00 ns, 13.6864 ns/op
WorkloadActual  10: 67108864 op, 922672562.00 ns, 13.7489 ns/op
WorkloadActual  11: 67108864 op, 898998156.00 ns, 13.3961 ns/op
WorkloadActual  12: 67108864 op, 899949975.00 ns, 13.4103 ns/op
WorkloadActual  13: 67108864 op, 925889041.00 ns, 13.7968 ns/op
WorkloadActual  14: 67108864 op, 912260491.00 ns, 13.5937 ns/op
WorkloadActual  15: 67108864 op, 913238496.00 ns, 13.6083 ns/op

// AfterActualRun
WorkloadResult   1: 67108864 op, 794267018.00 ns, 11.8355 ns/op
WorkloadResult   2: 67108864 op, 804800697.00 ns, 11.9925 ns/op
WorkloadResult   3: 67108864 op, 816056873.00 ns, 12.1602 ns/op
WorkloadResult   4: 67108864 op, 822362480.00 ns, 12.2542 ns/op
WorkloadResult   5: 67108864 op, 815435765.00 ns, 12.1509 ns/op
WorkloadResult   6: 67108864 op, 812433642.00 ns, 12.1062 ns/op
WorkloadResult   7: 67108864 op, 823615744.00 ns, 12.2728 ns/op
WorkloadResult   8: 67108864 op, 805057476.00 ns, 11.9963 ns/op
WorkloadResult   9: 67108864 op, 812582802.00 ns, 12.1084 ns/op
WorkloadResult  10: 67108864 op, 816775662.00 ns, 12.1709 ns/op
WorkloadResult  11: 67108864 op, 793101256.00 ns, 11.8181 ns/op
WorkloadResult  12: 67108864 op, 794053075.00 ns, 11.8323 ns/op
WorkloadResult  13: 67108864 op, 819992141.00 ns, 12.2188 ns/op
WorkloadResult  14: 67108864 op, 806363591.00 ns, 12.0158 ns/op
WorkloadResult  15: 67108864 op, 807341596.00 ns, 12.0303 ns/op
// GC:  85 0 0 2147484384 67108864
// Threading:  0 0 67108864

// AfterAll
// Benchmark Process 35645 has exited with code 0.

Mean = 12.064 ns, StdErr = 0.039 ns (0.32%), N = 15, StdDev = 0.150 ns
Min = 11.818 ns, Q1 = 11.994 ns, Median = 12.106 ns, Q3 = 12.166 ns, Max = 12.273 ns
IQR = 0.171 ns, LowerFence = 11.738 ns, UpperFence = 12.422 ns
ConfidenceInterval = [11.904 ns; 12.225 ns] (CI 99.9%), Margin = 0.160 ns (1.33% of Mean)
Skewness = -0.35, Kurtosis = 1.76, MValue = 2

// ** Remained 39 (97.5 %) benchmark(s) to run. Estimated finish 2024-05-30 18:26 (0h 16m from now) **
// **************************
// Benchmark: Perf_Count.ArrayTwoDelegates: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=1]
// *** Execute ***
// Launch: 1 / 1
// Execute: /home/egorbot/492d316d-8c7c-432d-b28b-9e91dbf4b1c2/corerun ade80f89-6895-4c73-aecf-e48027fe5e22.dll --anonymousPipes 118 119 --benchmarkName "Perf_Count.ArrayTwoDelegates(Length: 1)" --job Toolchain=/core_root_base/corerun --benchmarkId 1 in /home/egorbot/benchapp/bin/Release/net9.0/ade80f89-6895-4c73-aecf-e48027fe5e22/bin/Release/net9.0/publish
// BeforeAnythingElse

// Benchmark Process Environment Information:
// BenchmarkDotNet v0.13.12
// Runtime=.NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
// GC=Concurrent Workstation
// HardwareIntrinsics=AVX-512F+CD+BW+DQ+VL+VBMI,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT VectorSize=256
// Job: DefaultJob

OverheadJitting  1: 1 op, 309513.00 ns, 309.5130 us/op
WorkloadJitting  1: 1 op, 826850.00 ns, 826.8500 us/op

OverheadJitting  2: 16 op, 425470.00 ns, 26.5919 us/op
WorkloadJitting  2: 16 op, 451191.00 ns, 28.1994 us/op

WorkloadPilot    1: 16 op, 6681.00 ns, 417.5625 ns/op
WorkloadPilot    2: 32 op, 10142.00 ns, 316.9375 ns/op
WorkloadPilot    3: 64 op, 24310.00 ns, 379.8438 ns/op
WorkloadPilot    4: 128 op, 37672.00 ns, 294.3125 ns/op
WorkloadPilot    5: 256 op, 70723.00 ns, 276.2617 ns/op
WorkloadPilot    6: 512 op, 124544.00 ns, 243.2500 ns/op
WorkloadPilot    7: 1024 op, 268271.00 ns, 261.9834 ns/op
WorkloadPilot    8: 2048 op, 512563.00 ns, 250.2749 ns/op
WorkloadPilot    9: 4096 op, 1277670.00 ns, 311.9312 ns/op
WorkloadPilot   10: 8192 op, 2234720.00 ns, 272.7930 ns/op
WorkloadPilot   11: 16384 op, 4279369.00 ns, 261.1920 ns/op
WorkloadPilot   12: 32768 op, 8281356.00 ns, 252.7269 ns/op
WorkloadPilot   13: 65536 op, 16322141.00 ns, 249.0561 ns/op
WorkloadPilot   14: 131072 op, 32162429.00 ns, 245.3799 ns/op
WorkloadPilot   15: 262144 op, 63957895.00 ns, 243.9800 ns/op
WorkloadPilot   16: 524288 op, 33055319.00 ns, 63.0480 ns/op
WorkloadPilot   17: 1048576 op, 31056964.00 ns, 29.6182 ns/op
WorkloadPilot   18: 2097152 op, 63933958.00 ns, 30.4861 ns/op
WorkloadPilot   19: 4194304 op, 120589221.00 ns, 28.7507 ns/op
WorkloadPilot   20: 8388608 op, 240858368.00 ns, 28.7126 ns/op
WorkloadPilot   21: 16777216 op, 491372127.00 ns, 29.2881 ns/op
WorkloadPilot   22: 33554432 op, 979252122.00 ns, 29.1840 ns/op

OverheadWarmup   1: 33554432 op, 72290488.00 ns, 2.1544 ns/op
OverheadWarmup   2: 33554432 op, 72255106.00 ns, 2.1534 ns/op
OverheadWarmup   3: 33554432 op, 67109037.00 ns, 2.0000 ns/op
OverheadWarmup   4: 33554432 op, 52993437.00 ns, 1.5793 ns/op
OverheadWarmup   5: 33554432 op, 52970100.00 ns, 1.5786 ns/op
OverheadWarmup   6: 33554432 op, 52961061.00 ns, 1.5784 ns/op
OverheadWarmup   7: 33554432 op, 52984239.00 ns, 1.5791 ns/op
OverheadWarmup   8: 33554432 op, 52979020.00 ns, 1.5789 ns/op
OverheadWarmup   9: 33554432 op, 53135599.00 ns, 1.5836 ns/op
OverheadWarmup  10: 33554432 op, 52963169.00 ns, 1.5784 ns/op

OverheadActual   1: 33554432 op, 52982308.00 ns, 1.5790 ns/op
OverheadActual   2: 33554432 op, 52978218.00 ns, 1.5789 ns/op
OverheadActual   3: 33554432 op, 52958898.00 ns, 1.5783 ns/op
OverheadActual   4: 33554432 op, 52940893.00 ns, 1.5778 ns/op
OverheadActual   5: 33554432 op, 52980539.00 ns, 1.5789 ns/op
OverheadActual   6: 33554432 op, 52973863.00 ns, 1.5787 ns/op
OverheadActual   7: 33554432 op, 53052151.00 ns, 1.5811 ns/op
OverheadActual   8: 33554432 op, 53023298.00 ns, 1.5802 ns/op
OverheadActual   9: 33554432 op, 52956259.00 ns, 1.5782 ns/op
OverheadActual  10: 33554432 op, 52994737.00 ns, 1.5794 ns/op
OverheadActual  11: 33554432 op, 52969111.00 ns, 1.5786 ns/op
OverheadActual  12: 33554432 op, 53089997.00 ns, 1.5822 ns/op
OverheadActual  13: 33554432 op, 52972732.00 ns, 1.5787 ns/op
OverheadActual  14: 33554432 op, 52991424.00 ns, 1.5793 ns/op
OverheadActual  15: 33554432 op, 52975028.00 ns, 1.5788 ns/op

WorkloadWarmup   1: 33554432 op, 1005398406.00 ns, 29.9632 ns/op
WorkloadWarmup   2: 33554432 op, 980769741.00 ns, 29.2292 ns/op
WorkloadWarmup   3: 33554432 op, 985903716.00 ns, 29.3822 ns/op
WorkloadWarmup   4: 33554432 op, 977570440.00 ns, 29.1339 ns/op
WorkloadWarmup   5: 33554432 op, 986408165.00 ns, 29.3973 ns/op
WorkloadWarmup   6: 33554432 op, 979761609.00 ns, 29.1992 ns/op

// BeforeActualRun
WorkloadActual   1: 33554432 op, 969511668.00 ns, 28.8937 ns/op
WorkloadActual   2: 33554432 op, 974344694.00 ns, 29.0377 ns/op
WorkloadActual   3: 33554432 op, 982115303.00 ns, 29.2693 ns/op
WorkloadActual   4: 33554432 op, 971658098.00 ns, 28.9577 ns/op
WorkloadActual   5: 33554432 op, 977681654.00 ns, 29.1372 ns/op
WorkloadActual   6: 33554432 op, 981778946.00 ns, 29.2593 ns/op
WorkloadActual   7: 33554432 op, 972534231.00 ns, 28.9838 ns/op
WorkloadActual   8: 33554432 op, 982967122.00 ns, 29.2947 ns/op
WorkloadActual   9: 33554432 op, 1000444318.00 ns, 29.8156 ns/op
WorkloadActual  10: 33554432 op, 998971036.00 ns, 29.7717 ns/op
WorkloadActual  11: 33554432 op, 1008013374.00 ns, 30.0411 ns/op
WorkloadActual  12: 33554432 op, 995055763.00 ns, 29.6550 ns/op
WorkloadActual  13: 33554432 op, 1005414190.00 ns, 29.9637 ns/op
WorkloadActual  14: 33554432 op, 1004725804.00 ns, 29.9432 ns/op
WorkloadActual  15: 33554432 op, 998842178.00 ns, 29.7678 ns/op

// AfterActualRun
WorkloadResult   1: 33554432 op, 916533450.00 ns, 27.3148 ns/op
WorkloadResult   2: 33554432 op, 921366476.00 ns, 27.4589 ns/op
WorkloadResult   3: 33554432 op, 929137085.00 ns, 27.6904 ns/op
WorkloadResult   4: 33554432 op, 918679880.00 ns, 27.3788 ns/op
WorkloadResult   5: 33554432 op, 924703436.00 ns, 27.5583 ns/op
WorkloadResult   6: 33554432 op, 928800728.00 ns, 27.6804 ns/op
WorkloadResult   7: 33554432 op, 919556013.00 ns, 27.4049 ns/op
WorkloadResult   8: 33554432 op, 929988904.00 ns, 27.7158 ns/op
WorkloadResult   9: 33554432 op, 947466100.00 ns, 28.2367 ns/op
WorkloadResult  10: 33554432 op, 945992818.00 ns, 28.1928 ns/op
WorkloadResult  11: 33554432 op, 955035156.00 ns, 28.4623 ns/op
WorkloadResult  12: 33554432 op, 942077545.00 ns, 28.0761 ns/op
WorkloadResult  13: 33554432 op, 952435972.00 ns, 28.3848 ns/op
WorkloadResult  14: 33554432 op, 951747586.00 ns, 28.3643 ns/op
WorkloadResult  15: 33554432 op, 945863960.00 ns, 28.1889 ns/op
// GC:  85 0 0 2147484096 33554432
// Threading:  0 0 33554432

// AfterAll
// Benchmark Process 35673 has exited with code 0.

Mean = 27.874 ns, StdErr = 0.106 ns (0.38%), N = 15, StdDev = 0.411 ns
Min = 27.315 ns, Q1 = 27.509 ns, Median = 27.716 ns, Q3 = 28.215 ns, Max = 28.462 ns
IQR = 0.706 ns, LowerFence = 26.449 ns, UpperFence = 29.274 ns
ConfidenceInterval = [27.434 ns; 28.313 ns] (CI 99.9%), Margin = 0.440 ns (1.58% of Mean)
Skewness = 0.05, Kurtosis = 1.25, MValue = 2

// ** Remained 38 (95.0 %) benchmark(s) to run. Estimated finish 2024-05-30 18:26 (0h 16m from now) **
// **************************
// Benchmark: Perf_Count.Select: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=1]
// *** Execute ***
// Launch: 1 / 1
// Execute: /home/egorbot/492d316d-8c7c-432d-b28b-9e91dbf4b1c2/corerun ade80f89-6895-4c73-aecf-e48027fe5e22.dll --anonymousPipes 118 119 --benchmarkName "Perf_Count.Select(Length: 1)" --job Toolchain=/core_root_base/corerun --benchmarkId 2 in /home/egorbot/benchapp/bin/Release/net9.0/ade80f89-6895-4c73-aecf-e48027fe5e22/bin/Release/net9.0/publish
// BeforeAnythingElse

// Benchmark Process Environment Information:
// BenchmarkDotNet v0.13.12
// Runtime=.NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
// GC=Concurrent Workstation
// HardwareIntrinsics=AVX-512F+CD+BW+DQ+VL+VBMI,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT VectorSize=256
// Job: DefaultJob

OverheadJitting  1: 1 op, 300426.00 ns, 300.4260 us/op
WorkloadJitting  1: 1 op, 931999.00 ns, 931.9990 us/op

OverheadJitting  2: 16 op, 418422.00 ns, 26.1514 us/op
WorkloadJitting  2: 16 op, 571512.00 ns, 35.7195 us/op

WorkloadPilot    1: 16 op, 4886.00 ns, 305.3750 ns/op
WorkloadPilot    2: 32 op, 6966.00 ns, 217.6875 ns/op
WorkloadPilot    3: 64 op, 10214.00 ns, 159.5938 ns/op
WorkloadPilot    4: 128 op, 26411.00 ns, 206.3359 ns/op
WorkloadPilot    5: 256 op, 44261.00 ns, 172.8945 ns/op
WorkloadPilot    6: 512 op, 82894.00 ns, 161.9023 ns/op
WorkloadPilot    7: 1024 op, 153210.00 ns, 149.6191 ns/op
WorkloadPilot    8: 2048 op, 307545.00 ns, 150.1685 ns/op
WorkloadPilot    9: 4096 op, 611697.00 ns, 149.3401 ns/op
WorkloadPilot   10: 8192 op, 1400813.00 ns, 170.9977 ns/op
WorkloadPilot   11: 16384 op, 2647349.00 ns, 161.5814 ns/op
WorkloadPilot   12: 32768 op, 5024754.00 ns, 153.3433 ns/op
WorkloadPilot   13: 65536 op, 9928975.00 ns, 151.5041 ns/op
WorkloadPilot   14: 131072 op, 19375838.00 ns, 147.8259 ns/op
WorkloadPilot   15: 262144 op, 38638550.00 ns, 147.3944 ns/op
WorkloadPilot   16: 524288 op, 72514042.00 ns, 138.3096 ns/op
WorkloadPilot   17: 1048576 op, 20969921.00 ns, 19.9985 ns/op
WorkloadPilot   18: 2097152 op, 44075943.00 ns, 21.0170 ns/op
WorkloadPilot   19: 4194304 op, 80560469.00 ns, 19.2071 ns/op
WorkloadPilot   20: 8388608 op, 161544066.00 ns, 19.2576 ns/op
WorkloadPilot   21: 16777216 op, 315976761.00 ns, 18.8337 ns/op
WorkloadPilot   22: 33554432 op, 646601106.00 ns, 19.2702 ns/op

OverheadWarmup   1: 33554432 op, 71759824.00 ns, 2.1386 ns/op
OverheadWarmup   2: 33554432 op, 71716621.00 ns, 2.1373 ns/op
OverheadWarmup   3: 33554432 op, 66895179.00 ns, 1.9936 ns/op
OverheadWarmup   4: 33554432 op, 52365466.00 ns, 1.5606 ns/op
OverheadWarmup   5: 33554432 op, 52370317.00 ns, 1.5608 ns/op
OverheadWarmup   6: 33554432 op, 52379160.00 ns, 1.5610 ns/op
OverheadWarmup   7: 33554432 op, 52356973.00 ns, 1.5604 ns/op
OverheadWarmup   8: 33554432 op, 52355118.00 ns, 1.5603 ns/op
OverheadWarmup   9: 33554432 op, 52420528.00 ns, 1.5623 ns/op
OverheadWarmup  10: 33554432 op, 52371332.00 ns, 1.5608 ns/op

OverheadActual   1: 33554432 op, 52366461.00 ns, 1.5606 ns/op
OverheadActual   2: 33554432 op, 52356338.00 ns, 1.5603 ns/op
OverheadActual   3: 33554432 op, 52373274.00 ns, 1.5608 ns/op
OverheadActual   4: 33554432 op, 52346735.00 ns, 1.5601 ns/op
OverheadActual   5: 33554432 op, 52366926.00 ns, 1.5607 ns/op
OverheadActual   6: 33554432 op, 52367973.00 ns, 1.5607 ns/op
OverheadActual   7: 33554432 op, 52376661.00 ns, 1.5609 ns/op
OverheadActual   8: 33554432 op, 52386279.00 ns, 1.5612 ns/op
OverheadActual   9: 33554432 op, 52380720.00 ns, 1.5611 ns/op
OverheadActual  10: 33554432 op, 52378436.00 ns, 1.5610 ns/op
OverheadActual  11: 33554432 op, 52371543.00 ns, 1.5608 ns/op
OverheadActual  12: 33554432 op, 52403267.00 ns, 1.5617 ns/op
OverheadActual  13: 33554432 op, 52380234.00 ns, 1.5611 ns/op
OverheadActual  14: 33554432 op, 52371414.00 ns, 1.5608 ns/op
OverheadActual  15: 33554432 op, 52353139.00 ns, 1.5602 ns/op

WorkloadWarmup   1: 33554432 op, 650671662.00 ns, 19.3915 ns/op
WorkloadWarmup   2: 33554432 op, 634891210.00 ns, 18.9212 ns/op
WorkloadWarmup   3: 33554432 op, 617898105.00 ns, 18.4148 ns/op
WorkloadWarmup   4: 33554432 op, 621489097.00 ns, 18.5218 ns/op
WorkloadWarmup   5: 33554432 op, 621256299.00 ns, 18.5149 ns/op
WorkloadWarmup   6: 33554432 op, 628445899.00 ns, 18.7291 ns/op
WorkloadWarmup   7: 33554432 op, 633387191.00 ns, 18.8764 ns/op
WorkloadWarmup   8: 33554432 op, 620154440.00 ns, 18.4820 ns/op

// BeforeActualRun
WorkloadActual   1: 33554432 op, 553173047.00 ns, 16.4858 ns/op
WorkloadActual   2: 33554432 op, 564013631.00 ns, 16.8089 ns/op
WorkloadActual   3: 33554432 op, 551142941.00 ns, 16.4253 ns/op
WorkloadActual   4: 33554432 op, 549584588.00 ns, 16.3789 ns/op
WorkloadActual   5: 33554432 op, 556840527.00 ns, 16.5951 ns/op
WorkloadActual   6: 33554432 op, 564786992.00 ns, 16.8320 ns/op
WorkloadActual   7: 33554432 op, 548881378.00 ns, 16.3579 ns/op
WorkloadActual   8: 33554432 op, 550954929.00 ns, 16.4197 ns/op
WorkloadActual   9: 33554432 op, 558265714.00 ns, 16.6376 ns/op
WorkloadActual  10: 33554432 op, 549761193.00 ns, 16.3842 ns/op
WorkloadActual  11: 33554432 op, 552816450.00 ns, 16.4752 ns/op
WorkloadActual  12: 33554432 op, 551457565.00 ns, 16.4347 ns/op
WorkloadActual  13: 33554432 op, 557489684.00 ns, 16.6145 ns/op
WorkloadActual  14: 33554432 op, 561083736.00 ns, 16.7216 ns/op
WorkloadActual  15: 33554432 op, 554831087.00 ns, 16.5353 ns/op

// AfterActualRun
WorkloadResult   1: 33554432 op, 500801504.00 ns, 14.9250 ns/op
WorkloadResult   2: 33554432 op, 511642088.00 ns, 15.2481 ns/op
WorkloadResult   3: 33554432 op, 498771398.00 ns, 14.8645 ns/op
WorkloadResult   4: 33554432 op, 497213045.00 ns, 14.8181 ns/op
WorkloadResult   5: 33554432 op, 504468984.00 ns, 15.0343 ns/op
WorkloadResult   6: 33554432 op, 512415449.00 ns, 15.2712 ns/op
WorkloadResult   7: 33554432 op, 496509835.00 ns, 14.7971 ns/op
WorkloadResult   8: 33554432 op, 498583386.00 ns, 14.8589 ns/op
WorkloadResult   9: 33554432 op, 505894171.00 ns, 15.0768 ns/op
WorkloadResult  10: 33554432 op, 497389650.00 ns, 14.8234 ns/op
WorkloadResult  11: 33554432 op, 500444907.00 ns, 14.9144 ns/op
WorkloadResult  12: 33554432 op, 499086022.00 ns, 14.8739 ns/op
WorkloadResult  13: 33554432 op, 505118141.00 ns, 15.0537 ns/op
WorkloadResult  14: 33554432 op, 508712193.00 ns, 15.1608 ns/op
WorkloadResult  15: 33554432 op, 502459544.00 ns, 14.9745 ns/op
// GC:  64 0 0 1610613472 33554432
// Threading:  0 0 33554432

// AfterAll
// Benchmark Process 35695 has exited with code 0.

Mean = 14.980 ns, StdErr = 0.040 ns (0.27%), N = 15, StdDev = 0.155 ns
Min = 14.797 ns, Q1 = 14.862 ns, Median = 14.925 ns, Q3 = 15.065 ns, Max = 15.271 ns
IQR = 0.204 ns, LowerFence = 14.556 ns, UpperFence = 15.371 ns
ConfidenceInterval = [14.814 ns; 15.146 ns] (CI 99.9%), Margin = 0.166 ns (1.11% of Mean)
Skewness = 0.58, Kurtosis = 1.89, MValue = 2

// ** Remained 37 (92.5 %) benchmark(s) to run. Estimated finish 2024-05-30 18:24 (0h 13m from now) **
// **************************
// Benchmark: Perf_Count.SelectTwoDelegates: Job-CACWVH(Toolchain=/core_root_base/corerun) [Length=1]
// *** Execute ***
// Launch: 1 / 1
// Execute: /home/egorbot/492d316d-8c7c-432d-b28b-9e91dbf4b1c2/corerun ade80f89-6895-4c73-aecf-e48027fe5e22.dll --anonymousPipes 118 119 --benchmarkName "Perf_Count.SelectTwoDelegates(Length: 1)" --job Toolchain=/core_root_base/corerun --benchmarkId 3 in /home/egorbot/benchapp/bin/Release/net9.0/ade80f89-6895-4c73-aecf-e48027fe5e22/bin/Release/net9.0/publish
// BeforeAnythingElse

// Benchmark Process Environment Information:
// BenchmarkDotNet v0.13.12
// Runtime=.NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
// GC=Concurrent Workstation
// HardwareIntrinsics=AVX-512F+CD+BW+DQ+VL+VBMI,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT VectorSize=256
// Job: DefaultJob

OverheadJitting  1: 1 op, 308815.00 ns, 308.8150 us/op
WorkloadJitting  1: 1 op, 1054371.00 ns, 1.0544 ms/op

OverheadJitting  2: 16 op, 459600.00 ns, 28.7250 us/op
WorkloadJitting  2: 16 op, 444851.00 ns, 27.8032 us/op

WorkloadPilot    1: 16 op, 7552.00 ns, 472.0000 ns/op
WorkloadPilot    2: 32 op, 11416.00 ns, 356.7500 ns/op
WorkloadPilot    3: 64 op, 25740.00 ns, 402.1875 ns/op
WorkloadPilot    4: 128 op, 50634.00 ns, 395.5781 ns/op
WorkloadPilot    5: 256 op, 86262.00 ns, 336.9609 ns/op
WorkloadPilot    6: 512 op, 155335.00 ns, 303.3887 ns/op
WorkloadPilot    7: 1024 op, 327031.00 ns, 319.3662 ns/op
WorkloadPilot    8: 2048 op, 642617.00 ns, 313.7778 ns/op
WorkloadPilot    9: 4096 op, 1456461.00 ns, 355.5813 ns/op
WorkloadPilot   10: 8192 op, 2734809.00 ns, 333.8390 ns/op
WorkloadPilot   11: 16384 op, 5308585.00 ns, 324.0103 ns/op
WorkloadPilot   12: 32768 op, 10383936.00 ns, 316.8926 ns/op
WorkloadPilot   13: 65536 op, 20256909.00 ns, 309.0959 ns/op
WorkloadPilot   14: 131072 op, 40524559.00 ns, 309.1778 ns/op
WorkloadPilot   15: 262144 op, 69542327.00 ns, 265.2829 ns/op
WorkloadPilot   16: 524288 op, 22869588.00 ns, 43.6203 ns/op
WorkloadPilot   17: 1048576 op, 37431745.00 ns, 35.6977 ns/op
WorkloadPilot   18: 2097152 op, 75502686.00 ns, 36.0025 ns/op
WorkloadPilot   19: 4194304 op, 151981085.00 ns, 36.2351 ns/op
WorkloadPilot   20: 8388608 op, 303937462.00 ns, 36.2322 ns/op
WorkloadPilot   21: 16777216 op, 595485201.00 ns, 35.4937 ns/op

OverheadWarmup   1: 16777216 op, 36162177.00 ns, 2.1554 ns/op
OverheadWarmup   2: 16777216 op, 36168188.00 ns, 2.1558 ns/op
OverheadWarmup   3: 16777216 op, 36155664.00 ns, 2.1550 ns/op
OverheadWarmup   4: 16777216 op, 36162591.00 ns, 2.1555 ns/op
OverheadWarmup   5: 16777216 op, 36149605.00 ns, 2.1547 ns/op

OverheadActual   1: 16777216 op, 36167076.00 ns, 2.1557 ns/op
OverheadActual   2: 16777216 op, 36149785.00 ns, 2.1547 ns/op
OverheadActual   3: 16777216 op, 36153159.00 ns, 2.1549 ns/op
OverheadActual   4: 16777216 op, 36129157.00 ns, 2.1535 ns/op
OverheadActual   5: 16777216 op, 36170450.00 ns, 2.1559 ns/op
OverheadActual   6: 16777216 op, 36147591.00 ns, 2.1546 ns/op
OverheadActual   7: 16777216 op, 36141703.00 ns, 2.1542 ns/op
OverheadActual   8: 16777216 op, 36146586.00 ns, 2.1545 ns/op
OverheadActual   9: 16777216 op, 32895860.00 ns, 1.9607 ns/op
OverheadActual  10: 16777216 op, 31335037.00 ns, 1.8677 ns/op
OverheadActual  11: 16777216 op, 31307246.00 ns, 1.8661 ns/op
OverheadActual  12: 16777216 op, 31297330.00 ns, 1.8655 ns/op
OverheadActual  13: 16777216 op, 31288973.00 ns, 1.8650 ns/op
OverheadActual  14: 16777216 op, 31296664.00 ns, 1.8654 ns/op
OverheadActual  15: 16777216 op, 31309622.00 ns, 1.8662 ns/op
OverheadActual  16: 16777216 op, 28453007.00 ns, 1.6959 ns/op
OverheadActual  17: 16777216 op, 26514626.00 ns, 1.5804 ns/op
OverheadActual  18: 16777216 op, 26504768.00 ns, 1.5798 ns/op
OverheadActual  19: 16777216 op, 26506734.00 ns, 1.5799 ns/op
OverheadActual  20: 16777216 op, 26489936.00 ns, 1.5789 ns/op

WorkloadWarmup   1: 16777216 op, 606852011.00 ns, 36.1712 ns/op
WorkloadWarmup   2: 16777216 op, 629030871.00 ns, 37.4932 ns/op
WorkloadWarmup   3: 16777216 op, 624629235.00 ns, 37.2308 ns/op
WorkloadWarmup   4: 16777216 op, 633789871.00 ns, 37.7768 ns/op
WorkloadWarmup   5: 16777216 op, 609305265.00 ns, 36.3174 ns/op
WorkloadWarmup   6: 16777216 op, 608351730.00 ns, 36.2606 ns/op

// BeforeActualRun
WorkloadActual   1: 16777216 op, 595418820.00 ns, 35.4897 ns/op
WorkloadActual   2: 16777216 op, 597470382.00 ns, 35.6120 ns/op
WorkloadActual   3: 16777216 op, 599813318.00 ns, 35.7517 ns/op
WorkloadActual   4: 16777216 op, 601035739.00 ns, 35.8245 ns/op
WorkloadActual   5: 16777216 op, 602201314.00 ns, 35.8940 ns/op
WorkloadActual   6: 16777216 op, 601911140.00 ns, 35.8767 ns/op
WorkloadActual   7: 16777216 op, 607796031.00 ns, 36.2275 ns/op
WorkloadActual   8: 16777216 op, 611498685.00 ns, 36.4482 ns/op
WorkloadActual   9: 16777216 op, 608227572.00 ns, 36.2532 ns/op
WorkloadActual  10: 16777216 op, 618413497.00 ns, 36.8603 ns/op
WorkloadActual  11: 16777216 op, 602218503.00 ns, 35.8950 ns/op
WorkloadActual  12: 16777216 op, 602877554.00 ns, 35.9343 ns/op
WorkloadActual  13: 16777216 op, 598192471.00 ns, 35.6550 ns/op
WorkloadActual  14: 16777216 op, 618126489.00 ns, 36.8432 ns/op
WorkloadActual  15: 16777216 op, 603189068.00 ns, 35.9529 ns/op

// AfterActualRun
WorkloadResult   1: 16777216 op, 564096490.50 ns, 33.6228 ns/op
WorkloadResult   2: 16777216 op, 566148052.50 ns, 33.7451 ns/op
WorkloadResult   3: 16777216 op, 568490988.50 ns, 33.8847 ns/op
WorkloadResult   4: 16777216 op, 569713409.50 ns, 33.9576 ns/op
WorkloadResult   5: 16777216 op, 570878984.50 ns, 34.0270 ns/op
WorkloadResult   6: 16777216 op, 570588810.50 ns, 34.0097 ns/op
WorkloadResult   7: 16777216 op, 576473701.50 ns, 34.3605 ns/op
WorkloadResult   8: 16777216 op, 580176355.50 ns, 34.5812 ns/op
WorkloadResult   9: 16777216 op, 576905242.50 ns, 34.3862 ns/op
WorkloadResult  10: 16777216 op, 587091167.50 ns, 34.9934 ns/op
WorkloadResult  11: 16777216 op, 570896173.50 ns, 34.0281 ns/op
WorkloadResult  12: 16777216 op, 571555224.50 ns, 34.0673 ns/op
WorkloadResult  13: 16777216 op, 566870141.50 ns, 33.7881 ns/op
WorkloadResult  14: 16777216 op, 586804159.50 ns, 34.9763 ns/op
WorkloadResult  15: 16777216 op, 571866738.50 ns, 34.0859 ns/op
// GC:  64 0 0 1610613136 16777216
// Threading:  0 0 16777216

// AfterAll
// Benchmark Process 35710 has exited with code 0.

Mean = 34.168 ns, StdErr = 0.107 ns (0.31%), N = 15, StdDev = 0.415 ns
Min = 33.623 ns, Q1 = 33.921 ns, Median = 34.028 ns, Q3 = 34.373 ns, Max = 34.993 ns
IQR = 0.452 ns, LowerFence = 33.243 ns, UpperFence = 35.052 ns
ConfidenceInterval = [33.723 ns; 34.612 ns] (CI 99.9%), Margin = 0.444 ns (1.30% of Mean)
Skewness = 0.79, Kurtosis = 2.41, MValue = 2

// ** Remained 36 (90.0 %) benchmark(s) to run. Estimated finish 2024-05-30 18:23 (0h 12m from now) **
// **************************
// Benchmark: Perf_Count.Array: Job-RMPIRF(Toolchain=/core_root_diff/corerun) [Length=1]
// *** Execute ***
// Launch: 1 / 1
// Execute: /home/egorbot/e4c10281-f8c7-4396-8dc4-c472474b7c66/corerun 86e96c8f-d9ac-4f39-9d51-266a853f12b2.dll --anonymousPipes 118 119 --benchmarkName "Perf_Count.Array(Length: 1)" --job Toolchain=/core_root_diff/corerun --benchmarkId 0 in /home/egorbot/benchapp/bin/Release/net9.0/86e96c8f-d9ac-4f39-9d51-266a853f12b2/bin/Release/net9.0/publish
// BeforeAnythingElse

// Benchmark Process Environment Information:
// BenchmarkDotNet v0.13.12
// Runtime=.NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
// GC=Concurrent Workstation
// HardwareIntrinsics=AVX-512F+CD+BW+DQ+VL+VBMI,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT VectorSize=256
// Job: DefaultJob

OverheadJitting  1: 1 op, 308813.00 ns, 308.8130 us/op
WorkloadJitting  1: 1 op, 739453.00 ns, 739.4530 us/op

OverheadJitting  2: 16 op, 441604.00 ns, 27.6002 us/op
WorkloadJitting  2: 16 op, 433479.00 ns, 27.0924 us/op

WorkloadPilot    1: 16 op, 2742.00 ns, 171.3750 ns/op
WorkloadPilot    2: 32 op, 4084.00 ns, 127.6250 ns/op
WorkloadPilot    3: 64 op, 6406.00 ns, 100.0938 ns/op
WorkloadPilot    4: 128 op, 11628.00 ns, 90.8438 ns/op
WorkloadPilot    5: 256 op, 22027.00 ns, 86.0430 ns/op
WorkloadPilot    6: 512 op, 36333.00 ns, 70.9629 ns/op
WorkloadPilot    7: 1024 op, 62989.00 ns, 61.5127 ns/op
WorkloadPilot    8: 2048 op, 118040.00 ns, 57.6367 ns/op
WorkloadPilot    9: 4096 op, 264526.00 ns, 64.5815 ns/op
WorkloadPilot   10: 8192 op, 609469.00 ns, 74.3981 ns/op
WorkloadPilot   11: 16384 op, 998224.00 ns, 60.9268 ns/op
WorkloadPilot   12: 32768 op, 1806469.00 ns, 55.1291 ns/op
WorkloadPilot   13: 65536 op, 3398457.00 ns, 51.8563 ns/op
WorkloadPilot   14: 131072 op, 6516312.00 ns, 49.7155 ns/op
WorkloadPilot   15: 262144 op, 12868922.00 ns, 49.0910 ns/op
WorkloadPilot   16: 524288 op, 25532305.00 ns, 48.6990 ns/op
WorkloadPilot   17: 1048576 op, 50799324.00 ns, 48.4460 ns/op
WorkloadPilot   18: 2097152 op, 50806741.00 ns, 24.2265 ns/op
WorkloadPilot   19: 4194304 op, 14331004.00 ns, 3.4168 ns/op
WorkloadPilot   20: 8388608 op, 29123889.00 ns, 3.4718 ns/op
WorkloadPilot   21: 16777216 op, 58987390.00 ns, 3.5159 ns/op
WorkloadPilot   22: 33554432 op, 117566904.00 ns, 3.5038 ns/op
WorkloadPilot   23: 67108864 op, 232610430.00 ns, 3.4662 ns/op
WorkloadPilot   24: 134217728 op, 467386714.00 ns, 3.4823 ns/op
WorkloadPilot   25: 268435456 op, 931222798.00 ns, 3.4691 ns/op

OverheadWarmup   1: 268435456 op, 472372118.00 ns, 1.7597 ns/op
OverheadWarmup   2: 268435456 op, 418829205.00 ns, 1.5603 ns/op
OverheadWarmup   3: 268435456 op, 418692070.00 ns, 1.5597 ns/op
OverheadWarmup   4: 268435456 op, 418723795.00 ns, 1.5599 ns/op
OverheadWarmup   5: 268435456 op, 418706063.00 ns, 1.5598 ns/op
OverheadWarmup   6: 268435456 op, 418739477.00 ns, 1.5599 ns/op
OverheadWarmup   7: 268435456 op, 418729743.00 ns, 1.5599 ns/op

OverheadActual   1: 268435456 op, 418777185.00 ns, 1.5601 ns/op
OverheadActual   2: 268435456 op, 418748417.00 ns, 1.5600 ns/op
OverheadActual   3: 268435456 op, 418698806.00 ns, 1.5598 ns/op
OverheadActual   4: 268435456 op, 418664028.00 ns, 1.5596 ns/op
OverheadActual   5: 268435456 op, 418743750.00 ns, 1.5599 ns/op
OverheadActual   6: 268435456 op, 418693083.00 ns, 1.5598 ns/op
OverheadActual   7: 268435456 op, 418789646.00 ns, 1.5601 ns/op
OverheadActual   8: 268435456 op, 418795026.00 ns, 1.5601 ns/op
OverheadActual   9: 268435456 op, 418708958.00 ns, 1.5598 ns/op
OverheadActual  10: 268435456 op, 418709491.00 ns, 1.5598 ns/op
OverheadActual  11: 268435456 op, 418754280.00 ns, 1.5600 ns/op
OverheadActual  12: 268435456 op, 418772480.00 ns, 1.5600 ns/op
OverheadActual  13: 268435456 op, 418711461.00 ns, 1.5598 ns/op
OverheadActual  14: 268435456 op, 418726911.00 ns, 1.5599 ns/op
OverheadActual  15: 268435456 op, 418740221.00 ns, 1.5599 ns/op

WorkloadWarmup   1: 268435456 op, 934654432.00 ns, 3.4819 ns/op
WorkloadWarmup   2: 268435456 op, 935232707.00 ns, 3.4840 ns/op
WorkloadWarmup   3: 268435456 op, 929118660.00 ns, 3.4612 ns/op
WorkloadWarmup   4: 268435456 op, 928424576.00 ns, 3.4587 ns/op
WorkloadWarmup   5: 268435456 op, 930637434.00 ns, 3.4669 ns/op
WorkloadWarmup   6: 268435456 op, 936577471.00 ns, 3.4890 ns/op
WorkloadWarmup   7: 268435456 op, 932589862.00 ns, 3.4742 ns/op

// BeforeActualRun
WorkloadActual   1: 268435456 op, 926521160.00 ns, 3.4516 ns/op
WorkloadActual   2: 268435456 op, 935285764.00 ns, 3.4842 ns/op
WorkloadActual   3: 268435456 op, 930324092.00 ns, 3.4657 ns/op
WorkloadActual   4: 268435456 op, 924860188.00 ns, 3.4454 ns/op
WorkloadActual   5: 268435456 op, 935703608.00 ns, 3.4858 ns/op
WorkloadActual   6: 268435456 op, 933981831.00 ns, 3.4794 ns/op
WorkloadActual   7: 268435456 op, 929629606.00 ns, 3.4631 ns/op
WorkloadActual   8: 268435456 op, 931221189.00 ns, 3.4691 ns/op
WorkloadActual   9: 268435456 op, 934699411.00 ns, 3.4820 ns/op
WorkloadActual  10: 268435456 op, 935194244.00 ns, 3.4839 ns/op
WorkloadActual  11: 268435456 op, 937475452.00 ns, 3.4924 ns/op
WorkloadActual  12: 268435456 op, 931884673.00 ns, 3.4715 ns/op
WorkloadActual  13: 268435456 op, 932575484.00 ns, 3.4741 ns/op
WorkloadActual  14: 268435456 op, 933159183.00 ns, 3.4763 ns/op
WorkloadActual  15: 268435456 op, 935375926.00 ns, 3.4845 ns/op

// AfterActualRun
WorkloadResult   1: 268435456 op, 507780939.00 ns, 1.8916 ns/op
WorkloadResult   2: 268435456 op, 516545543.00 ns, 1.9243 ns/op
WorkloadResult   3: 268435456 op, 511583871.00 ns, 1.9058 ns/op
WorkloadResult   4: 268435456 op, 506119967.00 ns, 1.8854 ns/op
WorkloadResult   5: 268435456 op, 516963387.00 ns, 1.9258 ns/op
WorkloadResult   6: 268435456 op, 515241610.00 ns, 1.9194 ns/op
WorkloadResult   7: 268435456 op, 510889385.00 ns, 1.9032 ns/op
WorkloadResult   8: 268435456 op, 512480968.00 ns, 1.9091 ns/op
WorkloadResult   9: 268435456 op, 515959190.00 ns, 1.9221 ns/op
WorkloadResult  10: 268435456 op, 516454023.00 ns, 1.9239 ns/op
WorkloadResult  11: 268435456 op, 518735231.00 ns, 1.9324 ns/op
WorkloadResult  12: 268435456 op, 513144452.00 ns, 1.9116 ns/op
WorkloadResult  13: 268435456 op, 513835263.00 ns, 1.9142 ns/op
WorkloadResult  14: 268435456 op, 514418962.00 ns, 1.9164 ns/op
WorkloadResult  15: 268435456 op, 516635705.00 ns, 1.9246 ns/op
// GC:  0 0 0 736 268435456
// Threading:  0 0 268435456

// AfterAll
// Benchmark Process 35725 has exited with code 0.

Mean = 1.914 ns, StdErr = 0.003 ns (0.18%), N = 15, StdDev = 0.013 ns
Min = 1.885 ns, Q1 = 1.907 ns, Median = 1.916 ns, Q3 = 1.924 ns, Max = 1.932 ns
IQR = 0.017 ns, LowerFence = 1.883 ns, UpperFence = 1.949 ns
ConfidenceInterval = [1.900 ns; 1.928 ns] (CI 99.9%), Margin = 0.014 ns (0.74% of Mean)
Skewness = -0.72, Kurtosis = 2.46, MValue = 2

// ** Remained 35 (87.5 %) benchmark(s) to run. Estimated finish 2024-05-30 18:25 (0h 13m from now) **
// **************************
// Benchmark: Perf_Count.ArrayTwoDelegates: Job-RMPIRF(Toolchain=/core_root_diff/corerun) [Length=1]
// *** Execute ***
// Launch: 1 / 1
// Execute: /home/egorbot/e4c10281-f8c7-4396-8dc4-c472474b7c66/corerun 86e96c8f-d9ac-4f39-9d51-266a853f12b2.dll --anonymousPipes 118 119 --benchmarkName "Perf_Count.ArrayTwoDelegates(Length: 1)" --job Toolchain=/core_root_diff/corerun --benchmarkId 1 in /home/egorbot/benchapp/bin/Release/net9.0/86e96c8f-d9ac-4f39-9d51-266a853f12b2/bin/Release/net9.0/publish
// BeforeAnythingElse

// Benchmark Process Environment Information:
// BenchmarkDotNet v0.13.12
// Runtime=.NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
// GC=Concurrent Workstation
// HardwareIntrinsics=AVX-512F+CD+BW+DQ+VL+VBMI,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT VectorSize=256
// Job: DefaultJob

OverheadJitting  1: 1 op, 300250.00 ns, 300.2500 us/op
WorkloadJitting  1: 1 op, 803681.00 ns, 803.6810 us/op

OverheadJitting  2: 16 op, 419461.00 ns, 26.2163 us/op
WorkloadJitting  2: 16 op, 408026.00 ns, 25.5016 us/op

WorkloadPilot    1: 16 op, 3166.00 ns, 197.8750 ns/op
WorkloadPilot    2: 32 op, 4818.00 ns, 150.5625 ns/op
WorkloadPilot    3: 64 op, 7769.00 ns, 121.3906 ns/op
WorkloadPilot    4: 128 op, 14629.00 ns, 114.2891 ns/op
WorkloadPilot    5: 256 op, 28231.00 ns, 110.2773 ns/op
WorkloadPilot    6: 512 op, 55343.00 ns, 108.0918 ns/op
WorkloadPilot    7: 1024 op, 109585.00 ns, 107.0166 ns/op
WorkloadPilot    8: 2048 op, 217875.00 ns, 106.3843 ns/op
WorkloadPilot    9: 4096 op, 590140.00 ns, 144.0771 ns/op
WorkloadPilot   10: 8192 op, 1018842.00 ns, 124.3704 ns/op
WorkloadPilot   11: 16384 op, 1835384.00 ns, 112.0229 ns/op
WorkloadPilot   12: 32768 op, 3462928.00 ns, 105.6802 ns/op
WorkloadPilot   13: 65536 op, 6759878.00 ns, 103.1476 ns/op
WorkloadPilot   14: 131072 op, 13259531.00 ns, 101.1622 ns/op
WorkloadPilot   15: 262144 op, 26257306.00 ns, 100.1637 ns/op
WorkloadPilot   16: 524288 op, 52264764.00 ns, 99.6871 ns/op
WorkloadPilot   17: 1048576 op, 52329419.00 ns, 49.9052 ns/op
WorkloadPilot   18: 2097152 op, 13716127.00 ns, 6.5404 ns/op
WorkloadPilot   19: 4194304 op, 27279804.00 ns, 6.5040 ns/op
WorkloadPilot   20: 8388608 op, 55620977.00 ns, 6.6305 ns/op
WorkloadPilot   21: 16777216 op, 110596294.00 ns, 6.5921 ns/op
WorkloadPilot   22: 33554432 op, 219740254.00 ns, 6.5488 ns/op
WorkloadPilot   23: 67108864 op, 440231514.00 ns, 6.5600 ns/op
WorkloadPilot   24: 134217728 op, 878251311.00 ns, 6.5435 ns/op

OverheadWarmup   1: 134217728 op, 264892403.00 ns, 1.9736 ns/op
OverheadWarmup   2: 134217728 op, 211762309.00 ns, 1.5778 ns/op
OverheadWarmup   3: 134217728 op, 211742675.00 ns, 1.5776 ns/op
OverheadWarmup   4: 134217728 op, 211794792.00 ns, 1.5780 ns/op
OverheadWarmup   5: 134217728 op, 211758937.00 ns, 1.5777 ns/op
OverheadWarmup   6: 134217728 op, 211722790.00 ns, 1.5775 ns/op
OverheadWarmup   7: 134217728 op, 211758872.00 ns, 1.5777 ns/op
OverheadWarmup   8: 134217728 op, 211796378.00 ns, 1.5780 ns/op
OverheadWarmup   9: 134217728 op, 211771570.00 ns, 1.5778 ns/op

OverheadActual   1: 134217728 op, 211736280.00 ns, 1.5776 ns/op
OverheadActual   2: 134217728 op, 211748910.00 ns, 1.5777 ns/op
OverheadActual   3: 134217728 op, 211744068.00 ns, 1.5776 ns/op
OverheadActual   4: 134217728 op, 211749153.00 ns, 1.5777 ns/op
OverheadActual   5: 134217728 op, 211759517.00 ns, 1.5777 ns/op
OverheadActual   6: 134217728 op, 211767955.00 ns, 1.5778 ns/op
OverheadActual   7: 134217728 op, 211778179.00 ns, 1.5779 ns/op
OverheadActual   8: 134217728 op, 211737912.00 ns, 1.5776 ns/op
OverheadActual   9: 134217728 op, 211768882.00 ns, 1.5778 ns/op
OverheadActual  10: 134217728 op, 211749400.00 ns, 1.5777 ns/op
OverheadActual  11: 134217728 op, 211750253.00 ns, 1.5777 ns/op
OverheadActual  12: 134217728 op, 211794254.00 ns, 1.5780 ns/op
OverheadActual  13: 134217728 op, 211769344.00 ns, 1.5778 ns/op
OverheadActual  14: 134217728 op, 211858569.00 ns, 1.5785 ns/op
OverheadActual  15: 134217728 op, 211772765.00 ns, 1.5778 ns/op

WorkloadWarmup   1: 134217728 op, 875727031.00 ns, 6.5247 ns/op
WorkloadWarmup   2: 134217728 op, 876755874.00 ns, 6.5323 ns/op
WorkloadWarmup   3: 134217728 op, 880510560.00 ns, 6.5603 ns/op
WorkloadWarmup   4: 134217728 op, 880146493.00 ns, 6.5576 ns/op
WorkloadWarmup   5: 134217728 op, 880824778.00 ns, 6.5627 ns/op
WorkloadWarmup   6: 134217728 op, 879419601.00 ns, 6.5522 ns/op

// BeforeActualRun
WorkloadActual   1: 134217728 op, 877352826.00 ns, 6.5368 ns/op
WorkloadActual   2: 134217728 op, 879274592.00 ns, 6.5511 ns/op
WorkloadActual   3: 134217728 op, 879275278.00 ns, 6.5511 ns/op
WorkloadActual   4: 134217728 op, 881703416.00 ns, 6.5692 ns/op
WorkloadActual   5: 134217728 op, 885236101.00 ns, 6.5955 ns/op
WorkloadActual   6: 134217728 op, 878578683.00 ns, 6.5459 ns/op
WorkloadActual   7: 134217728 op, 875944284.00 ns, 6.5263 ns/op
WorkloadActual   8: 134217728 op, 880405690.00 ns, 6.5595 ns/op
WorkloadActual   9: 134217728 op, 880549642.00 ns, 6.5606 ns/op
WorkloadActual  10: 134217728 op, 879794387.00 ns, 6.5550 ns/op
WorkloadActual  11: 134217728 op, 882493690.00 ns, 6.5751 ns/op
WorkloadActual  12: 134217728 op, 880975913.00 ns, 6.5638 ns/op
WorkloadActual  13: 134217728 op, 884025684.00 ns, 6.5865 ns/op
WorkloadActual  14: 134217728 op, 882262955.00 ns, 6.5734 ns/op
WorkloadActual  15: 134217728 op, 882774506.00 ns, 6.5772 ns/op

// AfterActualRun
WorkloadResult   1: 134217728 op, 665593309.00 ns, 4.9591 ns/op
WorkloadResult   2: 134217728 op, 667515075.00 ns, 4.9734 ns/op
WorkloadResult   3: 134217728 op, 667515761.00 ns, 4.9734 ns/op
WorkloadResult   4: 134217728 op, 669943899.00 ns, 4.9915 ns/op
WorkloadResult   5: 134217728 op, 673476584.00 ns, 5.0178 ns/op
WorkloadResult   6: 134217728 op, 666819166.00 ns, 4.9682 ns/op
WorkloadResult   7: 134217728 op, 664184767.00 ns, 4.9486 ns/op
WorkloadResult   8: 134217728 op, 668646173.00 ns, 4.9818 ns/op
WorkloadResult   9: 134217728 op, 668790125.00 ns, 4.9829 ns/op
WorkloadResult  10: 134217728 op, 668034870.00 ns, 4.9772 ns/op
WorkloadResult  11: 134217728 op, 670734173.00 ns, 4.9974 ns/op
WorkloadResult  12: 134217728 op, 669216396.00 ns, 4.9861 ns/op
WorkloadResult  13: 134217728 op, 672266167.00 ns, 5.0088 ns/op
WorkloadResult  14: 134217728 op, 670503438.00 ns, 4.9956 ns/op
WorkloadResult  15: 134217728 op, 671014989.00 ns, 4.9995 ns/op
// GC:  0 0 0 448 134217728
// Threading:  0 0 134217728

// AfterAll

@EgorBot
Copy link

EgorBot commented May 30, 2024

Benchmark results on Arm64
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Unknown processor
  Job-KPIUTA : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-ZTLZPC : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Toolchain Length Mean Error Ratio Code Size Gen0 Allocated Alloc Ratio
Array Main 1 17.962 ns 0.0340 ns 1.00 848 B 0.0019 32 B 1.00
Array PR 1 3.143 ns 0.0014 ns 0.17 588 B - - 0.00
ArrayTwoDelegates Main 1 39.478 ns 0.0575 ns 1.00 1,076 B 0.0038 64 B 1.00
ArrayTwoDelegates PR 1 7.068 ns 0.0092 ns 0.18 932 B - - 0.00
Select Main 1 23.539 ns 0.4964 ns 1.00 1,100 B 0.0029 48 B 1.00
Select PR 1 22.678 ns 0.0692 ns 0.95 1,256 B 0.0029 48 B 1.00
SelectTwoDelegates Main 1 44.897 ns 0.5776 ns 1.00 1,232 B 0.0057 96 B 1.00
SelectTwoDelegates PR 1 62.163 ns 0.0701 ns 1.39 1,520 B 0.0057 96 B 1.00
Array Main 5 26.916 ns 0.0580 ns 1.00 896 B 0.0019 32 B 1.00
Array PR 5 7.377 ns 0.0020 ns 0.27 588 B - - 0.00
ArrayTwoDelegates Main 5 73.327 ns 0.1173 ns 1.00 1,068 B 0.0038 64 B 1.00
ArrayTwoDelegates PR 5 25.919 ns 0.0028 ns 0.35 932 B - - 0.00
Select Main 5 29.020 ns 0.0944 ns 1.00 1,084 B 0.0029 48 B 1.00
Select PR 5 30.440 ns 0.0985 ns 1.05 1,240 B 0.0029 48 B 1.00
SelectTwoDelegates Main 5 68.948 ns 0.1624 ns 1.00 1,248 B 0.0057 96 B 1.00
SelectTwoDelegates PR 5 99.838 ns 0.5658 ns 1.45 1,528 B 0.0057 96 B 1.00
Array Main 10 36.828 ns 0.0172 ns 1.00 908 B 0.0019 32 B 1.00
Array PR 10 16.058 ns 0.0043 ns 0.44 588 B - - 0.00
ArrayTwoDelegates Main 10 110.429 ns 0.3215 ns 1.00 1,072 B 0.0038 64 B 1.00
ArrayTwoDelegates PR 10 47.870 ns 0.0065 ns 0.43 932 B - - 0.00
Select Main 10 46.274 ns 0.2265 ns 1.00 1,096 B 0.0029 48 B 1.00
Select PR 10 40.824 ns 0.0680 ns 0.88 1,252 B 0.0029 48 B 1.00
SelectTwoDelegates Main 10 95.956 ns 0.1711 ns 1.00 1,248 B 0.0057 96 B 1.00
SelectTwoDelegates PR 10 114.314 ns 0.2522 ns 1.19 1,524 B 0.0057 96 B 1.00
Array Main 100 210.623 ns 0.0742 ns 1.00 908 B 0.0019 32 B 1.00
Array PR 100 153.473 ns 0.0233 ns 0.73 588 B - - 0.00
ArrayTwoDelegates Main 100 667.137 ns 0.1970 ns 1.00 1,064 B 0.0038 64 B 1.00
ArrayTwoDelegates PR 100 554.366 ns 0.1675 ns 0.83 916 B - - 0.00
Select Main 100 225.669 ns 0.1646 ns 1.00 1,092 B 0.0029 48 B 1.00
Select PR 100 234.810 ns 0.3160 ns 1.04 1,252 B 0.0029 48 B 1.00
SelectTwoDelegates Main 100 637.834 ns 0.7520 ns 1.00 1,252 B 0.0057 96 B 1.00
SelectTwoDelegates PR 100 656.683 ns 2.4740 ns 1.03 1,516 B 0.0057 96 B 1.00
Array Main 10000 18,717.879 ns 2.2861 ns 1.00 888 B - 32 B 1.00
Array PR 10000 15,036.834 ns 3.2973 ns 0.80 588 B - - 0.00
ArrayTwoDelegates Main 10000 57,513.356 ns 8.0178 ns 1.00 1,008 B - 64 B 1.00
ArrayTwoDelegates PR 10000 44,270.098 ns 10.6963 ns 0.77 924 B - - 0.00
Select Main 10000 20,068.905 ns 7.8026 ns 1.00 1,060 B - 48 B 1.00
Select PR 10000 20,069.714 ns 3.4131 ns 1.00 1,220 B - 48 B 1.00
SelectTwoDelegates Main 10000 57,815.233 ns 35.4719 ns 1.00 1,172 B - 96 B 1.00
SelectTwoDelegates PR 10000 57,309.569 ns 67.3670 ns 0.99 1,456 B - 96 B 1.00

BDN_Artifacts.zip

@EgorBo
Copy link
Member

EgorBo commented May 30, 2024

❌ Benchmark failed on Intel

Sadly, it seems to be some general unrelated BDN issue - it just dies sometimes (I think Stephen hit it too once). Looks like it's DisassemblyDiagnoser specific.

@neon-sunset
Copy link
Contributor Author

neon-sunset commented May 30, 2024

@jkotas After reviewing ARM64 results and their disassembly, it appears that my assumption on per-callsite delegate inlining was proven wrong - I think JIT gathers the profile in the instrumented Count/CountSpan compilations before they get inlined, causing only a single instance of the delegate to get inlined in both loop bodies (winning non-deterministically?):

M00_L02: ;; first loop start
       ldr       w1, [x23, w22, uxtw #2]
       ldr       x0, [x21, #0x18]
       add       x2, x26, #0x18
       cmp       x0, x2
       b.ne      M00_L03
       tst       w1, #7 ;; i % 8 is 0 instead of i % 2 is 0
       b.ne      M00_L05
       b         M00_L04
M00_L03:
       ldr       x0, [x21, #8]
       ldr       x2, [x21, #0x18]
       blr       x2
       cbz       w0, M00_L05
M00_L04:
       add       w25, w25, #1
M00_L05:
       add       w22, w22, #1
       cmp       w22, w24
       b.lt      M00_L02
M00_L06:
       ldr       x22, [x20, #8]
       cbz       x22, M00_L18
M00_L07:
       cbz       x22, M00_L24
       movz      w23, #0x1
       ldr       x1, [x19]
       movz      x0, #0x61d0
       movk      x0, #0x8d05, lsl #16
       movk      x0, #0xe8f5, lsl #32
       cmp       x1, x0
       b.ne      M00_L19
       add       x20, x19, #0x10
       ldr       w21, [x19, #8]
M00_L08:
       cbz       w23, M00_L23
       mov       w24, wzr
       mov       w19, wzr
       cmp       w21, #0
       b.le      M00_L13
       movz      x26, #0xb360
       movk      x26, #0x8d61, lsl #16
       movk      x26, #0xe8f5, lsl #32
M00_L09:
       ldr       w1, [x20, w19, uxtw #2]
       ldr       x0, [x22, #0x18]
       add       x2, x26, #0x18
       cmp       x0, x2
       b.ne      M00_L10
       tst       w1, #7 ;; same i % 8 is 0, as expected
       b.ne      M00_L12
       b         M00_L11
M00_L10:
       ldr       x0, [x22, #8]
       ldr       x2, [x22, #0x18]
       blr       x2
       cbz       w0, M00_L12
M00_L11:
       add       w24, w24, #1
M00_L12:
       add       w19, w19, #1
       cmp       w19, w21
       b.lt      M00_L09

Given the above, please let me know what you think is the appropriate course of action:

  • Keep current variant
  • Outline span loop (mitigates DPGO impact on enumerable in bi/polymorphic scenarios where span was favoured, but prevents method inlining which has both pros and cons, more cons if JIT learns to optimize away TryGetSpan)
  • Remove outlining for both span and enumerable, enumerable will be impacted in bi/polymorphic scenarios but it will be more consistent with other LINQ code

@jkotas
Copy link
Member

jkotas commented May 30, 2024

My preference is to keep all Linq code consistent.

@neon-sunset
Copy link
Contributor Author

neon-sunset commented May 30, 2024

My preference is to keep all Linq code consistent.

Personally, I think all top-level LINQ methods that have TryGetSpan + EH from IEnumerable path in them are making a suboptimal implementation choice but that's not for me to decide.

Either way, thanks!

@neon-sunset
Copy link
Contributor Author

neon-sunset commented May 30, 2024

@jkotas do you mind if I update this PR to include a similar optimization for .Any and .All predicate overloads? Both are very popular for scanning arrays/lists. I know Stephen Toub has concerns regarding regressions, particularly, when including IList path instead, but it seems the impact of just span path is more limited with better (if less frequent) wins across the board.

@stephentoub
Copy link
Member

As noted in #102696 (comment), main...stephentoub:runtime:fasterlinqenumeration highlights that it's not just Count/Any/All, but many more APIs that would need to be updated to be made consistent. I've not submitted a PR for that yet though because of the regressions that it brings in addition to the improvements. Most of those were from special-casing IList as well, but even for the checks for array/list that the span checks incur, it adds some overhead that will show up if these APIs are used on non-arrays/lists frequently.

@neon-sunset
Copy link
Contributor Author

neon-sunset commented May 31, 2024

Fair enough, but on the other hand this change (particularly the variant that moves EH blocking inlining to local function) replaces the pattern that is difficult for JIT to optimize to the one that is easier to optimize and that has higher likelihood of improving (I find it surprising it still does not optimize away arr.GetType() != typeof(int[]) given it's a direct scalar comparison).

It pays with about two branches for setup to iterate directly on the span. It's cheaper than is IList<T> -> .TryGetSpan and I feel like such kind of small changes are easier to reason about and introduce vs larger rewrites.

@neon-sunset
Copy link
Contributor Author

Please let me know what are the next steps with this change if any, thanks!

@stephentoub
Copy link
Member

I added in use of TryGetSpan to the other relevant sinks.

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Linq community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LINQ - use TryGetSpan in Enumerable.Count with predicate
5 participants