Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird performance numbers for Generics with class constraint and .net9 regression #110831

Open
Qibbi opened this issue Dec 19, 2024 · 22 comments
Open
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue untriaged New issue has not been triaged by the area owner

Comments

@Qibbi
Copy link

Qibbi commented Dec 19, 2024

Description

I made some Benchmarks (using BenchmarkDotNet) to see how calls to functions behave on some combination with interfaces and structs/classes.
With that I found a weird performance penalty in .net 8 when using the class generic constraint. On my machine my benchmark takes 400ms compared to 300ms when I don't have that constraint. Even though it is needed to be said that there seem to sometimes be some outliers where this also takes 400ms. In .net 9 it actually takes 400ms reliably with or without that constraint.

My setup is simple, so I hoped to minimize any other reasons.

Interface/Struct/Class:

public interface ITest
{
    int Count { get; }
    void Run();
}

public struct Test : ITest
{
    public int Count { get; private set; }

    public void Run()
    {
        Count++;
    }
}

public class TestClass : ITest
{
    public int Count { get; private set; }

    public void Run()
    {
        Count++;
    }
}

(I could have used a int Count; member instead of the property, but that gets jitted away anyway and doesn't affect any numbers)

Then I set up several members in a Benchmark class:

private readonly ITest Test1 = new Test();
private readonly T Test2 = new(); // where T: ITest, new()
private readonly U Test3; // where U : struct, ITest
private readonly V Test4; // where V: unmanaged, ITest
private readonly Test Test5;
private readonly ITest Test6 = new TestClass();
private readonly W Test7 = new(); // where W: ITest, new()
private readonly X Test8 = new(); // where X: class, ITest, new()
private readonly TestClass Test9 = new();

My benchmark methods all look like this:

    [Benchmark]
    public void RunStructViaInterface()
    {
        for (long idx = 0; idx < 1_000_000_000; ++idx)
        {
            Test1.Run();
        }
    }

Configuration/Data

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2605)
AMD Ryzen 9 7940HS w/ Radeon 780M Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

| Method                                 | Mean     | Error   | StdDev  |
|--------------------------------------- |---------:|--------:|--------:|
| RunStructViaInterface                  | 434.8 ms | 6.93 ms | 6.15 ms |
| RunStructWhereTIsOneInterface          | 203.8 ms | 1.78 ms | 1.67 ms |
| RunStructWhereTIsStructOneInterface    | 203.4 ms | 1.73 ms | 1.45 ms |
| RunStructWhereTIsUnmanagedOneInterface | 204.1 ms | 2.74 ms | 2.57 ms |
| RunStructDirect                        | 204.0 ms | 2.51 ms | 2.23 ms |
| RunClassViaInterface                   | 413.9 ms | 7.17 ms | 6.70 ms |
| RunClassWhereTIsOneInterface           | 302.2 ms | 2.00 ms | 3.86 ms |
| RunClassWhereTIsClassOneInterface      | 413.7 ms | 4.82 ms | 4.27 ms |
| RunClassDirect                         | 206.5 ms | 1.20 ms | 1.12 ms |
BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2605)
AMD Ryzen 9 7940HS w/ Radeon 780M Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


| Method                                 | Mean     | Error   | StdDev  |
|--------------------------------------- |---------:|--------:|--------:|
| RunStructViaInterface                  | 434.1 ms | 1.57 ms | 1.31 ms |
| RunStructWhereTIsOneInterface          | 204.4 ms | 2.02 ms | 1.89 ms |
| RunStructWhereTIsStructOneInterface    | 203.9 ms | 1.74 ms | 1.63 ms |
| RunStructWhereTIsUnmanagedOneInterface | 203.7 ms | 1.98 ms | 1.75 ms |
| RunStructDirect                        | 203.6 ms | 2.05 ms | 1.92 ms |
| RunClassViaInterface                   | 416.2 ms | 4.26 ms | 3.99 ms |
| RunClassWhereTIsOneInterface           | 417.1 ms | 4.96 ms | 4.39 ms |
| RunClassWhereTIsClassOneInterface      | 416.5 ms | 4.83 ms | 4.51 ms |
| RunClassDirect                         | 207.4 ms | 2.99 ms | 2.80 ms |
@Qibbi Qibbi added the tenet-performance Performance related issue label Dec 19, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Dec 19, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Dec 19, 2024
@huoyaoyuan huoyaoyuan added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Dec 19, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@jakobbotsch
Copy link
Member

Can you please post the full benchmark source code?

@Qibbi
Copy link
Author

Qibbi commented Dec 19, 2024

Benchmark.cs:

using BenchmarkDotNet.Attributes;

namespace ConsoleApp1;

public class Benchmark<T, U, V, W, X> where T : ITest, new() where U : struct, ITest where V : unmanaged, ITest where W : ITest, new() where X : class, ITest, new()
{
    private readonly ITest Test1 = new Test();
    private readonly T Test2 = new();
    private readonly U Test3;
    private readonly V Test4;
    private readonly Test Test5;
    private readonly ITest Test6 = new TestClass();
    private readonly W Test7 = new();
    private readonly X Test8 = new();
    private readonly TestClass Test9 = new();

    [Benchmark]
    public void RunStructViaInterface()
    {
        for (long idx = 0; idx < 1_000_000_000; ++idx)
        {
            Test1.Run();
        }
    }

    [Benchmark]
    public void RunStructWhereTIsOneInterface()
    {
        for (long idx = 0; idx < 1_000_000_000; ++idx)
        {
            Test2.Run();
        }
    }

    [Benchmark]
    public void RunStructWhereTIsStructOneInterface()
    {
        for (long idx = 0; idx < 1_000_000_000; ++idx)
        {
            Test3.Run();
        }
    }

    [Benchmark]
    public void RunStructWhereTIsUnmanagedOneInterface()
    {
        for (long idx = 0; idx < 1_000_000_000; ++idx)
        {
            Test4.Run();
        }
    }

    [Benchmark]
    public void RunStructDirect()
    {
        for (long idx = 0; idx < 1_000_000_000; ++idx)
        {
            Test5.Run();
        }
    }

    [Benchmark]
    public void RunClassViaInterface()
    {
        for (long idx = 0; idx < 1_000_000_000; ++idx)
        {
            Test6.Run();
        }
    }

    [Benchmark]
    public void RunClassWhereTIsOneInterface()
    {
        for (long idx = 0; idx < 1_000_000_000; ++idx)
        {
            Test7.Run();
        }
    }

    [Benchmark]
    public void RunClassWhereTIsClassOneInterface()
    {
        for (long idx = 0; idx < 1_000_000_000; ++idx)
        {
            Test8.Run();
        }
    }

    [Benchmark]
    public void RunClassDirect()
    {
        for (long idx = 0; idx < 1_000_000_000; ++idx)
        {
            Test9.Run();
        }
    }
}

Test.cs:

namespace ConsoleApp1;

public interface ITest
{
    int Count { get; }
    void Run();
}

public struct Test : ITest
{
    public int Count { get; private set; }

    public void Run()
    {
        Count++;
    }
}

public class TestClass : ITest
{
    public int Count { get; private set; }

    public void Run()
    {
        Count++;
    }
}

Program.cs:

using BenchmarkDotNet.Running;
using ConsoleApp1;

BenchmarkRunner.Run<Benchmark<Test, Test, Test, TestClass, TestClass>>();

@Qibbi
Copy link
Author

Qibbi commented Dec 19, 2024

An addendum:
As I realized I wanted raw data on how fast it is to call methods in various scenarios, and the jit might inline stuff (which it obviously does), I changed Test.cs to:

using System.Runtime.CompilerServices;

namespace ConsoleApp1;

public interface ITest
{
    int Count { get; }

    [MethodImpl(MethodImplOptions.NoInlining)]
    void Run();
}

public struct Test : ITest
{
    public int Count { get; private set; }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public void Run()
    {
        Count++;
    }
}

public class TestClass : ITest
{
    public int Count { get; private set; }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public void Run()
    {
        Count++;
    }
}

(side note: does the attribute on the interface actually even do something/what would be the correct way?)

However the same things can be observed as before.

I switched machines when doing so, so here just the 8.0 test without modifications on the other machine as a base line:

AMD Ryzen 9 6900HX with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 8.0.11 (8.0.1124.51707), X64 AOT AVX2
  DefaultJob : .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX2


| Method                                 | Mean     | Error   | StdDev  |
|--------------------------------------- |---------:|--------:|--------:|
| RunStructViaInterface                  | 465.7 ms | 5.42 ms | 5.07 ms |
| RunStructWhereTIsOneInterface          | 218.7 ms | 2.96 ms | 2.62 ms |
| RunStructWhereTIsStructOneInterface    | 215.6 ms | 0.96 ms | 0.85 ms |
| RunStructWhereTIsUnmanagedOneInterface | 217.3 ms | 1.24 ms | 0.97 ms |
| RunStructDirect                        | 214.9 ms | 0.82 ms | 0.72 ms |
| RunClassViaInterface                   | 441.5 ms | 1.26 ms | 1.12 ms |
| RunClassWhereTIsOneInterface           | 338.9 ms | 0.74 ms | 0.66 ms |
| RunClassWhereTIsClassOneInterface      | 443.1 ms | 1.44 ms | 1.35 ms |
| RunClassDirect                         | 219.0 ms | 0.85 ms | 0.67 ms |

and here the benchmarks with the NoInlining:

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2605)
AMD Ryzen 9 6900HX with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 8.0.11 (8.0.1124.51707), X64 AOT AVX2
  DefaultJob : .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX2


| Method                                 | Mean    | Error    | StdDev   |
|--------------------------------------- |--------:|---------:|---------:|
| RunStructViaInterface                  | 2.207 s | 0.0074 s | 0.0069 s |
| RunStructWhereTIsOneInterface          | 1.111 s | 0.0044 s | 0.0039 s |
| RunStructWhereTIsStructOneInterface    | 1.108 s | 0.0041 s | 0.0038 s |
| RunStructWhereTIsUnmanagedOneInterface | 1.104 s | 0.0031 s | 0.0027 s |
| RunStructDirect                        | 1.104 s | 0.0018 s | 0.0016 s |
| RunClassViaInterface                   | 1.556 s | 0.0047 s | 0.0044 s |
| RunClassWhereTIsOneInterface           | 1.325 s | 0.0065 s | 0.0060 s |
| RunClassWhereTIsClassOneInterface      | 1.543 s | 0.0046 s | 0.0038 s |
| RunClassDirect                         | 1.106 s | 0.0055 s | 0.0051 s |
BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2605)
AMD Ryzen 9 6900HX with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 AOT AVX2
  DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2


| Method                                 | Mean    | Error    | StdDev   |
|--------------------------------------- |--------:|---------:|---------:|
| RunStructViaInterface                  | 2.408 s | 0.0101 s | 0.0089 s |
| RunStructWhereTIsOneInterface          | 1.096 s | 0.0018 s | 0.0015 s |
| RunStructWhereTIsStructOneInterface    | 1.098 s | 0.0024 s | 0.0021 s |
| RunStructWhereTIsUnmanagedOneInterface | 1.097 s | 0.0013 s | 0.0012 s |
| RunStructDirect                        | 1.098 s | 0.0022 s | 0.0019 s |
| RunClassViaInterface                   | 1.739 s | 0.0030 s | 0.0026 s |
| RunClassWhereTIsOneInterface           | 1.797 s | 0.0070 s | 0.0062 s |
| RunClassWhereTIsClassOneInterface      | 1.804 s | 0.0056 s | 0.0053 s |
| RunClassDirect                         | 1.356 s | 0.0026 s | 0.0022 s |

@AndyAyersMS
Copy link
Member

When your benchmarks contain long-running loops there's a chance BDN is not always able to get to the most optimized code. So the results can be somewhat misleading. BDN's approach works best if the benchmark itself is fairly short and BDN is the one controlling the measurement strategy.

What happens if you reduce the iteration count in those loops to something like 5000? If you want BDN to do the per-invocation math for you, also add something like

[Benchmark(OperationsPerInvoke = 5000)]

@Qibbi
Copy link
Author

Qibbi commented Dec 19, 2024

[Benchmark(OperationsPerInvoke = 5000)]

If I try to do that all values are ~0.0001ns or The method duration is indistinguishable from the empty method duration with a time of 0.0ns.

By just setting the loop to 5000 iterations I get proportionally smaller values. 5000 is already pretty small so a lot of the duration is taken by the loop itself.
Here is an example of 50k iterations:

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2605)
AMD Ryzen 9 7940HS w/ Radeon 780M Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 8.0.11 (8.0.1124.51707), X64 AOT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


| Method                                 | Mean      | Error    | StdDev   |
|--------------------------------------- |----------:|---------:|---------:|
| RunStructViaInterface                  | 108.85 us | 0.774 us | 0.724 us |
| RunStructWhereTIsOneInterface          |  60.66 us | 0.555 us | 0.519 us |
| RunStructWhereTIsStructOneInterface    |  61.23 us | 0.801 us | 0.749 us |
| RunStructWhereTIsUnmanagedOneInterface |  61.59 us | 0.666 us | 0.623 us |
| RunStructDirect                        |  60.64 us | 0.572 us | 0.535 us |
| RunClassViaInterface                   |  71.03 us | 0.705 us | 0.659 us |
| RunClassWhereTIsOneInterface           |  61.01 us | 0.464 us | 0.411 us |
| RunClassWhereTIsClassOneInterface      |  71.37 us | 0.630 us | 0.590 us |
| RunClassDirect                         |  60.96 us | 0.428 us | 0.401 us |

There is another weirdness going on, I also tested it on my Intel desktop machine and got these times:

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4602/23H2/2023Update/SunValley3)
12th Gen Intel Core i5-12600K, 1 CPU, 16 logical and 10 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX2


| Method                                 | Mean      | Error    | StdDev   |
|--------------------------------------- |----------:|---------:|---------:|
| RunStructViaInterface                  | 130.88 us | 0.535 us | 0.447 us |
| RunStructWhereTIsOneInterface          |  42.87 us | 0.307 us | 0.287 us |
| RunStructWhereTIsStructOneInterface    |  42.94 us | 0.297 us | 0.278 us |
| RunStructWhereTIsUnmanagedOneInterface |  42.87 us | 0.377 us | 0.334 us |
| RunStructDirect                        |  42.75 us | 0.297 us | 0.263 us |
| RunClassViaInterface                   |  65.90 us | 0.181 us | 0.169 us |
| RunClassWhereTIsOneInterface           |  77.00 us | 0.179 us | 0.167 us |
| RunClassWhereTIsClassOneInterface      |  65.99 us | 0.276 us | 0.245 us |
| RunClassDirect                         |  44.66 us | 0.197 us | 0.184 us |

.net 9 however again lowers the performance to the lower of the two:

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4602/23H2/2023Update/SunValley3)
12th Gen Intel Core i5-12600K, 1 CPU, 16 logical and 10 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
  DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2


| Method                                 | Mean      | Error    | StdDev   |
|--------------------------------------- |----------:|---------:|---------:|
| RunStructViaInterface                  | 143.06 us | 0.650 us | 0.576 us |
| RunStructWhereTIsOneInterface          |  43.03 us | 0.333 us | 0.312 us |
| RunStructWhereTIsStructOneInterface    |  43.27 us | 0.255 us | 0.238 us |
| RunStructWhereTIsUnmanagedOneInterface |  43.05 us | 0.120 us | 0.113 us |
| RunStructDirect                        |  43.21 us | 0.184 us | 0.172 us |
| RunClassViaInterface                   |  76.56 us | 0.262 us | 0.245 us |
| RunClassWhereTIsOneInterface           |  76.50 us | 0.285 us | 0.267 us |
| RunClassWhereTIsClassOneInterface      |  76.48 us | 0.299 us | 0.280 us |
| RunClassDirect                         |  44.46 us | 0.234 us | 0.219 us |

@EgorBo
Copy link
Member

EgorBo commented Dec 19, 2024

@EgorBot -amd --runtimes net8.0 net9.0

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<Benchmark>(args: args);

public class Benchmark
{
    private readonly ITest Test1 = new Test();

    [Benchmark]
    public void RunStructViaInterface()
    {
        for (long idx = 0; idx < 1_000_000_000; ++idx)
            Test1.Run();
    }
}

public interface ITest
{
    int Count { get; }
    void Run();
}

public struct Test : ITest
{
    public int Count { get; private set; }

    public void Run()
    {
        Count++;
    }
}

@EgorBo
Copy link
Member

EgorBo commented Dec 19, 2024

the bot doesn't repro it EgorBot/runtime-utils#211 (comment) (on all platforms)

@Qibbi
Copy link
Author

Qibbi commented Dec 19, 2024

@EgorBo That's not the part which is weird, it's with

public class Benchmark<T> where T : ITest, new()
{
private readonly T Test;

    [Benchmark]
    public void RunClassWhereTIsOneInterface()
    {
        for (long idx = 0; idx < 1_000_000_000; ++idx)
            Test1.Run();
    }
}

vs

public class Benchmark<T> where T : class, ITest, new()
{
private readonly T Test;

    [Benchmark]
    public void RunClassWhereTIsClassOneInterface()
    {
        for (long idx = 0; idx < 1_000_000_000; ++idx)
            Test1.Run();
    }
}

Adding class (second case) tanks the performance in .net 8 on my AMD system (while for Intel it is somehow the other way around)
And both cases in .net 9 run at the speed of the slower one (on both AMD and Intel)

@EgorBo
Copy link
Member

EgorBo commented Dec 19, 2024

@Qibbi I am not sure I understand, the title implies that .net 9.0 regressed something, can you point me directly to just one [Benchmark] that became slower in 9.0 compared to 8.0 ?

@Qibbi
Copy link
Author

Qibbi commented Dec 19, 2024

@EgorBo

.net 8

| RunClassViaInterface                   | 413.9 ms | 7.17 ms | 6.70 ms |
| RunClassWhereTIsOneInterface           | 302.2 ms | 2.00 ms | 3.86 ms |
| RunClassWhereTIsClassOneInterface      | 413.7 ms | 4.82 ms | 4.27 ms |
| RunClassDirect                         | 206.5 ms | 1.20 ms | 1.12 ms |

vs
.net 9

| RunClassViaInterface                   | 416.2 ms | 4.26 ms | 3.99 ms |
| RunClassWhereTIsOneInterface           | 417.1 ms | 4.96 ms | 4.39 ms |
| RunClassWhereTIsClassOneInterface      | 416.5 ms | 4.83 ms | 4.51 ms |
| RunClassDirect                         | 207.4 ms | 2.99 ms | 2.80 ms |

as you can see RunClassWhereTIsOneInterface has significantly less performance in .net9
which is

public class Benchmark<T> where T : ITest, new()
{
  public readonly T Test = new();
/*,,,*/
}

@EgorBo
Copy link
Member

EgorBo commented Dec 19, 2024

@EgorBot -windows_intel -linux_amd --runtimes net8.0 net9.0

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<Benchmark<Test, Test, Test, TestClass, TestClass>>(args: args);

public class Benchmark<T, U, V, W, X> where T : ITest, new() where U : struct, ITest where V : unmanaged, ITest where W : ITest, new() where X : class, ITest, new()
{
    private readonly ITest Test1 = new Test();
    private readonly U Test3;
    private readonly V Test4;
    private readonly Test Test5;
    private readonly W Test7 = new();

    [Benchmark]
    public void RunClassWhereTIsOneInterface()
    {
        for (long idx = 0; idx < 1_000_000_000; ++idx)
            Test7.Run();
    }
}
public interface ITest
{
    int Count { get; }
    void Run();
}
public struct Test : ITest
{
    public int Count { get; private set; }
    public void Run() => Count++;
}
public class TestClass : ITest
{
    public int Count { get; private set; }
    public void Run() => Count++;
}

@EgorBo
Copy link
Member

EgorBo commented Dec 19, 2024

@Qibbi still no difference between net8.0 and net9.0 for RunClassWhereTIsOneInterface EgorBot/runtime-utils#212. Have you tried running it individually?

@Qibbi
Copy link
Author

Qibbi commented Dec 19, 2024

@EgorBo good point, I didn't before. Doing so actually does result in both being the same with ~432ms on my system.
Buuut I also tested running RunClassWhereTIsClassOneInterface individually now and now this shows up like that:

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2605)
AMD Ryzen 9 7940HS w/ Radeon 780M Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 8.0.11 (8.0.1124.51707), X64 AOT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


| Method                            | Mean     | Error   | StdDev  |
|---------------------------------- |---------:|--------:|--------:|
| RunClassWhereTIsClassOneInterface | 307.5 ms | 3.67 ms | 3.43 ms |
BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2605)
AMD Ryzen 9 7940HS w/ Radeon 780M Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 AOT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


| Method                            | Mean     | Error   | StdDev  |
|---------------------------------- |---------:|--------:|--------:|
| RunClassWhereTIsClassOneInterface | 431.5 ms | 4.82 ms | 4.51 ms |

There seems to be some weirdness going on still.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Dec 20, 2024

I'll just warn again that measuring long running loops like this might not give you the measurement you want. EG from egor bot's logs

WorkloadResult   1: 1 op, 388842103.00 ns, 388.8421 ms/op

that "1 op" means benchmark dot net is timing one invocation of your method, and repeating it about 30 times. Typically a method needs to be invoked at least 60 times to get to the best code.

We can see this if we profile (here using -p ETW on windows) and look at the kind of code that is being measured by benchmark dot net:

99.56%   1.761E+08   OSR      [r110831]Benchmark`5[Test,Test,Test,System.__Canon,System.__Canon].RunClassWhereTIsOneInterface()
00.16%   2.9E+05     native   clrjit.dll
00.14%   2.5E+05     native   ntoskrnl.exe
00.08%   1.4E+05     native   coreclr.dll

So your test is measuring the OSR on-stack replacement version of RunClassWhereTIsOneInterface which is probably not what you want to measure (unless your actual app looks exactly like your benchmark).

For what it's worth I don't see any regression locally either:

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2605)
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK 9.0.100
  [Host]     : .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX2
  Job-GWNGIV : .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX2
  Job-JOJGMJ : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
Method Runtime Mean Error StdDev Ratio RatioSD
RunClassWhereTIsOneInterface .NET 8.0 1.243 s 0.0243 s 0.0238 s 1.00 0.03
RunClassWhereTIsOneInterface .NET 9.0 1.221 s 0.0191 s 0.0178 s 0.98 0.02

Here we see each operation takes about 1.24ns (for .NET 8)

and the .NET 9 OSR inner loop is

G_M000_IG04:                ;; offset=0x003F
       mov      rcx, gword ptr [rbx+0x08]
       cmp      qword ptr [rcx], rdi
       jne      SHORT G_M000_IG07
       inc      dword ptr [rcx+0x08]
 
G_M000_IG05:                ;; offset=0x004B
       inc      rsi
       cmp      rsi, 0xC350
       jl       SHORT G_M000_IG04

If you instead measure 50000 operations, BDN runs your benchmark method 8192 times; this gives tiered compilation a chance to fully engage

// AfterActualRun
WorkloadResult   1: 8192 op, 499922100.00 ns, 61.0256 us/op

profile shows you're measuring fully optimized (Tier1) code:

99.69%   1.449E+08   Tier-1   [r110831]Benchmark`5[Test,Test,Test,System.__Canon,System.__Canon].RunClassWhereTIsOneInterface()
00.11%   1.6E+05     native   clrjit.dll
00.10%   1.4E+05     native   ntoskrnl.exe
00.06%   9E+04       native   coreclr.dll

and each operation is still around 1.2ns, though perhaps a tiny bit faster than the above:

Method Runtime Mean Error StdDev Ratio RatioSD
RunClassWhereTIsOneInterface .NET 8.0 60.67 us 0.931 us 0.825 us 1.00 0.02
RunClassWhereTIsOneInterface .NET 9.0 58.09 us 0.249 us 0.221 us 0.96 0.01

and the .net 9 inner loop is

G_M000_IG03:                ;; offset=0x0019
       mov      rcx, gword ptr [rbx+0x08]
       cmp      qword ptr [rcx], rsi
       jne      SHORT G_M000_IG06
       inc      dword ptr [rcx+0x08]
 
G_M000_IG04:                ;; offset=0x0025
       dec      rdi
       jne      SHORT G_M000_IG03

(note it is slightly more efficient than the OSR version).

@Qibbi
Copy link
Author

Qibbi commented Dec 21, 2024

@AndyAyersMS So if I understood you correctly should change my Benchmark to:

public interface ITest
{
    int Count { get; }
    void Run();
}

public class TestClass : ITest
{
    public int Count { get; private set; }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public void Run()
    {
        Count++;
    }
}

public class Benchmark<T> where T : ITest, new()
{
    private readonly T Test = new();

    [Benchmark]
    public void RunClassWhereTIsOneInterface()
    {
        for (long idx = 0; idx < 50_000; ++idx)
        {
            Test.Run();
        }
    }
}

BenchmarkRunner.Run<Benchmark<TestClass>>();

which I did, but doing so this time gave me a benchmark measurement of:

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2605)
AMD Ryzen 9 7940HS w/ Radeon 780M Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 8.0.11 (8.0.1124.51707), X64 AOT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


| Method                       | Mean     | Error    | StdDev   |
|----------------------------- |---------:|---------:|---------:|
| RunClassWhereTIsOneInterface | 64.91 us | 0.296 us | 0.263 us |

vs

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2605)
AMD Ryzen 9 7940HS w/ Radeon 780M Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 AOT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


| Method                       | Mean     | Error    | StdDev   |
|----------------------------- |---------:|---------:|---------:|
| RunClassWhereTIsOneInterface | 75.50 us | 0.361 us | 0.338 us |

Ran both benchmarks twice, and for this I do get 8192 op

@MichalPetryka
Copy link
Contributor

Can you add [DisassemblyDiagnoser(maxDepth: 5)] and post the codegens for both runtimes?

@Qibbi
Copy link
Author

Qibbi commented Dec 21, 2024

@MichalPetryka

.NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; ConsoleApp1.Benchmark`1[[System.__Canon, System.Private.CoreLib]].RunClassWhereTIsOneInterface()
       push      rsi
       push      rbx
       sub       rsp,28
       mov       rbx,rcx
       xor       esi,esi
M00_L00:
       mov       rcx,[rbx+8]
       mov       r11,7FFC662F04D0
       call      qword ptr [r11]
       inc       rsi
       cmp       rsi,0C350
       jl        short M00_L00
       add       rsp,28
       pop       rbx
       pop       rsi
       ret
; Total bytes of code 47

.NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; ConsoleApp1.Benchmark`1[[System.__Canon, System.Private.CoreLib]].RunClassWhereTIsOneInterface()
       push      rsi
       push      rbx
       sub       rsp,28
       mov       rbx,rcx
       mov       esi,0C350
M00_L00:
       mov       rcx,[rbx+8]
       mov       r11,7FFC65CE0508
       call      qword ptr [r11]
       dec       rsi
       jne       short M00_L00
       add       rsp,28
       pop       rbx
       pop       rsi
       ret
; Total bytes of code 43

Guess from this it's simply faster how the loop is translated into opcodes on my machines then (Zen3+/Zen4).

@EgorBo
Copy link
Member

EgorBo commented Dec 21, 2024

@EgorBot -windows_intel -linux_amd --envvars DOTNET_JitDisasm:RunClassWhereTIsOneInterface

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;

public interface ITest
{
    int Count { get; }
    void Run();
}

public class TestClass : ITest
{
    public int Count { get; private set; }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public void Run()
    {
        Count++;
    }
}

public class Benchmark<T> where T : ITest, new()
{
    private readonly T Test = new();

    [Benchmark]
    public void RunClassWhereTIsOneInterface()
    {
        for (long idx = 0; idx < 50_000; ++idx)
            Test.Run();
    }
}

@EgorBo
Copy link
Member

EgorBo commented Dec 21, 2024

The bot detected some difference here.

Codegen diff: https://www.diffchecker.com/zR29bxc4/ (perhaps, @jakobbotsch can notice a difference, since its mostly changes around the loop).

@Qibbi
Copy link
Author

Qibbi commented Dec 21, 2024

@EgorBo is it possible to see the number for Intel CPU as well?

@EgorBo
Copy link
Member

EgorBo commented Dec 21, 2024

@EgorBo is it possible to see the number for Intel CPU as well?

Intel is here EgorBot/runtime-utils#215 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

6 participants