-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weird performance numbers for Generics with class constraint and .net9 regression #110831
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
Can you please post the full benchmark source code? |
Benchmark.cs:
Test.cs:
Program.cs:
|
An addendum:
(side note: does the attribute on the interface actually even do something/what would be the correct way?) However the same things can be observed as before. I switched machines when doing so, so here just the 8.0 test without modifications on the other machine as a base line:
and here the benchmarks with the NoInlining:
|
When your benchmarks contain long-running loops there's a chance BDN is not always able to get to the most optimized code. So the results can be somewhat misleading. BDN's approach works best if the benchmark itself is fairly short and BDN is the one controlling the measurement strategy. What happens if you reduce the iteration count in those loops to something like 5000? If you want BDN to do the per-invocation math for you, also add something like
|
If I try to do that all values are ~0.0001ns or By just setting the loop to 5000 iterations I get proportionally smaller values. 5000 is already pretty small so a lot of the duration is taken by the loop itself.
There is another weirdness going on, I also tested it on my Intel desktop machine and got these times:
.net 9 however again lowers the performance to the lower of the two:
|
@EgorBot -amd --runtimes net8.0 net9.0 using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkRunner.Run<Benchmark>(args: args);
public class Benchmark
{
private readonly ITest Test1 = new Test();
[Benchmark]
public void RunStructViaInterface()
{
for (long idx = 0; idx < 1_000_000_000; ++idx)
Test1.Run();
}
}
public interface ITest
{
int Count { get; }
void Run();
}
public struct Test : ITest
{
public int Count { get; private set; }
public void Run()
{
Count++;
}
} |
the bot doesn't repro it EgorBot/runtime-utils#211 (comment) (on all platforms) |
@EgorBo That's not the part which is weird, it's with
vs
Adding |
@Qibbi I am not sure I understand, the title implies that .net 9.0 regressed something, can you point me directly to just one [Benchmark] that became slower in 9.0 compared to 8.0 ? |
.net 8
vs
as you can see
|
@EgorBot -windows_intel -linux_amd --runtimes net8.0 net9.0 using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkRunner.Run<Benchmark<Test, Test, Test, TestClass, TestClass>>(args: args);
public class Benchmark<T, U, V, W, X> where T : ITest, new() where U : struct, ITest where V : unmanaged, ITest where W : ITest, new() where X : class, ITest, new()
{
private readonly ITest Test1 = new Test();
private readonly U Test3;
private readonly V Test4;
private readonly Test Test5;
private readonly W Test7 = new();
[Benchmark]
public void RunClassWhereTIsOneInterface()
{
for (long idx = 0; idx < 1_000_000_000; ++idx)
Test7.Run();
}
}
public interface ITest
{
int Count { get; }
void Run();
}
public struct Test : ITest
{
public int Count { get; private set; }
public void Run() => Count++;
}
public class TestClass : ITest
{
public int Count { get; private set; }
public void Run() => Count++;
} |
@Qibbi still no difference between net8.0 and net9.0 for |
@EgorBo good point, I didn't before. Doing so actually does result in both being the same with ~432ms on my system.
There seems to be some weirdness going on still. |
I'll just warn again that measuring long running loops like this might not give you the measurement you want. EG from egor bot's logs
that "1 op" means benchmark dot net is timing one invocation of your method, and repeating it about 30 times. Typically a method needs to be invoked at least 60 times to get to the best code. We can see this if we profile (here using
So your test is measuring the OSR on-stack replacement version of For what it's worth I don't see any regression locally either:
Here we see each operation takes about 1.24ns (for .NET 8) and the .NET 9 OSR inner loop is G_M000_IG04: ;; offset=0x003F
mov rcx, gword ptr [rbx+0x08]
cmp qword ptr [rcx], rdi
jne SHORT G_M000_IG07
inc dword ptr [rcx+0x08]
G_M000_IG05: ;; offset=0x004B
inc rsi
cmp rsi, 0xC350
jl SHORT G_M000_IG04 If you instead measure 50000 operations, BDN runs your benchmark method 8192 times; this gives tiered compilation a chance to fully engage
profile shows you're measuring fully optimized (Tier1) code:
and each operation is still around 1.2ns, though perhaps a tiny bit faster than the above:
and the .net 9 inner loop is G_M000_IG03: ;; offset=0x0019
mov rcx, gword ptr [rbx+0x08]
cmp qword ptr [rcx], rsi
jne SHORT G_M000_IG06
inc dword ptr [rcx+0x08]
G_M000_IG04: ;; offset=0x0025
dec rdi
jne SHORT G_M000_IG03 (note it is slightly more efficient than the OSR version). |
@AndyAyersMS So if I understood you correctly should change my Benchmark to:
which I did, but doing so this time gave me a benchmark measurement of:
vs
Ran both benchmarks twice, and for this I do get 8192 op |
Can you add |
.NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; ConsoleApp1.Benchmark`1[[System.__Canon, System.Private.CoreLib]].RunClassWhereTIsOneInterface()
push rsi
push rbx
sub rsp,28
mov rbx,rcx
xor esi,esi
M00_L00:
mov rcx,[rbx+8]
mov r11,7FFC662F04D0
call qword ptr [r11]
inc rsi
cmp rsi,0C350
jl short M00_L00
add rsp,28
pop rbx
pop rsi
ret
; Total bytes of code 47 .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; ConsoleApp1.Benchmark`1[[System.__Canon, System.Private.CoreLib]].RunClassWhereTIsOneInterface()
push rsi
push rbx
sub rsp,28
mov rbx,rcx
mov esi,0C350
M00_L00:
mov rcx,[rbx+8]
mov r11,7FFC65CE0508
call qword ptr [r11]
dec rsi
jne short M00_L00
add rsp,28
pop rbx
pop rsi
ret
; Total bytes of code 43 Guess from this it's simply faster how the loop is translated into opcodes on my machines then (Zen3+/Zen4). |
@EgorBot -windows_intel -linux_amd --envvars DOTNET_JitDisasm:RunClassWhereTIsOneInterface using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
public interface ITest
{
int Count { get; }
void Run();
}
public class TestClass : ITest
{
public int Count { get; private set; }
[MethodImpl(MethodImplOptions.NoInlining)]
public void Run()
{
Count++;
}
}
public class Benchmark<T> where T : ITest, new()
{
private readonly T Test = new();
[Benchmark]
public void RunClassWhereTIsOneInterface()
{
for (long idx = 0; idx < 50_000; ++idx)
Test.Run();
}
} |
The bot detected some difference here. Codegen diff: https://www.diffchecker.com/zR29bxc4/ (perhaps, @jakobbotsch can notice a difference, since its mostly changes around the loop). |
@EgorBo is it possible to see the number for Intel CPU as well? |
Intel is here EgorBot/runtime-utils#215 (comment) |
Description
I made some Benchmarks (using BenchmarkDotNet) to see how calls to functions behave on some combination with interfaces and structs/classes.
With that I found a weird performance penalty in .net 8 when using the
class
generic constraint. On my machine my benchmark takes 400ms compared to 300ms when I don't have that constraint. Even though it is needed to be said that there seem to sometimes be some outliers where this also takes 400ms. In .net 9 it actually takes 400ms reliably with or without that constraint.My setup is simple, so I hoped to minimize any other reasons.
Interface/Struct/Class:
(I could have used a
int Count;
member instead of the property, but that gets jitted away anyway and doesn't affect any numbers)Then I set up several members in a Benchmark class:
My benchmark methods all look like this:
Configuration/Data
The text was updated successfully, but these errors were encountered: