Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression on .NET 9.0 when creating AVX constants inside loops via System.Numerics.Vector #110125

Closed
Chicken-Bones opened this issue Nov 25, 2024 · 4 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue

Comments

@Chicken-Bones
Copy link
Contributor

Description

There is a significant performance regression in the .NET 9 JIT with System.Numerics.Vector when Vector constants are created inline. Run the following benchmark to reproduce. The issue occurs at FullOpts, regardless of whether Tiered compilation or PGO is enabled (according to Disasmo)

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using System.Numerics;
using System.Runtime.CompilerServices;

[SimpleJob(RuntimeMoniker.Net481)]
[SimpleJob(RuntimeMoniker.Net80)]
[SimpleJob(RuntimeMoniker.Net90)]
public class VectorStackSpill
{
	public IEnumerable<object> Args() // for single argument it's an IEnumerable of objects (object)
	{
		yield return new int[10000];
	}

	[Benchmark]
	[ArgumentsSource(nameof(Args))]
	public void NumericsVectorDecrement(int[] array)
	{
		if ((array.Length & (Vector<int>.Count - 1)) != 0)
			throw new ArgumentOutOfRangeException("Not a multiple of vector length");

		ref var arrStart = ref array[0];
		for (var i = 0; i <= (array.Length - Vector<int>.Count); i += Vector<int>.Count) {
			ref var p = ref Unsafe.As<int, Vector<int>>(ref Unsafe.Add(ref arrStart, i));
			p -= new Vector<int>(1);
		}
	}

	[Benchmark]
	[ArgumentsSource(nameof(Args))]
	public void NumericsVectorDecrementConstantExtracted(int[] array)
	{
		if ((array.Length & (Vector<int>.Count - 1)) != 0)
			throw new ArgumentOutOfRangeException("Not a multiple of vector length");

		var one = new Vector<int>(1);

		ref var arrStart = ref array[0];
		for (var i = 0; i <= (array.Length - Vector<int>.Count); i += Vector<int>.Count) {
			ref var p = ref Unsafe.As<int, Vector<int>>(ref Unsafe.Add(ref arrStart, i));
			p -= one;
		}
	}

Regression?

This is a regression from .NET 8.0

Data

| Method                                   | Job                  | Runtime              | array        | Mean       | Error    | StdDev   |
|----------------------------------------- |--------------------- |--------------------- |------------- |-----------:|---------:|---------:|
| NumericsVectorDecrement                  | .NET 8.0             | .NET 8.0             | Int32[10000] |   612.6 ns | 12.21 ns | 15.87 ns |
| NumericsVectorDecrementConstantExtracted | .NET 8.0             | .NET 8.0             | Int32[10000] |   560.8 ns | 11.04 ns | 17.51 ns |
| NumericsVectorDecrement                  | .NET 9.0             | .NET 9.0             | Int32[10000] | 7,866.9 ns | 49.79 ns | 46.58 ns |
| NumericsVectorDecrementConstantExtracted | .NET 9.0             | .NET 9.0             | Int32[10000] |   566.6 ns |  6.10 ns |  5.41 ns |
| NumericsVectorDecrement                  | .NET Framework 4.8.1 | .NET Framework 4.8.1 | Int32[10000] |   692.3 ns | 13.85 ns | 20.30 ns |
| NumericsVectorDecrementConstantExtracted | .NET Framework 4.8.1 | .NET Framework 4.8.1 | Int32[10000] |   533.6 ns |  4.27 ns |  3.57 ns |

Analysis

Looking at the x86 in Dasmo reveals the issue.

On .NET 8 the constant is loaded into ymm0 from reloc @RWD00 outside the loop

       vmovups  ymm0, ymmword ptr [reloc @RWD00]
       align    [5 bytes for IG03]
 
G_M000_IG03:                ;; offset=0x0030
       movsxd   r8, eax
       lea      r8, bword ptr [rdx+4*r8]
       vmovups  ymm1, ymmword ptr [r8]
       vpsubd   ymm1, ymm1, ymm0
       vmovups  ymmword ptr [r8], ymm1
       add      eax, 8
       cmp      ecx, eax
       jge      SHORT G_M000_IG03

On .NET 9 the constant is created on the stack inside the loop:

       align    [0 bytes for IG03]
 
G_M000_IG03:                ;; offset=0x0024
       movsxd   r8, eax
       lea      r8, bword ptr [rdx+4*r8]
       vxorps   ymm0, ymm0, ymm0
       vmovups  ymmword ptr [rsp+0x20], ymm0
       vmovups  ymm0, ymmword ptr [r8]
       mov      dword ptr [rsp+0x20], 1
       mov      dword ptr [rsp+0x24], 1
       mov      dword ptr [rsp+0x28], 1
       mov      dword ptr [rsp+0x2C], 1
       mov      dword ptr [rsp+0x30], 1
       mov      dword ptr [rsp+0x34], 1
       mov      dword ptr [rsp+0x38], 1
       mov      dword ptr [rsp+0x3C], 1
       vpsubd   ymm0, ymm0, ymmword ptr [rsp+0x20]
       vmovups  ymmword ptr [r8], ymm0
       add      eax, 8
       cmp      ecx, eax
       jge      SHORT G_M000_IG03

Manually moving the constant outside the loop works around the issue in .NET 9

       mov      dword ptr [rsp+0x20], 1
       mov      dword ptr [rsp+0x24], 1
       mov      dword ptr [rsp+0x28], 1
       mov      dword ptr [rsp+0x2C], 1
       mov      dword ptr [rsp+0x30], 1
       mov      dword ptr [rsp+0x34], 1
       mov      dword ptr [rsp+0x38], 1
       mov      dword ptr [rsp+0x3C], 1
...
       align    [2 bytes for IG03]
 
G_M000_IG03:                ;; offset=0x0070
       movsxd   r8, eax
       lea      r8, bword ptr [rdx+4*r8]
       vmovups  ymm0, ymmword ptr [r8]
       vpsubd   ymm0, ymm0, ymmword ptr [rsp+0x20]
       vmovups  ymmword ptr [r8], ymm0
       add      eax, 8
       cmp      ecx, eax
       jge      SHORT G_M000_IG03
@Chicken-Bones Chicken-Bones added the tenet-performance Performance related issue label Nov 25, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 25, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Nov 25, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@MichalPetryka
Copy link
Contributor

Duplicate of #108929?

@Chicken-Bones
Copy link
Contributor Author

Similar, and may be covered by that issue, but vpbroadcastd is not used on .NET 8 in this example. There may be additional optimisations when the scalar value is a constant that should also be checked

@tannergooding
Copy link
Member

Duplicate of #108929?

Yes. It was a general issue with the constructor APIs no longer being treated as intrinsic. It has been resolved and is pending backport via #109322

@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

3 participants