Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

port SpanHelpers.IndexOf(ref char, char, int) to Vector128/256 #73368

Merged
merged 1 commit into from
Aug 5, 2022

Conversation

adamsitnik
Copy link
Member

@adamsitnik adamsitnik commented Aug 4, 2022

x64

Update: The performance has not regressed, we can observe gains for both AMD and Intel, but I am afraid that they are caused by code alignment changes.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-JFJYAP : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-ONLITS : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Method Toolchain Size Mean Ratio
IndexOfValue \PR\corerun.exe 512 9.984 ns 0.74
IndexOfValue \baseline\corerun.exe 512 13.563 ns 1.00
BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 10 (10.0.18363.2212/1909/November2019Update/19H2)                                                                         
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores                                                                                                  
.NET SDK=7.0.100-rc.1.22403.8                                                                                                                                              
  [Host]     : .NET 7.0.0 (7.0.22.40210), X64 RyuJIT AVX2                                                                                                                  
  Job-AKDWNL : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2                                                                                                                
  Job-LOZHUC : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2                                                                                                                
                                                                                                                                                                           
LaunchCount=9                                                                               
Method Toolchain Size Mean Ratio
IndexOfValue \PR\corerun.exe 512 12.30 ns 0.92
IndexOfValue \baseline\corerun.exe 512 13.34 ns 1.00
BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-KVCIYU : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
  Job-QAYATX : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX

EnvironmentVariables=COMPlus_EnableAVX2=0
Method Toolchain Size Mean Ratio
IndexOfValue \PR\corerun.exe 512 15.58 ns 0.95
IndexOfValue \baseline\corerun.exe 512 16.38 ns 1.00

arm64

Perf on par.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22403.8
  [Host]     : .NET 7.0.0 (7.0.22.40210), Arm64 RyuJIT AdvSIMD
  Job-BEMTXA : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-MKFBYT : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Toolchain Size Mean Ratio
IndexOfValue /PR/corerun 512 43.13 ns 1.00
IndexOfValue /main/corerun 512 43.10 ns 1.00

contributes to #64451

@ghost
Copy link

ghost commented Aug 4, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

x64

25% improvement for AVX2 and 5% for AVX.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-JFJYAP : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-ONLITS : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Method Toolchain Size Mean Ratio
IndexOfValue \PR\corerun.exe 512 9.984 ns 0.74
IndexOfValue \baseline\corerun.exe 512 13.563 ns 1.00
BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-KVCIYU : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
  Job-QAYATX : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX

EnvironmentVariables=COMPlus_EnableAVX2=0
Method Toolchain Size Mean Ratio
IndexOfValue \PR\corerun.exe 512 15.58 ns 0.95
IndexOfValue \baseline\corerun.exe 512 16.38 ns 1.00

arm64

Perf on par.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22403.8
  [Host]     : .NET 7.0.0 (7.0.22.40210), Arm64 RyuJIT AdvSIMD
  Job-BEMTXA : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-MKFBYT : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Toolchain Size Mean Ratio
IndexOfValue /PR/corerun 512 43.13 ns 1.00
IndexOfValue /main/corerun 512 43.10 ns 1.00

contributes to #64451

Author: adamsitnik
Assignees: -
Labels:

area-System.Memory, tenet-performance

Milestone: -

@stephentoub
Copy link
Member

stephentoub commented Aug 4, 2022

25% improvement for AVX2 and 5% for AVX.

I love this, but what's the reasoning for why that's the case? Is it that the unsafe loads are more efficient? Something else?

@@ -575,13 +575,14 @@ public static unsafe int IndexOf(ref char searchSpace, char value, int length)
lengthToExamine--;
}

// We get past SequentialScan only if IsHardwareAccelerated or intrinsic .IsSupported is true. However, we still have the redundant check to allow
// We get past SequentialScan only if IsHardwareAccelerated is true. However, we still have the redundant check to allow
// the JIT to see that the code is unreachable and eliminate it when the platform does not have hardware accelerated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, last time I checked it wasn't needed for JIT - it's smart enough 🙂

@adamsitnik
Copy link
Member Author

I love this, but what's the reasoning for why that's the case? Is it that the unsafe loads are more efficient? Something else?

I've tried to find the answer to this question by using AMD uProf (the machine where I've observed such huge gain is AMD), but latest version of uProf fails to disassemble the code and provide the profile information for me (previous also did not work, but this was announced as a fix so I am upset).

So I've switch to my old Intel box and re-run the benchmarks nine times and observed only a 8% gain:

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 10 (10.0.18363.2212/1909/November2019Update/19H2)                                                                         
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores                                                                                                  
.NET SDK=7.0.100-rc.1.22403.8                                                                                                                                              
  [Host]     : .NET 7.0.0 (7.0.22.40210), X64 RyuJIT AVX2                                                                                                                  
  Job-AKDWNL : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2                                                                                                                
  Job-LOZHUC : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2                                                                                                                
                                                                                                                                                                           
LaunchCount=9                                                                               
Method Toolchain Size Mean Ratio
IndexOfValue \PR\corerun.exe 512 12.30 ns 0.92
IndexOfValue \baseline\corerun.exe 512 13.34 ns 1.00

I took a look at VTune:

image

It looks like adding ref ushort ushortSearchSpace = ref Unsafe.As<char, ushort>(ref searchSpace); has changed the role of the registers (nr 1 on the pic) and code alignment which resulted in removing the artificial code alignment (nr 2 on the pic)? Please keep this explanation with a grain of salt as I am not experienced with assembly analysis.


// Same method as above
int matches = Sse2.MoveMask(Sse2.CompareEqual(values, search).AsByte());
if (matches == 0)
Vector128<ushort> compareResult = Vector128.Equals(values, search);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as on the byte PR.

What is the cost of this approach vs doing it every loop for Vector256

Is it better, particularly for large inputs, to do this there as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding is slightly worse compared to the PR (4%):

- uint matches = Vector256.Equals(values, search).AsByte().ExtractMostSignificantBits();
- if (matches == 0)
+ Vector256<ushort> compareResult = Vector256.Equals(values, search);
+ if (compareResult == Vector256<ushort>.Zero)
  {
      // Zero flags set so no matches
      offset += Vector256<ushort>.Count;
      lengthToExamine -= Vector128<ushort>.Count;
      continue;
  }

+ uint matches = compareResult.AsByte().ExtractMostSignificantBits();
BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 10 (10.0.18363.2212/1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-rc.1.22405.1
  [Host]     : .NET 7.0.0 (7.0.22.40308), X64 RyuJIT AVX2
  Job-ZEKBPR : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-XEENUE : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-FOODNB : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2

LaunchCount=9
Method Toolchain Size Mean Ratio
IndexOfValue \question\corerun.exe 512 13.09 ns 0.99
IndexOfValue \main\corerun.exe 512 13.18 ns 1.00
IndexOfValue \PR\corerun.exe 512 12.48 ns 0.95

vtune

@tannergooding
Copy link
Member

It looks like adding ref ushort ushortSearchSpace = ref Unsafe.As<char, ushort>(ref searchSpace); has changed the role of the registers (nr 1 on the pic) and code alignment which resulted in removing the artificial code alignment (nr 2 on the pic)? Please keep this explanation with a grain of salt as I am not experienced with assembly analysis.

That does look to be the case. We are getting an 8-byte aligned address in after and an unaligned address in before. It's actually interesting that we have the nop inserted but still end up with a misaligned address. That seems like a bug. CC. @kunalspathak

@kunalspathak
Copy link
Member

It's actually interesting that we have the nop inserted but still end up with a misaligned address. That seems like a bug. CC. @kunalspathak

@adamsitnik - Could you share the before/after disassembly (you can copy it from vTune)? But from preliminary look it seems that nop after Block 20 are added to align the loop that starts at Block 23 at 0x7ffe1fef0630. Block 21 is not even a loop.

If you can share the JitDump of before/after, I can find out to see why we decided to not align the loop that starts at Block22 in the "after" case.

@adamsitnik
Copy link
Member Author

Could you share the before/after disassembly (you can copy it from vTune)?

@kunalspathak Sure! I exported it to CSV and uploaded here.

@adamsitnik adamsitnik merged commit 66c93ca into dotnet:main Aug 5, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Sep 4, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants