port SpanHelpers.IndexOf(ref char, char, int) to Vector128/256 #73368

adamsitnik · 2022-08-04T12:43:46Z

x64

Update: The performance has not regressed, we can observe gains for both AMD and Intel, but I am afraid that they are caused by code alignment changes.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-JFJYAP : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-ONLITS : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Method	Toolchain	Size	Mean	Ratio
IndexOfValue	\PR\corerun.exe	512	9.984 ns	0.74
IndexOfValue	\baseline\corerun.exe	512	13.563 ns	1.00

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 10 (10.0.18363.2212/1909/November2019Update/19H2)                                                                         
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores                                                                                                  
.NET SDK=7.0.100-rc.1.22403.8                                                                                                                                              
  [Host]     : .NET 7.0.0 (7.0.22.40210), X64 RyuJIT AVX2                                                                                                                  
  Job-AKDWNL : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2                                                                                                                
  Job-LOZHUC : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2                                                                                                                
                                                                                                                                                                           
LaunchCount=9

Method	Toolchain	Size	Mean	Ratio
IndexOfValue	\PR\corerun.exe	512	12.30 ns	0.92
IndexOfValue	\baseline\corerun.exe	512	13.34 ns	1.00

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-KVCIYU : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
  Job-QAYATX : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX

EnvironmentVariables=COMPlus_EnableAVX2=0

Method	Toolchain	Size	Mean	Ratio
IndexOfValue	\PR\corerun.exe	512	15.58 ns	0.95
IndexOfValue	\baseline\corerun.exe	512	16.38 ns	1.00

arm64

Perf on par.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22403.8
  [Host]     : .NET 7.0.0 (7.0.22.40210), Arm64 RyuJIT AdvSIMD
  Job-BEMTXA : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-MKFBYT : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

Method	Toolchain	Size	Mean	Ratio
IndexOfValue	/PR/corerun	512	43.13 ns	1.00
IndexOfValue	/main/corerun	512	43.10 ns	1.00

contributes to #64451

ghost · 2022-08-04T12:44:09Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

x64

25% improvement for AVX2 and 5% for AVX.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-JFJYAP : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-ONLITS : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Method	Toolchain	Size	Mean	Ratio
IndexOfValue	\PR\corerun.exe	512	9.984 ns	0.74
IndexOfValue	\baseline\corerun.exe	512	13.563 ns	1.00

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-KVCIYU : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
  Job-QAYATX : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX

EnvironmentVariables=COMPlus_EnableAVX2=0

Method	Toolchain	Size	Mean	Ratio
IndexOfValue	\PR\corerun.exe	512	15.58 ns	0.95
IndexOfValue	\baseline\corerun.exe	512	16.38 ns	1.00

arm64

Perf on par.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22403.8
  [Host]     : .NET 7.0.0 (7.0.22.40210), Arm64 RyuJIT AdvSIMD
  Job-BEMTXA : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-MKFBYT : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

Method	Toolchain	Size	Mean	Ratio
IndexOfValue	/PR/corerun	512	43.13 ns	1.00
IndexOfValue	/main/corerun	512	43.10 ns	1.00

contributes to #64451

Author:	adamsitnik
Assignees:	-
Labels:	`area-System.Memory`, `tenet-performance`
Milestone:	-

stephentoub · 2022-08-04T12:54:36Z

25% improvement for AVX2 and 5% for AVX.

I love this, but what's the reasoning for why that's the case? Is it that the unsafe loads are more efficient? Something else?

EgorBo · 2022-08-04T12:54:45Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs

@@ -575,13 +575,14 @@ public static unsafe int IndexOf(ref char searchSpace, char value, int length)
                lengthToExamine--;
            }

-            // We get past SequentialScan only if IsHardwareAccelerated or intrinsic .IsSupported is true. However, we still have the redundant check to allow
+            // We get past SequentialScan only if IsHardwareAccelerated is true. However, we still have the redundant check to allow
            // the JIT to see that the code is unreachable and eliminate it when the platform does not have hardware accelerated.


btw, last time I checked it wasn't needed for JIT - it's smart enough 🙂

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs

adamsitnik · 2022-08-04T15:06:23Z

I love this, but what's the reasoning for why that's the case? Is it that the unsafe loads are more efficient? Something else?

I've tried to find the answer to this question by using AMD uProf (the machine where I've observed such huge gain is AMD), but latest version of uProf fails to disassemble the code and provide the profile information for me (previous also did not work, but this was announced as a fix so I am upset).

So I've switch to my old Intel box and re-run the benchmarks nine times and observed only a 8% gain:

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 10 (10.0.18363.2212/1909/November2019Update/19H2)                                                                         
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores                                                                                                  
.NET SDK=7.0.100-rc.1.22403.8                                                                                                                                              
  [Host]     : .NET 7.0.0 (7.0.22.40210), X64 RyuJIT AVX2                                                                                                                  
  Job-AKDWNL : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2                                                                                                                
  Job-LOZHUC : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2                                                                                                                
                                                                                                                                                                           
LaunchCount=9

Method	Toolchain	Size	Mean	Ratio
IndexOfValue	\PR\corerun.exe	512	12.30 ns	0.92
IndexOfValue	\baseline\corerun.exe	512	13.34 ns	1.00

I took a look at VTune:

It looks like adding ref ushort ushortSearchSpace = ref Unsafe.As<char, ushort>(ref searchSpace); has changed the role of the registers (nr 1 on the pic) and code alignment which resulted in removing the artificial code alignment (nr 2 on the pic)? Please keep this explanation with a grain of salt as I am not experienced with assembly analysis.

tannergooding · 2022-08-04T16:27:31Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs


                            // Same method as above
-                            int matches = Sse2.MoveMask(Sse2.CompareEqual(values, search).AsByte());
-                            if (matches == 0)
+                            Vector128<ushort> compareResult = Vector128.Equals(values, search);


Same question as on the byte PR.

What is the cost of this approach vs doing it every loop for Vector256

Is it better, particularly for large inputs, to do this there as well?

@tannergooding is slightly worse compared to the PR (4%):

- uint matches = Vector256.Equals(values, search).AsByte().ExtractMostSignificantBits(); - if (matches == 0) + Vector256<ushort> compareResult = Vector256.Equals(values, search); + if (compareResult == Vector256<ushort>.Zero) { // Zero flags set so no matches offset += Vector256<ushort>.Count; lengthToExamine -= Vector128<ushort>.Count; continue; } + uint matches = compareResult.AsByte().ExtractMostSignificantBits();

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 10 (10.0.18363.2212/1909/November2019Update/19H2) Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores .NET SDK=7.0.100-rc.1.22405.1 [Host] : .NET 7.0.0 (7.0.22.40308), X64 RyuJIT AVX2 Job-ZEKBPR : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2 Job-XEENUE : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2 Job-FOODNB : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2 LaunchCount=9

Method Toolchain Size Mean Ratio

IndexOfValue \question\corerun.exe 512 13.09 ns 0.99

IndexOfValue \main\corerun.exe 512 13.18 ns 1.00

IndexOfValue \PR\corerun.exe 512 12.48 ns 0.95

tannergooding · 2022-08-04T16:31:09Z

It looks like adding ref ushort ushortSearchSpace = ref Unsafe.As<char, ushort>(ref searchSpace); has changed the role of the registers (nr 1 on the pic) and code alignment which resulted in removing the artificial code alignment (nr 2 on the pic)? Please keep this explanation with a grain of salt as I am not experienced with assembly analysis.

That does look to be the case. We are getting an 8-byte aligned address in after and an unaligned address in before. It's actually interesting that we have the nop inserted but still end up with a misaligned address. That seems like a bug. CC. @kunalspathak

kunalspathak · 2022-08-04T17:26:42Z

It's actually interesting that we have the nop inserted but still end up with a misaligned address. That seems like a bug. CC. @kunalspathak

@adamsitnik - Could you share the before/after disassembly (you can copy it from vTune)? But from preliminary look it seems that nop after Block 20 are added to align the loop that starts at Block 23 at 0x7ffe1fef0630. Block 21 is not even a loop.

If you can share the JitDump of before/after, I can find out to see why we decided to not align the loop that starts at Block22 in the "after" case.

adamsitnik · 2022-08-05T12:21:32Z

Could you share the before/after disassembly (you can copy it from vTune)?

@kunalspathak Sure! I exported it to CSV and uploaded here.

port SpanHelpers.IndexOf(ref char, char, int)

3821373

adamsitnik added area-System.Memory tenet-performance Performance related issue labels Aug 4, 2022

adamsitnik requested review from EgorBo, stephentoub and tannergooding August 4, 2022 12:43

ghost assigned adamsitnik Aug 4, 2022

adamsitnik mentioned this pull request Aug 4, 2022

Switch from direct intrinsics usage to Vector/Vector64/Vector128/Vector256 #64451

Open

75 tasks

EgorBo reviewed Aug 4, 2022

View reviewed changes

filipnavara reviewed Aug 4, 2022

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs Show resolved Hide resolved

tannergooding reviewed Aug 4, 2022

View reviewed changes

stephentoub approved these changes Aug 5, 2022

View reviewed changes

adamsitnik merged commit 66c93ca into dotnet:main Aug 5, 2022

ghost locked as resolved and limited conversation to collaborators Sep 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

port SpanHelpers.IndexOf(ref char, char, int) to Vector128/256 #73368

port SpanHelpers.IndexOf(ref char, char, int) to Vector128/256 #73368

adamsitnik commented Aug 4, 2022 •

edited

Loading

ghost commented Aug 4, 2022

x64

arm64

stephentoub commented Aug 4, 2022 •

edited

Loading

EgorBo Aug 4, 2022

adamsitnik commented Aug 4, 2022

tannergooding Aug 4, 2022

adamsitnik Aug 5, 2022

tannergooding commented Aug 4, 2022

kunalspathak commented Aug 4, 2022

adamsitnik commented Aug 5, 2022

Method	Toolchain	Size	Mean	Ratio
IndexOfValue	\question\corerun.exe	512	13.09 ns	0.99
IndexOfValue	\main\corerun.exe	512	13.18 ns	1.00
IndexOfValue	\PR\corerun.exe	512	12.48 ns	0.95

port SpanHelpers.IndexOf(ref char, char, int) to Vector128/256 #73368

port SpanHelpers.IndexOf(ref char, char, int) to Vector128/256 #73368

Conversation

adamsitnik commented Aug 4, 2022 • edited Loading

x64

arm64

ghost commented Aug 4, 2022

x64

arm64

stephentoub commented Aug 4, 2022 • edited Loading

EgorBo Aug 4, 2022

Choose a reason for hiding this comment

adamsitnik commented Aug 4, 2022

tannergooding Aug 4, 2022

Choose a reason for hiding this comment

adamsitnik Aug 5, 2022

Choose a reason for hiding this comment

tannergooding commented Aug 4, 2022

kunalspathak commented Aug 4, 2022

adamsitnik commented Aug 5, 2022

adamsitnik commented Aug 4, 2022 •

edited

Loading

stephentoub commented Aug 4, 2022 •

edited

Loading