-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
port SpanHelpers.IndexOf(ref char, char, int) to Vector128/256 #73368
Conversation
Tagging subscribers to this area: @dotnet/area-system-memory Issue Detailsx6425% improvement for AVX2 and 5% for AVX. BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
[Host] : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
Job-JFJYAP : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-ONLITS : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
[Host] : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
Job-KVCIYU : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
Job-QAYATX : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
EnvironmentVariables=COMPlus_EnableAVX2=0
arm64Perf on par. BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22403.8
[Host] : .NET 7.0.0 (7.0.22.40210), Arm64 RyuJIT AdvSIMD
Job-BEMTXA : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-MKFBYT : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
contributes to #64451
|
I love this, but what's the reasoning for why that's the case? Is it that the unsafe loads are more efficient? Something else? |
@@ -575,13 +575,14 @@ public static unsafe int IndexOf(ref char searchSpace, char value, int length) | |||
lengthToExamine--; | |||
} | |||
|
|||
// We get past SequentialScan only if IsHardwareAccelerated or intrinsic .IsSupported is true. However, we still have the redundant check to allow | |||
// We get past SequentialScan only if IsHardwareAccelerated is true. However, we still have the redundant check to allow | |||
// the JIT to see that the code is unreachable and eliminate it when the platform does not have hardware accelerated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, last time I checked it wasn't needed for JIT - it's smart enough 🙂
I've tried to find the answer to this question by using AMD uProf (the machine where I've observed such huge gain is AMD), but latest version of uProf fails to disassemble the code and provide the profile information for me (previous also did not work, but this was announced as a fix so I am upset). So I've switch to my old Intel box and re-run the benchmarks nine times and observed only a 8% gain: BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 10 (10.0.18363.2212/1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-rc.1.22403.8
[Host] : .NET 7.0.0 (7.0.22.40210), X64 RyuJIT AVX2
Job-AKDWNL : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-LOZHUC : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
LaunchCount=9
I took a look at VTune: It looks like adding |
|
||
// Same method as above | ||
int matches = Sse2.MoveMask(Sse2.CompareEqual(values, search).AsByte()); | ||
if (matches == 0) | ||
Vector128<ushort> compareResult = Vector128.Equals(values, search); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as on the byte
PR.
What is the cost of this approach vs doing it every loop for Vector256
Is it better, particularly for large inputs, to do this there as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding is slightly worse compared to the PR (4%):
- uint matches = Vector256.Equals(values, search).AsByte().ExtractMostSignificantBits();
- if (matches == 0)
+ Vector256<ushort> compareResult = Vector256.Equals(values, search);
+ if (compareResult == Vector256<ushort>.Zero)
{
// Zero flags set so no matches
offset += Vector256<ushort>.Count;
lengthToExamine -= Vector128<ushort>.Count;
continue;
}
+ uint matches = compareResult.AsByte().ExtractMostSignificantBits();
BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 10 (10.0.18363.2212/1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-rc.1.22405.1
[Host] : .NET 7.0.0 (7.0.22.40308), X64 RyuJIT AVX2
Job-ZEKBPR : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-XEENUE : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-FOODNB : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
LaunchCount=9
Method | Toolchain | Size | Mean | Ratio |
---|---|---|---|---|
IndexOfValue | \question\corerun.exe | 512 | 13.09 ns | 0.99 |
IndexOfValue | \main\corerun.exe | 512 | 13.18 ns | 1.00 |
IndexOfValue | \PR\corerun.exe | 512 | 12.48 ns | 0.95 |
That does look to be the case. We are getting an |
@adamsitnik - Could you share the before/after disassembly (you can copy it from vTune)? But from preliminary look it seems that If you can share the JitDump of before/after, I can find out to see why we decided to not align the loop that starts at |
@kunalspathak Sure! I exported it to CSV and uploaded here. |
x64
Update: The performance has not regressed, we can observe gains for both AMD and Intel, but I am afraid that they are caused by code alignment changes.
arm64
Perf on par.
contributes to #64451