-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Span.SequenceEqual for small buffers. #32364
Conversation
This is very good :) |
@@ -1312,28 +1312,32 @@ public static unsafe int LastIndexOfAny(ref byte searchSpace, byte value0, byte | |||
[MethodImpl(MethodImplOptions.AggressiveOptimization)] | |||
public static unsafe bool SequenceEqual(ref byte first, ref byte second, nuint length) | |||
{ | |||
if (Unsafe.AreSame(ref first, ref second)) | |||
goto Equal; | |||
|
|||
IntPtr offset = (IntPtr)0; // Use IntPtr for arithmetic to avoid unnecessary 64->32->64 truncations | |||
IntPtr lengthToExamine = (IntPtr)(void*)length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like lengthToExamine is cast to a byte* most places it's used. Should it just be stored as one in the first place? Same for offset. (Looking at diff on my phone so maybe I'm just not seeing the reason.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these should be fixed to use nuint/nint. It will be easier once Roslyn adds native support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Picking this up in a follow up PR (for this method)
} | ||
offset += Vector<byte>.Count; | ||
return LoadVector(ref first, lengthToExamine) == LoadVector(ref second, lengthToExamine); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do our perf tests look at comparison lengths larger than one vector and between vector lengths, e.g. 257 if the vector size is 256? I'm curious if/how alignment affects this final comparison, which would seem to generally be unaligned in such a case. Not an issue? I ask simply because in other implementations I've seen us go out of our way to try to align such operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do our perf tests look at comparison lengths larger than one vector and between vector lengths, e.g. 257 if the vector size is 256?
No, our perf tests aren't very extensive for some of the APIs. I was told to not bloat the number of test permutations too much. Right now we only test length 512.
We can certainly do one-offs locally though.
Feel free to add more here:
https://github.com/dotnet/performance/blob/1930f660f56f80b0cad5bfc749fe4c46464801f4/src/benchmarks/micro/libraries/System.Memory/Span.cs#L14-L48
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Locally for me, Vector<byte>.Count = 32
.
Here are the results for what's in master atm:
BenchmarkDotNet=v0.12.0, OS=Windows 10.0.19041
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-alpha1-015914
[Host] : .NET Core 5.0.0 (CoreCLR 5.0.19.56303, CoreFX 5.0.19.56306), X64 RyuJIT
Job-BCFXLD : .NET Core 5.0.0 (CoreCLR 5.0.19.56303, CoreFX 5.0.19.56306), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 MaxIterationCount=10 MinIterationCount=5
WarmupCount=3
Method | Length | Mean | Error | StdDev | Median | Min | Max | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|
SequenceEqual | 32 | 2.813 ns | 0.0692 ns | 0.0180 ns | 2.817 ns | 2.782 ns | 2.826 ns | - | - | - | - |
SequenceEqual | 33 | 3.439 ns | 0.1479 ns | 0.0880 ns | 3.415 ns | 3.341 ns | 3.556 ns | - | - | - | - |
SequenceEqual | 34 | 4.010 ns | 0.0969 ns | 0.0252 ns | 4.001 ns | 3.993 ns | 4.055 ns | - | - | - | - |
SequenceEqual | 63 | 3.401 ns | 0.0959 ns | 0.0502 ns | 3.414 ns | 3.286 ns | 3.443 ns | - | - | - | - |
SequenceEqual | 64 | 3.300 ns | 0.0775 ns | 0.0201 ns | 3.305 ns | 3.278 ns | 3.327 ns | - | - | - | - |
SequenceEqual | 65 | 4.118 ns | 0.1121 ns | 0.0586 ns | 4.122 ns | 3.993 ns | 4.201 ns | - | - | - | - |
SequenceEqual | 256 | 7.424 ns | 0.2525 ns | 0.1503 ns | 7.441 ns | 7.211 ns | 7.705 ns | - | - | - | - |
SequenceEqual | 257 | 7.622 ns | 0.2085 ns | 0.1379 ns | 7.673 ns | 7.433 ns | 7.848 ns | - | - | - | - |
Helps #32363
summary:
better: 6, geomean: 1.168
worse: 1, geomean: 1.216
total diff: 7
cc @jkotas, @benaadams, @GrabYourPitchforks