-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
try to port ASCIIUtility.WidenAsciiToUtf16 to x-plat intrinsics #73055
Conversation
Tagging subscribers to this area: @dotnet/area-system-text-encoding Issue Detailsx64There was a major regression, but I was able to solve it by enforcing the inlining of BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
[Host] : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
Job-IVMYPN : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-BJNCCX : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
ARM64There is a major perf regression: 40-50%. I know that it's not caused by the I am trying to update my Surface Pro X to Win 11, which will allow me to install VS 2022 (ARM64), build dotnet/runtime and try the new internal MS profiler. But I can't promise anything. BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22378.8
[Host] : .NET 7.0.0 (7.0.22.37802), Arm64 RyuJIT AdvSIMD
Job-VUTCOY : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-LKXRPH : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
contributes to #64451 cc @tannergooding @GrabYourPitchforks
|
…ciiDataSeenInInnerLoop was performed only when containsNonAsciiBytes was true)
9d653d5
to
26c0c9d
Compare
26c0c9d
to
764d038
Compare
@@ -3618,6 +3618,7 @@ public static bool TryCopyTo<T>(this Vector128<T> vector, Span<T> destination) | |||
/// <param name="source">The vector whose elements are to be widened.</param> | |||
/// <returns>A pair of vectors that contain the widened lower and upper halves of <paramref name="source" />.</returns> | |||
[CLSCompliant(false)] | |||
[MethodImpl(MethodImplOptions.AggressiveInlining)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add this to all the Widen
APIs and to the same on Vector64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the JIT use named intrinsic and treat all methods in Vector###<T>
as candidates for aggressive inlining?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding I've synced my fork with upstream and verified that it's not needed anymore. So we don't need to backport anything to 7.0
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
I added I replaced I updated the results posted above. |
|
||
// Can we at least widen the first part of the vector? | ||
|
||
if (!containsNonAsciiBytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this code, as it was impossible to satisfy this requirement as the jump was performed only when the flag was set to true:
runtime/src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Lines 1823 to 1827 in 85d638b
if (containsNonAsciiBytes) | |
{ | |
// non-ASCII byte somewhere | |
goto NonAsciiDataSeenInInnerLoop; | |
} |
|
||
// Calculate how many elements we wrote in order to get pOutputBuffer to its next alignment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on the benchmarking that I've done this was not improving perf in noticeable way, but increasing the code complexity. I've removed it and added Vector256
code path (less code = less code to duplicate)
Edit: it turned out to be a R2R bug: #74253 OK, I have no idea what is going on and I need an extra pair of eyes. The tests are failing with The callstack is quite long, but it shows that failure starts from
This method has 3 debug asserts: runtime/src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs Lines 1753 to 1757 in 20a2f23
And the only place where it's called from has exactly the same guards: runtime/src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs Lines 1581 to 1584 in 20a2f23
So the asserts mentioned above should definitely not fail. I was able to reproduce the failure locally. It's gone when I remove those 3 asserts! What am I missing? |
Http test crash. Worth cracking the dump to check it's not related to this change? |
@adamsitnik please, don't just comment that assert out. The generated code will not reliably behave correctly, you need to follow the instructions I put in #74253 |
@davidwrighton thank you! btw based on the perf results I decided to simply inline these two helpers
|
Updated perf numbers: For x64 AVX2There is a 10-30% boost for large inputs. It's caused by adding the Small inputs also work faster, partially because BenchmarkDotNet=v0.13.1.20220823-develop, OS=Windows 11 (10.0.22000.856/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.7.22377.5
[Host] : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT AVX2
Job-SJSFRM : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-QAHUTQ : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
LaunchCount=9 MemoryRandomization=True
ARM64 AdvSIMDThere is a small (4-8%) gain for all test cases. BenchmarkDotNet=v0.13.1.1845-nightly, OS=Windows 11 (10.0.22622.575)
Microsoft SQ1 3.0 GHz, 1 CPU, 8 logical and 8 physical cores
.NET SDK=7.0.100-rc.2.22422.7
[Host] : .NET 7.0.0 (7.0.22.41112), Arm64 RyuJIT AdvSIMD
Job-AHYKGY : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-THHHUV : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
LaunchCount=9 MemoryRandomization=True
|
{ | ||
Vector256<byte> asciiVector = Vector256.Load(pAsciiBuffer + currentOffset); | ||
|
||
if (asciiVector.ExtractMostSignificantBits() != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/dotnet/runtime/pull/73055/files?diff=split&w=1#diff-66bbe89271f826c9232bd146abb678844754515dc027f70ad0ce36f751da46ebR1378 suggests that Sse41.TestZ is faster than ExtractMostSignificantBits for 128 bits. Does the same not hold for Avx.TestZ for 256 bits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not:
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Reflection.Metadata;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
namespace VectorBenchmarks
{
internal class Program
{
static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
}
public unsafe class ContainsNonAscii
{
private const int Size = 1024;
private byte* _bytes;
[GlobalSetup]
public void Setup()
{
_bytes = (byte*)NativeMemory.AlignedAlloc(new UIntPtr(Size), new UIntPtr(32));
new Span<byte>(_bytes, Size).Clear();
}
[GlobalCleanup]
public void Free() => NativeMemory.AlignedFree(_bytes);
[Benchmark]
public bool ExtractMostSignificantBits()
{
ref byte searchSpace = ref *_bytes;
nuint currentOffset = 0;
nuint finalOffsetWhereCanRunLoop = Size - (uint)Vector256<byte>.Count;
do
{
Vector256<byte> asciiVector = Vector256.LoadUnsafe(ref searchSpace, currentOffset);
if (asciiVector.ExtractMostSignificantBits() != 0)
{
return true;
}
currentOffset += (nuint)Vector256<byte>.Count;
} while (currentOffset <= finalOffsetWhereCanRunLoop);
return false;
}
[Benchmark]
public bool TestZ()
{
ref byte searchSpace = ref *_bytes;
nuint currentOffset = 0;
nuint finalOffsetWhereCanRunLoop = Size - (uint)Vector256<byte>.Count;
do
{
Vector256<byte> asciiVector = Vector256.LoadUnsafe(ref searchSpace, currentOffset);
if (!Avx.TestZ(asciiVector, Vector256.Create((byte)0x80)))
{
return true;
}
currentOffset += (nuint)Vector256<byte>.Count;
} while (currentOffset <= finalOffsetWhereCanRunLoop);
return false;
}
}
}
BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22000.856/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-rc.1.22423.16
[Host] : .NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2
DefaultJob : .NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2
Method | Mean | Error | StdDev | Code Size |
---|---|---|---|---|
ExtractMostSignificantBits | 11.78 ns | 0.042 ns | 0.040 ns | 57 B |
TestZ | 14.76 ns | 0.320 ns | 0.416 ns | 68 B |
.NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2
; VectorBenchmarks.ContainsNonAscii.ExtractMostSignificantBits()
vzeroupper
mov rax,[rcx+8]
xor edx,edx
nop dword ptr [rax]
M00_L00:
vmovdqu ymm0,ymmword ptr [rax+rdx]
vpmovmskb ecx,ymm0
test ecx,ecx
jne short M00_L01
add rdx,20
cmp rdx,3E0
jbe short M00_L00
xor eax,eax
vzeroupper
ret
M00_L01:
mov eax,1
vzeroupper
ret
; Total bytes of code 57
.NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2
; VectorBenchmarks.ContainsNonAscii.TestZ()
vzeroupper
mov rax,[rcx+8]
xor edx,edx
vmovupd ymm0,[7FF9E3D94D60]
nop dword ptr [rax]
nop dword ptr [rax]
M00_L00:
vptest ymm0,ymmword ptr [rax+rdx]
jne short M00_L01
add rdx,20
cmp rdx,3E0
jbe short M00_L00
xor eax,eax
vzeroupper
ret
M00_L01:
mov eax,1
vzeroupper
ret
; Total bytes of code 68
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not:
That's not what I see on my machine.
Method | Mean |
---|---|
ExtractMostSignificantBits_128 | 31.77 ns |
TestZ_128 | 25.58 ns |
ExtractMostSignificantBits_256 | 15.58 ns |
TestZ_256 | 11.66 ns |
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics.X86;
using System.Runtime.Intrinsics;
[HideColumns("Error", "StdDev", "Median", "RatioSD")]
public unsafe partial class Program
{
static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
private const int Size = 1024;
private byte* _bytes;
[GlobalSetup]
public void Setup()
{
_bytes = (byte*)NativeMemory.AlignedAlloc(new UIntPtr(Size), new UIntPtr(32));
new Span<byte>(_bytes, Size).Clear();
}
[GlobalCleanup]
public void Free() => NativeMemory.AlignedFree(_bytes);
[Benchmark]
public bool ExtractMostSignificantBits_128()
{
ref byte searchSpace = ref *_bytes;
nuint currentOffset = 0;
nuint finalOffsetWhereCanRunLoop = Size - (uint)Vector128<byte>.Count;
do
{
Vector128<byte> asciiVector = Vector128.LoadUnsafe(ref searchSpace, currentOffset);
if (asciiVector.ExtractMostSignificantBits() != 0)
{
return true;
}
currentOffset += (nuint)Vector128<byte>.Count;
} while (currentOffset <= finalOffsetWhereCanRunLoop);
return false;
}
[Benchmark]
public bool TestZ_128()
{
ref byte searchSpace = ref *_bytes;
nuint currentOffset = 0;
nuint finalOffsetWhereCanRunLoop = Size - (uint)Vector128<byte>.Count;
do
{
Vector128<byte> asciiVector = Vector128.LoadUnsafe(ref searchSpace, currentOffset);
if (!Sse41.TestZ(asciiVector, Vector128.Create((byte)0x80)))
{
return true;
}
currentOffset += (nuint)Vector128<byte>.Count;
} while (currentOffset <= finalOffsetWhereCanRunLoop);
return false;
}
[Benchmark]
public bool ExtractMostSignificantBits_256()
{
ref byte searchSpace = ref *_bytes;
nuint currentOffset = 0;
nuint finalOffsetWhereCanRunLoop = Size - (uint)Vector256<byte>.Count;
do
{
Vector256<byte> asciiVector = Vector256.LoadUnsafe(ref searchSpace, currentOffset);
if (asciiVector.ExtractMostSignificantBits() != 0)
{
return true;
}
currentOffset += (nuint)Vector256<byte>.Count;
} while (currentOffset <= finalOffsetWhereCanRunLoop);
return false;
}
[Benchmark]
public bool TestZ_256()
{
ref byte searchSpace = ref *_bytes;
nuint currentOffset = 0;
nuint finalOffsetWhereCanRunLoop = Size - (uint)Vector256<byte>.Count;
do
{
Vector256<byte> asciiVector = Vector256.LoadUnsafe(ref searchSpace, currentOffset);
if (!Avx.TestZ(asciiVector, Vector256.Create((byte)0x80)))
{
return true;
}
currentOffset += (nuint)Vector256<byte>.Count;
} while (currentOffset <= finalOffsetWhereCanRunLoop);
return false;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephentoub I think it depends on CPU, I even had to revert TestZ from Vector.Equals because it produced regressions #67902
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it depends on CPU, I even had to revert TestZ from Vector.Equals because it produced regressions #67902
That change reverted it from both the 256-bit and 128-bit code paths. This PR uses TestZ for 128-bit. Why is that ok?
I'm questioning the non-symmetrical usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some helpful comments in these SO links (and related) https://stackoverflow.com/questions/60446759/sse2-test-xmm-bitmask-directly-without-using-pmovmskb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, but I'm not seeing the answer there to my question.
I'll restate:
This PR is adding additional code to prefer using TestZ with Vector128:
https://github.com/dotnet/runtime/pull/73055/files#diff-66bbe89271f826c9232bd146abb678844754515dc027f70ad0ce36f751da46ebR1379-R1391
Your #67902 reverted other changes that preferred using TestZ, not just on 256-bit but also on 128-bit vectors.
Does it still make sense for this PR to be adding additional code to use TestZ with Vector128?
(Part of why I'm pushing on this is with a goal of avoiding needing to drop down to direct instrinsics as much as possible. I'd hope we can get to a point where the obvious code to write is the best code to write in as many situations as possible.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I am not saying the current non-symmetrical usage is correct, I'd probably change both to use ExtractMostSignificantBits
C++ compilers also do different things here, e.g. LLVM folds even direct MoveMask usage to testz: https://godbolt.org/z/MobvxvzGK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "better" is going to depend on a few factors.
On x86/x64, ExtractMostSignificantBits
is likely faster because it can always emit exactly as movmsk
and there are some CPUs where TestZ
can be slower, particularly for "small inputs" where the match is sooner. When the match is later, TestZ
typically wins out regardless.
On Arm64, doing the early comparison against == Zero
is likely better because it is a single instruction vs the multi-instruction sequence required to emulate x64's movmsk.
I think the best choice here is to use == Zero
(and therefore TestZ
) as I believe it will, on average, produce the best/most consistent code. The cases where it might be a bit slower will typically be for smaller inputs where we're already returning quickly and the extra couple nanoseconds won't really matter.
|
||
currentOffset += (nuint)Vector256<byte>.Count; | ||
pCurrentWriteAddress += (nuint)Vector256<byte>.Count; | ||
} while (currentOffset <= finalOffsetWhereCanRunLoop); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On every iteration of the loop we're bumping currentOffset and then also adding that to pAsciiBuffer. Would it be faster to instead compute the upper bound as an address, just bump the current pointer in the loop, and then after the loop compute the currentOffset if needed based on the ending/starting difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It produces better codegen, but the reported time difference is within the range of error (0-0.3ns gain or loss)
public unsafe class Widen
{
private const int Size = 1024;
private byte* _bytes;
private char* _chars;
[GlobalSetup]
public void Setup()
{
_bytes = (byte*)NativeMemory.AlignedAlloc(new UIntPtr(Size), new UIntPtr(32));
new Span<byte>(_bytes, Size).Fill((byte)'a');
_chars = (char*)NativeMemory.AlignedAlloc(new UIntPtr(Size * sizeof(char)), new UIntPtr(32));
}
[GlobalCleanup]
public void Free()
{
NativeMemory.AlignedFree(_bytes);
NativeMemory.AlignedFree(_chars);
}
[Benchmark]
public void Current()
{
ref byte searchSpace = ref *_bytes;
ushort* pCurrentWriteAddress = (ushort*)_chars;
nuint currentOffset = 0;
nuint finalOffsetWhereCanRunLoop = Size - (uint)Vector256<byte>.Count;
do
{
Vector256<byte> asciiVector = Vector256.Load(_bytes + currentOffset);
if (asciiVector.ExtractMostSignificantBits() != 0)
{
break;
}
(Vector256<ushort> low, Vector256<ushort> upper) = Vector256.Widen(asciiVector);
low.Store(pCurrentWriteAddress);
upper.Store(pCurrentWriteAddress + Vector256<ushort>.Count);
currentOffset += (nuint)Vector256<byte>.Count;
pCurrentWriteAddress += (nuint)Vector256<byte>.Count;
} while (currentOffset <= finalOffsetWhereCanRunLoop);
}
[Benchmark]
public void Suggested()
{
ref byte currentSearchSpace = ref *_bytes;
ref ushort currentWriteAddress = ref Unsafe.As<char, ushort>(ref *_chars);
ref byte oneVectorAwayFromEnd = ref Unsafe.Add(ref currentSearchSpace, Size - Vector256<byte>.Count);
do
{
Vector256<byte> asciiVector = Vector256.LoadUnsafe(ref currentSearchSpace);
if (asciiVector.ExtractMostSignificantBits() != 0)
{
break;
}
(Vector256<ushort> low, Vector256<ushort> upper) = Vector256.Widen(asciiVector);
low.StoreUnsafe(ref currentWriteAddress);
upper.StoreUnsafe(ref currentWriteAddress, (nuint)Vector256<ushort>.Count);
currentSearchSpace = ref Unsafe.Add(ref currentSearchSpace, Vector256<byte>.Count);
currentWriteAddress = ref Unsafe.Add(ref currentWriteAddress, Vector256<byte>.Count);
} while (!Unsafe.IsAddressGreaterThan(ref currentSearchSpace, ref oneVectorAwayFromEnd));
}
}
BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22000.856/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-rc.1.22423.16
[Host] : .NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2
DefaultJob : .NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2
Method | Mean | Error | StdDev | Code Size |
---|---|---|---|---|
Current | 44.29 ns | 0.171 ns | 0.152 ns | 81 B |
Suggested | 44.54 ns | 0.042 ns | 0.032 ns | 77 B |
.NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2
; VectorBenchmarks.Widen.Current()
vzeroupper
mov eax,[rcx+8]
mov rax,[rcx+10]
xor edx,edx
M00_L00:
mov r8,[rcx+8]
vmovdqu ymm0,ymmword ptr [r8+rdx]
vpmovmskb r8d,ymm0
test r8d,r8d
jne short M00_L01
vmovaps ymm1,ymm0
vpmovzxbw ymm1,xmm1
vextractf128 xmm0,ymm0,1
vpmovzxbw ymm0,xmm0
vmovdqu ymmword ptr [rax],ymm1
vmovdqu ymmword ptr [rax+20],ymm0
add rdx,20
add rax,40
cmp rdx,3E0
jbe short M00_L00
M00_L01:
vzeroupper
ret
; Total bytes of code 81
.NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2
; VectorBenchmarks.Widen.Suggested()
vzeroupper
mov rax,[rcx+8]
mov rdx,[rcx+10]
lea rcx,[rax+3E0]
M00_L00:
vmovdqu ymm0,ymmword ptr [rax]
vpmovmskb r8d,ymm0
test r8d,r8d
jne short M00_L01
vmovaps ymm1,ymm0
vpmovzxbw ymm1,xmm1
vextractf128 xmm0,ymm0,1
vpmovzxbw ymm0,xmm0
vmovdqu ymmword ptr [rdx],ymm1
vmovdqu ymmword ptr [rdx+20],ymm0
add rax,20
add rdx,40
cmp rax,rcx
jbe short M00_L00
M00_L01:
vzeroupper
ret
; Total bytes of code 77
If you don't mind I am going to merge it as it is and apply your suggestion in my next PR.
break; | ||
} | ||
|
||
// Vector128.Widen is not used here as it less performant on ARM64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know why? Naively I'd expect the JIT to be able to produce the same code for both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry, but I don't. It's not that I did not try to find out, it's the arm64 tooling that makes it hard for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is by design, ok. But if it's something we can/should be fixing in the JIT, I want to make sure we're not sweeping such issues under the rug. Ideally the obvious code is also the best performing code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd need a comparison between the two code paths to see where the difference is.
I would expect these to be identical except for a case where the original code was making some assumption (based on knowing the inputs were restricted to a subset of all possible values) and therefore skipping an otherwise "required" step that would be necessary to ensure deterministic results for "any input".
|
||
currentOffset += (nuint)Vector128<byte>.Count; | ||
pCurrentWriteAddress += (nuint)Vector128<byte>.Count; | ||
} while (currentOffset <= finalOffsetWhereCanRunLoop); | ||
} | ||
} | ||
else if (Vector.IsHardwareAccelerated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the Vector<T>
path still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the Vector path still needed?
Some Mono variants don't support Vector128
for all configs yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some Mono variants don't support Vector128 for all configs yet
Which support Vector and not Vector128?
Just the presence of these paths are keeping the methods from being R2R'd it seems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mono-LLVM supports both, Mono without LLVM (e.g. default Mono jit mode or AOT) supports only Vector<>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mono-LLVM supports both, Mono without LLVM (e.g. default Mono jit mode or AOT) supports only Vector<>
Is that getting fixed?
We now have multiple vectorized implementations that don't have a Vector<T>
code path. Why is this one special that it still needs one?
EDIT: for updated perf numbers please go to #73055 (comment)
x64
Initially, there was a major regression, but I was able to solve it by enforcing the inlining of
Vector128.Widen(Vector128<byte>)
. After porting everything the new implementation was on par:With some additional optimizations and adding
Vector256
code path it's now on par or up to 20% faster, depending on input (the more characters are ascii, the better).ARM64
Initially, after just mapping the code there was a major regression of 40-50%:
After I replaced
Vector128.Widen
withVector128.WidenLower
andVector128.WidenUpper
, I was able to lower the regression to 10-16%I am trying to update my Surface Pro X to Win 11, which will allow me to install VS 2022 (ARM64), build dotnet/runtime and try the new internal MS profiler. But I can't promise anything.
contributes to #64451
cc @tannergooding @GrabYourPitchforks