Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try to port ASCIIUtility.WidenAsciiToUtf16 to x-plat intrinsics #73055

Merged
merged 16 commits into from
Sep 9, 2022

Conversation

adamsitnik
Copy link
Member

@adamsitnik adamsitnik commented Jul 29, 2022

EDIT: for updated perf numbers please go to #73055 (comment)

x64

Initially, there was a major regression, but I was able to solve it by enforcing the inlining of Vector128.Widen(Vector128<byte>) . After porting everything the new implementation was on par:

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-IVMYPN : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-BJNCCX : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Type Method Job size encName Input Mean Ratio
Perf_Encoding GetString PR 16 ascii ? 22.09 ns 0.97
Perf_Encoding GetString base 16 ascii ? 22.72 ns 1.00
Perf_Encoding GetString PR 16 utf-8 ? 20.37 ns 0.99
Perf_Encoding GetString base 16 utf-8 ? 20.55 ns 1.00
Perf_Encoding GetString PR 512 ascii ? 70.05 ns 0.93
Perf_Encoding GetString base 512 ascii ? 75.12 ns 1.00
Perf_Encoding GetString PR 512 utf-8 ? 77.35 ns 1.00
Perf_Encoding GetString base 512 utf-8 ? 78.87 ns 1.00
Perf_Utf8Encoding GetString PR ? ? EnglishAllAscii 20,990.29 ns 0.99
Perf_Utf8Encoding GetString base ? ? EnglishAllAscii 21,203.47 ns 1.00
Perf_Utf8Encoding GetString PR ? ? EnglishMostlyAscii 125,617.25 ns 0.99
Perf_Utf8Encoding GetString base ? ? EnglishMostlyAscii 126,595.44 ns 1.00
Perf_Utf8Encoding GetString PR ? ? Chinese 156,988.65 ns 1.00
Perf_Utf8Encoding GetString base ? ? Chinese 156,257.20 ns 1.00
Perf_Utf8Encoding GetString PR ? ? Cyrillic 155,448.25 ns 1.00
Perf_Utf8Encoding GetString base ? ? Cyrillic 155,961.57 ns 1.00
Perf_Utf8Encoding GetString PR ? ? Greek 244,318.02 ns 1.00
Perf_Utf8Encoding GetString base ? ? Greek 244,570.33 ns 1.00

With some additional optimizations and adding Vector256 code path it's now on par or up to 20% faster, depending on input (the more characters are ascii, the better).

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  
Method=GetString
Type Job size encName Input Mean Ratio
Perf_Encoding PR 16 ascii ? 22.22 ns 0.99
Perf_Encoding main 16 ascii ? 22.56 ns 1.00
Perf_Encoding PR 16 utf-8 ? 20.19 ns 0.94
Perf_Encoding main 16 utf-8 ? 21.42 ns 1.00
Perf_Encoding PR 512 ascii ? 57.58 ns 0.77
Perf_Encoding main 512 ascii ? 74.38 ns 1.00
Perf_Encoding PR 512 utf-8 ? 67.43 ns 0.79
Perf_Encoding main 512 utf-8 ? 84.91 ns 1.00
Perf_Utf8Encoding PR ? ? EnglishAllAscii 20,610.46 ns 0.98
Perf_Utf8Encoding main ? ? EnglishAllAscii 21,046.16 ns 1.00
Perf_Utf8Encoding PR ? ? EnglishMostlyAscii 129,464.76 ns 1.00
Perf_Utf8Encoding main ? ? EnglishMostlyAscii 128,755.26 ns 1.00
Perf_Utf8Encoding PR ? ? Chinese 157,597.33 ns 1.03
Perf_Utf8Encoding main ? ? Chinese 153,599.10 ns 1.00
Perf_Utf8Encoding PR ? ? Cyrillic 151,221.52 ns 0.98
Perf_Utf8Encoding main ? ? Cyrillic 153,741.42 ns 1.00
Perf_Utf8Encoding PR ? ? Greek 245,325.88 ns 1.01
Perf_Utf8Encoding main ? ? Greek 243,378.47 ns 1.00

ARM64

Initially, after just mapping the code there was a major regression of 40-50%:

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22378.8
  [Host]     : .NET 7.0.0 (7.0.22.37802), Arm64 RyuJIT AdvSIMD
  Job-VUTCOY : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-LKXRPH : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Type Method Toolchain size encName Input Mean Ratio
Perf_Encoding GetString /7.0.0/corerun 16 ascii ? 90.41 ns 1.02
Perf_Encoding GetString /main/corerun 16 ascii ? 88.37 ns 1.00
Perf_Encoding GetString /7.0.0/corerun 16 utf-8 ? 81.14 ns 1.00
Perf_Encoding GetString /main/corerun 16 utf-8 ? 81.43 ns 1.00
Perf_Encoding GetString /7.0.0/corerun 512 ascii ? 376.65 ns 1.55
Perf_Encoding GetString /main/corerun 512 ascii ? 242.93 ns 1.00
Perf_Encoding GetString /7.0.0/corerun 512 utf-8 ? 435.39 ns 1.43
Perf_Encoding GetString /main/corerun 512 utf-8 ? 305.42 ns 1.00
Perf_Utf8Encoding GetString /7.0.0/corerun ? ? EnglishAllAscii 142,873.52 ns 1.39
Perf_Utf8Encoding GetString /main/corerun ? ? EnglishAllAscii 102,949.86 ns 1.00
Perf_Utf8Encoding GetString /7.0.0/corerun ? ? EnglishMostlyAscii 267,682.01 ns 1.04
Perf_Utf8Encoding GetString /main/corerun ? ? EnglishMostlyAscii 256,662.77 ns 1.00
Perf_Utf8Encoding GetString /7.0.0/corerun ? ? Chinese 372,408.90 ns 0.99
Perf_Utf8Encoding GetString /main/corerun ? ? Chinese 376,872.54 ns 1.00
Perf_Utf8Encoding GetString /7.0.0/corerun ? ? Cyrillic 277,097.47 ns 1.00
Perf_Utf8Encoding GetString /main/corerun ? ? Cyrillic 275,724.07 ns 1.00
Perf_Utf8Encoding GetString /7.0.0/corerun ? ? Greek 418,494.63 ns 1.01
Perf_Utf8Encoding GetString /main/corerun ? ? Greek 416,134.26 ns 1.00

After I replaced Vector128.Widen with Vector128.WidenLower and Vector128.WidenUpper, I was able to lower the regression to 10-16%

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22379.1
  [Host]     : .NET 7.0.0 (7.0.22.37802), Arm64 RyuJIT AdvSIMD
  
  Method=GetString
Type Job size encName Input Mean Ratio
Perf_Encoding PR 16 ascii ? 91.36 ns 1.02
Perf_Encoding main 16 ascii ? 89.58 ns 1.00
Perf_Encoding PR 16 utf-8 ? 81.10 ns 0.97
Perf_Encoding main 16 utf-8 ? 83.81 ns 1.00
Perf_Encoding PR 512 ascii ? 282.59 ns 1.16
Perf_Encoding main 512 ascii ? 244.48 ns 1.00
Perf_Encoding PR 512 utf-8 ? 337.99 ns 1.10
Perf_Encoding main 512 utf-8 ? 308.56 ns 1.00
Perf_Utf8Encoding PR ? ? EnglishAllAscii 114,550.43 ns 1.13
Perf_Utf8Encoding main ? ? EnglishAllAscii 101,123.74 ns 1.00
Perf_Utf8Encoding PR ? ? EnglishMostlyAscii 270,465.07 ns 1.06
Perf_Utf8Encoding main ? ? EnglishMostlyAscii 254,863.02 ns 1.00
Perf_Utf8Encoding PR ? ? Chinese 370,954.69 ns 0.97
Perf_Utf8Encoding main ? ? Chinese 382,811.25 ns 1.00
Perf_Utf8Encoding PR ? ? Cyrillic 278,273.22 ns 1.01
Perf_Utf8Encoding main ? ? Cyrillic 275,921.61 ns 1.00
Perf_Utf8Encoding PR ? ? Greek 418,066.91 ns 1.01
Perf_Utf8Encoding main ? ? Greek 415,772.69 ns 1.00

I am trying to update my Surface Pro X to Win 11, which will allow me to install VS 2022 (ARM64), build dotnet/runtime and try the new internal MS profiler. But I can't promise anything.

contributes to #64451

cc @tannergooding @GrabYourPitchforks

@ghost
Copy link

ghost commented Jul 29, 2022

Tagging subscribers to this area: @dotnet/area-system-text-encoding
See info in area-owners.md if you want to be subscribed.

Issue Details

x64

There was a major regression, but I was able to solve it by enforcing the inlining of Vector128.Widen(Vector128<byte>) . The new implementation is now on par.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-IVMYPN : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-BJNCCX : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Type Method Job size encName Input Mean Ratio
Perf_Encoding GetString PR 16 ascii ? 22.09 ns 0.97
Perf_Encoding GetString base 16 ascii ? 22.72 ns 1.00
Perf_Encoding GetString PR 16 utf-8 ? 20.37 ns 0.99
Perf_Encoding GetString base 16 utf-8 ? 20.55 ns 1.00
Perf_Encoding GetString PR 512 ascii ? 70.05 ns 0.93
Perf_Encoding GetString base 512 ascii ? 75.12 ns 1.00
Perf_Encoding GetString PR 512 utf-8 ? 77.35 ns 1.00
Perf_Encoding GetString base 512 utf-8 ? 78.87 ns 1.00
Perf_Utf8Encoding GetString PR ? ? EnglishAllAscii 20,990.29 ns 0.99
Perf_Utf8Encoding GetString base ? ? EnglishAllAscii 21,203.47 ns 1.00
Perf_Utf8Encoding GetString PR ? ? EnglishMostlyAscii 125,617.25 ns 0.99
Perf_Utf8Encoding GetString base ? ? EnglishMostlyAscii 126,595.44 ns 1.00
Perf_Utf8Encoding GetString PR ? ? Chinese 156,988.65 ns 1.00
Perf_Utf8Encoding GetString base ? ? Chinese 156,257.20 ns 1.00
Perf_Utf8Encoding GetString PR ? ? Cyrillic 155,448.25 ns 1.00
Perf_Utf8Encoding GetString base ? ? Cyrillic 155,961.57 ns 1.00
Perf_Utf8Encoding GetString PR ? ? Greek 244,318.02 ns 1.00
Perf_Utf8Encoding GetString base ? ? Greek 244,570.33 ns 1.00

ARM64

There is a major perf regression: 40-50%. I know that it's not caused by the ContainsNonAsciiByte changes, I suspect that ARMs Vector128.Widen* implementations are just suboptimal. I am just guessing because I currently can't get the ARM64 disassembly with profile information.

I am trying to update my Surface Pro X to Win 11, which will allow me to install VS 2022 (ARM64), build dotnet/runtime and try the new internal MS profiler. But I can't promise anything.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22378.8
  [Host]     : .NET 7.0.0 (7.0.22.37802), Arm64 RyuJIT AdvSIMD
  Job-VUTCOY : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-LKXRPH : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Type Method Toolchain size encName Input Mean Ratio
Perf_Encoding GetString /7.0.0/corerun 16 ascii ? 90.41 ns 1.02
Perf_Encoding GetString /main/corerun 16 ascii ? 88.37 ns 1.00
Perf_Encoding GetString /7.0.0/corerun 16 utf-8 ? 81.14 ns 1.00
Perf_Encoding GetString /main/corerun 16 utf-8 ? 81.43 ns 1.00
Perf_Encoding GetString /7.0.0/corerun 512 ascii ? 376.65 ns 1.55
Perf_Encoding GetString /main/corerun 512 ascii ? 242.93 ns 1.00
Perf_Encoding GetString /7.0.0/corerun 512 utf-8 ? 435.39 ns 1.43
Perf_Encoding GetString /main/corerun 512 utf-8 ? 305.42 ns 1.00
Perf_Utf8Encoding GetString /7.0.0/corerun ? ? EnglishAllAscii 142,873.52 ns 1.39
Perf_Utf8Encoding GetString /main/corerun ? ? EnglishAllAscii 102,949.86 ns 1.00
Perf_Utf8Encoding GetString /7.0.0/corerun ? ? EnglishMostlyAscii 267,682.01 ns 1.04
Perf_Utf8Encoding GetString /main/corerun ? ? EnglishMostlyAscii 256,662.77 ns 1.00
Perf_Utf8Encoding GetString /7.0.0/corerun ? ? Chinese 372,408.90 ns 0.99
Perf_Utf8Encoding GetString /main/corerun ? ? Chinese 376,872.54 ns 1.00
Perf_Utf8Encoding GetString /7.0.0/corerun ? ? Cyrillic 277,097.47 ns 1.00
Perf_Utf8Encoding GetString /main/corerun ? ? Cyrillic 275,724.07 ns 1.00
Perf_Utf8Encoding GetString /7.0.0/corerun ? ? Greek 418,494.63 ns 1.01
Perf_Utf8Encoding GetString /main/corerun ? ? Greek 416,134.26 ns 1.00

contributes to #64451

cc @tannergooding @GrabYourPitchforks

Author: adamsitnik
Assignees: -
Labels:

area-System.Text.Encoding

Milestone: -

@@ -3618,6 +3618,7 @@ public static bool TryCopyTo<T>(this Vector128<T> vector, Span<T> destination)
/// <param name="source">The vector whose elements are to be widened.</param>
/// <returns>A pair of vectors that contain the widened lower and upper halves of <paramref name="source" />.</returns>
[CLSCompliant(false)]
[MethodImpl(MethodImplOptions.AggressiveInlining)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add this to all the Widen APIs and to the same on Vector64.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the JIT use named intrinsic and treat all methods in Vector###<T> as candidates for aggressive inlining?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding I've synced my fork with upstream and verified that it's not needed anymore. So we don't need to backport anything to 7.0

@adamsitnik
Copy link
Member Author

I added Vector256 code path and got it up to 20% faster on x64.

I replaced Vector128.Widen with Vector128.WidenLower and Vector128.WidenUpper and got the ARM64 regression down to 10-16% (from 40-50%).

I updated the results posted above.


// Can we at least widen the first part of the vector?

if (!containsNonAsciiBytes)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this code, as it was impossible to satisfy this requirement as the jump was performed only when the flag was set to true:

if (containsNonAsciiBytes)
{
// non-ASCII byte somewhere
goto NonAsciiDataSeenInInnerLoop;
}


// Calculate how many elements we wrote in order to get pOutputBuffer to its next alignment
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on the benchmarking that I've done this was not improving perf in noticeable way, but increasing the code complexity. I've removed it and added Vector256 code path (less code = less code to duplicate)

@adamsitnik
Copy link
Member Author

adamsitnik commented Aug 19, 2022

Edit: it turned out to be a R2R bug: #74253

OK, I have no idea what is going on and I need an extra pair of eyes.

The tests are failing with Encountered infinite recursion while looking up resource 'Word_At' in System.Private.CoreLib.:

The callstack is quite long, but it shows that failure starts from WidenAsciiToUtf16_Vector256 method which I've just added:

   at System.Diagnostics.Debug.Fail(System.String, System.String)
   at System.Text.ASCIIUtility.WidenAsciiToUtf16_Vector256(Byte*, Char*, UIntPtr)
   at System.Text.ASCIIUtility.WidenAsciiToUtf16(Byte*, Char*, UIntPtr)

This method has 3 debug asserts:

private static unsafe nuint WidenAsciiToUtf16_Vector256(byte* pAsciiBuffer, char* pUtf16Buffer, nuint elementCount)
{
Debug.Assert(Vector256.IsHardwareAccelerated);
Debug.Assert(BitConverter.IsLittleEndian);
Debug.Assert(elementCount >= 2 * (uint)Vector256<byte>.Count);

And the only place where it's called from has exactly the same guards:

if (BitConverter.IsLittleEndian && Vector256.IsHardwareAccelerated && elementCount >= 2 * (uint)Vector256<byte>.Count)
{
currentOffset = WidenAsciiToUtf16_Vector256(pAsciiBuffer, pUtf16Buffer, elementCount);
}

So the asserts mentioned above should definitely not fail.

I was able to reproduce the failure locally. It's gone when I remove those 3 asserts! What am I missing?

@danmoseley
Copy link
Member

Http test crash. Worth cracking the dump to check it's not related to this change?

@davidwrighton
Copy link
Member

@adamsitnik please, don't just comment that assert out. The generated code will not reliably behave correctly, you need to follow the instructions I put in #74253

@adamsitnik
Copy link
Member Author

@adamsitnik please, don't just comment that assert out. The generated code will not reliably behave correctly, you need to follow the instructions I put in #74253

@davidwrighton thank you!

btw based on the perf results I decided to simply inline these two helpers

Method Toolchain size encName Mean Ratio
GetString \helpers\corerun.exe 8 ascii 18.53 ns 0.88
GetString \inlined\corerun.exe 8 ascii 17.47 ns 0.83
GetString \main\corerun.exe 8 ascii 21.16 ns 1.00
GetString \helpers\corerun.exe 8 utf-8 16.84 ns 0.80
GetString \inlined\corerun.exe 8 utf-8 15.96 ns 0.76
GetString \main\corerun.exe 8 utf-8 20.96 ns 1.00
GetString \helpers\corerun.exe 16 ascii 24.87 ns 0.83
GetString \inlined\corerun.exe 16 ascii 18.32 ns 0.61
GetString \main\corerun.exe 16 ascii 30.11 ns 1.00
GetString \helpers\corerun.exe 16 utf-8 18.20 ns 0.77
GetString \inlined\corerun.exe 16 utf-8 16.27 ns 0.69
GetString \main\corerun.exe 16 utf-8 23.71 ns 1.00
GetString \helpers\corerun.exe 32 ascii 18.50 ns 0.61
GetString \inlined\corerun.exe 32 ascii 17.07 ns 0.56
GetString \main\corerun.exe 32 ascii 30.56 ns 1.00
GetString \helpers\corerun.exe 32 utf-8 20.95 ns 0.84
GetString \inlined\corerun.exe 32 utf-8 18.00 ns 0.72
GetString \main\corerun.exe 32 utf-8 24.88 ns 1.00
GetString \helpers\corerun.exe 64 ascii 21.41 ns 0.60
GetString \inlined\corerun.exe 64 ascii 20.59 ns 0.58
GetString \main\corerun.exe 64 ascii 35.69 ns 1.00
GetString \helpers\corerun.exe 64 utf-8 30.83 ns 1.03
GetString \inlined\corerun.exe 64 utf-8 28.59 ns 0.96
GetString \main\corerun.exe 64 utf-8 29.89 ns 1.00
GetString \helpers\corerun.exe 512 ascii 55.39 ns 0.72
GetString \inlined\corerun.exe 512 ascii 53.79 ns 0.70
GetString \main\corerun.exe 512 ascii 77.35 ns 1.00
GetString \helpers\corerun.exe 512 utf-8 71.63 ns 0.89
GetString \inlined\corerun.exe 512 utf-8 69.81 ns 0.87
GetString \main\corerun.exe 512 utf-8 80.38 ns 1.00

@adamsitnik
Copy link
Member Author

Updated perf numbers:

For size==16 we can observe gain for both x64 and arm64. It's caused by executing the vectorized code path now (previously the buffer needed to be at least double of Vector128<byte>.Count (32)).

x64 AVX2

There is a 10-30% boost for large inputs. It's caused by adding the Vector256 code path.

Small inputs also work faster, partially because WidenFourAsciiBytesToUtf16AndWriteToBuffer is producing better code gen now. I am afraid that some of these gains are caused by code alignment changes (the benchmarks themselves were run with memory randomization enabled, so we can exclude memory alignment from the list).

BenchmarkDotNet=v0.13.1.20220823-develop, OS=Windows 11 (10.0.22000.856/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.7.22377.5
  [Host]     : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT AVX2
  Job-SJSFRM : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-QAHUTQ : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

LaunchCount=9 MemoryRandomization=True
Method Toolchain size encName Mean Ratio
GetString \main\corerun.exe 8 ascii 24.70 ns 1.00
GetString \pr\corerun.exe 8 ascii 16.86 ns 0.69
GetString \main\corerun.exe 8 utf-8 19.65 ns 1.00
GetString \pr\corerun.exe 8 utf-8 16.48 ns 0.84
GetString \main\corerun.exe 16 ascii 27.84 ns 1.00
GetString \pr\corerun.exe 16 ascii 18.36 ns 0.66
GetString \main\corerun.exe 16 utf-8 21.78 ns 1.00
GetString \pr\corerun.exe 16 utf-8 17.26 ns 0.79
GetString \main\corerun.exe 32 ascii 30.05 ns 1.00
GetString \pr\corerun.exe 32 ascii 17.09 ns 0.57
GetString \main\corerun.exe 32 utf-8 23.42 ns 1.00
GetString \pr\corerun.exe 32 utf-8 17.59 ns 0.75
GetString \main\corerun.exe 64 ascii 33.73 ns 1.00
GetString \pr\corerun.exe 64 ascii 20.59 ns 0.61
GetString \main\corerun.exe 64 utf-8 26.96 ns 1.00
GetString \pr\corerun.exe 64 utf-8 27.51 ns 1.02
GetString \main\corerun.exe 512 ascii 75.27 ns 1.00
GetString \pr\corerun.exe 512 ascii 52.61 ns 0.70
GetString \main\corerun.exe 512 utf-8 76.35 ns 1.00
GetString \pr\corerun.exe 512 utf-8 67.73 ns 0.89

ARM64 AdvSIMD

There is a small (4-8%) gain for all test cases.

BenchmarkDotNet=v0.13.1.1845-nightly, OS=Windows 11 (10.0.22622.575)
Microsoft SQ1 3.0 GHz, 1 CPU, 8 logical and 8 physical cores
.NET SDK=7.0.100-rc.2.22422.7
  [Host]     : .NET 7.0.0 (7.0.22.41112), Arm64 RyuJIT AdvSIMD
  Job-AHYKGY : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-THHHUV : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

LaunchCount=9 MemoryRandomization=True
Method Toolchain size encName Mean Ratio
GetString \main\corerun.exe 8 ascii 29.61 ns 1.00
GetString \pr\corerun.exe 8 ascii 28.39 ns 0.96
GetString \main\corerun.exe 8 utf-8 29.42 ns 1.00
GetString \pr\corerun.exe 8 utf-8 28.39 ns 0.96
GetString \main\corerun.exe 16 ascii 33.19 ns 1.00
GetString \pr\corerun.exe 16 ascii 30.45 ns 0.92
GetString \main\corerun.exe 16 utf-8 33.15 ns 1.00
GetString \pr\corerun.exe 16 utf-8 29.42 ns 0.89
GetString \main\corerun.exe 32 ascii 39.20 ns 1.00
GetString \pr\corerun.exe 32 ascii 34.30 ns 0.88
GetString \main\corerun.exe 32 utf-8 37.09 ns 1.00
GetString \pr\corerun.exe 32 utf-8 33.67 ns 0.91
GetString \main\corerun.exe 64 ascii 47.49 ns 1.00
GetString \pr\corerun.exe 64 ascii 41.39 ns 0.87
GetString \main\corerun.exe 64 utf-8 50.62 ns 1.00
GetString \pr\corerun.exe 64 utf-8 44.74 ns 0.89
GetString \main\corerun.exe 512 ascii 162.25 ns 1.00
GetString \pr\corerun.exe 512 ascii 148.39 ns 0.92
GetString \main\corerun.exe 512 utf-8 199.57 ns 1.00
GetString \pr\corerun.exe 512 utf-8 182.53 ns 0.92

{
Vector256<byte> asciiVector = Vector256.Load(pAsciiBuffer + currentOffset);

if (asciiVector.ExtractMostSignificantBits() != 0)
Copy link
Member

@stephentoub stephentoub Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/dotnet/runtime/pull/73055/files?diff=split&w=1#diff-66bbe89271f826c9232bd146abb678844754515dc027f70ad0ce36f751da46ebR1378 suggests that Sse41.TestZ is faster than ExtractMostSignificantBits for 128 bits. Does the same not hold for Avx.TestZ for 256 bits?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Reflection.Metadata;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

namespace VectorBenchmarks
{
    internal class Program
    {
        static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
    }

    public unsafe class ContainsNonAscii
    {
        private const int Size = 1024;
        private byte* _bytes;
            
        [GlobalSetup]
        public void Setup()
        {
            _bytes = (byte*)NativeMemory.AlignedAlloc(new UIntPtr(Size), new UIntPtr(32));
            new Span<byte>(_bytes, Size).Clear();
        }

        [GlobalCleanup]
        public void Free() => NativeMemory.AlignedFree(_bytes);

        [Benchmark]
        public bool ExtractMostSignificantBits()
        {
            ref byte searchSpace = ref *_bytes;
            nuint currentOffset = 0;
            nuint finalOffsetWhereCanRunLoop = Size - (uint)Vector256<byte>.Count;

            do
            {
                Vector256<byte> asciiVector = Vector256.LoadUnsafe(ref searchSpace, currentOffset);

                if (asciiVector.ExtractMostSignificantBits() != 0)
                {
                    return true;
                }

                currentOffset += (nuint)Vector256<byte>.Count;
            } while (currentOffset <= finalOffsetWhereCanRunLoop);

            return false;
        }

        [Benchmark]
        public bool TestZ()
        {
            ref byte searchSpace = ref *_bytes;
            nuint currentOffset = 0;
            nuint finalOffsetWhereCanRunLoop = Size - (uint)Vector256<byte>.Count;

            do
            {
                Vector256<byte> asciiVector = Vector256.LoadUnsafe(ref searchSpace, currentOffset);

                if (!Avx.TestZ(asciiVector, Vector256.Create((byte)0x80)))
                {
                    return true;
                }

                currentOffset += (nuint)Vector256<byte>.Count;
            } while (currentOffset <= finalOffsetWhereCanRunLoop);

            return false;
        }
    }
}
BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22000.856/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-rc.1.22423.16
  [Host]     : .NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2
  DefaultJob : .NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2
Method Mean Error StdDev Code Size
ExtractMostSignificantBits 11.78 ns 0.042 ns 0.040 ns 57 B
TestZ 14.76 ns 0.320 ns 0.416 ns 68 B

.NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2

; VectorBenchmarks.ContainsNonAscii.ExtractMostSignificantBits()
       vzeroupper
       mov       rax,[rcx+8]
       xor       edx,edx
       nop       dword ptr [rax]
M00_L00:
       vmovdqu   ymm0,ymmword ptr [rax+rdx]
       vpmovmskb ecx,ymm0
       test      ecx,ecx
       jne       short M00_L01
       add       rdx,20
       cmp       rdx,3E0
       jbe       short M00_L00
       xor       eax,eax
       vzeroupper
       ret
M00_L01:
       mov       eax,1
       vzeroupper
       ret
; Total bytes of code 57

.NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2

; VectorBenchmarks.ContainsNonAscii.TestZ()
       vzeroupper
       mov       rax,[rcx+8]
       xor       edx,edx
       vmovupd   ymm0,[7FF9E3D94D60]
       nop       dword ptr [rax]
       nop       dword ptr [rax]
M00_L00:
       vptest    ymm0,ymmword ptr [rax+rdx]
       jne       short M00_L01
       add       rdx,20
       cmp       rdx,3E0
       jbe       short M00_L00
       xor       eax,eax
       vzeroupper
       ret
M00_L01:
       mov       eax,1
       vzeroupper
       ret
; Total bytes of code 68

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not:

That's not what I see on my machine.

Method Mean
ExtractMostSignificantBits_128 31.77 ns
TestZ_128 25.58 ns
ExtractMostSignificantBits_256 15.58 ns
TestZ_256 11.66 ns
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics.X86;
using System.Runtime.Intrinsics;

[HideColumns("Error", "StdDev", "Median", "RatioSD")]
public unsafe partial class Program
{
    static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

    private const int Size = 1024;
    private byte* _bytes;

    [GlobalSetup]
    public void Setup()
    {
        _bytes = (byte*)NativeMemory.AlignedAlloc(new UIntPtr(Size), new UIntPtr(32));
        new Span<byte>(_bytes, Size).Clear();
    }

    [GlobalCleanup]
    public void Free() => NativeMemory.AlignedFree(_bytes);

    [Benchmark]
    public bool ExtractMostSignificantBits_128()
    {
        ref byte searchSpace = ref *_bytes;
        nuint currentOffset = 0;
        nuint finalOffsetWhereCanRunLoop = Size - (uint)Vector128<byte>.Count;

        do
        {
            Vector128<byte> asciiVector = Vector128.LoadUnsafe(ref searchSpace, currentOffset);

            if (asciiVector.ExtractMostSignificantBits() != 0)
            {
                return true;
            }

            currentOffset += (nuint)Vector128<byte>.Count;
        } while (currentOffset <= finalOffsetWhereCanRunLoop);

        return false;
    }

    [Benchmark]
    public bool TestZ_128()
    {
        ref byte searchSpace = ref *_bytes;
        nuint currentOffset = 0;
        nuint finalOffsetWhereCanRunLoop = Size - (uint)Vector128<byte>.Count;

        do
        {
            Vector128<byte> asciiVector = Vector128.LoadUnsafe(ref searchSpace, currentOffset);

            if (!Sse41.TestZ(asciiVector, Vector128.Create((byte)0x80)))
            {
                return true;
            }

            currentOffset += (nuint)Vector128<byte>.Count;
        } while (currentOffset <= finalOffsetWhereCanRunLoop);

        return false;
    }

    [Benchmark]
    public bool ExtractMostSignificantBits_256()
    {
        ref byte searchSpace = ref *_bytes;
        nuint currentOffset = 0;
        nuint finalOffsetWhereCanRunLoop = Size - (uint)Vector256<byte>.Count;

        do
        {
            Vector256<byte> asciiVector = Vector256.LoadUnsafe(ref searchSpace, currentOffset);

            if (asciiVector.ExtractMostSignificantBits() != 0)
            {
                return true;
            }

            currentOffset += (nuint)Vector256<byte>.Count;
        } while (currentOffset <= finalOffsetWhereCanRunLoop);

        return false;
    }

    [Benchmark]
    public bool TestZ_256()
    {
        ref byte searchSpace = ref *_bytes;
        nuint currentOffset = 0;
        nuint finalOffsetWhereCanRunLoop = Size - (uint)Vector256<byte>.Count;

        do
        {
            Vector256<byte> asciiVector = Vector256.LoadUnsafe(ref searchSpace, currentOffset);

            if (!Avx.TestZ(asciiVector, Vector256.Create((byte)0x80)))
            {
                return true;
            }

            currentOffset += (nuint)Vector256<byte>.Count;
        } while (currentOffset <= finalOffsetWhereCanRunLoop);

        return false;
    }
}

Copy link
Member

@EgorBo EgorBo Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub I think it depends on CPU, I even had to revert TestZ from Vector.Equals because it produced regressions #67902

Copy link
Member

@stephentoub stephentoub Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends on CPU, I even had to revert TestZ from Vector.Equals because it produced regressions #67902

That change reverted it from both the 256-bit and 128-bit code paths. This PR uses TestZ for 128-bit. Why is that ok?

I'm questioning the non-symmetrical usage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some helpful comments in these SO links (and related) https://stackoverflow.com/questions/60446759/sse2-test-xmm-bitmask-directly-without-using-pmovmskb

Copy link
Member

@stephentoub stephentoub Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but I'm not seeing the answer there to my question.

I'll restate:
This PR is adding additional code to prefer using TestZ with Vector128:
https://github.com/dotnet/runtime/pull/73055/files#diff-66bbe89271f826c9232bd146abb678844754515dc027f70ad0ce36f751da46ebR1379-R1391
Your #67902 reverted other changes that preferred using TestZ, not just on 256-bit but also on 128-bit vectors.
Does it still make sense for this PR to be adding additional code to use TestZ with Vector128?

(Part of why I'm pushing on this is with a goal of avoiding needing to drop down to direct instrinsics as much as possible. I'd hope we can get to a point where the obvious code to write is the best code to write in as many situations as possible.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I am not saying the current non-symmetrical usage is correct, I'd probably change both to use ExtractMostSignificantBits

C++ compilers also do different things here, e.g. LLVM folds even direct MoveMask usage to testz: https://godbolt.org/z/MobvxvzGK

Copy link
Member

@tannergooding tannergooding Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "better" is going to depend on a few factors.

On x86/x64, ExtractMostSignificantBits is likely faster because it can always emit exactly as movmsk and there are some CPUs where TestZ can be slower, particularly for "small inputs" where the match is sooner. When the match is later, TestZ typically wins out regardless.

On Arm64, doing the early comparison against == Zero is likely better because it is a single instruction vs the multi-instruction sequence required to emulate x64's movmsk.

I think the best choice here is to use == Zero (and therefore TestZ) as I believe it will, on average, produce the best/most consistent code. The cases where it might be a bit slower will typically be for smaller inputs where we're already returning quickly and the extra couple nanoseconds won't really matter.


currentOffset += (nuint)Vector256<byte>.Count;
pCurrentWriteAddress += (nuint)Vector256<byte>.Count;
} while (currentOffset <= finalOffsetWhereCanRunLoop);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On every iteration of the loop we're bumping currentOffset and then also adding that to pAsciiBuffer. Would it be faster to instead compute the upper bound as an address, just bump the current pointer in the loop, and then after the loop compute the currentOffset if needed based on the ending/starting difference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It produces better codegen, but the reported time difference is within the range of error (0-0.3ns gain or loss)

public unsafe class Widen
{
    private const int Size = 1024;
    private byte* _bytes;
    private char* _chars;

    [GlobalSetup]
    public void Setup()
    {
        _bytes = (byte*)NativeMemory.AlignedAlloc(new UIntPtr(Size), new UIntPtr(32));
        new Span<byte>(_bytes, Size).Fill((byte)'a');
        _chars = (char*)NativeMemory.AlignedAlloc(new UIntPtr(Size * sizeof(char)), new UIntPtr(32));
    }

    [GlobalCleanup]
    public void Free()
    {
        NativeMemory.AlignedFree(_bytes);
        NativeMemory.AlignedFree(_chars);
    }

    [Benchmark]
    public void Current()
    {
        ref byte searchSpace = ref *_bytes;
        ushort* pCurrentWriteAddress = (ushort*)_chars;
        nuint currentOffset = 0;
        nuint finalOffsetWhereCanRunLoop = Size - (uint)Vector256<byte>.Count;

        do
        {
            Vector256<byte> asciiVector = Vector256.Load(_bytes + currentOffset);

            if (asciiVector.ExtractMostSignificantBits() != 0)
            {
                break;
            }

            (Vector256<ushort> low, Vector256<ushort> upper) = Vector256.Widen(asciiVector);
            low.Store(pCurrentWriteAddress);
            upper.Store(pCurrentWriteAddress + Vector256<ushort>.Count);

            currentOffset += (nuint)Vector256<byte>.Count;
            pCurrentWriteAddress += (nuint)Vector256<byte>.Count;
        } while (currentOffset <= finalOffsetWhereCanRunLoop);
    }

    [Benchmark]
    public void Suggested()
    {
        ref byte currentSearchSpace = ref *_bytes;
        ref ushort currentWriteAddress = ref Unsafe.As<char, ushort>(ref *_chars);
        ref byte oneVectorAwayFromEnd = ref Unsafe.Add(ref currentSearchSpace, Size - Vector256<byte>.Count);

        do
        {
            Vector256<byte> asciiVector = Vector256.LoadUnsafe(ref currentSearchSpace);

            if (asciiVector.ExtractMostSignificantBits() != 0)
            {
                break;
            }

            (Vector256<ushort> low, Vector256<ushort> upper) = Vector256.Widen(asciiVector);
            low.StoreUnsafe(ref currentWriteAddress);
            upper.StoreUnsafe(ref currentWriteAddress, (nuint)Vector256<ushort>.Count);

            currentSearchSpace = ref Unsafe.Add(ref currentSearchSpace, Vector256<byte>.Count);
            currentWriteAddress = ref Unsafe.Add(ref currentWriteAddress, Vector256<byte>.Count);
        } while (!Unsafe.IsAddressGreaterThan(ref currentSearchSpace, ref oneVectorAwayFromEnd));
    }
}
BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22000.856/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-rc.1.22423.16
  [Host]     : .NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2
  DefaultJob : .NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2
Method Mean Error StdDev Code Size
Current 44.29 ns 0.171 ns 0.152 ns 81 B
Suggested 44.54 ns 0.042 ns 0.032 ns 77 B

.NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2

; VectorBenchmarks.Widen.Current()
       vzeroupper
       mov       eax,[rcx+8]
       mov       rax,[rcx+10]
       xor       edx,edx
M00_L00:
       mov       r8,[rcx+8]
       vmovdqu   ymm0,ymmword ptr [r8+rdx]
       vpmovmskb r8d,ymm0
       test      r8d,r8d
       jne       short M00_L01
       vmovaps   ymm1,ymm0
       vpmovzxbw ymm1,xmm1
       vextractf128 xmm0,ymm0,1
       vpmovzxbw ymm0,xmm0
       vmovdqu   ymmword ptr [rax],ymm1
       vmovdqu   ymmword ptr [rax+20],ymm0
       add       rdx,20
       add       rax,40
       cmp       rdx,3E0
       jbe       short M00_L00
M00_L01:
       vzeroupper
       ret
; Total bytes of code 81

.NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2

; VectorBenchmarks.Widen.Suggested()
       vzeroupper
       mov       rax,[rcx+8]
       mov       rdx,[rcx+10]
       lea       rcx,[rax+3E0]
M00_L00:
       vmovdqu   ymm0,ymmword ptr [rax]
       vpmovmskb r8d,ymm0
       test      r8d,r8d
       jne       short M00_L01
       vmovaps   ymm1,ymm0
       vpmovzxbw ymm1,xmm1
       vextractf128 xmm0,ymm0,1
       vpmovzxbw ymm0,xmm0
       vmovdqu   ymmword ptr [rdx],ymm1
       vmovdqu   ymmword ptr [rdx+20],ymm0
       add       rax,20
       add       rdx,40
       cmp       rax,rcx
       jbe       short M00_L00
M00_L01:
       vzeroupper
       ret
; Total bytes of code 77

If you don't mind I am going to merge it as it is and apply your suggestion in my next PR.

break;
}

// Vector128.Widen is not used here as it less performant on ARM64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know why? Naively I'd expect the JIT to be able to produce the same code for both.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry, but I don't. It's not that I did not try to find out, it's the arm64 tooling that makes it hard for me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding?

If this is by design, ok. But if it's something we can/should be fixing in the JIT, I want to make sure we're not sweeping such issues under the rug. Ideally the obvious code is also the best performing code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codegen looks good to me:
image
(add could be contained but it's unrelated here)

Copy link
Member

@tannergooding tannergooding Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd need a comparison between the two code paths to see where the difference is.

I would expect these to be identical except for a case where the original code was making some assumption (based on knowing the inputs were restricted to a subset of all possible values) and therefore skipping an otherwise "required" step that would be necessary to ensure deterministic results for "any input".


currentOffset += (nuint)Vector128<byte>.Count;
pCurrentWriteAddress += (nuint)Vector128<byte>.Count;
} while (currentOffset <= finalOffsetWhereCanRunLoop);
}
}
else if (Vector.IsHardwareAccelerated)
Copy link
Member

@stephentoub stephentoub Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the Vector<T> path still needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the Vector path still needed?

Some Mono variants don't support Vector128 for all configs yet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some Mono variants don't support Vector128 for all configs yet

Which support Vector and not Vector128?

Just the presence of these paths are keeping the methods from being R2R'd it seems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mono-LLVM supports both, Mono without LLVM (e.g. default Mono jit mode or AOT) supports only Vector<>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mono-LLVM supports both, Mono without LLVM (e.g. default Mono jit mode or AOT) supports only Vector<>

Is that getting fixed?

We now have multiple vectorized implementations that don't have a Vector<T> code path. Why is this one special that it still needs one?

@ghost ghost locked as resolved and limited conversation to collaborators Oct 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants