-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Vectorize Convert.ToBase64String using SSSE3 #21833
Conversation
The graph doesn't show below 24... is there a regression for small values? (It's pretty common to use base-64 encoding with small values, such as in various HTTP headers.) |
The current |
src/System.Private.CoreLib/shared/System/Convert.Base64.Avx2.cs
Outdated
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Convert.Base64.Avx2.cs
Outdated
Show resolved
Hide resolved
FYI: https://github.com/dotnet/corefx/issues/32365 (will do when I get some time for this) (and https://github.com/gfoidl/Base64) |
src/System.Private.CoreLib/shared/System/Convert.Base64.Avx2.cs
Outdated
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Convert.Base64.Avx2.cs
Outdated
Show resolved
Hide resolved
@gfoidl oh, didn't see your work. I did this just to practice and test Intrinsics API 🙂 |
src/System.Private.CoreLib/shared/System/Convert.Base64.Avx2.cs
Outdated
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Convert.Base64.Avx2.cs
Outdated
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Convert.Base64.Avx2.cs
Outdated
Show resolved
Hide resolved
@tannergooding @stephentoub @fiigii @gfoidl I updated the PR and its description (added graphs). Could you please take a look? |
Vector128<byte> result = Sse2.SubtractSaturate(indices, tt5); | ||
Vector128<sbyte> compareResult = Sse2.CompareGreaterThan(tt7, indices.AsSByte()); | ||
result = Sse2.Or(result, Sse2.And(compareResult.AsByte(), tt8)); | ||
result = Ssse3.Shuffle(s_base64ShiftLut, result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is s_base64ShiftLut
kept is a register or read from memory everytime?
Hoisting this outside the loop maybe better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s_base64ShiftLut
is a static readonly field (constant) - so I guess it should be kept in a register, will check asm once again for loads in this place
|
||
// Do it for the second part of the vector (rotate it first in order to re-use asciiToStringMaskLo) | ||
result = Sse2.Shuffle(result.AsUInt32(), 0x4E /*_MM_SHUFFLE(1,0,3,2)*/).AsByte(); | ||
result = Ssse3.Shuffle(result, s_base64TwoBytesStringMaskLo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for s_base64TwoBytesStringMaskLo
.
result = Sse2.Shuffle(result.AsUInt32(), 0x4E /*_MM_SHUFFLE(1,0,3,2)*/).AsByte(); | ||
result = Ssse3.Shuffle(result, s_base64TwoBytesStringMaskLo); | ||
|
||
if (insertLineBreaks && (charcount += 16) >= base64LineBreakPosition) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move the case with insertLineBreaks
into a separate method, so that the codegen for either case can be optimized.
This may also prevent some spills in the simd-registers (if there are any).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't notice any noticeable performance regressions after I added this block for any values when insertLineBreaks
is false
@EgorBo, are you still working on this? |
@stephentoub updated the comments. |
@EgorBo I wouldn't call it "intersects", as the other PR is for span-based byte -> byte encoding / decoding, whilst this one is for byte -> string (with line-breaks). So similar, but different targets. If there would be no need for line-breaks, so the base64 encoding in |
Resolved merge conflict so we can get test results. |
@tannergooding if tests pass is this ready to merge? |
I'll give this one more pass after lunch. |
@EgorBo do you think that PR can be finished before the consolidation (in next 2 weeks)? |
# Conflicts: # THIRD-PARTY-NOTICES.TXT # src/System.Private.CoreLib/shared/System/Convert.cs
# Conflicts: # THIRD-PARTY-NOTICES.TXT
@@ -2492,19 +2494,146 @@ public static unsafe bool TryToBase64Chars(ReadOnlySpan<byte> bytes, Span<char> | |||
} | |||
} | |||
|
|||
internal static readonly Vector128<byte> s_base64ShuffleMask = Vector128.Create((byte) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A short comment describing each constant would be useful.
It's also not clear why these are static readonly, but several of the others (such as tt0-tt8
) are not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given https://github.com/dotnet/coreclr/issues/17225 and https://github.com/dotnet/coreclr/issues/26976, it would be more efficient processing and space-wise to use the ROS<byte>
read-only property trick on these, especially since they're only used by code behind a Ssse3.IsSupported
check.
Vector128<byte> indices = Sse2.Or(t1, t3); | ||
|
||
// lookup function "Single pshufb method" (lookup_pshufb_improved) | ||
Vector128<byte> result = Sse2.SubtractSaturate(indices, tt5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason this isn't a static local function (since it was a separate function in the original algorithm)? Inlining?
result = Sse2.Shuffle(result.AsUInt32(), 0x4E /*_MM_SHUFFLE(1,0,3,2)*/).AsByte(); | ||
result = Ssse3.Shuffle(result, localTwoBytesStringMaskLo); | ||
|
||
if (insertLineBreaks && (charcount += 16) >= base64LineBreakPosition) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the side effect only hit if insertLineBreaks
is true, but required for both the true and false scenarios is non-obvious.
It would be nice to move the charCount += 16
out separately
Is the 36-byte cutover point appropriate for 32-bit as well? There are more than 8 active XMM registers used in the inner loop, so there will likely be some stack shuffling offsetting the SSE gains. |
Thank you for your contribution. As announced in dotnet/coreclr#27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:
|
Thank you for your contribution. As announced in #27549 the dotnet/runtime repository will be used going forward for changes to this code base. Closing this PR as no more changes will be accepted into master for this repository. If you’d like to continue working on this change please move it to dotnet/runtime. |
This PR improves
Convert.ToBase64String
performance using SSSE3 instructions.It's based on "Base64 encoding with SIMD instructions" article by Wojciech Muła
Benchmark:
Windows 10.0.17134.523, Core i7-8700K 3.7GHz (Coffee Lake):
macOS 10.13.6, Core i7-4980HQ 2.8GHz (Haswell):
SSSE3-based implementation is limited with
input.Length>36
condition in order to avoid regressions for smaller values (the best value for my Skylake, Coffee Lake and Haswell based machines).