Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Vectorize Convert.ToBase64String using SSSE3 #21833

Closed
wants to merge 25 commits into from

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Jan 6, 2019

This PR improves Convert.ToBase64String performance using SSSE3 instructions.
It's based on "Base64 encoding with SIMD instructions" article by Wojciech Muła

Benchmark:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Collections.Generic;

namespace ConsoleApp143
{
    public class ToBase64StringBenchmarks
    {
        public static IEnumerable<object[]> TestDataForGraph()
        {
            var rand = new Random(314666); // fixed "seed"
            for (int i = 0; i < 100; i++)
            {
                var data = new byte[i];
                for (int j = 0; j < i; j++)
                    data[j] = (byte)rand.Next(0, byte.MaxValue);
                yield return new object[] { data, i };
            }
        }

        [Benchmark]
        [ArgumentsSource(nameof(TestDataForGraph))]
        public string ToBase64(byte[] testData, int inputSize /* argument for report */) =>
            Convert.ToBase64String(testData, Base64FormattingOptions.InsertLineBreaks);

        static unsafe void Main(string[] args) => 
            BenchmarkSwitcher.FromAssembly(typeof(ToBase64StringBenchmarks).Assembly).Run(args);
    }
}

Windows 10.0.17134.523, Core i7-8700K 3.7GHz (Coffee Lake):

image

macOS 10.13.6, Core i7-4980HQ 2.8GHz (Haswell):

image

SSSE3-based implementation is limited with input.Length>36 condition in order to avoid regressions for smaller values (the best value for my Skylake, Coffee Lake and Haswell based machines).

@stephentoub
Copy link
Member

For smaller input arrays according to my benchmark, performance shows up after input.Length >= 50

The graph doesn't show below 24... is there a regression for small values? (It's pretty common to use base-64 encoding with small values, such as in various HTTP headers.)

@tannergooding
Copy link
Member

BTW, when I did port I had to manually reverse all values in _mm256_setr - maybe it makes sense to add Vector.CreateReversed in order to simplify such cases?

The current Create methods for Vector64, Vector128, and Vector256 take the values in the same order as the native setr methods (which is e0, e1, ...)

@gfoidl
Copy link
Member

gfoidl commented Jan 7, 2019

FYI: https://github.com/dotnet/corefx/issues/32365 (will do when I get some time for this) (and https://github.com/gfoidl/Base64)

@EgorBo
Copy link
Member Author

EgorBo commented Jan 7, 2019

@gfoidl oh, didn't see your work. I did this just to practice and test Intrinsics API 🙂

@EgorBo
Copy link
Member Author

EgorBo commented Feb 10, 2019

@tannergooding @stephentoub @fiigii @gfoidl I updated the PR and its description (added graphs). Could you please take a look?
I tried to keep it simple and small and to avoid any regressions for small values

@EgorBo EgorBo changed the title Vectorize Convert.ToBase64String using AVX2 Vectorize Convert.ToBase64String using SSSE3 Feb 10, 2019
Vector128<byte> result = Sse2.SubtractSaturate(indices, tt5);
Vector128<sbyte> compareResult = Sse2.CompareGreaterThan(tt7, indices.AsSByte());
result = Sse2.Or(result, Sse2.And(compareResult.AsByte(), tt8));
result = Ssse3.Shuffle(s_base64ShiftLut, result);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is s_base64ShiftLut kept is a register or read from memory everytime?
Hoisting this outside the loop maybe better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s_base64ShiftLut is a static readonly field (constant) - so I guess it should be kept in a register, will check asm once again for loads in this place


// Do it for the second part of the vector (rotate it first in order to re-use asciiToStringMaskLo)
result = Sse2.Shuffle(result.AsUInt32(), 0x4E /*_MM_SHUFFLE(1,0,3,2)*/).AsByte();
result = Ssse3.Shuffle(result, s_base64TwoBytesStringMaskLo);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for s_base64TwoBytesStringMaskLo.

result = Sse2.Shuffle(result.AsUInt32(), 0x4E /*_MM_SHUFFLE(1,0,3,2)*/).AsByte();
result = Ssse3.Shuffle(result, s_base64TwoBytesStringMaskLo);

if (insertLineBreaks && (charcount += 16) >= base64LineBreakPosition)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the case with insertLineBreaks into a separate method, so that the codegen for either case can be optimized.

This may also prevent some spills in the simd-registers (if there are any).

Copy link
Member Author

@EgorBo EgorBo Feb 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice any noticeable performance regressions after I added this block for any values when insertLineBreaks is false

@stephentoub
Copy link
Member

@EgorBo, are you still working on this?

@EgorBo
Copy link
Member Author

EgorBo commented Apr 24, 2019

@stephentoub updated the comments.
I guess this PR intersects with @gfoidl dotnet/corefx#34529 who started to work on this earlier (and my PR focuses only on Encoding).

@gfoidl
Copy link
Member

gfoidl commented Apr 24, 2019

@EgorBo I wouldn't call it "intersects", as the other PR is for span-based byte -> byte encoding / decoding, whilst this one is for byte -> string (with line-breaks). So similar, but different targets.

If there would be no need for line-breaks, so the base64 encoding in Convert could be based on System.Buffers.Text.Base64.

@danmoseley
Copy link
Member

Resolved merge conflict so we can get test results.

@danmoseley
Copy link
Member

@tannergooding if tests pass is this ready to merge?

@tannergooding
Copy link
Member

I'll give this one more pass after lunch.

@sandreenko
Copy link

@EgorBo do you think that PR can be finished before the consolidation (in next 2 weeks)?

@@ -2492,19 +2494,146 @@ public static unsafe bool TryToBase64Chars(ReadOnlySpan<byte> bytes, Span<char>
}
}

internal static readonly Vector128<byte> s_base64ShuffleMask = Vector128.Create((byte)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A short comment describing each constant would be useful.

It's also not clear why these are static readonly, but several of the others (such as tt0-tt8) are not

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given https://github.com/dotnet/coreclr/issues/17225 and https://github.com/dotnet/coreclr/issues/26976, it would be more efficient processing and space-wise to use the ROS<byte> read-only property trick on these, especially since they're only used by code behind a Ssse3.IsSupported check.

Vector128<byte> indices = Sse2.Or(t1, t3);

// lookup function "Single pshufb method" (lookup_pshufb_improved)
Vector128<byte> result = Sse2.SubtractSaturate(indices, tt5);
Copy link
Member

@tannergooding tannergooding Nov 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this isn't a static local function (since it was a separate function in the original algorithm)? Inlining?

result = Sse2.Shuffle(result.AsUInt32(), 0x4E /*_MM_SHUFFLE(1,0,3,2)*/).AsByte();
result = Ssse3.Shuffle(result, localTwoBytesStringMaskLo);

if (insertLineBreaks && (charcount += 16) >= base64LineBreakPosition)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the side effect only hit if insertLineBreaks is true, but required for both the true and false scenarios is non-obvious.

It would be nice to move the charCount += 16 out separately

@saucecontrol
Copy link
Member

SSSE3-based implementation is limited with input.Length>36 condition in order to avoid regressions for smaller values (the best value for my Skylake, Coffee Lake and Haswell based machines).

Is the 36-byte cutover point appropriate for 32-bit as well? There are more than 8 active XMM registers used in the inner loop, so there will likely be some stack shuffling offsetting the SSE gains.

@maryamariyan
Copy link
Member

Thank you for your contribution. As announced in dotnet/coreclr#27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:

  1. In your coreclr repository clone, create patch by running git format-patch origin
  2. In your runtime repository clone, apply the patch by running git apply --directory src/coreclr <path to the patch created in step 1>

@maryamariyan
Copy link
Member

Thank you for your contribution. As announced in #27549 the dotnet/runtime repository will be used going forward for changes to this code base. Closing this PR as no more changes will be accepted into master for this repository. If you’d like to continue working on this change please move it to dotnet/runtime.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants