Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Hardware Accelerated Checksums #1201

Merged
merged 6 commits into from
May 18, 2020
Merged

Add Hardware Accelerated Checksums #1201

merged 6 commits into from
May 18, 2020

Conversation

JimBobSquarePants
Copy link
Member

@JimBobSquarePants JimBobSquarePants commented May 16, 2020

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following matches the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

Added support for hardware accelerated Crc32 and Adler32 checksum generation both used by our png codecs. They're fast!

Benchmarks.

Adler32

Method Runtime Count Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
SharpZipLibCalculate .NET 4.7.2 1024 793.18 ns 775.66 ns 42.516 ns 1.00 0.00 - - - -
SixLaborsCalculate .NET 4.7.2 1024 384.86 ns 15.64 ns 0.857 ns 0.49 0.03 - - - -
SharpZipLibCalculate .NET Core 2.1 1024 790.31 ns 353.34 ns 19.368 ns 1.00 0.00 - - - -
SixLaborsCalculate .NET Core 2.1 1024 465.28 ns 652.41 ns 35.761 ns 0.59 0.03 - - - -
SharpZipLibCalculate .NET Core 3.1 1024 877.25 ns 97.89 ns 5.365 ns 1.00 0.00 - - - -
SixLaborsCalculate .NET Core 3.1 1024 45.60 ns 13.28 ns 0.728 ns 0.05 0.00 - - - -
SharpZipLibCalculate .NET 4.7.2 2048 1,537.04 ns 428.44 ns 23.484 ns 1.00 0.00 - - - -
SixLaborsCalculate .NET 4.7.2 2048 849.76 ns 1,066.34 ns 58.450 ns 0.55 0.04 - - - -
SharpZipLibCalculate .NET Core 2.1 2048 1,616.97 ns 276.70 ns 15.167 ns 1.00 0.00 - - - -
SixLaborsCalculate .NET Core 2.1 2048 790.77 ns 691.71 ns 37.915 ns 0.49 0.03 - - - -
SharpZipLibCalculate .NET Core 3.1 2048 1,735.11 ns 1,374.22 ns 75.325 ns 1.00 0.00 - - - -
SixLaborsCalculate .NET Core 3.1 2048 87.80 ns 56.84 ns 3.116 ns 0.05 0.00 - - - -
SharpZipLibCalculate .NET 4.7.2 4096 3,054.53 ns 796.41 ns 43.654 ns 1.00 0.00 - - - -
SixLaborsCalculate .NET 4.7.2 4096 1,538.90 ns 487.02 ns 26.695 ns 0.50 0.01 - - - -
SharpZipLibCalculate .NET Core 2.1 4096 3,223.48 ns 32.32 ns 1.771 ns 1.00 0.00 - - - -
SixLaborsCalculate .NET Core 2.1 4096 1,547.60 ns 309.72 ns 16.977 ns 0.48 0.01 - - - -
SharpZipLibCalculate .NET Core 3.1 4096 3,672.33 ns 1,095.81 ns 60.065 ns 1.00 0.00 - - - -
SixLaborsCalculate .NET Core 3.1 4096 159.44 ns 36.31 ns 1.990 ns 0.04 0.00 - - - -

Crc32

Method Runtime Count Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
SharpZipLibCalculate .NET 4.7.2 1024 3,067.24 ns 769.25 ns 42.165 ns 1.00 0.00 - - - -
SixLaborsCalculate .NET 4.7.2 1024 2,546.86 ns 1,106.36 ns 60.643 ns 0.83 0.02 - - - -
SharpZipLibCalculate .NET Core 2.1 1024 3,377.15 ns 3,903.41 ns 213.959 ns 1.00 0.00 - - - -
SixLaborsCalculate .NET Core 2.1 1024 2,524.25 ns 2,220.97 ns 121.739 ns 0.75 0.04 - - - -
SharpZipLibCalculate .NET Core 3.1 1024 3,980.60 ns 8,497.37 ns 465.769 ns 1.00 0.00 - - - -
SixLaborsCalculate .NET Core 3.1 1024 78.68 ns 69.82 ns 3.827 ns 0.02 0.00 - - - -
SharpZipLibCalculate .NET 4.7.2 2048 7,934.29 ns 42,550.13 ns 2,332.316 ns 1.00 0.00 - - - -
SixLaborsCalculate .NET 4.7.2 2048 5,437.81 ns 12,760.51 ns 699.447 ns 0.71 0.10 - - - -
SharpZipLibCalculate .NET Core 2.1 2048 6,008.05 ns 621.37 ns 34.059 ns 1.00 0.00 - - - -
SixLaborsCalculate .NET Core 2.1 2048 4,791.50 ns 3,894.94 ns 213.495 ns 0.80 0.04 - - - -
SharpZipLibCalculate .NET Core 3.1 2048 5,900.06 ns 1,344.70 ns 73.707 ns 1.00 0.00 - - - -
SixLaborsCalculate .NET Core 3.1 2048 103.12 ns 15.66 ns 0.859 ns 0.02 0.00 - - - -
SharpZipLibCalculate .NET 4.7.2 4096 12,422.59 ns 1,308.01 ns 71.696 ns 1.00 0.00 - - - -
SixLaborsCalculate .NET 4.7.2 4096 10,524.63 ns 6,267.56 ns 343.546 ns 0.85 0.03 - - - -
SharpZipLibCalculate .NET Core 2.1 4096 11,888.00 ns 1,059.25 ns 58.061 ns 1.00 0.00 - - - -
SixLaborsCalculate .NET Core 2.1 4096 9,806.24 ns 241.91 ns 13.260 ns 0.82 0.00 - - - -
SharpZipLibCalculate .NET Core 3.1 4096 12,181.28 ns 1,974.68 ns 108.239 ns 1.00 0.00 - - - -
SixLaborsCalculate .NET Core 3.1 4096 192.39 ns 10.27 ns 0.563 ns 0.02 0.00 - - - -

Copy link
Contributor

@saucecontrol saucecontrol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool to see this happening! I left a few suggestions for improvements if you want to push it further.

const byte S2301 = 0b1011_0001; // A B C D -> B A D C
const byte S1032 = 0b0100_1110; // A B C D -> C D A B

v_s1 = Sse2.Add(v_s1, Sse2.Shuffle(v_s1, S2301));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
v_s1 = Sse2.Add(v_s1, Sse2.Shuffle(v_s1, S2301));

This was a mistake in the Chromium code. The odd elements of the s1 vector are always 0, so this shuffle/add pair doesn't do anything.

Comment on lines 109 to 111
Vector128<int> v_ps = Vector128.CreateScalar(s1 * n).AsInt32();
Vector128<int> v_s2 = Vector128.CreateScalar(s2).AsInt32();
Vector128<int> v_s1 = Vector128<int>.Zero;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Vector128<int> v_ps = Vector128.CreateScalar(s1 * n).AsInt32();
Vector128<int> v_s2 = Vector128.CreateScalar(s2).AsInt32();
Vector128<int> v_s1 = Vector128<int>.Zero;
Vector128<uint> v_ps = Vector128.CreateScalar(s1 * n);
Vector128<uint> v_s2 = Vector128.CreateScalar(s2);
Vector128<uint> v_s1 = Vector128<uint>.Zero;

The logic depends on these values not overflowing uint.MaxValue when processing NMAX bytes. Best to keep them unsigned all the way through.

Vector128<short> mad1 = Ssse3.MultiplyAddAdjacent(bytes1, tap1);
v_s2 = Sse2.Add(v_s2, Sse2.MultiplyAddAdjacent(mad1, ones));

v_s1 = Sse2.Add(v_s1, Sse2.SumAbsoluteDifferences(bytes2, zero).AsInt32());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
v_s1 = Sse2.Add(v_s1, Sse2.SumAbsoluteDifferences(bytes2, zero).AsInt32());
v_s1 = Sse2.Add(v_s1, Sse2.SumAbsoluteDifferences(bytes2, zero).AsUInt32());


// Horizontally add the bytes for s1, multiply-adds the
// bytes by [ 32, 31, 30, ... ] for s2.
v_s1 = Sse2.Add(v_s1, Sse2.SumAbsoluteDifferences(bytes1, zero).AsInt32());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
v_s1 = Sse2.Add(v_s1, Sse2.SumAbsoluteDifferences(bytes1, zero).AsInt32());
v_s1 = Sse2.Add(v_s1, Sse2.SumAbsoluteDifferences(bytes1, zero).AsUInt32());

v_s2 = Sse2.Add(v_s2, Sse2.Shuffle(v_s2, S2301));
v_s2 = Sse2.Add(v_s2, Sse2.Shuffle(v_s2, S1032));

s2 = (uint)v_s2.ToScalar();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s2 = (uint)v_s2.ToScalar();
s2 = v_s2.ToScalar();

// bytes by [ 32, 31, 30, ... ] for s2.
v_s1 = Sse2.Add(v_s1, Sse2.SumAbsoluteDifferences(bytes1, zero).AsInt32());
Vector128<short> mad1 = Ssse3.MultiplyAddAdjacent(bytes1, tap1);
v_s2 = Sse2.Add(v_s2, Sse2.MultiplyAddAdjacent(mad1, ones));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
v_s2 = Sse2.Add(v_s2, Sse2.MultiplyAddAdjacent(mad1, ones));
v_s2 = Sse2.Add(v_s2, Sse2.MultiplyAddAdjacent(mad1, ones).AsUInt32());

Comment on lines 92 to 93
var tap1 = Vector128.Create(32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17);
var tap2 = Vector128.Create(16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vector128.Create() with a lot of elements is quite inefficient on netcoreapp3.x. @tannergooding recently fixed it up for net5.0 in dotnet/runtime#35857, but you might want to use the ROS trick to load these from fixed data since 3.1 will be around for a while.

Comment on lines 162 to 167
if (length >= 16)
{
s1 += Unsafe.Add(ref bufferRef, index++);
s2 += s1;
s1 += Unsafe.Add(ref bufferRef, index++);
s2 += s1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an odd choice by the Chromium dev(s). Since NMAX is an even multiple of 16 but not of 32, you could always safely process a 16-byte straggler as a half iteration of the SIMD loop. No need for this big manually-unrolled block.

For that matter, if the typical data length is greater than, say, 128 bytes the algorithm would extend naturally to AVX2, which would allow 64 bytes per iteration using basically the same code.

Comment on lines 234 to 237
s1 += Unsafe.Add(ref bufferRef, index++);
s2 += s1;
s1 += Unsafe.Add(ref bufferRef, index++);
s2 += s1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JIT doesn't do great with code like this.

On legacy jit32, it emits this:

    L0031: movzx eax, byte [ebx+eax]
    L0035: add edx, eax
    L0037: add ecx, edx
    L0039: mov eax, esi
    L003b: inc esi
    L003c: movzx eax, byte [ebx+eax]
    L0040: add edx, eax
    L0042: add ecx, edx
    L0044: mov eax, esi
    L0046: inc esi

RyuJIT does a bit better by not actually incrementing index each time, but it emits an extra mov for each block because it's confused by your two variables:

    L0031: movzx r9d, byte ptr [rax+1]
    L0036: add r9d, ecx
    L0039: mov ecx, r9d
    L003c: add r8d, ecx
    L003f: movzx r9d, byte ptr [rax+2]
    L0044: add r9d, ecx
    L0047: mov ecx, r9d
    L004a: add r8d, ecx

Writing it more like the C code:

    byte* pbuff = pbuffer + index;
    if (length >= 16) 
    { 
        s2 += (s1 += pbuff[0]); 
        s2 += (s1 += pbuff[1]); 
        s2 += (s1 += pbuff[2]); 
        ...
    }

gives better codegen in both:

    L0042: movzx r9d, byte ptr [rcx+1]
    L0047: add eax, r9d
    L004a: add r8d, eax
    L004d: movzx r9d, byte ptr [rcx+2]
    L0052: add eax, r9d
    L0055: add r8d, eax

SharpLab for both variants

It's also not clear that unroll by 16 is ideal for modern processors. This may be an outdated optimization from zlib. Might be worth testing unroll by 8 or even 4 to see how they do if you haven't already.

Comment on lines 83 to 86
fixed (ulong* k1k2Ptr = &k1k2[0])
fixed (ulong* k3k4Ptr = &k3k4[0])
fixed (ulong* k5k0Ptr = &k5k0[0])
fixed (ulong* polyPtr = &poly[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another place where the ROS trick would be of benefit. Instead of 4 separate unmanaged pointers, you could have a single pointer to a block containing all 4 vector values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the code to use a static readonly array as I wasn't sure the ROS optimizations apply for anything other than byte?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the compiler won't optimize anything bigger than byte because of endianness concerns. Since this code is processor-specific, you can outsmart the compiler by breaking the ulongs up into bytes (and reversing their byte order, ofc)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave that for now since the massive hashing improvements have actually made less of an impact than I'd hoped. I'm going to profile now and find other soft targets.

Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (without going into implementation details).


// act
crc.Update(data);
// Longer run, enough to require moving the point in SIMD implementation with
Copy link
Member

@antonfirsov antonfirsov May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there there are some other corner cases worth of testing?
Eg: data.Length == 0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, The current test isn't great, I'll add better reference tests

@JimBobSquarePants
Copy link
Member Author

JimBobSquarePants commented May 18, 2020

@saucecontrol Thanks for the review! Some great improvement ideas there. 👍 I've committed changes in one go rather than accepting individual suggestions as I didn't want to keep triggering the build systems.

@JimBobSquarePants JimBobSquarePants merged commit 4002a97 into master May 18, 2020
@JimBobSquarePants JimBobSquarePants deleted the js/fast-hash branch May 18, 2020 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants