-
-
Notifications
You must be signed in to change notification settings - Fork 851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Hardware Accelerated Checksums #1201
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really cool to see this happening! I left a few suggestions for improvements if you want to push it further.
const byte S2301 = 0b1011_0001; // A B C D -> B A D C | ||
const byte S1032 = 0b0100_1110; // A B C D -> C D A B | ||
|
||
v_s1 = Sse2.Add(v_s1, Sse2.Shuffle(v_s1, S2301)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v_s1 = Sse2.Add(v_s1, Sse2.Shuffle(v_s1, S2301)); |
This was a mistake in the Chromium code. The odd elements of the s1 vector are always 0, so this shuffle/add pair doesn't do anything.
Vector128<int> v_ps = Vector128.CreateScalar(s1 * n).AsInt32(); | ||
Vector128<int> v_s2 = Vector128.CreateScalar(s2).AsInt32(); | ||
Vector128<int> v_s1 = Vector128<int>.Zero; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vector128<int> v_ps = Vector128.CreateScalar(s1 * n).AsInt32(); | |
Vector128<int> v_s2 = Vector128.CreateScalar(s2).AsInt32(); | |
Vector128<int> v_s1 = Vector128<int>.Zero; | |
Vector128<uint> v_ps = Vector128.CreateScalar(s1 * n); | |
Vector128<uint> v_s2 = Vector128.CreateScalar(s2); | |
Vector128<uint> v_s1 = Vector128<uint>.Zero; |
The logic depends on these values not overflowing uint.MaxValue
when processing NMAX
bytes. Best to keep them unsigned all the way through.
Vector128<short> mad1 = Ssse3.MultiplyAddAdjacent(bytes1, tap1); | ||
v_s2 = Sse2.Add(v_s2, Sse2.MultiplyAddAdjacent(mad1, ones)); | ||
|
||
v_s1 = Sse2.Add(v_s1, Sse2.SumAbsoluteDifferences(bytes2, zero).AsInt32()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v_s1 = Sse2.Add(v_s1, Sse2.SumAbsoluteDifferences(bytes2, zero).AsInt32()); | |
v_s1 = Sse2.Add(v_s1, Sse2.SumAbsoluteDifferences(bytes2, zero).AsUInt32()); |
|
||
// Horizontally add the bytes for s1, multiply-adds the | ||
// bytes by [ 32, 31, 30, ... ] for s2. | ||
v_s1 = Sse2.Add(v_s1, Sse2.SumAbsoluteDifferences(bytes1, zero).AsInt32()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v_s1 = Sse2.Add(v_s1, Sse2.SumAbsoluteDifferences(bytes1, zero).AsInt32()); | |
v_s1 = Sse2.Add(v_s1, Sse2.SumAbsoluteDifferences(bytes1, zero).AsUInt32()); |
v_s2 = Sse2.Add(v_s2, Sse2.Shuffle(v_s2, S2301)); | ||
v_s2 = Sse2.Add(v_s2, Sse2.Shuffle(v_s2, S1032)); | ||
|
||
s2 = (uint)v_s2.ToScalar(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s2 = (uint)v_s2.ToScalar(); | |
s2 = v_s2.ToScalar(); |
// bytes by [ 32, 31, 30, ... ] for s2. | ||
v_s1 = Sse2.Add(v_s1, Sse2.SumAbsoluteDifferences(bytes1, zero).AsInt32()); | ||
Vector128<short> mad1 = Ssse3.MultiplyAddAdjacent(bytes1, tap1); | ||
v_s2 = Sse2.Add(v_s2, Sse2.MultiplyAddAdjacent(mad1, ones)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v_s2 = Sse2.Add(v_s2, Sse2.MultiplyAddAdjacent(mad1, ones)); | |
v_s2 = Sse2.Add(v_s2, Sse2.MultiplyAddAdjacent(mad1, ones).AsUInt32()); |
var tap1 = Vector128.Create(32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17); | ||
var tap2 = Vector128.Create(16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vector128.Create()
with a lot of elements is quite inefficient on netcoreapp3.x. @tannergooding recently fixed it up for net5.0 in dotnet/runtime#35857, but you might want to use the ROS trick to load these from fixed data since 3.1 will be around for a while.
if (length >= 16) | ||
{ | ||
s1 += Unsafe.Add(ref bufferRef, index++); | ||
s2 += s1; | ||
s1 += Unsafe.Add(ref bufferRef, index++); | ||
s2 += s1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an odd choice by the Chromium dev(s). Since NMAX
is an even multiple of 16 but not of 32, you could always safely process a 16-byte straggler as a half iteration of the SIMD loop. No need for this big manually-unrolled block.
For that matter, if the typical data length is greater than, say, 128 bytes the algorithm would extend naturally to AVX2, which would allow 64 bytes per iteration using basically the same code.
s1 += Unsafe.Add(ref bufferRef, index++); | ||
s2 += s1; | ||
s1 += Unsafe.Add(ref bufferRef, index++); | ||
s2 += s1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JIT doesn't do great with code like this.
On legacy jit32, it emits this:
L0031: movzx eax, byte [ebx+eax]
L0035: add edx, eax
L0037: add ecx, edx
L0039: mov eax, esi
L003b: inc esi
L003c: movzx eax, byte [ebx+eax]
L0040: add edx, eax
L0042: add ecx, edx
L0044: mov eax, esi
L0046: inc esi
…
RyuJIT does a bit better by not actually incrementing index
each time, but it emits an extra mov
for each block because it's confused by your two variables:
L0031: movzx r9d, byte ptr [rax+1]
L0036: add r9d, ecx
L0039: mov ecx, r9d
L003c: add r8d, ecx
L003f: movzx r9d, byte ptr [rax+2]
L0044: add r9d, ecx
L0047: mov ecx, r9d
L004a: add r8d, ecx
…
Writing it more like the C code:
byte* pbuff = pbuffer + index;
if (length >= 16)
{
s2 += (s1 += pbuff[0]);
s2 += (s1 += pbuff[1]);
s2 += (s1 += pbuff[2]);
...
}
gives better codegen in both:
L0042: movzx r9d, byte ptr [rcx+1]
L0047: add eax, r9d
L004a: add r8d, eax
L004d: movzx r9d, byte ptr [rcx+2]
L0052: add eax, r9d
L0055: add r8d, eax
It's also not clear that unroll by 16 is ideal for modern processors. This may be an outdated optimization from zlib. Might be worth testing unroll by 8 or even 4 to see how they do if you haven't already.
fixed (ulong* k1k2Ptr = &k1k2[0]) | ||
fixed (ulong* k3k4Ptr = &k3k4[0]) | ||
fixed (ulong* k5k0Ptr = &k5k0[0]) | ||
fixed (ulong* polyPtr = &poly[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another place where the ROS trick would be of benefit. Instead of 4 separate unmanaged pointers, you could have a single pointer to a block containing all 4 vector values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the code to use a static readonly array as I wasn't sure the ROS optimizations apply for anything other than byte?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, the compiler won't optimize anything bigger than byte because of endianness concerns. Since this code is processor-specific, you can outsmart the compiler by breaking the ulongs up into bytes (and reversing their byte order, ofc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave that for now since the massive hashing improvements have actually made less of an impact than I'd hoped. I'm going to profile now and find other soft targets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (without going into implementation details).
|
||
// act | ||
crc.Update(data); | ||
// Longer run, enough to require moving the point in SIMD implementation with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe there there are some other corner cases worth of testing?
Eg: data.Length == 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, The current test isn't great, I'll add better reference tests
@saucecontrol Thanks for the review! Some great improvement ideas there. 👍 I've committed changes in one go rather than accepting individual suggestions as I didn't want to keep triggering the build systems. |
Prerequisites
Description
Added support for hardware accelerated Crc32 and Adler32 checksum generation both used by our png codecs. They're fast!
Benchmarks.
Adler32
Crc32