A couple of months back I wrote about Allocation free algorithms, and took a look at an example of calculation the checksum of FIX messages. The end results are utilizing the latest C# 7.3 language features.
For many reasons that implementation seems a reasonable algorithm and runs quite ok and without managed heap allocations. However, it could be made even faster, and in this post, I will take a look how. I will use Vectors which is available in the System.Numerics namespace.
Vector<T>
provides an abstractions over data, to enable data parallelism known as Single Instruction Multiple Data (SIMD). The nice thing about Vector and the CLR itself, that it can leverage the SIMD capability of the processor itself, thus providing access to low-level hardware acceleration in a managed language.
Let me begin with the original algorithm. The IsValid
method receives a Span<T>
of input data to be validated and an integer named checksumTagStartIndex which indicates the position of the checksum tag appearing in the data.
The checksum tag is the last tag of a FIX messages. Its tag is 10 and the value is a 3-digit integer. The value is calculated by summing the byte values from the beginning of the message up to the checksum tag and taking a modulo of 256 operation on the calculated value. This value is then represented as a 3-digit integer encoded with ASCII.
The method first validates that the checksumTagStartIndex given is really the last tag of the input data. Then it uses a simple loop to sum the bytes up to the checksum tag. It calculates the modulo operation, and it converts the representation of the result using only the stack. Finally, it compares the given input checksum with the calculated one.
public bool IsValid(Span<byte> data, int checksumTagStartIndex)
{
if(checksumTagStartIndex < 0 || (checksumTagStartIndex + ChecksumLength) != data.Length)
{
return false;
}
int sum = 0;
for(int i = 0; i < checksumTagStartIndex; i++)
{
sum += data[i];
}
int expectedChecksum = sum % Modulus;
Span<byte> expectedDigits = stackalloc byte[ChecksumValueLength];
_converter.Convert(number: expectedChecksum, into: expectedDigits, count: ChecksumValueLength);
var receivedChecksum = data.Slice(checksumTagStartIndex + 3, ChecksumValueLength);
return receivedChecksum.SequenceEqual(expectedDigits);
}
In the first attempt, I will show a naive vectorization. The goal is to increase the throughput of the for
loop from the previous solution. If we could sum multiple byte at the same time, we could improve the performance significantly. The code looks a little more complicated at first sight:
public bool IsValidSimd(Span<byte> input, int checksumTagStartIndex)
{
if(checksumTagStartIndex < 0 || (checksumTagStartIndex + ChecksumLength) != input.Length)
{
return false;
}
var data = input.Slice(0, checksumTagStartIndex);
int vectorLength = Vector<byte>.Count;
Vector<ushort> sumV1 = Vector<ushort>.Zero;
Vector<ushort> sumV2 = Vector<ushort>.Zero;
while(data.Length >= vectorLength)
{
var vectorizedData = new Vector<byte>(data);
Vector.Widen(vectorizedData, out Vector<ushort> left, out Vector<ushort> right);
sumV1 += left;
sumV2 += right;
data = data.Slice(vectorLength);
}
int sum = 0;
for(int i = 0; i < data.Length; i++)
{
sum += data[i];
}
sum += Vector.Dot(Vector<ushort>.One, sumV1) + Vector.Dot(Vector<ushort>.One, sumV2);
int expectedChecksum = sum % Modulus;
Span<byte> expectedDigits = stackalloc byte[ChecksumValueLength];
_converter.Convert(number: expectedChecksum, into: expectedDigits, count: ChecksumValueLength);
var receivedChecksum = input.Slice(checksumTagStartIndex + 3, ChecksumValueLength);
return receivedChecksum.SequenceEqual(expectedDigits);
}
The beginning and the end of the code looks the same, while the sum calculation has changed. First, we will need to look at how many bytes can the given hardware store in its SIMD register Vector<byte>.Count
. Then we initialize two unsigned shorts vectors of zeros: sumV1 and sumV2. In the while loop we can read vectorLength of bytes from the input data to a temporary vector. Then we Widen these bytes to ushorts: this operation says that we want to handle the bytes as ushorts (like an up cast). As ushort is a 16-bit value, a vector could only handle half as many values as with the 8-bit bytes. So, the output is 2 Vector<ushort>
named left and right. The rest of the while loop adds the values to the cumulative variables and slices the next part of the input message. The second for loop is needing to sum the last slice of data, that is smaller than vectorLength. Finally, to have a single sum value we can use the Dot product of a one Vector and the cumulative values.
Let's compare the two approaches. I use BenchmarkDotNet to benchmark both methods. I use 3 sample fix messages:
Sample0 = "35=8|49=PHLX|20=3|167=CS|54=1|38=15|58=PHLX EQUITY TESTING|59=0|47=C|32=0|31=0|151=15|14=0|6=0|";
Sample1 = "35=8|49=PHLX|56=PERS|52=20071123-05:30:00.000|11=ATOMNOCCC9990900|20=3|150=E|39=E|55=MSFT|167=CS|54=1|38=15|40=2|44=15|58=PHLX EQUITY TESTING|59=0|47=C|32=0|31=0|151=15|14=0|6=0|";
Sample2 = "35=8|49=PHLX|56=PERS|52=20071123-05:30:00.000|11=ATOMNOCCC9990900|20=3|150=E|39=E|55=MSFT|167=CS|54=1|38=15|40=2|44=15|58=PHLX EQUITY TESTING|59=0|47=C|32=0|31=0|151=15|14=0|6=0|35=8|49=PHLX|56=PERS|52=20071123-05:30:00.000|11=ATOMNOCCC9990900|20=3|150=E|39=E|55=MSFT|167=CS|54=1|38=15|40=2|44=15|58=PHLX EQUITY TESTING|59=0|47=C|32=0|31=0|151=15|14=0|6=0|";
and the two methods benchmarked:
[Benchmark(Baseline = true, Description = "Regular")]
public void Validate() => _validator.IsValid(_message, _checksumStart);
[Benchmark(Description = "SIMD")]
public void ValidateSIMD() => _validator.IsValidSimd(_message, _checksumStart);
The results indicate that the for the shortest message the original approach is slighlty faster, but for the largest message, there is already a 33% improvement:
BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17134.471 (1803/April2018Update/Redstone4), VM=Hyper-V
Intel Xeon CPU E5-2673 v3 2.40GHz, 1 CPU, 2 logical and 2 physical cores
.NET Core SDK=2.2.100
[Host] : .NET Core 2.2.0 (CoreCLR 4.6.27110.04, CoreFX 4.6.27110.04), 64bit RyuJIT
Core : .NET Core 2.2.0 (CoreCLR 4.6.27110.04, CoreFX 4.6.27110.04), 64bit RyuJIT
Job=Core Runtime=Core
Method | Sample | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|---|
Regular | 35=8(...) [95] | 116.3 ns | 2.013 ns | 1.883 ns | 1.00 | 0.00 |
SIMD | 35=8(...) [95] | 133.3 ns | 2.638 ns | 3.037 ns | 1.14 | 0.03 |
Regular | 35=8(...) [178] | 167.5 ns | 3.274 ns | 4.141 ns | 1.00 | 0.00 |
SIMD | 35=8(...) [178] | 140.2 ns | 2.723 ns | 2.674 ns | 0.84 | 0.02 |
Regular | 35=8(...) [356] | 274.4 ns | 4.755 ns | 4.448 ns | 1.00 | 0.00 |
SIMD | 35=8(...) [356] | 182.9 ns | 4.034 ns | 4.317 ns | 0.67 | 0.02 |
The improvement for the naive vectorization is clear for long messages, but can we still improve it? Giving it a second thought the previous vectorized algorithm has a limitation: ushort-s can represent values up to 65,535. What happens if the message is just too long, so a ushort is not big enough to represent the local sum? The variable will overflow. But we get lucky here, because 256*256=65536 and 256 is just the modulus we use later on. This means we don't even need to use ushort-s or int-s, we can simply use bytes if we can accept overflow.
This brings us to the following implementation:
public bool IsValidSimdSlim(Span<byte> input, int checksumTagStartIndex)
{
if(checksumTagStartIndex < 0 || (checksumTagStartIndex + ChecksumLength) != input.Length)
{
return false;
}
var data = input.Slice(0, checksumTagStartIndex);
int vectorLength = Vector<byte>.Count;
Vector<byte> sumV = Vector<byte>.Zero;
while(data.Length >= vectorLength)
{
sumV += new Vector<byte>(data);
data = data.Slice(vectorLength);
}
byte sum = Vector.Dot(Vector<byte>.One, sumV);
for(int i = 0; i < data.Length; i++)
{
sum += data[i];
}
Span<byte> expectedDigits = stackalloc byte[ChecksumValueLength];
_converter.Convert(number: sum, into: expectedDigits, count: ChecksumValueLength);
var receivedChecksum = input.Slice(checksumTagStartIndex + 3, ChecksumValueLength);
return receivedChecksum.SequenceEqual(expectedDigits);
}
The beginning and the end of this method is still quite the same, and the middle is similar too. The difference to the previous solution is that we can use purely Vector<byte>
-s, thus we can parallelize more data at the same time, and we can omit the Widen method call. This method is even simpler and faster.
Adding the new IsValidSimdSlim
to the benchmark:
[Benchmark(Description = "Slim")]
public void ValidateSIMDSlim() => _validator.IsValidSimdSlim(_message, _checksumStart);
The results, indicate further improvement, especially for large messages, where the new approach is nearly twice as fast:
BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17134.471 (1803/April2018Update/Redstone4), VM=Hyper-V
Intel Xeon CPU E5-2673 v3 2.40GHz, 1 CPU, 2 logical and 2 physical cores
.NET Core SDK=2.2.100
[Host] : .NET Core 2.2.0 (CoreCLR 4.6.27110.04, CoreFX 4.6.27110.04), 64bit RyuJIT
Core : .NET Core 2.2.0 (CoreCLR 4.6.27110.04, CoreFX 4.6.27110.04), 64bit RyuJIT
Job=Core Runtime=Core
Method | Sample | Mean | Error | StdDev | Median | Ratio | RatioSD |
---|---|---|---|---|---|---|---|
Regular | 35=8(...) [95] | 115.3 ns | 2.307 ns | 2.369 ns | 115.8 ns | 1.00 | 0.00 |
SIMD | 35=8(...) [95] | 128.9 ns | 2.596 ns | 3.465 ns | 127.8 ns | 1.12 | 0.04 |
Slim | 35=8(...) [95] | 113.8 ns | 2.355 ns | 2.203 ns | 113.3 ns | 0.99 | 0.03 |
Regular | 35=8(...) [178] | 167.8 ns | 3.191 ns | 2.985 ns | 167.2 ns | 1.00 | 0.00 |
SIMD | 35=8(...) [178] | 151.0 ns | 5.500 ns | 16.218 ns | 144.1 ns | 0.84 | 0.03 |
Slim | 35=8(...) [178] | 135.6 ns | 2.320 ns | 2.057 ns | 136.1 ns | 0.81 | 0.02 |
Regular | 35=8(...) [356] | 280.0 ns | 5.463 ns | 5.366 ns | 279.8 ns | 1.00 | 0.00 |
SIMD | 35=8(...) [356] | 184.2 ns | 3.051 ns | 2.854 ns | 184.2 ns | 0.66 | 0.02 |
Slim | 35=8(...) [356] | 158.5 ns | 3.115 ns | 3.587 ns | 157.6 ns | 0.56 | 0.02 |
Finally, I ran a test on different machine as well, which is my current Surface Pro. The results of the benchmark are attached below. We can see that with this Intel processor the performance gain is even bigger.
BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17763.194 (1809/October2018Update/Redstone5)
Intel Core i5-6300U CPU 2.40GHz (Skylake), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=2.2.101
[Host] : .NET Core 2.2.0 (CoreCLR 4.6.27110.04, CoreFX 4.6.27110.04), 64bit RyuJIT
Core : .NET Core 2.2.0 (CoreCLR 4.6.27110.04, CoreFX 4.6.27110.04), 64bit RyuJIT
Job=Core Runtime=Core
Method | Sample | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|
Regular | 35=8(...) [95] | 112.47 ns | 0.5235 ns | 0.4897 ns | 1.00 |
SIMD | 35=8(...) [95] | 97.69 ns | 0.2650 ns | 0.2349 ns | 0.87 |
Slim | 35=8(...) [95] | 81.47 ns | 0.2168 ns | 0.2028 ns | 0.72 |
Regular | 35=8(...) [178] | 170.35 ns | 0.5023 ns | 0.4698 ns | 1.00 |
SIMD | 35=8(...) [178] | 104.95 ns | 0.2345 ns | 0.2193 ns | 0.62 |
Slim | 35=8(...) [178] | 91.37 ns | 0.1781 ns | 0.1666 ns | 0.54 |
Regular | 35=8(...) [356] | 283.11 ns | 1.0103 ns | 0.9450 ns | 1.00 |
SIMD | 35=8(...) [356] | 137.52 ns | 0.4837 ns | 0.4288 ns | 0.49 |
Slim | 35=8(...) [356] | 116.93 ns | 0.3978 ns | 0.3526 ns | 0.41 |
The full source code is available on GitHub.