Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigInteger performance improvements #41495

Open
Grevor opened this issue Aug 28, 2020 · 5 comments
Open

BigInteger performance improvements #41495

Grevor opened this issue Aug 28, 2020 · 5 comments
Labels
Milestone

Comments

@Grevor
Copy link

Grevor commented Aug 28, 2020

The BigInteger class is fast already, but after doing a bit-twiddling kata/review on the source code I realized there were some further improvements to be made. Due to the ongoing effort to spanify the BigInteger #35565 I will leave the proposed code changes and their rationales below, with performance measurements as delta from existing code. I see no reason the same shouldn't apply for the spanified version as well. The code is spanified for later convenience.

All code can be found in the BigIntegerCalculator.* files.

If the proposed changes are accepted (with some further refinements, of course) I am willing to do the implementation on top of the spanification once it is merged. Further benchmarking also needs to be done to ensure this is not a "my machine" thing.

Configuration

.NET 5.0 (from repository, commit 3ac735b)
Windows 10 19041 x64
Intel i7-8700K (Unclocked for the benchmark runs)

Add

Add should impose a check for carry == 0 in the last loop of both the half trivial case and the "full add":

            int i = 0;
            for ( ; carry != 0 & i < left.Length; i++)
            {
                long digit = left[i] + carry;
                bits[i] = unchecked((uint)digit);
                carry = digit >> 32;
            }
            left.Slice(i).CopyTo(bits.Slice(i));
            bits[left.Length] = (uint)carry;

Note the use of single &. This is to keep the single branch instructions. The JIT seem to fold this nicely in the x64 disassembly (It's one more instruction). The single & could also be used in AddSelf. In general, the carry will very quickly reach zero, so this should enable big + small additions to mainly be memmove. For some reason the change also gave better performance on cases I did not expect. The results are repeatable with new baselines as well.

| Faster                                                                      | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| --------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Numerics.Tests.Perf_BigInteger.Add(arguments: 65536,65536 bits)      |      1.14 |          2403.37 |          2102.15 |         |
| System.Numerics.Tests.Perf_BigInteger.Add(arguments: 16,16 bits)            |      1.14 |             7.30 |             6.43 |         |

Subtract

The same as for Add. The carry will tend toward zero.

The results are repeatable.

| Faster                                                                      | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| --------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Numerics.Tests.Perf_BigInteger.Subtract(arguments: 16,16 bits)       |      1.07 |             7.60 |             7.07 |         |
| System.Numerics.Tests.Perf_BigInteger.Subtract(arguments: 65536,65536 bits) |      1.14 |          2391.64 |          2094.65 |         |

Divide

This is where it gets wild. Changing the SubtractDivisor accordingly removes a TON of branch misspredictions. Measurements are needed on 32-bit machines due to the 64 bit operations.

           ulong carry = 0UL;

            for (int i = 0; i < right.Length; i++)
            {
                carry += right[i] * q;
                uint digit = unchecked((uint)carry);
                carry = carry >> 32;
                ulong newDigit = unchecked((ulong)left[i] - digit);
                carry += (newDigit  >> 32 ) & 0x1; // This is the same as if (leftElement < digit) ++carry
                left[i] = unchecked((uint)newDigit);
            }

            return (uint)carry;

The results are a bit varied on this one, surprisingly. I do not fully understand why the first case degrades so much. It did have much lower missprediction rate, so there could be further optimizations to be done by perhaps checking the value q (for example, single binary digit q values gives a non-uniform distribution under truncated multiplication with random left and right, IE. good for the current algorithm) and selecting the current code or the proposed one. On average I would suspect the proposed code to perform better, as the ratio of true to false branches is fairly close to 1 on average (slight overweight on branch taken).

| Slower                                                                 | diff/base | Base Median (ns) | Diff Median (ns) | Modality|
| ---------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Numerics.Tests.Perf_BigInteger.Divide(arguments: 1024,512 bits) |      1.27 |           524.26 |           665.83 |         |
| System.Numerics.Tests.Perf_BigInteger.Remainder(arguments: 1024,512 bits)        |      1.31 |           524.66 |           685.61 |         |

| Faster                                                                    | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| ------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Numerics.Tests.Perf_BigInteger.Divide(arguments: 65536,32768 bits) |      2.63 |       4249482.81 |       1617236.56 |         |
| System.Numerics.Tests.Perf_BigInteger.Remainder(arguments: 65536,32768 bits) |      2.61 |       4257183.59 |       1633850.63 |   

Multiply

The last one I have is uncertain. It has to do with loop-unrolling in the trivial case of Multiply . This enables the use of a dirty result buffers. The effects can only be seen in PowMod (with removed zeroing of the BitBuffers). However, there are some potential negative effects due to increased code size (double the instruction cache misses? Hard to say, processors are good at predictively streaming into them these days).

                if (right.Length <= 0)
                {
                    result.Clear();
                    return;
                }
                // Implied i = 0 and result = 0
                ulong carry = 0UL;
                for (int j = 0; j < left.Length; j++)
                {
                    ulong digits = carry
                        + (ulong)left[j] * right[0];
                    result[j] = unchecked((uint)digits);
                    carry = digits >> 32;
                }
                result[left.Length] = (uint)carry;

                for (int i = 1; i < right.Length; i++)
                {
                    carry = 0UL;
                    for (int j = 0; j < left.Length; j++)
                    {
                        ulong digits = result[i + j] + carry
                            + (ulong)left[j] * right[i];
                        result[i + j] = unchecked((uint)digits);
                        carry = digits >> 32;
                    }
                    result[i + left.Length] = (uint)carry;
                }
| Faster                                                                       | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| ---------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Numerics.Tests.Perf_BigInteger.ModPow(arguments: 1024,1024,64 bits)   |      1.14 |        130853.59 |        114545.93 |         |
| System.Numerics.Tests.Perf_BigInteger.ModPow(arguments: 16384,16384,64 bits) |      1.09 |       2186651.56 |       1998425.00 |         |
@Grevor Grevor added the tenet-performance Performance related issue label Aug 28, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Numerics untriaged New issue has not been triaged by the area owner labels Aug 28, 2020
@ghost
Copy link

ghost commented Aug 28, 2020

Tagging subscribers to this area: @tannergooding, @pgovind
See info in area-owners.md if you want to be subscribed.

@tannergooding tannergooding added this to the Future milestone Sep 14, 2020
@tannergooding tannergooding removed the untriaged New issue has not been triaged by the area owner label Sep 14, 2020
@tannergooding
Copy link
Member

I'm generally happy to take perf PRs, provided all tests pass and numbers look good (both local and those on the perf lab hardware).

@speshuric
Copy link
Contributor

@adamsitnik please take a look. It seems that the issue can be closed with #83457

@adamsitnik
Copy link
Member

@adamsitnik please take a look. It seems that the issue can be closed with #83457

It looks like #83457 has included the + and - perf improvements, how about * and / ? It would be great to try them and see what perf benefits we can get. Since this seems to be a small change I can give it a quick try and report back.

@speshuric
Copy link
Contributor

Yes, it need to be clarified.
But Karatsuba/Toom-Cook branch already use Add method which contains this improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants