Implement number theoretic transform for large integer multiplication #282

byeongkeunahn · 2023-08-28T14:02:31Z

This commit implements number theoretic transform (NTT) for large integer multiplication (issue #169).

To simplify implementation the Schönhage–Strassen algorithm was not used. Instead, three distinct 64-bit primes were carefully chosen to enable NTT up to ~10^18 64bit integers, which allows multiplication up to ~5 x 10^17 64bit integers. Depending on the input length either two or three primes are used, with the latter only used when the inputs consist of at least 2^40 64bit integers. The convolution results modulo primes are merged using the Chinese Remainder Theorem.
To reduce padding and the number of cycles for NTT, multiple radices are used (radix-2, radix-3, radix-4, radix-5, radix-6). Radix-8 is desirable but actually slower, presumably due to register spill. Although manual SIMD coding may alleviate this issue, it was not used (i) for maximum portability and (ii) since 64bit SIMD multiply is not widely available even on x86/x64 platforms (AVX512). The prime moduli are carefully chosen to support these radices.
- The NTT length is selected by exhaustively evaluating cost estimates for all allowed lengths within a factor of two.
Single-word (u64) Montgomery reduction is used for fast modular multiplication.
32bit digits are supported by repacking the digits into u64, running the u64 algorithm, and converting back to u32. This results in 32bit builds being about 3-5x slower compared to 64bit builds, which, however, still is an improvement upon the existing algorithms.
Unbalanced multiplication is enabled when the cost estimates are favorable.
Based on experimentation, the following thresholds are chosen:
- For u64 digits, switch to NTT if the shorter integer has at least 512 digits.
- For u32 digits, switch to NTT if the shorter integer has at least 2,048 digits.
The three primes are as follows:
- P1 = 10_237_243_632_176_332_801, Max NTT length = 2^24 * 3^20 * 5^2 = 1_462_463_376_025_190_400
- P2 = 13_649_658_176_235_110_401, Max NTT length = 2^26 * 3^19 * 5^2 = 1_949_951_168_033_587_200
- P3 = 14_259_017_916_245_606_401, Max NTT length = 2^22 * 3^21 * 5^2 = 1_096_847_532_018_892_800
- P1 and P2 are used for the two-prime NTT, whereas the three-prime NTT uses all three.
The following MIT-licensed projects are used as reference:
- http://wwwa.pikara.ne.jp/okojisan/otfft-en/index.html
- https://github.com/Bubbler-4/cg263748

On Ryzen 7 2700X, 64bit, it takes about 15ms for 2.7Mbits x 2.7Mbits and 170ms for 27Mbits x 27Mbits multiplication. This seems comparable to GMP 6.2.1.

Previously 3 primes were used, which was suboptimal in terms of speed. Currently, the threshold for switching from 2 to 3 primes is 2^38.

Despite the simple implementation with obvious inefficiencies (e.g., not reusing the NTT of the shorter array), this leads to speed gains in multiple benchmarks, although there is a small regression in others.

The prime numbers were replaced by larger ones to allow for tighter packing. Also, we compute the maximum number of bits that can be packed into one digit more precisely.

Breaking at this point is the right thing to do since future encounters will all `continue`.

Make ntt.rs shorter

Improve NTT planning

HKalbasi · 2024-12-13T08:33:02Z

Does the chart shows that current algorithm is faster than GMP? That's impressive.

HKalbasi · 2024-12-13T09:09:01Z

I ran benchmark fib_hex 100m from https://github.com/tczajka/bigint-benchmark-rs on this PR and it made num-bigint twice faster than malachite, slightly faster than gmp and 12x faster than itself.

byeongkeunahn added 10 commits August 27, 2023 21:21

Use number theoretic transform for multiplication

b633aea

Use 2 primes to multiply short arrays

98eba1c

Previously 3 primes were used, which was suboptimal in terms of speed. Currently, the threshold for switching from 2 to 3 primes is 2^38.

Speed up unbalanced multiplication (1)

5d8b725

Despite the simple implementation with obvious inefficiencies (e.g., not reusing the NTT of the shorter array), this leads to speed gains in multiple benchmarks, although there is a small regression in others.

Support 32bit BigDigit

5abc879

Fix clippy warnings

5d2bdd5

Fix multiplication overflow on 32bit

09edfac

Speed up unbalanced multiplication (2)

6700b64

Adjust NTT threshold for u32 digits

dc73b87

Add more benchmarks for large integers

c803e43

Update three-prime threshold (44 -> 43)

701bdbc

byeongkeunahn changed the title ~~Implement number theroetic transform for large integer multiplication~~ Implement number theoretic transform for large integer multiplication Aug 28, 2023

byeongkeunahn added 19 commits August 28, 2023 23:56

Update multiplication.rs

a888b8e

Speed up unbalanced multiplication (3)

d9970e0

Add DIF-DIT, optimize CRT, etc.

e441c92

Reduce memory access

ebc5f0d

Optimize add-with-carry

7e6f558

Optimize base case multiplication

4d4c0dc

Update ntt.rs

253031f

Speed up bit repacking

a6ca654

Share the same Vec for all twiddle factors

3b32e93

Pack more bits per one u64 digit if possible

d3b478f

The prime numbers were replaced by larger ones to allow for tighter packing. Also, we compute the maximum number of bits that can be packed into one digit more precisely.

Don't use intermediate buffer for conv_base

6d4acd0

Remove unnecessary operation

de73df6

Fix NTT planning bug

5350ae6

Optimize base case multiplication

e60b678

Replace some addmodopt calls with submod

4475f49

Improve NTT planning

f782e93

Reduce constant multiplication operations

6370aa0

Simplify code

4343d5f

Simplify code

f0b7f96

byeongkeunahn added 27 commits September 19, 2023 16:14

Update ntt.rs

2f7f1dd

Refactor & fix potential carry bug

6871c4d

Update ntt.rs

cac1830

Breaking at this point is the right thing to do since future encounters will all `continue`.

Make ntt.rs shorter

1d480ff

Make ntt.rs shorter

69c4ec6

Make ntt.rs shorter

bad433f

Merge pull request #2 from byeongkeunahn/develop

7dfa583

Make ntt.rs shorter

Make ntt.rs shorter

f207402

Merge pull request #3 from byeongkeunahn/develop-2

a127070

Make ntt.rs shorter

Update ntt.rs

d7d3de1

Make ntt.rs shorter

eecdf92

Make ntt.rs shorter

a2426fa

Merge pull request #4 from byeongkeunahn/develop-3

86bb2dc

Make ntt.rs shorter

Make ntt.rs shorter

76b5add

Make ntt.rs shorter

1e41e16

Improve compile time

a8ee9ff

Merge pull request #5 from byeongkeunahn/develop-4

0d2a14b

Make ntt.rs shorter

Make ntt.rs shorter

5f564de

A very slight optimization

c85db2b

Make ntt.rs shorter

deacbed

Merge pull request #6 from byeongkeunahn/develop-5

1e71e01

Make ntt.rs shorter

Make ntt.rs shorter

8ecb65d

Make ntt.rs shorter

3f8c2c5

Merge pull request #7 from byeongkeunahn/develop-6

4320297

Make ntt.rs shorter

Improve NTT planning

2f4460b

Merge pull request #8 from byeongkeunahn/develop-7

3bb14f7

Improve NTT planning

Fix NTT pack/unpack bug with u32 digits

2291466

HKalbasi mentioned this pull request Dec 13, 2024

Convert to base 10 is significantly slower than similar libraries #315

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement number theoretic transform for large integer multiplication #282

Implement number theoretic transform for large integer multiplication #282

byeongkeunahn commented Aug 28, 2023 •

edited

Loading

HKalbasi commented Dec 13, 2024

HKalbasi commented Dec 13, 2024

Implement number theoretic transform for large integer multiplication #282

Are you sure you want to change the base?

Implement number theoretic transform for large integer multiplication #282

Conversation

byeongkeunahn commented Aug 28, 2023 • edited Loading

HKalbasi commented Dec 13, 2024

HKalbasi commented Dec 13, 2024

byeongkeunahn commented Aug 28, 2023 •

edited

Loading