This is an entry for Code Golf challenge Compute convolution quickly and accurately.
The algorithm used is Number Theoretic Transform, which is a variation of FFT that does not suffer from
floating-point inaccuracies. The modulo used is P = 71 * 2^57 + 1
which is slightly larger than 10^19
,
and I devised a specialized a * b % P
function here,
which gave ~4x speedup for the entire convolution compared to (a as u128 * b as u128 % P as u128) as u64
.
The main entrypoint for the algorithm is solve_ntt
.
Additionally, three variants of the same algorithm is implemented as
solve_ntt_par
,
solve_ntt_par2
,
and solve_ntt_par3
,
which are some attempts at applying data-independent parallelism using rayon
.
(Trying to parallelize individual butterflying turns out to be harmful, apparently because it causes much more cache misses.
Will it perform better if cache-line-sized chunks are used instead?)
Use cargo bench
to run benchmarks locally.
I don't claim this to be the fastest code that supports up to 10^19
in the result of convolution.
Even the core algorithm itself (Cooley-Tukey) can be substituted for different ones such as Stockham for significant improvement.