Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-Scalar-Multiplication / Linear combination #220

Merged
merged 35 commits into from
Feb 16, 2023
Merged

Multi-Scalar-Multiplication / Linear combination #220

merged 35 commits into from
Feb 16, 2023

Conversation

mratsim
Copy link
Owner

@mratsim mratsim commented Feb 15, 2023

Overview

This implements fast single-threaded (for now) multi-scalar-multiplication (MSM).

As mentioned in this EF presentation from 2020, the ecosystem has found techniques to remove FFTs altogether, in fact, Ethereum KZG polynomial commitment for EIP4844 used to require FFT in 2020 (#151) but not at all in 2023 (https://github.com/ethereum/consensus-specs/blob/59129e4/specs/deneb/polynomial-commitments.md) thanks to creating the polynomial in a specific way. And now 99% of time of any ZK system is spent in MSM (also called FLC for Fast Linear Combination in the slide)
image

The techniques used have been developed for easy porting to multi-cores and also GPU architectures.

Content

This PR goes a little all over the place.

  • While trying to find a signed digit representation to slice the MSM without precomputation hence suitable for GPUs (unlike NAF which needs a carry), I found interesting signed digits for the pairings' Miller Loop (but ended up using NAF anyway) and for randomizing/blinding in batch BLS signature verification and also implemented wNAF (but unused 🤷)
  • which then led to a refactor of the miller loop, and also accelerating pairings for the EVM with a MillerAccumulator.
  • Also the PR introduces the VarTime, Alloca and HeapAlloc effects so that the compiler can bubble up procs that uses those.
  • There are failed experiments (1 batch affine, some signed digits repr)
  • The PR introduced vartime field inversion for use in affine sum and MSM (and potentially pairings but it's like 0.5% of the cost)

Benches

  • BLST nim c -r -d:danger --passC:-march=native --hints:off --warnings:off --outdir:build benchmarks/bls12381_curve.nim, note it's important to bench BLST with completely random data as MSM is not constant-time and the doubling path would be much faster. (This is a measurement issue I add in Batch additions #207)

    image

  • Gnark from 32 to 128 go test -bench=MultiExpG1 -cpu 1 -run=^#
    image

  • Constantine from 8 to 128 nimble bench_ec_g1_msm_bls12_381
    image
    Here we're significantly faster than BLST and Gnark

  • Gnark from 256 to 8192
    image

  • Constantine same range
    image
    Over 20% speedup over both

  • Gnark from 16384 to 262144 inputs
    image

  • Constantine same range
    image
    Starting from 131071 points, Gnark takes the lead. The reason why is unknown but we start reaching L1 cache limits and also the 64K aliasing conflict boundary so some tuning on the number of buckets might help. See also https://www.youtube.com/watch?v=Bl5mQA7UL2I about having more than c=16 didn't help due to memory bandwidth limitations

cc @asn-d6 @yelhousni

@mratsim
Copy link
Owner Author

mratsim commented Feb 16, 2023

Some tuning to take back the perf crown on 2¹⁶ to 2¹⁸=262144 points range.

image
image

@paulmillr
Copy link

@mratsim thoughts on this? https://eprint.iacr.org/2022/1400

@mratsim
Copy link
Owner Author

mratsim commented May 19, 2023

I'm aware of this optimization, it was also mentioned in https://zprize.hardcaml.com/msm-point-representation.html.

There are 3 issues:

  • the asymptotic cost of twisted Edwards is 7M while affine addition is 6M. Affine addition is significantly harder to use but asymptotically 14% faster when you reach the threshold (~50 points to add).
  • Twisted Edwards representation is not universal, there is none for BLS12-381 for example, which is the curve I most interested in, iirc because there is no point of order 2.
  • There is a cost to conversion to edwards coordinates. This cost might be negligeable for small amount of points but noticeable with thousands or millions of points. For example I tested endomorphism acceleration in MSM but the extra preprocessing was not worth it when you got into 10000k+ points even though it's dividing the number of naive operations by 2.

So I didn't implemented it but I'm not hopeful.

Cc @gbotrel, @yelhousni

@yelhousni
Copy link

Cc @gbotrel, @yelhousni

Yup I agree with @mratsim comments — we ended up not merging this version into gnark-crypto because the affine version was faster in most use-cases we were interested in for gnark.
For bullet point 3, the conversion was mainly for the zprize competition sake (points were given in affine short Weierstrass) but for SNARK applications we can store them already in the right representation (twisted Edwards with a=-1 in the custom coordinates system). These curve points are part of the SNARK setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

multi-scalar multiplication / multi-exponentiations (a.k.a. Pippenger algorithm)
3 participants