-
-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel Multi-Scalar-Multiplication #226
Conversation
… chunking. Except maybe on arch with performance/efficiency cores
…ly on 2x threads. Thread wakeup issue?
…lso introduce spawnAwaitable to await void procs
BenchmarksConfigurationCurve BLS12-381 G1 We benchmark against Gnark which is the fastest multithreaded framework. Single threaded benches are available in #220 and also zkalc, https://crypto.ethereum.org/blog/zkalc. Constantine: 8-16 inputs32-128 inputsConstantine is 1.50x faster with 32 inputs, 1.38x faster with 64 inputs, 1.10x faster with 128 inputs 256-1024 inputsConstantine is 1.007x faster with 256 inputs, 1.24x faster with 512 inputs, 1.14x faster with 1024 inputs 2048-8192 inputsConstantine is 1.19x faster with 2048 inputs, 1.03x faster with 4096 inputs, 1.16x faster with 8192 inputs 16384-65536 inputsConstantine is 1.028x faster with 16384 inputs, 1.12x faster with 32768 inputs, 1.13x faster with 65536 inputs 131072-262144 inputsConstantine is 1.016x faster with 131072 inputs |
ArkworksArkworks is the reference ecosystem for zero-knowledge proofs Unfortunately arkworks doesn't seem to have a multithreaded MSM implementation. BellmanBellman is Zcash backend for Zero-Knowledge proofs. Their MSM bench uses 65536 (1 << 16) inputs https://github.com/zkcrypto/bellman/blob/e137775/benches/slow.rs#L14-L44. It is multicore enabled by default: https://github.com/zkcrypto/bellman/blob/e137775/Cargo.toml#L44 Constantine is 4.06x faster than bellman Bellman_ceBellman CE is a fork of Bellman by Matter Lab, with a focus on BN254_Snarks for the zkSync Ethereum L2.
Constantine is 2.46x faster than Bellman CE BarretenbergBarretenberg is a C++ library by Aztec Protocol, with a focus on BN254_Snarks for the Aztec Ethereum L2. Performance varies between -1.035x to +1.19x |
vs BLSTBLST actually has a multithreaded implementation in Rust at: https://github.com/supranational/blst/blob/e9dfc5e/bindings/rust/src/pippenger.rs#L56-L88 BLST Performancevs GnarkBeyond 512 points, BLST is significantly slower than Gnark vs ConstantineEven between 32 to 512 points Constantine is 1.26x to 1.51x faster. |
Parallel Multi-Scalar-Multiplication
As mentioned in #220, this is the largest bottleneck in zero-knowledge proofs. It's worth millions in prizes (https://zprize.io) and developing custom ASICs and GPU libraries.
We introduced the fastest:tm: CPU implementation on BLS12-381 G1 for small-scale MSMs and within 3.2% of the fastest for medium scale MSM (starting from 2^18 = 262144 points).
Overview
Multi-scalar-multiplication (MSM) in pseudocode
There are 3 main level of parallelisms:
Disadvantages: MSM complexity is about O(n/log(n)), the more points we have, the more we save. With a large number of points we can afford a large bit window
c
for the inner miniMSMs.Advantages: This parallelizes even the tricky final reduction (parallelizing it is possible at the cost of extra doubles). Also augmenting
c
has diminishing returns in practice (around ~16). In that case msm-level parallelism has no disadvantage.0.a MiniMSMs accumulation
loop on different threads.Disadvantages: none, this is natural parallelism. However it is limited, assuming 255-bit coefficient for large MSM, the window is about 16 so we expose 255/16 = 15.93x parallelism opportunities. If we split the scalars using endomorphism, it becomes 127/16 = 7.94x parallelism opportunities. With a high core count or on GPU more parallelism is needed.
1.a Bucket accumulation
loop on different threads.At first look we might want to have different threads handle separate chunk of points but then threads might race to add them to the same bucket. Instead we can have threads handle separate buckets and ignore points not relevant to their buckets.
Advantages: the number of buckets is easily in the thousands (starting from 2^13 = 8192 inputs), providing large parallelism opportunities even on GPUs.
Disadvantages: all tasks need to scan the whole input range, which becomes extremely large with millions of inputs as large as 255 bits.
Note on other parallelism opportunities:
Constantine implements window-level and has a stub for bucket-level parallelism when mid-level parallelism does not provide enough parallelism to occupy all cores. Unfortunately bucket-level parallelism actually slows down performance as implemented at the moment for the problem-size and on the hardware used (laptop CPU i9-9980HK 8-cores/16-threads) and so is deactivated.
Bug fix
In #223
Unfortunately an ordering issue prevented a thief from waking up any thread if all others threads were sleeping:
In the event loop, a thief notifies that it is sleepy (a 2-phase commit to sleep protocol to avoid deadlocks/sleeplocks), attempt to steal and if successful cancel sleep:
constantine/constantine/platforms/threadpool/threadpool.nim
Lines 563 to 577 in 4dc2610
However, on success, it wakes another thread before cancelling its own sleep:
constantine/constantine/platforms/threadpool/threadpool.nim
Lines 521 to 534 in 4dc2610
Due to #223 optimization (which led to significant improvement), the thief detects itself as sleepy but not sleeping and never wakes up anothe rthread, limiting parallelism.
In Multi-Scalar-Multiplication, this was significant as parallelism was limited to 4 cores before the fix in 8271be5
Unfortunately, this increase scheduling overhead by 1.8x on "empty tasks" due to syscalls as measured on the fibonacci benchmark. Though it's beneficial for non-synthetic benchmark and also beneficial on power consumption.