-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: multiexp_serial
skips doubling when all bits are zero
#202
feat: multiexp_serial
skips doubling when all bits are zero
#202
Conversation
* feat: `multiexp_serial` skips doubling when all bits are zero For each `c` bit "segment", we get the bits from each coefficient in `coeffs` ahead of time and determine what the largest segment is. We use this to determine the largest nonzero segment and skip higher segments. Before this, we always looked at all 256 bits and did a doubling regardless. Now, if all the coefficients in the MSM are 64 bits, we will detect this first and only do pippenger up to 64 bits. This only saves doublings because the pippenger buckets for addition were already dynamically created. * feat: handle case where all bits are 0
multiexp_serial
skips doubling when all bits are zero (#10)multiexp_serial
skips doubling when all bits are zero
The summation by parts at the end of each pippenger bucket is doing an addition with identity unnecessarily for unused bits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds really cool!! I like it! Thanks for upstreaming it!
Is there any source/benchmark from where we can see not only the performance increase in 0-bit buckets but also how the performance is when we are in non highly 0bit populated circuits??
.into_iter() | ||
.take(max_nonzero_segment.unwrap() + 1) | ||
.rev() | ||
{ | ||
for _ in 0..c { | ||
*acc = acc.double(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be faster and easier to maintain to add a boolean guard if is_init
here
that defaults to false and is set to true on first initialization of the accumulator.
It would avoid an extra pass over the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, nice suggestion.
But now I am tempted to just do a full scan ahead of time to do this bit column optimization...
@@ -108,7 +139,7 @@ fn multiexp_serial<C: CurveAffine>(coeffs: &[C::Scalar], bases: &[C], acc: &mut | |||
// (a) + b + | |||
// ((a) + b) + c | |||
let mut running_sum = C::Curve::identity(); | |||
for exp in buckets.into_iter().rev() { | |||
for exp in buckets.into_iter().take(max_bits).rev() { | |||
running_sum = exp.add(running_sum); | |||
*acc = *acc + &running_sum; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need a local accumulator initialized with the first non-zero running sum because we have no info on the global accumulator here.
I've looked into many implementations, the first one to do extensive tests was libff:
then arkworks:
Barretenberg uses empiric formulas: Dalek has precise cost, but doesn't use it: BLST uses empiric formula that takes into account memory bandwidth limitations: Gnark uses precise cost formulas I've tested the various outputs of libraries in https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=f7c1706993e534d19ebf444ee2e807ca In Constantine I use a precise cost formula with penalties when hitting memory bandwidth bottlenecks and tuned them based on empirical benchmarks: |
Looks like https://github.com/shuklaayush/halo2-keccak256 uses a very similar approach to the "bit based" keccak implementation: https://github.com/privacy-scaling-explorations/zkevm-circuits/blob/86f2c5e16d1d29acd562b1119fca81cc940ec7e2/zkevm-circuits/src/keccak_circuit/keccak_bit.rs, where all operations are done on bits and so can be done using only custom gates. If so, it may be possible to optimize even further because the MSM on binary values is just conditional additions, so I think probably even faster to not even use pippenger. I never tested it so don't know how big of a difference it would make, and you do have to know that the columns are just binary to be able to do it though, so less of a general solution than the one in this PR. But if you're trying to optimize as much as possible, it could be worth it. The bit based implementation one was removed from the PSE branch because it uses a lot more cells because each bit requires its own cell, compared to the lookup approach where more data is packed in a single cell. The bit based approach on its own was ~2x faster though, but at the cost of higher memory requirements and slower recursion, so was not ideal because of the super circuit approach for the zkevm-cirucits. At the time it also looked like lookups were getting much faster, so that version was kept. It looks like that will eventually be true, though we're not there yet of course in practice. Some more info comparing the approaches: Some more info comparing the two approaches can be found here: https://hackmd.io/NaTuIvmaQCybaOYgd-DG1Q, |
Benchmarks (on my M2Max mac): in https://github.com/shuklaayush/halo2-keccak256, run
This benchmark does 85 keccak_f permutations in with Now I patch Also FYI if I run it with @Brechtpd Yup I saw your comments about the keccak_bits, and had wanted to run it directly but didn't have time to try to update the versioning with zkevm and ff 0.13 etc, so the above repo was an easier approximation. Just to say, this PR is just trying to pick low hanging fruit. It is by no means an endgame for Pippenger's optimization. It seems there is a lot more surface area if we do a scan of all columns to determine bit counts per column. (Or user needs to specify bit counts per column, but that's a bigger change.) Just for bit columns, indeed just adding is faster. I guess that could be special cased by scanning the column first? More generally, @mratsim I wonder if you know the best parallel MSM for multiple bit columns (or byte columns). We MSM per column, but the bases are the same (it's |
Also yes, the above circuit uses ~2400-2500 columns, so the recursion cost is quite high. |
No. I think this is already fixed within Can you confirm @kilic ? If so, then we can close this. |
Do you mean in privacy-scaling-explorations/halo2curves#130 ? I don't remember seeing the doubling skipping in that PR. |
@jonathanpwang In this PR #282 we removed the msm implementation from halo2 in favor of using the implementation from halo2curves (we just had the same code repeated in both places). Could you port this PR to halo2curves, and possibly extend it to the cycloneMSM that we recently merged? |
Sounds good, nice to hear cyclone was merged. I will need to study it a little. I have been interested in implementations of MSM that reduce peak memory usage - do you know if cyclone is any better than the original for that? We were also using |
I don't know about the memory usage of cyclone VS the previous MSM implementation; the only benchmarks that Onur shared where showing time; but it would be great to include memory benchmarks in future MSM iterations. BTW, for memory usage I think there's a low hanging fruit improvement which is avoiding this step https://github.com/privacy-scaling-explorations/halo2curves/blob/8af4f1ebab640405c799e65d9873847a4acf04f8/src/msm.rs#L290 which clones the coefficients, and instead directly iterate over the |
For each
c
bit "segment", we get the bits from each coefficient incoeffs
ahead of time and determine what the largest segment is. We use this to determine the largest nonzero segment and skip higher segments.Before this, we always looked at all 256 bits and did a doubling regardless. Now, if all the coefficients in the MSM are 64 bits, we will detect this first and only do pippenger up to 64 bits. This only saves doublings because the pippenger buckets for addition were already dynamically created.
I was previously unsure how much this saved because I benchmarked on zkevm-keccak and there was almost no difference because almost all scalars used are 254 bits due to the way words are packed (bits are packed in base 256). However we found that in https://github.com/shuklaayush/halo2-keccak256, where most numbers are 64 bits, there was a 40% speedup in proving time! So I figured that justified this PR.
Also not sure if current choice of
c
using natural logarithm is optimal? Why not log_2? Edit: empirically on my laptopln
is indeed faster.