Merge celo/zexe w/ arkworks #127

jon-chuang · 2020-12-11T07:37:12Z

Features:

Batch Affine ops
Batch and ordinary w-NAF
GLV framework. GLV precomputation script.
GLV impl for both batch affine and projective mul
Batch bucket addition tree
Batched MSM (1.4-1.7x bump)
Batch subgroup verification based on addition to random buckets
12-limb .s asm impl for montgomery mul and constant-time add/sub
~~2D test matrix feature-gating with axes (curve, test type)~~
~~Refactor tests to use common template~~
~~Code timing instrumentation with fine grained control of functions and callers~~

Combined speedup (with w-NAF) for:

Batched mul (with GLV): 2.3-2.7x.
Non-batch projective mul (with GLV): 2.1x.
Without GLV: 1.65-1.8x and 1.4x resp.

Speedup for subgroup verification:

Over old impl 25-80x.
Over new batch mul: 10-30x.

Spec sheet (also included in code comments): https://hackmd.io/@celo/S10UeWUBD

I am not willing to break this PR into smaller PRs as that was already done in Zexe, and now I have had to repeat work to merge again to arkworks, so it's too much work, no thank you.

TODOs:

Add glv generation script.
Sister PR for curves which implement glv, cuda, test feature gating, tests for new features etc.
Feature gate for BigInt768 asm (.s rather than ff-asm).

Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.

Targeted PR against correct branch (main)
Linked to Github issue with discussion and accepted design OR have an explanation in the PR that describes this work.
Wrote unit tests
Updated relevant documentation in the code
Added a relevant changelog entry to the Pending section in CHANGELOG.md
Re-reviewed Files changed in the Github PR explorer

Pratyush · 2020-12-11T08:10:25Z

ff/src/fields/arithmetic.rs

            // Checking the modulus at compile time
            let first_bit_set = P::MODULUS.0[$limbs - 1] >> 63 != 0;
            let mut all_bits_set = P::MODULUS.0[$limbs - 1] == !0 - (1 << 63);
            for i in 1..$limbs {
-                all_bits_set &= P::MODULUS.0[$limbs - i - 1] == core::u64::MAX;
+                all_bits_set &= P::MODULUS.0[$limbs - i - 1] == !0u64;


Suggested change

all_bits_set &= P::MODULUS.0[$limbs - i - 1] == !0u64;

all_bits_set &= P::MODULUS.0[$limbs - i - 1] == core::u64::MAX;

Pratyush · 2020-12-11T08:58:04Z

Thanks for the PR and adapting it to arkworks!

Quick question: is the GPU scalar multiplication used to implement any operations (fixed/variable base MSMs/batch subgroup checks/batch muls)? I couldn't figure it out at a quick glance.

Also, I'd really appreciate it if you could you could point out the parts that need the closest review. (Since this is a large PR, if you could suggest a "review plan" so that I can figure out how to navigate the PR, that would also be great.)

ValarDragon · 2020-12-11T16:27:42Z

Thanks for working on adapting these great changes to arkworks and upstreaming them!

I'd like to ideally split this PR into many smaller PR's for easier review, would you like me to take a stab at doing this, or would you like to?

To start with, I'd like to move the ark-serialize changes into its own PR, since that should be modular
Similary, a separate PR for the bw6 ASM

ValarDragon · 2020-12-11T16:28:51Z

Ah sorry, didn't see the comment regarding not being willing to split up this PR. Is it fine if Pratyush and I go through and split up this into smaller PRs, and review/merge the smaller ones accordingly?

ValarDragon · 2020-12-11T16:41:08Z

ff/bw6-assembly/modadd768.S

@@ -0,0 +1,181 @@
+// void modadd768(const uint64_t x[12], const uint64_t y[12], const uint64_t m[13], uint64_t z[12])


Was there any automated process for how you got this assembly, or was it hand-written?

If there was an automated process, which was then edited for optimizations, could you share it please?

If it was hand-written, would it be possible to add some comments for how you derived whats going on?
(And also what features its doing, that leads it to be faster than native in expectation)

I believe that there was automatic process, and we eventually want to convert all the asm code to using .S, I think, as personally I really dislike inline ASM and also its the wrong usecase, since inline asm is meant for very short segments of code that one does indeed intend to be "inlined". Further, this would, at least notionally, remove the use of unsafe for asm, although, extern "C" technically is also unsafe, is it not?

@kobigurk should know more details on the automated process done by dag arne.

jon-chuang · 2021-01-21T06:22:35Z

Hi thanks all for the comments. I apologise that I have been feeling rather fatigued wrt to the PR due to the tediousness and also its size.

With regards to areas about which I have some doubts,

the CUDA code could be improved, maybe some thinking about how we would like to organise the code/expose interfaces/conditionally compile CUDA support. Currently, there is a lot of intermixing of locations and dependencies, and no clean separation of duties. One thing I think could make the difference is to have a CUDAKernel trait that is only a bound for relevant parameters etc when conditionally compiled with feature = "cuda". This is very satisfactory to me

To be continued.

jon-chuang · 2021-01-21T06:53:22Z

Thanks for the PR and adapting it to arkworks!

Quick question: is the GPU scalar multiplication used to implement any operations (fixed/variable base MSMs/batch subgroup checks/batch muls)? I couldn't figure it out at a quick glance.

So, this is specifically for batch multiplication. One thing I would like to improve is to create a more general framework for incorporating GPU kernels of all sorts, (including variablebaseMSM etc), which require instantiation at the parameter level due to needing to compile for specific curves without generics.

In the future, perhaps something so ugly will not have to be relied upon, e.g. if something like EmbarkStudios/rust-gpu#387 comes to be.

Also, I'd really appreciate it if you could you could point out the parts that need the closest review. (Since this is a large PR, if you could suggest a "review plan" so that I can figure out how to navigate the PR, that would also be great.)

I will do this.

jon-chuang · 2021-03-15T13:09:36Z

Currently I'm restarting this PR and will break into 4 components:

batch inversion group arithmetic
glv framework and scripts
things using batch inversion-based methods, like batch verification and batch MSM
cuda batch scalar mul

Number 4 is low priority as the code is spaghetti and also the API is very uncertain.

This PR is considered number 4, and I will stage the PRs so they build atop one another. Since this introduces breaking changes with curves, the corresponding curves changes will also be made.

I've decided I will complete this PR, as sickening as it is. I've already wasted enough time on porting this to arkworks that I decided I should waste even more time to at least get it done.

I'm very sickened that instead of having to write code and get things done, I have to waste time doing stupid refactoring and changing things everywhere. What a useless waste of time.

jon-chuang · 2021-03-20T03:51:39Z

I'm thinking of migrating the GPU segments of the code towards rust-gpu, partly because

I don't feel this segment is particularly urgent
The workarounds to use accel and also the modifications requiring maintaining a segment of it are not worthwhile.
accel = spaghetti code
rust-gpu is obviously the future ;P

So there you have it. The timeline for rust-gpu to become mature enough coincides with my imagined timeline for the development of a robust gpu component to arkworks.

The major targets are compute intensive workloads like proving (MSM), batch scalar muls (currently implemented), batch verifications (on curve checks, signatures). Currently, such functionality centers around small number of functions.

Other potential things are FFTs.

Something else I would like to look into is how to combine something like rayon with rust-gpu.

Pratyush reviewed Dec 11, 2020

View reviewed changes

Pratyush requested review from ValarDragon and Pratyush December 11, 2020 08:58

ValarDragon reviewed Dec 11, 2020

View reviewed changes

ValarDragon mentioned this pull request Dec 13, 2020

Celo: Serialization changes #130

Closed

6 tasks

ValarDragon marked this pull request as draft December 16, 2020 16:19

Merge celo/zexe w/ arkworks

54ccc1f

Merge branch 'master' into jonch/celo-merge

1689686

jon-chuang force-pushed the jonch/celo-merge branch 3 times, most recently from a613ac8 to fe7faef Compare March 15, 2021 13:08

hotfix

4328859

jon-chuang force-pushed the jonch/celo-merge branch from fe7faef to 4328859 Compare March 15, 2021 13:10

jon-chuang mentioned this pull request Mar 17, 2021

Miscellaneous Changes: Serialization, additional helper methods, reorg #235

Closed

6 tasks

zhenfeizhang mentioned this pull request Jul 3, 2021

Add GLVParameters trait definition #301

Merged

6 tasks

ghost mentioned this pull request Apr 5, 2022

Batch subgroup checks anoma/ferveo#80

Open

Pratyush closed this Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge celo/zexe w/ arkworks #127

Merge celo/zexe w/ arkworks #127

jon-chuang commented Dec 11, 2020 •

edited by ValarDragon

Loading

Pratyush Dec 11, 2020

Pratyush commented Dec 11, 2020

ValarDragon commented Dec 11, 2020 •

edited

Loading

ValarDragon commented Dec 11, 2020

ValarDragon Dec 11, 2020

jon-chuang Jan 21, 2021 •

edited

Loading

jon-chuang commented Jan 21, 2021

jon-chuang commented Jan 21, 2021 •

edited

Loading

jon-chuang commented Mar 15, 2021

jon-chuang commented Mar 20, 2021

	all_bits_set &= P::MODULUS.0[$limbs - i - 1] == !0u64;
	all_bits_set &= P::MODULUS.0[$limbs - i - 1] == core::u64::MAX;

		@@ -0,0 +1,181 @@
		// void modadd768(const uint64_t x[12], const uint64_t y[12], const uint64_t m[13], uint64_t z[12])

Merge celo/zexe w/ arkworks #127

Merge celo/zexe w/ arkworks #127

Conversation

jon-chuang commented Dec 11, 2020 • edited by ValarDragon Loading

Pratyush Dec 11, 2020

Choose a reason for hiding this comment

Pratyush commented Dec 11, 2020

ValarDragon commented Dec 11, 2020 • edited Loading

ValarDragon commented Dec 11, 2020

ValarDragon Dec 11, 2020

Choose a reason for hiding this comment

jon-chuang Jan 21, 2021 • edited Loading

Choose a reason for hiding this comment

jon-chuang commented Jan 21, 2021

jon-chuang commented Jan 21, 2021 • edited Loading

jon-chuang commented Mar 15, 2021

jon-chuang commented Mar 20, 2021

jon-chuang commented Dec 11, 2020 •

edited by ValarDragon

Loading

ValarDragon commented Dec 11, 2020 •

edited

Loading

jon-chuang Jan 21, 2021 •

edited

Loading

jon-chuang commented Jan 21, 2021 •

edited

Loading