Add native num implementation; modular inverse and Jacobi symbol without GMP #290

apoelstra · 2015-08-20T05:03:49Z

Implement num.h API without GMP backend; benchmarks show modular inversion taking about 65% longer than with GMP, but much faster than the constant-time version.

I will update this PR with an implementation of the Jacobi symbol computation; for now I'm pushing it in the hopes that somebody will see some more perf improvements that will let us catch up with GMP. (In fact, since we don't require arbitrary size numbers or allocation we should in principle be able to beat them..). This has a modular inverse but no Jacobi symbol.

This is required for #262 because we do not want to require a GMP dependency.

Edit: I have a paper by the author of the GMP code which describes (mostly) what they are doing. With this I am able to read the GMP code; I think I see how to optimize it for our special case. So we should hold off any detailed review until I've written new gcd code.

sipa · 2015-08-21T19:03:22Z

configure.ac

@@ -117,7 +117,7 @@ AC_ARG_ENABLE(module_schnorr,
 AC_ARG_WITH([field], [AS_HELP_STRING([--with-field=64bit|32bit|auto],
 [Specify Field Implementation. Default is auto])],[req_field=$withval], [req_field=auto])

-AC_ARG_WITH([bignum], [AS_HELP_STRING([--with-bignum=gmp|no|auto],
+AC_ARG_WITH([bignum], [AS_HELP_STRING([--with-bignum=gmp|64bit|auto],


Seems there is a 32bit as well?

Thanks, fixed.

sipa · 2015-09-04T18:27:42Z

Needs rebase.

sipa · 2015-09-22T21:30:14Z

Need more rebase.

apoelstra · 2015-10-25T23:38:27Z

Rebased.

peterdettman · 2015-10-26T03:11:39Z

How is the inversion performance looking? I ask because I've implemented an inversion based on the GCD in http://perso.ens-lyon.fr/damien.stehle/downloads/recbinary.pdf (except non-recursive). Currently it does 1 million inversions in ~2.3s, which is at least %10 faster than GMP according to benchmark_internal. There's room for improvement because this is currently in Java, using 32 bit multiplies.

If it is proving difficult to improve on GMP's performance using the current approach, perhaps this would be more fruitful. Due to time constraints, I think I would need help with the C conversion and secp256k1 integration though.

apoelstra · 2015-10-26T12:54:22Z

@peterdettman I'm surprised that you were able to beat my code (let alone GMP's!!) with binary GCD. Nobody in this space has published benchmarks anywhere but the papers I've read have consistently implied that Euclid + Lehmer would be a bit faster because it involved fewer iterations. I did some back-of-envelope calculations that were consistent with this.

And GMP doesn't use pure Euclid+Lehmer, but a recursive "half GCD" algorithm invented by Niels Möller. It is sorta described here https://www.jstor.org/stable/40234522 but the actual GMP code is heavily optimized beyond that for their data structures.

peterdettman · 2015-10-26T14:45:50Z

@apoelstra I was a bit surprised myself and I suppose it may yet turn out that I've mis-measured something, but I don't believe so. The focus in most GCD papers is on improving the big-O bounds for very large numbers; at small sizes like 256 bits the performance limit of an algorithm often depends on serendipity - whether things work out "cleanly". Ultimately the proof is in the pudding I guess.

Still, if you're interested, the best place to start is perhaps to just get you a copy of the Java code and have you confirm the performance comparison?

peterdettman · 2015-10-26T15:15:38Z

BTW, it's strictly speaking not binary GCD, although similar. In particular it calculates exact "quotients" from the least significant bits only. Best to read the paper!

apoelstra · 2015-10-26T15:53:52Z

Rebased (correctly this time, I think).

@peterdettman Sure, I'm interested. You can email me at apoelstra@wpsoftware.net if this is not public code. Note that I'm also time-constrained, plus I'm much less comfortable in Java than in C.

I'll read your paper as soon as I have have some cycles to work on this again. @sipa had some ideas about using Jacobi symbols for fast batch inversion, so the priority of this patch might be bumped up again..

For historic curiosity:

If I remember correctly, I actually did implement a binary GCD to do some measurements. Because proper bin-GCD requires lots of left-shifting, and I was working with fixed-size buffers, and I didn't really believe in this approach enough to dedicate lots of time to it, I basically did a modular reduction after every left shift (given the functions I had access to already, and C, this was actually easier than using a variable-sized buffer :)). This of course was hundreds, if not thousands, of times slower than a properly written bin-GCD, so I could not benchmark it directly. Instead I added a bunch of trace code and simply printed out some iteration counts to get a feel for "how much work a proper bin-GCD would do". If I remember right, The result was 4-600 iterations for a random 256-bit number, vs 20-30 for Euclid-Lehmer. A bingcd iteration is a left-shift-then-subtract (so on 5-word bignums, roughly 10 word operations) while an Euclid-Lehmer iteration is multiplication of a bignum 2-vector by a 2x2 word matrix (roughly 60 operations I think). So my intuition was that bingcd was going to be slower pretty-much no matter what I did.

Obviously these numbers are full of SWAGs and don't take into account compiler or CPU optimizations. :) And since your paper isn't really binary GCD maybe they're totally inapplicable. But there they are.

peterdettman · 2015-10-26T16:41:46Z

Yeah, your intuition was right about that class of algorithms (Laszlo Hars has a good paper rounding up the variations: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.384.8124). About 18 months ago I tried implementing a couple of them and could never get better than very roughly half as fast as GMP at 256 bits, maybe worse.

This algorithm is really more comparable to Euclid-Lehmer. It works on the least significant double-words, calculating exact "quotient" matrices (of the Generalised Binary Division, a sort of dual to plain
division, for 2-adic numbers) of a little less than word size without regard to the absolute or relative sizes of the full values. No actual division is needed, and no "fix-up" procedure. Then the matrices are used to update the full-length values.

The current 32-bit implementation almost always handles a 256 bit inverse in 10 iterations (frustratingly close to being 9), and a 64-bit oriented one would comfortably finish in 5 iterations (fewer full-length updates, but the amount of LSB-processing doesn't change much).

The raw output (of the modular inversion) can be larger than 256 bits (there's some analysis in the paper of how large the quotients can get, but 512 bits is plenty), and an adjustment is needed at the end to account for extra powers of 2 that are collected in the result (total coda cost is around 2 field multiplies on average, using a tiny precomputed table per prime).

Anyway, I'll tidy things up a little and get a copy to you in the next few days, just for benchmarking in the first instance.

apoelstra · 2015-10-26T19:45:31Z

(Note that the build failures are due to timeouts on the unit tests, which are due to the very bad perf on the current Jacobi symbol implementation vs GMP's.)

gmaxwell · 2015-10-27T00:09:58Z

The current 32-bit implementation almost always handles a 256 bit inverse in 10 iterations (frustratingly close to being 9),

How close? -1/F(-x) = 1/F(x). So if having a smaller magnitude input helps, you can shave off a bit that way.

peterdettman · 2015-10-27T08:27:58Z

I collected some better stats on the rounds and it turns out it's about 0.5% 9, 98% 10 and 1.5% 11. After nine rounds there's on average about 14 bits left, so each round averages just under 27 bits. Saying its close to 9 was an exaggeration that probably stems from earlier in the process before I had implemented the "stop heuristics" correctly. Still, a few extra bits in the average case would skew things more towards 9 and away from 11.

@gmaxwell In this context the smaller magnitude doesn't help, although a subtraction can get you a low order zero, which could - but the quotient calculation is already built around generating low-order zeroes and it's cheaper to just let it do its thing instead of manipulating the full values (apart from the single update per round).

sipa · 2015-11-02T01:21:43Z

src/num_native_impl.h

+    int shift_w, shift_b;
+    secp256k1_num_word rem_high;  /* This slides as rem decreases */
+    secp256k1_num_word div_high;  /* This one stays constant */
+    secp256k1_num div;


Declaration of div shadows a global declaration.

sipa · 2015-11-02T01:44:50Z

Fails inexplicably, but needs a trivial rebase.

apoelstra · 2015-11-02T14:22:34Z

@sipa the inexplicable failures are that my Jacobi implementation is so slow compared to GMP (like 50x; they do some magic I don't yet understand to make Jacobi way faster than gcd, though the code is superficially similar) that it times out.

But sure, I'll rebase this.

sipa · 2015-11-02T15:29:38Z

I'd like to see this in, even if inefficient. Both the Jacobi construct and the ability to have a num interface without GMP.

apoelstra · 2015-11-02T15:33:48Z

Is it OK if I add an #ifdef to the tests binary which drops the number of Jacobi iterations by a factor of 100 when it's testing my implementation?

Also I'll need to re-run kcov on this; I think I'm missing coverage for some error cases and I don't want to hurt @gmaxwell's numbers :)

sipa · 2015-11-02T15:47:14Z

How about just dropping the iteration count by a factor 100 unconditionally? :) And yes, not hurting the coverage numbers would be nice!

This function isn't used anywhere and will cause test failures if we implement the full num.h API for a fixed-width 256-bit numeric type. We lose the unit test for secp256k1_scalar_mul_shift_var; we compensate by improving the unit test for secp256k1_scalar_split_lambda (which is the only user of this function) to test that the algebraic relation `N = s_lam * lambda + s_1` actually holds for the lambda decomposition.

This num.h implementation works using fixed-size arrays large enough to hold a 256-bit number (plus one word for slop). It includes a modular inversion. Typical perf numbers on my 64-bit system are: scalar_inverse: constant time: min 13.4us / avg 13.5us / max 13.8us native num.h: min 5.18us / avg 4.55us / max 5.43us gmp num.h: min 2.65us / avg 2.68us / max 2.70us field_inverse: constant time: min 6.02us / avg 6.09us / max 6.15us native num.h: min 5.48us / avg 4.94us / max 5.68us gmp num.h: min 2.96us / avg 3.02us / max 3.09us

apoelstra · 2015-11-07T18:06:12Z

Rebased, and also:

Got full test coverage.
Fixed a couple edge-case bugs revealed by having full test coverage (in particular computing the Jacobi symbol of 0 wrt any modulus would cause an invalid memory access!!)
Dropped the number of Jacobi iterations in bench_internal by a factor of 100 (saving ~70s on my system with the 32-bit num implementation :P)
Renamed helper function secp256k1_num_div_mul_word to secp256k1_num_mul_word_shift because the old name made no sense to me.

Also add native Jacobi symbol test (Andrew) Rebased-by: Andrew Poelstra

…rd" case Originally we computed q by dividing the top words of the dividend and divisor, then corrected it by at most 2. However, the ratio of the top words can only ever be 1, so rather than having three cases (original correct; need to subtract 1; need to subtract 2), we actually only have two (1 is correct; 0 is correct). This simplifies the algorithm a little bit and gets us to full test coverage, which we didn't have before since the "correct by 2" case was impossible to hit.

sipa · 2015-11-10T22:27:33Z

Still way too slow for tests.

sipa · 2017-03-22T00:56:38Z

Any interest in picking this up in the near future?

spartucus · 2019-08-26T02:27:52Z

Any progress on this?

real-or-random · 2019-08-26T07:36:22Z

Also, is there any progress on @peterdettman's approach?

@peterdettman Are there newer developments? Are you still interested in this? Any code, even in java, would be useful.

elichai · 2019-10-23T20:06:58Z

Just to prevent duplication of work, I want to mention that in light of these results #667 (comment)
I'm picking this up and working on implementing a potentially faster subquadratic algorithm for the jacobi symbol.

And potentially a faster algorithm for calculating modulo (i.e. without doing actual division).

If i'll be able to understand and implement correctly I hope it will result in faster jacobi symbol calculation :) (or at least more comparable to gmp)

Edit:
In case anyone wants to comment/is interested I'm basing these on: https://arxiv.org/pdf/1004.2091.pdf and maybe also https://eprint.iacr.org/2014/755.pdf

elichai · 2019-10-24T08:49:32Z

@apoelstra did you actually implement Schönhage algorithm in the end? because you mentioned it but I can't see it in the code. only lehmer's algorithm. (It's fine if you don't remember, it's 4 years ago lol)

gmaxwell · 2019-10-25T05:44:42Z

Related, someone who doesn't find C++ numeric code unreadable might find this codebase interesting: https://github.com/JeanLucPons/VanitySearch/blob/master/IntMod.cpp#L110

apoelstra · 2019-10-29T20:29:13Z

@elichai as I recall, I got as far as searching flights to Sweden to visit Schönhage and ask him how his algorithm works, because I could not make sense of the paper :). I have some more experience now and maybe could get further if I tried again. So no, I did not implement it.

roconnor-blockstream · 2021-03-24T13:00:30Z

I believe this PR is now obsolete and can be closed since https://github.com/bitcoin-core/secp256k1/pull/831/files has now been merged.

real-or-random · 2021-03-24T21:13:10Z

@elichai as I recall, I got as far as searching flights to Sweden to visit Schönhage and ask him how his algorithm works,

I believe this PR is now obsolete and can be closed since https://github.com/bitcoin-core/secp256k1/pull/831/files has now been merged.

I'm all in favor of optimizing implementations but even I didn't expect #831 to come with such large CO2 savings.

apoelstra force-pushed the jacobi branch from ac63dda to 3833f51 Compare August 20, 2015 05:14

sipa reviewed Aug 21, 2015
View reviewed changes

apoelstra force-pushed the jacobi branch 4 times, most recently from 5726a17 to 3caa2a1 Compare August 28, 2015 21:02

apoelstra force-pushed the jacobi branch from 3caa2a1 to cfabb11 Compare September 6, 2015 14:31

apoelstra mentioned this pull request Oct 2, 2015

Default selection of field/scalar inverse implementation #331

Closed

sipa mentioned this pull request Oct 11, 2015

Moves all test-only code into tests and re-enables -Wunused-function. #323

Closed

apoelstra force-pushed the jacobi branch from cfabb11 to 28d17d8 Compare October 25, 2015 23:37

apoelstra force-pushed the jacobi branch from 28d17d8 to 6ecb0b6 Compare October 26, 2015 15:39

sipa reviewed Nov 2, 2015
View reviewed changes

Move endianness macros into util.h

e38eb3b

apoelstra added 2 commits November 6, 2015 11:40

apoelstra force-pushed the jacobi branch from 6ecb0b6 to f3599f3 Compare November 7, 2015 18:05

peterdettman and others added 2 commits November 7, 2015 12:10

Add Jacobi symbol test via GMP

953894a

Also add native Jacobi symbol test (Andrew) Rebased-by: Andrew Poelstra

apoelstra force-pushed the jacobi branch from f3599f3 to 5f7c1c9 Compare November 7, 2015 18:11

spartucus mentioned this pull request Aug 26, 2019

Why not use GMP num? mimblewimble/secp256k1-zkp#50

Closed

apoelstra mentioned this pull request Oct 22, 2019

Is the compiler optimizing out some of the benchmarks? #667

Closed

elichai mentioned this pull request Mar 29, 2020

Use the extened euclidian algorithm to compute scalar inverse in variable time #730

Closed

elichai mentioned this pull request Aug 24, 2020

WIP: "safegcd" field and scalar inversion #767

Closed

real-or-random closed this Mar 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add native num implementation; modular inverse and Jacobi symbol without GMP #290

Add native num implementation; modular inverse and Jacobi symbol without GMP #290

apoelstra commented Aug 20, 2015

sipa Aug 21, 2015

apoelstra Aug 23, 2015

sipa commented Sep 4, 2015

sipa commented Sep 22, 2015

apoelstra commented Oct 25, 2015

peterdettman commented Oct 26, 2015

apoelstra commented Oct 26, 2015

peterdettman commented Oct 26, 2015

peterdettman commented Oct 26, 2015

apoelstra commented Oct 26, 2015

peterdettman commented Oct 26, 2015

apoelstra commented Oct 26, 2015

gmaxwell commented Oct 27, 2015

peterdettman commented Oct 27, 2015

sipa Nov 2, 2015

sipa commented Nov 2, 2015

apoelstra commented Nov 2, 2015

sipa commented Nov 2, 2015 via email

apoelstra commented Nov 2, 2015

sipa commented Nov 2, 2015 via email

apoelstra commented Nov 7, 2015

sipa commented Nov 10, 2015

sipa commented Mar 22, 2017

spartucus commented Aug 26, 2019

real-or-random commented Aug 26, 2019

elichai commented Oct 23, 2019 •

edited

Loading

elichai commented Oct 24, 2019 •

edited

Loading

gmaxwell commented Oct 25, 2019

apoelstra commented Oct 29, 2019

roconnor-blockstream commented Mar 24, 2021

real-or-random commented Mar 24, 2021

Add native num implementation; modular inverse and Jacobi symbol without GMP #290

Add native num implementation; modular inverse and Jacobi symbol without GMP #290

Conversation

apoelstra commented Aug 20, 2015

sipa Aug 21, 2015

Choose a reason for hiding this comment

apoelstra Aug 23, 2015

Choose a reason for hiding this comment

sipa commented Sep 4, 2015

sipa commented Sep 22, 2015

apoelstra commented Oct 25, 2015

peterdettman commented Oct 26, 2015

apoelstra commented Oct 26, 2015

peterdettman commented Oct 26, 2015

peterdettman commented Oct 26, 2015

apoelstra commented Oct 26, 2015

peterdettman commented Oct 26, 2015

apoelstra commented Oct 26, 2015

gmaxwell commented Oct 27, 2015

peterdettman commented Oct 27, 2015

sipa Nov 2, 2015

Choose a reason for hiding this comment

sipa commented Nov 2, 2015

apoelstra commented Nov 2, 2015

sipa commented Nov 2, 2015 via email

apoelstra commented Nov 2, 2015

sipa commented Nov 2, 2015 via email

apoelstra commented Nov 7, 2015

sipa commented Nov 10, 2015

sipa commented Mar 22, 2017

spartucus commented Aug 26, 2019

real-or-random commented Aug 26, 2019

elichai commented Oct 23, 2019 • edited Loading

elichai commented Oct 24, 2019 • edited Loading

gmaxwell commented Oct 25, 2019

apoelstra commented Oct 29, 2019

roconnor-blockstream commented Mar 24, 2021

real-or-random commented Mar 24, 2021

elichai commented Oct 23, 2019 •

edited

Loading

elichai commented Oct 24, 2019 •

edited

Loading