Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add native num implementation; modular inverse and Jacobi symbol without GMP #290

Closed
wants to merge 5 commits into from

Conversation

apoelstra
Copy link
Contributor

Implement num.h API without GMP backend; benchmarks show modular inversion taking about 65% longer than with GMP, but much faster than the constant-time version.

I will update this PR with an implementation of the Jacobi symbol computation; for now I'm pushing it in the hopes that somebody will see some more perf improvements that will let us catch up with GMP. (In fact, since we don't require arbitrary size numbers or allocation we should in principle be able to beat them..). This has a modular inverse but no Jacobi symbol.

This is required for #262 because we do not want to require a GMP dependency.

Edit: I have a paper by the author of the GMP code which describes (mostly) what they are doing. With this I am able to read the GMP code; I think I see how to optimize it for our special case. So we should hold off any detailed review until I've written new gcd code.

@@ -117,7 +117,7 @@ AC_ARG_ENABLE(module_schnorr,
AC_ARG_WITH([field], [AS_HELP_STRING([--with-field=64bit|32bit|auto],
[Specify Field Implementation. Default is auto])],[req_field=$withval], [req_field=auto])

AC_ARG_WITH([bignum], [AS_HELP_STRING([--with-bignum=gmp|no|auto],
AC_ARG_WITH([bignum], [AS_HELP_STRING([--with-bignum=gmp|64bit|auto],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems there is a 32bit as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

@apoelstra apoelstra force-pushed the jacobi branch 4 times, most recently from 5726a17 to 3caa2a1 Compare August 28, 2015 21:02
@sipa
Copy link
Contributor

sipa commented Sep 4, 2015

Needs rebase.

@sipa
Copy link
Contributor

sipa commented Sep 22, 2015

Need more rebase.

@apoelstra
Copy link
Contributor Author

Rebased.

@peterdettman
Copy link
Contributor

How is the inversion performance looking? I ask because I've implemented an inversion based on the GCD in http://perso.ens-lyon.fr/damien.stehle/downloads/recbinary.pdf (except non-recursive). Currently it does 1 million inversions in ~2.3s, which is at least %10 faster than GMP according to benchmark_internal. There's room for improvement because this is currently in Java, using 32 bit multiplies.

If it is proving difficult to improve on GMP's performance using the current approach, perhaps this would be more fruitful. Due to time constraints, I think I would need help with the C conversion and secp256k1 integration though.

@apoelstra
Copy link
Contributor Author

@peterdettman I'm surprised that you were able to beat my code (let alone GMP's!!) with binary GCD. Nobody in this space has published benchmarks anywhere but the papers I've read have consistently implied that Euclid + Lehmer would be a bit faster because it involved fewer iterations. I did some back-of-envelope calculations that were consistent with this.

And GMP doesn't use pure Euclid+Lehmer, but a recursive "half GCD" algorithm invented by Niels Möller. It is sorta described here https://www.jstor.org/stable/40234522 but the actual GMP code is heavily optimized beyond that for their data structures.

@peterdettman
Copy link
Contributor

@apoelstra I was a bit surprised myself and I suppose it may yet turn out that I've mis-measured something, but I don't believe so. The focus in most GCD papers is on improving the big-O bounds for very large numbers; at small sizes like 256 bits the performance limit of an algorithm often depends on serendipity - whether things work out "cleanly". Ultimately the proof is in the pudding I guess.

Still, if you're interested, the best place to start is perhaps to just get you a copy of the Java code and have you confirm the performance comparison?

@peterdettman
Copy link
Contributor

BTW, it's strictly speaking not binary GCD, although similar. In particular it calculates exact "quotients" from the least significant bits only. Best to read the paper!

@apoelstra
Copy link
Contributor Author

Rebased (correctly this time, I think).

@peterdettman Sure, I'm interested. You can email me at apoelstra@wpsoftware.net if this is not public code. Note that I'm also time-constrained, plus I'm much less comfortable in Java than in C.

I'll read your paper as soon as I have have some cycles to work on this again. @sipa had some ideas about using Jacobi symbols for fast batch inversion, so the priority of this patch might be bumped up again..

For historic curiosity:

If I remember correctly, I actually did implement a binary GCD to do some measurements. Because proper bin-GCD requires lots of left-shifting, and I was working with fixed-size buffers, and I didn't really believe in this approach enough to dedicate lots of time to it, I basically did a modular reduction after every left shift (given the functions I had access to already, and C, this was actually easier than using a variable-sized buffer :)). This of course was hundreds, if not thousands, of times slower than a properly written bin-GCD, so I could not benchmark it directly. Instead I added a bunch of trace code and simply printed out some iteration counts to get a feel for "how much work a proper bin-GCD would do". If I remember right, The result was 4-600 iterations for a random 256-bit number, vs 20-30 for Euclid-Lehmer. A bingcd iteration is a left-shift-then-subtract (so on 5-word bignums, roughly 10 word operations) while an Euclid-Lehmer iteration is multiplication of a bignum 2-vector by a 2x2 word matrix (roughly 60 operations I think). So my intuition was that bingcd was going to be slower pretty-much no matter what I did.

Obviously these numbers are full of SWAGs and don't take into account compiler or CPU optimizations. :) And since your paper isn't really binary GCD maybe they're totally inapplicable. But there they are.

@peterdettman
Copy link
Contributor

Yeah, your intuition was right about that class of algorithms (Laszlo Hars has a good paper rounding up the variations: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.384.8124). About 18 months ago I tried implementing a couple of them and could never get better than very roughly half as fast as GMP at 256 bits, maybe worse.

This algorithm is really more comparable to Euclid-Lehmer. It works on the least significant double-words, calculating exact "quotient" matrices (of the Generalised Binary Division, a sort of dual to plain
division, for 2-adic numbers) of a little less than word size without regard to the absolute or relative sizes of the full values. No actual division is needed, and no "fix-up" procedure. Then the matrices are used to update the full-length values.

The current 32-bit implementation almost always handles a 256 bit inverse in 10 iterations (frustratingly close to being 9), and a 64-bit oriented one would comfortably finish in 5 iterations (fewer full-length updates, but the amount of LSB-processing doesn't change much).

The raw output (of the modular inversion) can be larger than 256 bits (there's some analysis in the paper of how large the quotients can get, but 512 bits is plenty), and an adjustment is needed at the end to account for extra powers of 2 that are collected in the result (total coda cost is around 2 field multiplies on average, using a tiny precomputed table per prime).

Anyway, I'll tidy things up a little and get a copy to you in the next few days, just for benchmarking in the first instance.

@apoelstra
Copy link
Contributor Author

(Note that the build failures are due to timeouts on the unit tests, which are due to the very bad perf on the current Jacobi symbol implementation vs GMP's.)

@gmaxwell
Copy link
Contributor

The current 32-bit implementation almost always handles a 256 bit inverse in 10 iterations (frustratingly close to being 9),

How close? -1/F(-x) = 1/F(x). So if having a smaller magnitude input helps, you can shave off a bit that way.

@peterdettman
Copy link
Contributor

I collected some better stats on the rounds and it turns out it's about 0.5% 9, 98% 10 and 1.5% 11. After nine rounds there's on average about 14 bits left, so each round averages just under 27 bits. Saying its close to 9 was an exaggeration that probably stems from earlier in the process before I had implemented the "stop heuristics" correctly. Still, a few extra bits in the average case would skew things more towards 9 and away from 11.

@gmaxwell In this context the smaller magnitude doesn't help, although a subtraction can get you a low order zero, which could - but the quotient calculation is already built around generating low-order zeroes and it's cheaper to just let it do its thing instead of manipulating the full values (apart from the single update per round).

int shift_w, shift_b;
secp256k1_num_word rem_high; /* This slides as rem decreases */
secp256k1_num_word div_high; /* This one stays constant */
secp256k1_num div;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declaration of div shadows a global declaration.

@sipa
Copy link
Contributor

sipa commented Nov 2, 2015

Fails inexplicably, but needs a trivial rebase.

@apoelstra
Copy link
Contributor Author

@sipa the inexplicable failures are that my Jacobi implementation is so slow compared to GMP (like 50x; they do some magic I don't yet understand to make Jacobi way faster than gcd, though the code is superficially similar) that it times out.

But sure, I'll rebase this.

@sipa
Copy link
Contributor

sipa commented Nov 2, 2015 via email

@apoelstra
Copy link
Contributor Author

Is it OK if I add an #ifdef to the tests binary which drops the number of Jacobi iterations by a factor of 100 when it's testing my implementation?

Also I'll need to re-run kcov on this; I think I'm missing coverage for some error cases and I don't want to hurt @gmaxwell's numbers :)

@sipa
Copy link
Contributor

sipa commented Nov 2, 2015 via email

This function isn't used anywhere and will cause test failures if we
implement the full num.h API for a fixed-width 256-bit numeric type.

We lose the unit test for secp256k1_scalar_mul_shift_var; we compensate
by improving the unit test for secp256k1_scalar_split_lambda (which is
the only user of this function) to test that the algebraic relation
`N = s_lam * lambda + s_1` actually holds for the lambda decomposition.
This num.h implementation works using fixed-size arrays large enough
to hold a 256-bit number (plus one word for slop). It includes a
modular inversion. Typical perf numbers on my 64-bit system are:

  scalar_inverse:
    constant time: min 13.4us / avg 13.5us / max 13.8us
     native num.h: min 5.18us / avg 4.55us / max 5.43us
        gmp num.h: min 2.65us / avg 2.68us / max 2.70us

  field_inverse:
    constant time: min 6.02us / avg 6.09us / max 6.15us
     native num.h: min 5.48us / avg 4.94us / max 5.68us
        gmp num.h: min 2.96us / avg 3.02us / max 3.09us
@apoelstra
Copy link
Contributor Author

Rebased, and also:

  1. Got full test coverage.
  2. Fixed a couple edge-case bugs revealed by having full test coverage (in particular computing the Jacobi symbol of 0 wrt any modulus would cause an invalid memory access!!)
  3. Dropped the number of Jacobi iterations in bench_internal by a factor of 100 (saving ~70s on my system with the 32-bit num implementation :P)
  4. Renamed helper function secp256k1_num_div_mul_word to secp256k1_num_mul_word_shift because the old name made no sense to me.

peterdettman and others added 2 commits November 7, 2015 12:10
Also add native Jacobi symbol test (Andrew)

Rebased-by: Andrew Poelstra
…rd" case

Originally we computed q by dividing the top words of the dividend and
divisor, then corrected it by at most 2. However, the ratio of the top
words can only ever be 1, so rather than having three cases (original
correct; need to subtract 1; need to subtract 2), we actually only have
two (1 is correct; 0 is correct).

This simplifies the algorithm a little bit and gets us to full test coverage,
which we didn't have before since the "correct by 2" case was impossible to
hit.
@sipa
Copy link
Contributor

sipa commented Nov 10, 2015

Still way too slow for tests.

@sipa
Copy link
Contributor

sipa commented Mar 22, 2017

Any interest in picking this up in the near future?

@spartucus
Copy link

Any progress on this?

@real-or-random
Copy link
Contributor

Also, is there any progress on @peterdettman's approach?

@peterdettman Are there newer developments? Are you still interested in this? Any code, even in java, would be useful.

@elichai
Copy link
Contributor

elichai commented Oct 23, 2019

Just to prevent duplication of work, I want to mention that in light of these results #667 (comment)
I'm picking this up and working on implementing a potentially faster subquadratic algorithm for the jacobi symbol.

And potentially a faster algorithm for calculating modulo (i.e. without doing actual division).

If i'll be able to understand and implement correctly I hope it will result in faster jacobi symbol calculation :) (or at least more comparable to gmp)

Edit:
In case anyone wants to comment/is interested I'm basing these on: https://arxiv.org/pdf/1004.2091.pdf and maybe also https://eprint.iacr.org/2014/755.pdf

@elichai
Copy link
Contributor

elichai commented Oct 24, 2019

@apoelstra did you actually implement Schönhage algorithm in the end? because you mentioned it but I can't see it in the code. only lehmer's algorithm. (It's fine if you don't remember, it's 4 years ago lol)

@gmaxwell
Copy link
Contributor

Related, someone who doesn't find C++ numeric code unreadable might find this codebase interesting: https://github.com/JeanLucPons/VanitySearch/blob/master/IntMod.cpp#L110

@apoelstra
Copy link
Contributor Author

@elichai as I recall, I got as far as searching flights to Sweden to visit Schönhage and ask him how his algorithm works, because I could not make sense of the paper :). I have some more experience now and maybe could get further if I tried again. So no, I did not implement it.

@roconnor-blockstream
Copy link
Contributor

I believe this PR is now obsolete and can be closed since https://github.com/bitcoin-core/secp256k1/pull/831/files has now been merged.

@real-or-random
Copy link
Contributor

@elichai as I recall, I got as far as searching flights to Sweden to visit Schönhage and ask him how his algorithm works,

I believe this PR is now obsolete and can be closed since https://github.com/bitcoin-core/secp256k1/pull/831/files has now been merged.

I'm all in favor of optimizing implementations but even I didn't expect #831 to come with such large CO2 savings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants