-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework of 10x26 field mul/sqr #815
base: master
Are you sure you want to change the base?
Conversation
- avoid overly-wide multiplications - save a few multiplies, masks and shifts - final residual left in r[9] instead of r[2]
@peterdettman That's great, I didn't see the widemul optimizations even though I was staring at this code for some time. I get the general idea but can you explain something like a more systematic approach to this? E.g., what was the algorithm you followed to create these changes? I think this helps reviewers and then we don't need to rediscover your thoughts. |
The overall algorithm is "schoolbook" multiplication to calculate 19 products p00..p18, with the high part reduced modulo the field prime P using multiplication by (2^256 - P), and a final carry propagation to ensure a magnitude 1 output. Operationally, we run two accumulators, 10 limbs (260 bits of "weight") apart, and interleave the modular reduction: the least-significant limb of the upper accumulator can be accumulated into the lower accumulator after multiplying by [ R1, R0 ], where The accumulation begins with p07, p17. Squaring is the same, modified only so that duplicate limb multiplications are consolidated when calculating the products p00..p18. |
I'm intending to push arbitrary-degree Karatsuba changes in this PR shortly, but I'd like to get some baseline performance numbers in place first. |
The latest commit adds this (only relevant to mul, not sqr). It saves 45 limb multiplies, but adds the equivalent of 52 limb additions (counting 64-bit adds as 2). However that may not be the full story; from the paper:
On my 64-bit machine this is slower (as expected), but on real 32-bit hardware there's a good chance this is faster. I'd need someone to help with measuring that though. From what I can see on godbolt the choice of compiler leads to large differences in instruction count. |
@gmaxwell I don't want to impose on you, but I had hoped you might collect some benchmarks for this PR. |
Woops! Will do. I'm excited about this work-- but I've been redoing all the power at home and have had systems in various states of off and disconnected and couldn't immediately try it and then forgot about it. Give me a day or two. |
@gmaxwell Still hoping for some benchmarks here (separately for the first "rework" commit and the latter "Karatsuba" one). If/when this is accepted I am hoping someone can update the corresponding asm to leave its final carry in the top limb (or in case Karatsuba proves fast, either abandon the asm or reimplement it along those lines). At that point field implementations could be modified to support a maximum magnitude of 16. Some easy (but small) performance improvements then follow, and I am expecting it will give a little more room for per-group-operation magnitude limits (on input and output field elements) to be profitable. |
@peterdettman Oh crap. I am terribly sorry for completely forgetting your request. I've tested on two different 32-bit arm systems: A faster superscalar cortex-a9, Common hardware wallets are usually Cortex-M4, which like the a8 is an in order core Results seem a little paradoxical. On a8, rework seems to slow things down and karatsuba With the older GCC the rework is a huge speedup. As the ASM (which is IIRC the same algorithm as master, just with hand scheduling and May be of interest to @laanwj too. i.MX53 (Cortex-a8) Master w/ arm asm: Master: Rework: Karatsuba: i.MX6 (Cortex-a9) Master w/ arm asm: Master: Rework: Karatsuba: i.MX6 (Cortex-a9) Master w/ arm asm: Master: Rework: Karatsuba: |
Given that it was so compiler sensitive I thought it would be useful to upgrade to a newer GCC and try clang too: Cortex-a8 gcc version 10.2.0 (GCC) Rework: Karatsuba: Cortex-a8 clang version 11.1.0 Rework: Karatsuba: ...chaos |
Thanks, @gmaxwell. Somewhat irksome results though. In the rework I tried loading a0..a9 so that the results can be written out to r immediately, but maybe it's worth testing a variation that doesn't attempt that change (so no a0..a9, but t0..t6 restored instead and written to r only after all reads from a). Karatsuba might do better with manual scheduling as you say, so I hope someone can try an asm version of it (well of both versions ideally). |
If this isn't faster on new compilers it's unclear if we should spend time on this. @peterdettman is there a smaller set of changes that we could consider for benchmarking? With work progressing on fiat-crypto (#1261) changes like this may become more appealing, if they can get integrated into their optimization framework, as it simplifies our job in convincing it's correct. |
@gmaxwell It looks faster to me, but if you could collect some results for real 32-bit hardware that would be very helpful.