Rework of 10x26 field mul/sqr #815

peterdettman · 2020-09-13T16:38:52Z

avoid overly-wide multiplications
save a few multiplies, masks and shifts
final residual left in r[9] instead of r[2]

@gmaxwell It looks faster to me, but if you could collect some results for real 32-bit hardware that would be very helpful.

- avoid overly-wide multiplications - save a few multiplies, masks and shifts - final residual left in r[9] instead of r[2]

real-or-random · 2020-09-13T21:02:37Z

@peterdettman That's great, I didn't see the widemul optimizations even though I was staring at this code for some time.

I get the general idea but can you explain something like a more systematic approach to this? E.g., what was the algorithm you followed to create these changes? I think this helps reviewers and then we don't need to rediscover your thoughts.

peterdettman · 2020-09-14T05:11:53Z

The overall algorithm is "schoolbook" multiplication to calculate 19 products p00..p18, with the high part reduced modulo the field prime P using multiplication by (2^256 - P), and a final carry propagation to ensure a magnitude 1 output.

Operationally, we run two accumulators, 10 limbs (260 bits of "weight") apart, and interleave the modular reduction: the least-significant limb of the upper accumulator can be accumulated into the lower accumulator after multiplying by [ R1, R0 ], where (R1 << 26) + R0 == (2^256 - P) << 4 (the << 4 is because 10 limbs is 260 bits, not 256).

The accumulation begins with p07, p17. d accumulates p07..p16, while c accumulates p17..p18 as the upper accumulator then p00..p06 as the lower accumulator. There are a couple of final carry propagation steps for c. The choice of starting accumulator positions affects where the final carry will land. There is an ulterior motive for finishing the carry propagation at r[9]: it will support an increased magnitude limit for field elements of 16 (currently 8). a[9], b[9] also have their input bounds changed to 27 bits (from 26) in anticipation of this.

Squaring is the same, modified only so that duplicate limb multiplications are consolidated when calculating the products p00..p18.

peterdettman · 2020-09-14T05:14:26Z

I'm intending to push arbitrary-degree Karatsuba changes in this PR shortly, but I'd like to get some baseline performance numbers in place first.

peterdettman · 2020-09-14T14:05:53Z

I'm intending to push arbitrary-degree Karatsuba changes in this PR shortly

The latest commit adds this (only relevant to mul, not sqr). It saves 45 limb multiplies, but adds the equivalent of 52 limb additions (counting 64-bit adds as 2). However that may not be the full story; from the paper:

However an inspection of the generated code also confirmed our suspicion that the ADK code generated more register-register move instructions and memory accesses. On some processors this could offset the ADK advantage.

On my 64-bit machine this is slower (as expected), but on real 32-bit hardware there's a good chance this is faster. I'd need someone to help with measuring that though. From what I can see on godbolt the choice of compiler leads to large differences in instruction count.

peterdettman · 2020-10-16T13:59:08Z

@gmaxwell I don't want to impose on you, but I had hoped you might collect some benchmarks for this PR.

gmaxwell · 2020-10-17T01:40:05Z

Woops! Will do. I'm excited about this work-- but I've been redoing all the power at home and have had systems in various states of off and disconnected and couldn't immediately try it and then forgot about it. Give me a day or two.

peterdettman · 2021-05-07T08:34:17Z

@gmaxwell Still hoping for some benchmarks here (separately for the first "rework" commit and the latter "Karatsuba" one).

If/when this is accepted I am hoping someone can update the corresponding asm to leave its final carry in the top limb (or in case Karatsuba proves fast, either abandon the asm or reimplement it along those lines).

At that point field implementations could be modified to support a maximum magnitude of 16. Some easy (but small) performance improvements then follow, and I am expecting it will give a little more room for per-group-operation magnitude limits (on input and output field elements) to be profitable.

gmaxwell · 2021-05-07T19:44:49Z

@peterdettman Oh crap. I am terribly sorry for completely forgetting your request.

I've tested on two different 32-bit arm systems: A faster superscalar cortex-a9,
and a slower in order cortex-a8 (which can dual issue, but only if the instructions
are paired right).

Common hardware wallets are usually Cortex-M4, which like the a8 is an in order core
with dual issue but due to pipeline differences I'm not sure how similar the results
would be. Some use M3 which also is 3-5 cycles "depending on values" (uh oh) for
a 32x32->64 mul. The slower M0 has a 32 cycle multiply... I'm not aware of anything
using this library on those, though I suppose that karatsuba would be a huge win there.
(I'd kinda like to see those numbers!)

Results seem a little paradoxical. On a8, rework seems to slow things down and karatsuba
speeds the mul back up. On a9 rework is a slight speedup, but karatsuba is a slight slowdown (though still faster than master).

With the older GCC the rework is a huge speedup.

As the ASM (which is IIRC the same algorithm as master, just with hand scheduling and
register allocation) and the older GCC results show, the performance is really sensitive
to compiler decisions. This might make it hard to draw conclusions on the algorithms
overall. I wouldn't be surprised if the small disadvantage for karatsuba on a9 went
away with somewhat more careful scheduling.

May be of interest to @laanwj too.

i.MX53 (Cortex-a8)
gcc version 9.3.0 (GCC)

Master w/ arm asm:
field_sqr: min 0.598us / avg 0.598us / max 0.599us
field_mul: min 0.810us / avg 0.811us / max 0.812us
ecdsa_verify: min 1556us / avg 1556us / max 1557us

Master:
field_sqr: min 0.666us / avg 0.666us / max 0.668us
field_mul: min 1.10us / avg 1.10us / max 1.11us
ecdsa_verify: min 1895us / avg 1896us / max 1896us

Rework:
field_sqr: min 0.716us / avg 0.718us / max 0.728us
field_mul: min 1.18us / avg 1.18us / max 1.19us
ecdsa_verify: min 2025us / avg 2025us / max 2027us

Karatsuba:
field_sqr: min 0.715us / avg 0.715us / max 0.717us
field_mul: min 0.983us / avg 0.984us / max 0.986us
ecdsa_verify: min 1829us / avg 1829us / max 1829us

i.MX6 (Cortex-a9)
gcc version 9.3.0 (GCC)

Master w/ arm asm:
field_sqr: min 0.292us / avg 0.292us / max 0.292us
field_mul: min 0.361us / avg 0.361us / max 0.362us
ecdsa_verify: min 740us / avg 740us / max 740us

Master:
field_sqr: min 0.342us / avg 0.343us / max 0.343us
field_mul: min 0.496us / avg 0.496us / max 0.497us
ecdsa_verify: min 919us / avg 919us / max 920us

Rework:
field_sqr: min 0.329us / avg 0.330us / max 0.330us
field_mul: min 0.495us / avg 0.495us / max 0.496us
ecdsa_verify: min 906us / avg 906us / max 907us

Karatsuba:
field_sqr: min 0.331us / avg 0.331us / max 0.331us
field_mul: min 0.500us / avg 0.501us / max 0.501us
ecdsa_verify: min 910us / avg 910us / max 911us

i.MX6 (Cortex-a9)
gcc version 4.9.2 (Debian 4.9.2-10+deb8u2)

Master w/ arm asm:
field_sqr: min 0.292us / avg 0.292us / max 0.292us
field_mul: min 0.361us / avg 0.361us / max 0.361us
ecdsa_verify: min 739us / avg 739us / max 740us

Master:
field_sqr: min 0.815us / avg 0.815us / max 0.816us
field_mul: min 0.940us / avg 0.941us / max 0.942us
ecdsa_verify: min 1755us / avg 1756us / max 1757us

Rework:
field_sqr: min 0.554us / avg 0.555us / max 0.556us
field_mul: min 0.597us / avg 0.598us / max 0.599us
ecdsa_verify: min 1164us / avg 1165us / max 1171us

Karatsuba:
field_sqr: min 0.555us / avg 0.556us / max 0.557us
field_mul: min 0.662us / avg 0.665us / max 0.668us
ecdsa_verify: min 1228us / avg 1229us / max 1229us

gmaxwell · 2021-05-09T02:48:47Z

Given that it was so compiler sensitive I thought it would be useful to upgrade to a newer GCC and try clang too:

Cortex-a8 gcc version 10.2.0 (GCC)
Master:
field_sqr: min 0.658us / avg 0.658us / max 0.668us
field_mul: min 1.13us / avg 1.13us / max 1.14us

Rework:
field_sqr: min 0.666us / avg 0.668us / max 0.671us
field_mul: min 1.03us / avg 1.03us / max 1.04us

Karatsuba:
field_sqr: min 0.667us / avg 0.667us / max 0.675us
field_mul: min 1.07us / avg 1.07us / max 1.08us

Cortex-a8 clang version 11.1.0
Master:
field_sqr: min 0.614us / avg 0.620us / max 0.643us
field_mul: min 1.33us / avg 1.34us / max 1.35us

Rework:
field_sqr: min 0.747us / avg 0.750us / max 0.760us
field_mul: min 1.35us / avg 1.36us / max 1.37us

Karatsuba:
field_sqr: min 0.746us / avg 0.752us / max 0.786us
field_mul: min 1.01us / avg 1.02us / max 1.03us

...chaos

peterdettman · 2021-05-11T14:10:55Z

Thanks, @gmaxwell. Somewhat irksome results though.

In the rework I tried loading a0..a9 so that the results can be written out to r immediately, but maybe it's worth testing a variation that doesn't attempt that change (so no a0..a9, but t0..t6 restored instead and written to r only after all reads from a).

Karatsuba might do better with manual scheduling as you say, so I hope someone can try an asm version of it (well of both versions ideally).

sipa · 2023-05-09T15:06:34Z

If this isn't faster on new compilers it's unclear if we should spend time on this. @peterdettman is there a smaller set of changes that we could consider for benchmarking?

With work progressing on fiat-crypto (#1261) changes like this may become more appealing, if they can get integrated into their optimization framework, as it simplifies our job in convincing it's correct.

Rework of 10x26 field mul/sqr

9388b77

- avoid overly-wide multiplications - save a few multiplies, masks and shifts - final residual left in r[9] instead of r[2]

peterdettman mentioned this pull request Sep 13, 2020

Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86 #711

Closed

peterdettman added 2 commits September 14, 2020 13:55

Formatting

85db039

Arbitrary-degree Karatsuba for 10x26 mul

9618abc

laanwj mentioned this pull request Oct 17, 2020

Experimental features should be deprecated #817

Open

real-or-random mentioned this pull request Nov 24, 2020

Tighter verification bounds in mul/sqr #167

Closed

real-or-random mentioned this pull request Dec 1, 2022

64x64->64 muls are not constant-time with MSVC on 32bit x86 #1164

Open

real-or-random added the performance label May 9, 2023

peterdettman mentioned this pull request May 23, 2023

fiat-crypto + CryptOpt tracking issue #1261

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework of 10x26 field mul/sqr #815

Rework of 10x26 field mul/sqr #815

peterdettman commented Sep 13, 2020

real-or-random commented Sep 13, 2020

peterdettman commented Sep 14, 2020

peterdettman commented Sep 14, 2020

peterdettman commented Sep 14, 2020

peterdettman commented Oct 16, 2020

gmaxwell commented Oct 17, 2020

peterdettman commented May 7, 2021 •

edited

Loading

gmaxwell commented May 7, 2021 •

edited

Loading

gmaxwell commented May 9, 2021

peterdettman commented May 11, 2021

sipa commented May 9, 2023

Rework of 10x26 field mul/sqr #815

Are you sure you want to change the base?

Rework of 10x26 field mul/sqr #815

Conversation

peterdettman commented Sep 13, 2020

real-or-random commented Sep 13, 2020

peterdettman commented Sep 14, 2020

peterdettman commented Sep 14, 2020

peterdettman commented Sep 14, 2020

peterdettman commented Oct 16, 2020

gmaxwell commented Oct 17, 2020

peterdettman commented May 7, 2021 • edited Loading

gmaxwell commented May 7, 2021 • edited Loading

gmaxwell commented May 9, 2021

peterdettman commented May 11, 2021

sipa commented May 9, 2023

peterdettman commented May 7, 2021 •

edited

Loading

gmaxwell commented May 7, 2021 •

edited

Loading