Implement lazy carries and reductions #15

mratsim · 2020-02-23T23:55:02Z

Context

Currently after each addition or substraction steps there is a reduction done if the result is over the field modulus.

Due to constant-time constraints, there is no shortcut if it is unnecessary, the memory accesses are always done.

Instead, at the cost of a couple bits, we could use lazy carries/reductions.

Instead of using 31 bits over 32-bit words or 63-bit over 64-bit word, we use less bits in each words. For example assuming we use 26 bits and 52 bits respectively from 32 or 64-bit words, we only need to reduce every 6 or 12 additions respectively.
This is desirable in particular for addition chains

Choosing a default: Maximizing the usage of the word overhead

256-bit curves are quite common:

secp256k1 for blockchain ECDSA / transaction signing
P256/secp256r1 or Curve25519 for Diffie-Hellman

Or closely 254-bit Barreto-Naehrig for zkSNARKS.

Representing a 256 or 254 bits curve in the most compact way on 64-bit arch requires 4 words. Assuming

we want at most a 1 word overhead per 256-bit
we want to maximize the laziness of reduction

52-bit logical words and 12 lazy bits or 51-bit logical words and 13 lazy bits would be the best logical word bitsizes. They both can represent 2^255 integers in 5 words (but a radix 2^52 representation can also represent the numbers 2^255 and 2^256 in 5 words)

Side-note on SIMD

This may also enable opportunities for SIMD vectorization using either integer or floating-point math.

2^24+1 is the first integer that cannot be represented in float32 though leaving 8-bit on the table might arguably be too much.
2^53+1 is the first integer that cannot be represented in float64
The new AVX512_IFMA instructions (Integer Fused-Multiply-Add) supports multiply-add of 52-bit integers: VPMADD52LUQ and VPMADD52HUQ

Using floating point for pairings is covered in this paper:

New software speed records for cryptographic pairings
Michael Naehrig, Ruben Niederhagen, and Peter Schwabe
https://cryptojedi.org/papers/dclxvi-20100714.pdf
Pairings on elliptic curves -- parameter selection and efficient computation
Michael Naehrig
https://cryptosith.org/michael/data/talks/2010-10-19-ECC.pdf

Side-note on "final substraction"-less Montgomery Multiplication/Exponentiation

With well-chosen word-size that allow redundant representations we can avoid final substraction in Montgomery Multiplication and Exponentiation:

Montgomery Exponentiation Needs No Final Subtractions
Colin D. Walter, 1999
https://pdfs.semanticscholar.org/0e6a/3e8f30b63b556679f5dff2cbfdfe9523f4fa.pdf
Montgomery’s Multiplication Technique:How to Make it Smaller and Faster
Colin D. Walter, 1999
https://colinandmargaret.co.uk/Research/CDW_CHES_99.pdf
Montgomery Exponentiation with no Final Subtractions: Improved Results
Gael Hachez and Jean-Jacques Quisquater, 2000
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.32.3181&rep=rep1&type=pdf
Montgomery Arithmetic from a Software Perspective
Joppe W. Bos and Peter L. Montgomery, 2017
https://eprint.iacr.org/2017/1057.pdf

Implementation strategy

The implementatons steps would be:

Change the WordBitSize at

constantine/constantine/config/common.nim

Lines 17 to 27 in d831011

    
           type Word* = Ct[uint32] 
        
             ## Logical BigInt word 
        
             ## A logical BigInt word is of size physical MachineWord-1 
        
           type DoubleWord* = Ct[uint64] 
        
           type BaseType* = uint32 
        
             ## Physical BigInt for conversion in "normal integers" 
        
           const 
        
             WordPhysBitSize* = sizeof(Word) * 8 
        
             WordBitSize* = WordPhysBitSize - 1

This can be made a {.intdefine.} for compile-time configuration

Ensure the encoding/decoding routine in io_bigints.nim properly deal with more than 1 unused bit
Field elements should now have a countdown that tracks how many potential carries are left given the free bits. If it reaches 0, they should call a normalization/reduction proc.
To be researched: add and sub may have to return a carry Word instead of a CtBool as the carry is not 0 or 1 anymore (but we never add the carry, it is used as an input for a optional reduction so might be unneeded)

References

Incomplete Reduction in Modular Arithmetic
T. Yanik, E. Savaş, and C. K.Koç, 2002
http://pdfs.semanticscholar.org/6f63/a8ae9a43777c710be44dfd7ee5b4c0f3defc.pdf
Curve25519: new Diffie-Hellman speed records
Daniel J. Bernstein, 2006
https://cr.yp.to/ecdh/curve25519-20060209.pdf
Faster Explicit Formulas for Computing Pairings over Ordinary Curves
Diego F. Aranha, Koray Karabina, Patrick Longa,Catherine H. Gebotys, Julio López, 2011
https://www.iacr.org/archive/eurocrypt2011/66320047/66320047.pdf
Elliptic Curve Cryptography at High Speeds
Patrick Longa, 2011
http://ecc2011.loria.fr/slides/longa.pdf
Efficient Implementation of Bilinear Pairings on ARM Processors
Gurleen Grewal, Reza Azarderakhsh, Patrick Longa, Shi Hu, and David Jao, 2012
https://eprint.iacr.org/2012/408.pdf
Efficient implementation of finite-field arithmetic
Peter Schwabe
https://cryptojedi.org/peter/data/pairing-20131122.pdf
Faster Compact Diffie-Hellman: Endomorphisms on the x-line
Craig Costello, Huseyin Hisil, Benjamin Smith, 2014
https://hal.inria.fr/hal-00932952/document
Software and Hardware Implementation of Elliptic Curve Cryptography
Jérémie Detrey, 2015
https://www.math.u-bordeaux.fr/~aenge/ecc2015/documents/detrey.pdf
Fast Software Implementations of Bilinear Pairings
Reza Azarderakhsh, Dieter Fishbein, Gurleen Grewal, Shi Hu, David Jao, Patrick Longa, and Rajeev Verma, 2016
http://cacr.uwaterloo.ca/techreports/2016/cacr2016-03.pdf
Slothful reduction
Michael Scott, 2017
https://eprint.iacr.org/2017/437

The text was updated successfully, but these errors were encountered:

mratsim · 2020-02-26T13:50:48Z

Some clarifications

As mentioned in Milagro docs, lazy carry and lazy reduction are separate, independant concepts.

Lazy reductions

Most prime modulus bitsize do not match exactly with a multiple of the logical word size of the library (2^3 or 2^63), for example the BN254 curve or BLS12-381 curve will leave respectively 63*5-254 = 61 and 63*7-381 = 60 unused bits in the last word. This can be used to accumulate up to 60~61 additions without worrying about overflowing in the physical representation.
And we would need to substract up to 60~61 * P to return to a fully reduced representation.

Not fully reducing

As an optimization we can stay in a unreduced representation by conditionally substracting half that range after that many substractions, taking advantage that it's unlikely that operands are in the same order of magnitude as the prime 2^254 or 2^381.

Fully reducing in logarithmic time

Or we can use a logarithmic approach by conditionally substracting half then a quarter then a eightth then ... of the excess range.
Given a word-bit size of 2^31 or 2^63, the excess bits are at most 30 or 62 and so we would need 4~5 conditional substractions every 30~62 additions.

As far as I know this is a novel approach that is not mentioned in the literature and was never used in software.

Dealing with substraction

Substraction of a reduced/unreduced operand by an unreduced operand must be handled specially:

either we follow Milagro path and we use negate + addition
or we follow Yanik or Schwabe path by using signed integers (or an emulation of)
or we follow the paper
High-Speed Elliptic Curve Cryptographyon the NVIDIA GT200 Graphics Processing Unit
Shujie Cui, Johann Großschadl, ZheLiu and Qiuliang Xu
https://www.doc.ic.ac.uk/~scui2/papers/Cui2014_GPU.pdf
which rely on the fact that substraction implemented as 4p + a - b will always be in the correct range when used in curve point addition and point doubling. Note that this may not hold for extension fields.
edit: This is also mentioned in Michael Scott paper: Slothful reduction p6 and p8

Removing the need of final substraction

With lazy reduction we have a redundant representation that can represent 2p 3p 4p ... 8p if there are enough excess bits. This may help Montgomery Multiplication and exponentiation by avoiding the need of the last conditional substraction (Walter, 1999, Hachez & Quisquater, 2000, Bos & Montgomery, 2017 and many hardware papers like Ors-Batina-Prenel-Vandewalle or Walter, 2017)

Note that the Almost Montgomery Multiplication of Gueron that checks the MSB to know if reduction is needed will leak information and we would still need to estimate if we need to remove p, 2p, ..., MaxExcess * p after each substraction

Lazy Carry

Lazy Carries are (almost) independent from Lazy Reductions, here there are excess bits within a machine word.
This allows to replace

constantine/constantine/arithmetic/bigints_raw.nim

Lines 313 to 324 in 2aec16d

    
           func add*(a: BigIntViewMut, b: BigIntViewAny): CTBool[Word] = 
        
             ## Constant-time in-place addition 
        
             ## Returns the carry 
        
             ## 
        
             ## a and b MAY be the same buffer 
        
             ## a and b MUST have the same announced bitlength (i.e. `bits` static parameters) 
        
             checkMatchingBitlengths(a, b) 
        
             for i in 0 ..< a.numLimbs(): 
        
               a[i] = a[i] + b[i] + Word(result) 
        
               result = a[i].isMsbSet() 
        
               a[i] = a[i].mask()

by

func add*(a: BigIntViewMut, b: BigIntViewAny) =
  checkMatchingBitlengths(a, b)

  for i in 0 ..< a.numLimbs():
    a[i] += b[i]
    a[i] = a[i].mask()

func modular_add(a: BigIntViewMut, b: BigIntViewAny, m: BigIntViewConst) =
  ## This is pseudocode and would be implemented at the Field-level and not BigInt level
  a.add(b)
  if a.potential_carries == MaxPotentialCarries:
    a.normalizeCarries()

Removing loop-carried dependencies and enabling easy vectorization by the compiler.
The only potential limitation is that the compiler cannot infer the loop unrolling, but the function becomes small enough that monomorphization+inlining would probably not increase codesize (separate functions require parameters and stack bookeeping before call and after call and also within the actual function call, this can easily require 10 to 20 instructions before doing anything useful, an inlined multilimb addition + mask on less than 20 limbs wouldn't increase codesize)

They also enable multiplication by constant less than the current word excess without converting them to Montgomery form and using full-blown Montgomery multiplication. This is useful for extension field that uses sqrt(-2) or sqrt(-5) as the solution to irreducible polynomials.

For normalization of lazy carries there is optional coupling with lazy reduction (hence the almost at the beginning of the paragraph) in that all the lazy carries must be accumulated somewhere, either they fit in the extra space of the last word WordBitSize - ModulusBitSize - Word Excess and we can handle this unreduced representation like in the lazy reduction case or we need an extra temporary word.

mratsim · 2020-03-07T15:03:33Z

Beginning of a proof-of-concept here: mratsim/finite-fields@2673a04

Lazy carry

This is very easy to build for addition/substraction/negation as we can ensure that assuming u = modulus.limbs[i] we always have u <= 2^ExcessBits * modulus.limbs[i] in particular for u == 0. With lazy carries all addition/substraction/negation are carryless thanks to the extra space per word.

Full reduction is also straightforward

However with multiplication, limbs multiplication can carry over to the next one, can it also exceed u <= 2^ExcessBits * modulus.limbs[i]?
Dealing with those may introduce a lot of complexity.

Lazy reduction

Lazy reduction instead is probably much less error prone to implement since only the high word holds the excess carries. It however is quite restricted as we can only have 2 excess bits for BN254 (4*64=256) or 3 for BLS12-381 (6*64 = 384)

mratsim · 2020-03-21T12:23:24Z

None of the pairing curves have a special form that make lazy reduction and logical word smaller than the physical word worth it.
Scott's paper Slothful reduction or the logarithmic reduction I mentioned makes debugging more complex and just delay the inevitable.

However, in some select cases we can use the technique in

High-Speed Software Implementation of the Optimal Ate Pairing over Barreto-Naehrig Curves
Jean-Luc Beuchat and Jorge Enrique González Díaz and Shigeo Mitsunari and Eiji Okamoto and Francisco Rodríguez-Henríquez and Tadanori Teruya, 2010
https://eprint.iacr.org/2010/354

to delay reduction, but making that constant time would still require a table lookup that scans all the potential n*p multiple of the field modulus p to select the appropriate one for reduction.

Furthermore, #17, adds no-carry Montgomery multiplication and squaring that is usable for most pairing-friendly curves (up to 15% perf improvement). Optimizing multiplications and squarings is probably more important than additions and forcing a reduction check before each is costly. Even if we use a redundant representation with 2 extra bits to allow storing up to 4p to allow for "final substraction-less" modular multiplication that may me paying for the substraction upfront.

mratsim · 2020-06-06T08:49:17Z

For posterity, Patrick Longa's PhD Thesis dedicates 30 pages to lazy reduction and scheduling techniques to reduce or erase data dependencies

Followed by a dedicated section for pairings.
Notably we have these substraction formula to canonicalize a lazy reduced field element in case of double-precision arithmetic (i.e. using only 32-bit out of a 64-bit limb and leaving the extra 32 for carries)

This is followed by multiplication formula in FP2, Fp6 and Fp12 and squarings in Fp12 that uses the lazy reduction

https://uwspace.uwaterloo.ca/handle/10012/5857

However the space cost is high, especially in constant-time settings where you have to iterate over all the space all the time. Using more space also doesn't play well with CPU caches that are significantly more expensive than a data-dependency. Lastly since that time while the ADC instruction data dependency cost was 6 cycles, recent Intel CPUs only have a data-dependency cost of 2 cycles (but only 1 execution port AFAIK) provided you update a register (and not a memory location) and AMD is even less costly.

mratsim mentioned this issue Mar 7, 2020

Serialisation additional bits cfrg/draft-irtf-cfrg-bls-signature#18

Open

mratsim added a commit to mratsim/finite-fields that referenced this issue Mar 7, 2020

Lay out lazy carry according to mratsim/constantine#15

2673a04

mratsim mentioned this issue Mar 15, 2020

Internals refactor + renewed focus on perf #17

Merged

mratsim closed this as completed Mar 21, 2020

This was referenced Apr 6, 2020

Benchmarks Consensys/goff#12

Closed

Internal API: in-place vs result #21

Closed

mratsim added the performance 🏁 label Jun 6, 2020

mratsim mentioned this issue Jan 31, 2021

FpDbl revisited #144

Merged

4 tasks

mratsim mentioned this issue Feb 8, 2021

Double-Precision towering #155

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement lazy carries and reductions #15

Implement lazy carries and reductions #15

mratsim commented Feb 23, 2020 •

edited

Loading

mratsim commented Feb 26, 2020 •

edited

Loading

mratsim commented Mar 7, 2020 •

edited

Loading

mratsim commented Mar 21, 2020

mratsim commented Jun 6, 2020

Implement lazy carries and reductions #15

Implement lazy carries and reductions #15

Comments

mratsim commented Feb 23, 2020 • edited Loading

Context

Choosing a default: Maximizing the usage of the word overhead

Side-note on SIMD

Side-note on "final substraction"-less Montgomery Multiplication/Exponentiation

Implementation strategy

References

mratsim commented Feb 26, 2020 • edited Loading

Lazy reductions

Not fully reducing

Fully reducing in logarithmic time

Dealing with substraction

Removing the need of final substraction

Lazy Carry

mratsim commented Mar 7, 2020 • edited Loading

Lazy carry

Lazy reduction

mratsim commented Mar 21, 2020

mratsim commented Jun 6, 2020

mratsim commented Feb 23, 2020 •

edited

Loading

mratsim commented Feb 26, 2020 •

edited

Loading

mratsim commented Mar 7, 2020 •

edited

Loading