Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crandall primes #445

Merged
merged 16 commits into from
Dec 3, 2024
Merged

Crandall primes #445

merged 16 commits into from
Dec 3, 2024

Conversation

mratsim
Copy link
Owner

@mratsim mratsim commented Jul 27, 2024

This closes #11 for primes of form 2ᵐ-c (Crandall primes / pseudo-Mersenne primes), such as the one used for Curve25519 and secp256kq (Ethereum/ Bitcoin).

Bench Fp vs Constantine master

Previous

image
image

Current

image
image

Analysis

  • Fp[Edwards25519] mul 1.27x improvement
  • Fp[Edwards25519] square 1.43x improvement
  • Fp[Secp256k1] mul 1.94x improvement
  • Fp[Secp256k1] square 1.28x improvement

Bench EC vs Constantine master

Previous

image

Current

image

Analysis

  • EC add projective constant-time improved by 1.36x
  • EC add jacobian constant-time improved by 1.34x
  • EC add projective vartime improved by 1.24x
  • EC add jacobian vartime improved by 1.37x
  • EC dbl projective constant-time improved by 1.31x
  • EC dbl jacobian constant-time improved by 1.06x

Bench vs bitcoin/secp256k1

image

  • field_sqr 12.4ns vs 8ns -> 1.55x
  • field_mul 15.8ns vs 10ns -> 1.58x
  • field_inv_ct 1410ns vs 1203ns -> 1.17x
  • field_inv_vt 820ns vs 848ns -> 0.97x
  • EC add jacobian var 247ns vs 97ns -> 2.55x
  • EC dbl jacobian var 97.8ns vs 145 -> 0.67x
  • EC mixed add ct 189ns vs 225ns -> 0.84x
  • EC mixed add var 173ns vs 98ns -> 1.77x

image

  • EC scalar-mul ct 28100ns vs 40196 ns -> 0.70x

Analysis

The fact that field operations are 1.5x faster BUT the elliptic curve operations are sometimes slower is suspicious. We probably need to check the EC formulae

TODO

  • fix windows
  • bound checks for lazy reduce and lazy reduced field exponentiation for 256-bit as eprint/iacr 2018/985
    indicates in Theorem 4 that their partial reduction may grow by 1 bit if 256-bit.
  • optimize EC impl to avoid if/else check for ADX and limit input/output movement
  • optimized mixed add

@mratsim
Copy link
Owner Author

mratsim commented Jul 27, 2024

Bench vs RustCrypto/elliptic-curves

https://github.com/RustCrypto/elliptic-curves/ is the current record holder of https://programming-language-benchmarks.vercel.app/problem/secp256k1

We modify it to bench some of the internals

Field implementation

cargo bench --features expose-field -- field

with an extra

fn bench_field_element_10adds<'a, M: Measurement>(group: &mut BenchmarkGroup<'a, M>) {
    let x = test_field_element_x();
    let y = test_field_element_y();
    group.bench_function("10 adds", |b| b.iter(
        || {
            &black_box(x) + &black_box(y);
            &black_box(x) + &black_box(y);
            &black_box(x) + &black_box(y);
            &black_box(x) + &black_box(y);
            &black_box(x) + &black_box(y);
            &black_box(x) + &black_box(y);
            &black_box(x) + &black_box(y);
            &black_box(x) + &black_box(y);
            &black_box(x) + &black_box(y);
            &black_box(x) + &black_box(y)
        }
    ));
}

image

  • 10 adds: 25ns vs 12ns - 2.08x
  • mul (partially normalized in k256): 17.825ns vs 10ns - 1.78x
  • sqr (partially normalized in k256): 13.846ns vs 8ns - 1.73x

EC implementation (projective with Renes2015 formulae)

use criterion::{
    black_box, criterion_group, criterion_main, measurement::Measurement, BenchmarkGroup, Criterion,
};
use k256::ProjectivePoint;
use elliptic_curve::{
    rand_core::SeedableRng,
    group::Group,
};
use rand_xorshift::XorShiftRng;

fn bench_ec_add<'a, M: Measurement>(group: &mut BenchmarkGroup<'a, M>) {
    let mut rng = XorShiftRng::seed_from_u64(1234u64);
    let p = ProjectivePoint::random(&mut rng);
    let q = ProjectivePoint::random(&mut rng);
    group.bench_function("EC Add", |b| {
        b.iter(|| &black_box(p) + &black_box(q))
    });
}

fn bench_ec_dbl<'a, M: Measurement>(group: &mut BenchmarkGroup<'a, M>) {
    let mut rng = XorShiftRng::seed_from_u64(1234u64);
    let p = ProjectivePoint::random(&mut rng);
    group.bench_function("EC Dbl", |b| {
        b.iter(|| black_box(p).double())
    });
}

fn bench_ec(c: &mut Criterion) {
    let mut group = c.benchmark_group("EC operations");
    bench_ec_add(&mut group);
    bench_ec_dbl(&mut group);
    group.finish();
}

criterion_group!(benches, bench_ec);
criterion_main!(benches);

image

  • EC add proj ct: 195.83ns vs 232ns - 0.84x
  • EC dbl proj ct: 130.83ns vs 153ns - 0.86x

Analysis

The fact that field operations are 1.7x to 2x faster BUT the elliptic curve operations are 0.85x slower is extremely suspicious. Especially when we implement the same formulae from Renes2015 paper.

There might be useless copies or parameter passing overhead similar to #21 and #146

@mratsim mratsim linked an issue Dec 3, 2024 that may be closed by this pull request
@mratsim mratsim merged commit 585f803 into master Dec 3, 2024
12 checks passed
@mratsim mratsim deleted the crandall-primes branch December 3, 2024 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Windows: Secp256k1 tests assembly test frozen Finite field computation for moduli of special form
1 participant