Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kyber ASM ARMv7E-M/ARMv7-M: added assembly code #7706

Merged
merged 1 commit into from
Oct 3, 2024

Conversation

SparkiDev
Copy link
Contributor

@SparkiDev SparkiDev commented Jul 3, 2024

Description

Improved performance by reworking kyber_ntt, kyber_invtt, kyber_basemul_mont, kyber_basemul_mont_add to be in assembly.
Replace WOLFSSL_SP_NO_UMAAL with WOLFSSL_ARM_ARCH_7M

Testing

./configure '--disable-shared' '--enable-experimental' '--enable-kyber' '--enable-cryptonly' '--disable-rsa' '--disable-dh' '--disable-ecc' 'LDFLAGS=--static' '--host=armv7m' 'CC=arm-linux-gnueabi-gcc' '--enable-armasm'

Checklist

  • added tests
  • updated/added doxygen
  • updated appropriate READMEs
  • Updated manual and documentation

@SparkiDev SparkiDev self-assigned this Jul 3, 2024
@dgarske
Copy link
Contributor

dgarske commented Jul 3, 2024

Tested on STM32H7A3ZI at 240MHz (Cortex M7)

Using:

#define WOLFSSL_EXPERIMENTAL_SETTINGS

#define WOLFSSL_SHA3
#define WOLFSSL_SHAKE128
#define WOLFSSL_SHAKE256

#define WOLFSSL_HAVE_KYBER
#define WOLFSSL_WC_KYBER
//#define WOLFSSL_KYBER_SMALL

#define WOLFSSL_ARMASM
#define WOLFSSL_ARMASM_INLINE
#define WOLFSSL_ARMASM_NO_HW_CRYPTO
#define WOLFSSL_ARMASM_NO_NEON
#define WOLFSSL_ARMASM_CRYPTO_SHA3
#define WOLFSSL_ARM_ARCH 7

Current Master (before this PR):

RNG                        975 KiB took 1.024 seconds,  952.148 KiB/s
SHA-256                      3 MiB took 1.004 seconds,    3.088 MiB/s
SHA3-224                     1 MiB took 1.012 seconds,    1.399 MiB/s
SHA3-256                     1 MiB took 1.016 seconds,    1.322 MiB/s
SHA3-384                     1 MiB took 1.000 seconds,    1.025 MiB/s
SHA3-512                   750 KiB took 1.016 seconds,  738.189 KiB/s
SHAKE128                     2 MiB took 1.004 seconds,    1.605 MiB/s
SHAKE256                     1 MiB took 1.015 seconds,    1.323 MiB/s
KYBER512    128  key gen       220 ops took 1.008 sec, avg 4.582 ms, 218.254 ops/sec
KYBER512    128    encap       202 ops took 1.000 sec, avg 4.950 ms, 202.000 ops/sec
KYBER512    128    decap       182 ops took 1.000 sec, avg 5.495 ms, 182.000 ops/sec
KYBER768    192  key gen       142 ops took 1.011 sec, avg 7.120 ms, 140.455 ops/sec
KYBER768    192    encap       124 ops took 1.000 sec, avg 8.065 ms, 124.000 ops/sec
KYBER768    192    decap       114 ops took 1.008 sec, avg 8.842 ms, 113.095 ops/sec
KYBER1024   256  key gen        92 ops took 1.011 sec, avg 10.989 ms, 90.999 ops/sec
KYBER1024   256    encap        82 ops took 1.012 sec, avg 12.341 ms, 81.028 ops/sec
KYBER1024   256    decap        76 ops took 1.016 sec, avg 13.368 ms, 74.803 ops/sec

With PR 7706:

RNG                        975 KiB took 1.016 seconds,  959.646 KiB/s
SHA-256                      3 MiB took 1.004 seconds,    2.967 MiB/s
SHA3-224                     1 MiB took 1.015 seconds,    1.395 MiB/s
SHA3-256                     1 MiB took 1.000 seconds,    1.318 MiB/s
SHA3-384                     1 MiB took 1.004 seconds,    1.021 MiB/s
SHA3-512                   750 KiB took 1.019 seconds,  736.016 KiB/s
SHAKE128                     2 MiB took 1.008 seconds,    1.599 MiB/s
SHAKE256                     1 MiB took 1.004 seconds,    1.313 MiB/s
KYBER512    128  key gen       238 ops took 1.000 sec, avg 4.202 ms, 238.000 ops/sec
KYBER512    128    encap       226 ops took 1.004 sec, avg 4.442 ms, 225.100 ops/sec
KYBER512    128    decap       212 ops took 1.000 sec, avg 4.717 ms, 212.000 ops/sec
KYBER768    192  key gen       156 ops took 1.008 sec, avg 6.462 ms, 154.762 ops/sec
KYBER768    192    encap       140 ops took 1.012 sec, avg 7.229 ms, 138.340 ops/sec
KYBER768    192    decap       132 ops took 1.007 sec, avg 7.629 ms, 131.082 ops/sec
KYBER1024   256  key gen       102 ops took 1.016 sec, avg 9.961 ms, 100.394 ops/sec
KYBER1024   256    encap        90 ops took 1.000 sec, avg 11.111 ms, 90.000 ops/sec
KYBER1024   256    decap        86 ops took 1.000 sec, avg 11.628 ms, 86.000 ops/sec

Note benchmark won't run Kyber without -kyber or with this patch:

diff --git a/wolfcrypt/benchmark/benchmark.c b/wolfcrypt/benchmark/benchmark.c
index 964f9ebd0..1082de63c 100644
--- a/wolfcrypt/benchmark/benchmark.c
+++ b/wolfcrypt/benchmark/benchmark.c
@@ -3593,17 +3593,17 @@ static void* benchmarks_do(void* args)
 #ifdef WOLFSSL_HAVE_KYBER
     if (bench_all || (bench_pq_asym_algs & BENCH_KYBER)) {
     #ifdef WOLFSSL_KYBER512
-        if (bench_pq_asym_algs & BENCH_KYBER512) {
+        if (bench_all || (bench_pq_asym_algs & BENCH_KYBER512)) {
             bench_kyber(KYBER512);
         }
     #endif
     #ifdef WOLFSSL_KYBER768
-        if (bench_pq_asym_algs & BENCH_KYBER768) {
+        if (bench_all || (bench_pq_asym_algs & BENCH_KYBER768)) {
             bench_kyber(KYBER768);
         }
     #endif
     #ifdef WOLFSSL_KYBER1024
-        if (bench_pq_asym_algs & BENCH_KYBER1024) {
+        if (bench_all || (bench_pq_asym_algs & BENCH_KYBER1024)) {
             bench_kyber(KYBER1024);
         }
     #endif

Improved performance by reworking kyber_ntt, kyber_invtt,
kyber_basemul_mont, kyber_basemul_mont_add, kyber_rej_uniform_c to be
in assembly.
Replace WOLFSSL_SP_NO_UMAAL with WOLFSSL_ARM_ARCH_7M
@SparkiDev
Copy link
Contributor Author

retest this please

@SparkiDev SparkiDev changed the title Kyber ASM ARMv7E-M: added assembly code Kyber ASM ARMv7E-M/ARMv7-M: added assembly code Oct 3, 2024
@dgarske
Copy link
Contributor

dgarske commented Oct 3, 2024

Re-ran on the same target STM32H7A3ZI at 240MHz with -Os:
Seems to be about 50% faster!

Please select one of the above options:
Running wolfCrypt Benchmarks...
wolfCrypt Benchmark (block bytes 1024, min 1.0 sec each)
RNG                          1 MiB took 1.011 seconds,    1.087 MiB/s
AES-128-CBC-enc              1 MiB took 1.000 seconds,    1.489 MiB/s
AES-128-CBC-dec              1 MiB took 1.004 seconds,    1.483 MiB/s
AES-192-CBC-enc              1 MiB took 1.012 seconds,    1.254 MiB/s
AES-192-CBC-dec              1 MiB took 1.007 seconds,    1.236 MiB/s
AES-256-CBC-enc              1 MiB took 1.016 seconds,    1.081 MiB/s
AES-256-CBC-dec              1 MiB took 1.008 seconds,    1.066 MiB/s
AES-128-GCM-enc           1000 KiB took 1.004 seconds,  996.016 KiB/s
AES-128-GCM-dec           1000 KiB took 1.016 seconds,  984.252 KiB/s
AES-192-GCM-enc            900 KiB took 1.020 seconds,  882.353 KiB/s
AES-192-GCM-dec            875 KiB took 1.000 seconds,  875.000 KiB/s
AES-256-GCM-enc            800 KiB took 1.004 seconds,  796.813 KiB/s
AES-256-GCM-dec            800 KiB took 1.016 seconds,  787.402 KiB/s
AES-128-GCM-enc-no_AAD       1 MiB took 1.019 seconds,    0.982 MiB/s
AES-128-GCM-dec-no_AAD    1000 KiB took 1.004 seconds,  996.016 KiB/s
AES-192-GCM-enc-no_AAD     900 KiB took 1.012 seconds,  889.328 KiB/s
AES-192-GCM-dec-no_AAD     900 KiB took 1.019 seconds,  883.219 KiB/s
AES-256-GCM-enc-no_AAD     800 KiB took 1.000 seconds,  800.000 KiB/s
AES-256-GCM-dec-no_AAD     800 KiB took 1.012 seconds,  790.514 KiB/s
GMAC Table 4-bit             3 MiB took 1.000 seconds,    2.907 MiB/s
CHACHA                       7 MiB took 1.000 seconds,    7.056 MiB/s
CHA-POLY                     5 MiB took 1.004 seconds,    4.936 MiB/s
POLY1305                    29 MiB took 1.000 seconds,   29.297 MiB/s
SHA-256                      3 MiB took 1.000 seconds,    3.027 MiB/s
SHA3-224                     1 MiB took 1.012 seconds,    1.423 MiB/s
SHA3-256                     1 MiB took 1.000 seconds,    1.343 MiB/s
SHA3-384                     1 MiB took 1.012 seconds,    1.037 MiB/s
SHA3-512                   750 KiB took 1.008 seconds,  744.048 KiB/s
SHAKE128                     2 MiB took 1.012 seconds,    1.640 MiB/s
SHAKE256                     1 MiB took 1.016 seconds,    1.346 MiB/s
HMAC-SHA256                  3 MiB took 1.000 seconds,    3.003 MiB/s
RSA     2048   public       100 ops took 1.004 sec, avg 10.040 ms, 99.602 ops/sec
RSA     2048  private         4 ops took 1.619 sec, avg 404.750 ms, 2.471 ops/sec
DH      2048  key gen         7 ops took 1.149 sec, avg 164.143 ms, 6.092 ops/sec
DH      2048    agree         8 ops took 1.309 sec, avg 163.625 ms, 6.112 ops/sec
KYBER512    128  key gen       362 ops took 1.000 sec, avg 2.762 ms, 362.000 ops/sec
KYBER512    128    encap       352 ops took 1.000 sec, avg 2.841 ms, 352.000 ops/sec
KYBER512    128    decap       260 ops took 1.004 sec, avg 3.862 ms, 258.964 ops/sec
KYBER768    192  key gen       224 ops took 1.008 sec, avg 4.500 ms, 222.222 ops/sec
KYBER768    192    encap       210 ops took 1.004 sec, avg 4.781 ms, 209.163 ops/sec
KYBER768    192    decap       160 ops took 1.000 sec, avg 6.250 ms, 160.000 ops/sec
KYBER1024   256  key gen       136 ops took 1.008 sec, avg 7.412 ms, 134.921 ops/sec
KYBER1024   256    encap       130 ops took 1.008 sec, avg 7.754 ms, 128.968 ops/sec
KYBER1024   256    decap       104 ops took 1.004 sec, avg 9.654 ms, 103.586 ops/sec
ECC   [      SECP256R1]   256  key gen       226 ops took 1.004 sec, avg 4.442 ms, 225.100 ops/sec
ECDHE [      SECP256R1]   256    agree       116 ops took 1.008 sec, avg 8.690 ms, 115.079 ops/sec
ECDSA [      SECP256R1]   256     sign       104 ops took 1.008 sec, avg 9.692 ms, 103.175 ops/sec
ECDSA [      SECP256R1]   256   verify        68 ops took 1.024 sec, avg 15.059 ms, 66.406 ops/sec
ML-DSA    44  key gen        76 ops took 1.020 sec, avg 13.421 ms, 74.510 ops/sec
ML-DSA    44     sign        30 ops took 1.047 sec, avg 34.900 ms, 28.653 ops/sec
ML-DSA    44   verify        74 ops took 1.008 sec, avg 13.622 ms, 73.413 ops/sec
ML-DSA    65  key gen        44 ops took 1.024 sec, avg 23.273 ms, 42.969 ops/sec
ML-DSA    65     sign        18 ops took 1.137 sec, avg 63.167 ms, 15.831 ops/sec
ML-DSA    65   verify        44 ops took 1.000 sec, avg 22.727 ms, 44.000 ops/sec
ML-DSA    87  key gen        26 ops took 1.019 sec, avg 39.192 ms, 25.515 ops/sec
ML-DSA    87     sign        14 ops took 1.090 sec, avg 77.857 ms, 12.844 ops/sec
ML-DSA    87   verify        26 ops took 1.004 sec, avg 38.615 ms, 25.896 ops/sec
Benchmark complete
Benchmark Test: Return code 0

@dgarske dgarske merged commit afe5209 into wolfSSL:master Oct 3, 2024
139 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants