Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf: Assembly code generator for ARM and ARM64 #200

Open
mratsim opened this issue Aug 6, 2022 · 2 comments
Open

Perf: Assembly code generator for ARM and ARM64 #200

mratsim opened this issue Aug 6, 2022 · 2 comments

Comments

@mratsim
Copy link
Owner

mratsim commented Aug 6, 2022

#69 introduced an assembly ode generator for x86 and x86-64
at https://github.com/mratsim/constantine/blob/7d29cb9/constantine/platforms/isa/macro_assembler_x86.nim

We need the same for ARM for efficiency on Raspberry Pi, Phones, Apple Silicon and other resource-restricted devices.

Efficient multiplication on ARM:

Related papers:

https://eprint.iacr.org/2021/185.pdf

No Silver Bullet: Optimized Montgomery
Multiplication on Various 64-bit ARM Platforms

Abstract

In this paper, we firstly presented optimized implementa-
tions of Montgomery multiplication on 64-bit ARM processors by taking
advantages of Karatsuba algorithm and efficient multiplication instruc-
tion sets for ARM64 architectures. The implementation of Montgomery
multiplication can improve the performance of (pre-quantum and post-
quantum) public key cryptography (e.g. CSIDH, ECC, and RSA) imple-
mentations on ARM64 architectures, directly. Last but not least, the per-
formance of Karatsuba algorithm does not ensure the fastest speed record
on various ARM architectures, while it is determined by the clock cycles
per multiplication instruction of target ARM architectures. In particular,
recent Apple processors based on ARM64 architecture show lower cycles
per instruction of multiplication than that of ARM Cortex-A series. For
this reason, the schoolbook method shows much better performance than
the sophisticated Karatsuba algorithm on Apple processors. With this
observation, we can determine the proper approach for multiplication
of cryptography library (e.g. Microsoft-SIDH) on Apple processors and
ARM Cortex-A process

@mratsim
Copy link
Owner Author

mratsim commented Aug 6, 2022

Relevant:

@mratsim
Copy link
Owner Author

mratsim commented Feb 11, 2024

https://eprint.iacr.org/2021/185.pdf is particularly interesting regarding general ARM CPUs and Apple CPUs:

image

Multiplications are 3x slower than addition on Rpi4 but have sensibly the same speed on Apple CPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant