Add bignum_mont{sqr,mul}_p256_neon for Arm #118

aqjune-aws · 2024-03-28T16:32:30Z

This patch adds bignum_mont{sqr,mul}_p256_neon functions.

These are vectorized and instruction-rescheduled versions of bignum_mont{sqr,mul}_p256.
They are verified using the equivalence checking tactics.

A new bash script tools/external/slothy.sh is added to help reproduce the optimized output.
The 'intermediate' functions of the two functions are written as comments in the two assembly files.

Additionally,

A new instruction umull2 is formalized add added to the simulator in order to verify the new functions.
Old *_neon functions' proofs are refactored a bit.

aqjune · 2024-04-01T16:45:11Z

TODO: add an instruction for reconstructing the rescheduling optimization

hanno-becker · 2024-04-02T05:40:11Z

@aqjune Thanks a lot! Yes, it would be good to have

the hybrid code that was used as the input to the SLOTHY optimization
the exact command you used to optimize the code
the commit hash of the SLOTHY version you were using (unfortunately, I haven't yet started versioning SLOTHY properly)

Which uArch's benefit from this new version?

aqjune-aws · 2024-04-02T06:02:36Z

the commit hash of the SLOTHY version you were using (unfortunately, I haven't yet started versioning SLOTHY properly)

Actually, I was updating slothy/targets/aarch64/cortex_a55.py on my local repo branch to add instruction latency/throughputs for neoverse n1, becauase curve25519 was optimized with the Cortex A55 model IIRC. It seems Cortext A55's instruction cost model is not equivalent to Neoverse N1 though.
The local repo branch is https://github.com/aqjune-aws/slothy/tree/customize .

Would it be a good idea if the instruction cost models for Neoverse N1 are backported to the official SLOTHY? Then, which file should we update (neoverse_n1_experimental.py?)?

Rerunning SLOTHY will not be a good problem modulo its running time. Proof can be updated mechanically.

hanno-becker · 2024-04-02T07:34:13Z

@aqjune Yes, we should be aiming to build & use the Neoverse N1 model, I think. If it turns out that the A55 model performs better on N1, then this is something to be investigated -- let's see!

Could you move your changes from the A55 model (which, as I understand, use N1 SWOG data?) over to the N1 model, and rerun your optimization scripts using the N1 model? And, once everything works, open a PR on the SLOTHY repository with the model enhancements (nothing else)?

aqjune-aws · 2024-04-10T22:32:38Z

Okay, CI checks are running. Fingers crossed...

This patch adds `bignum_mont{sqr,mul}_p256_neon` functions. These are vectorized and instruction-rescheduled versions of `bignum_mont{sqr,mul}_p256`. They are verified using the equivalence checking tactics. A new bash script `tools/external/slothy.sh` is added to help reproduce the optimized output. The 'intermediate' functions of the two functions are written as comments in the two assembly files. Additionally, - A new instruction `umull2` is formalized add added to the simulator in order to verify the new functions. - Old `*_neon` functions' proofs are refactored a bit.

jargh

This now looks great to me, thank you!

aqjune force-pushed the equiv-muls2 branch from 2e048d2 to b0d0022 Compare March 28, 2024 16:33

aqjune force-pushed the equiv-muls2 branch from b0d0022 to c01b9f0 Compare April 9, 2024 20:41

aqjune-aws force-pushed the equiv-muls2 branch 3 times, most recently from 458c6af to 8aaa240 Compare April 9, 2024 20:55

aqjune force-pushed the equiv-muls2 branch from 8aaa240 to e7c791c Compare April 10, 2024 22:23

aqjune-aws changed the title ~~Add bignum_mont{sqr,mul}_p256_neon~~ Add bignum_mont{sqr,mul}_p256_neon for Arm Apr 10, 2024

aqjune force-pushed the equiv-muls2 branch from e7c791c to 762a44a Compare April 10, 2024 22:28

aqjune force-pushed the equiv-muls2 branch from 762a44a to 060749a Compare April 10, 2024 22:44

aqjune-aws marked this pull request as ready for review April 11, 2024 03:26

aqjune force-pushed the equiv-muls2 branch from 060749a to 0752997 Compare April 17, 2024 02:14

aqjune-aws mentioned this pull request Apr 17, 2024

Faster proofs for arm/proofs/bignum_k{mul,sqr}_*_neon.ml #121

Closed

jargh approved these changes Apr 20, 2024

View reviewed changes

jargh merged commit 0a3b3f3 into awslabs:main Apr 20, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bignum_mont{sqr,mul}_p256_neon for Arm #118

Add bignum_mont{sqr,mul}_p256_neon for Arm #118

aqjune-aws commented Mar 28, 2024 •

edited

Loading

aqjune commented Apr 1, 2024

hanno-becker commented Apr 2, 2024 •

edited

Loading

aqjune-aws commented Apr 2, 2024

hanno-becker commented Apr 2, 2024 •

edited

Loading

aqjune-aws commented Apr 10, 2024

jargh left a comment

Add bignum_mont{sqr,mul}_p256_neon for Arm #118

Add bignum_mont{sqr,mul}_p256_neon for Arm #118

Conversation

aqjune-aws commented Mar 28, 2024 • edited Loading

aqjune commented Apr 1, 2024

hanno-becker commented Apr 2, 2024 • edited Loading

aqjune-aws commented Apr 2, 2024

hanno-becker commented Apr 2, 2024 • edited Loading

aqjune-aws commented Apr 10, 2024

jargh left a comment

Choose a reason for hiding this comment

aqjune-aws commented Mar 28, 2024 •

edited

Loading

hanno-becker commented Apr 2, 2024 •

edited

Loading

hanno-becker commented Apr 2, 2024 •

edited

Loading