Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AES-GCM AArch64: Store swapped Htable values #1403

Merged
merged 4 commits into from
Jul 11, 2024

Conversation

hanno-becker
Copy link
Contributor

@hanno-becker hanno-becker commented Jan 15, 2024

Implementations of AES-GCM in AWS-LC may use an "H-Table" to precompute and cache common computations across multiple invocations of AES-GCM using the same key, thereby improving performance.

The main example of such common precomputation is the computation of powers of the H-value used in the GHASH algorithm -- giving the H-Table its name. However, despite the name, the structure of the H-Table is opaque to the code invoking AES-GCM, and implementations are free to populate it with arbitrary data.

This freedom is already being leveraged: Currently, the AArch64 implementation of AES-GCM not only stores powers of H in the HTable (H1-H8 in the code), but also their 'Karatsuba preprocessing's, which are the EORs of the low and high halves. Those naturally occur when using Karatsuba's algorithm to reduce a 128-bit polynomial multiplication over GF(2) to 3x 64-bit polynomial.

This PR changes the structure of the H-Table for AArch64 implementations of AES-GCM slightly to obtain a small performance gain:

It is observed that every time a power of H is loaded from the H-Table (H1-H8), the first operation that happens to it in both aesv8-gcm-armv8.pl and aesv8-gcm-armv8-unroll8.pl is to swap low and high halves via ext arg.16b, arg.16b, arg.16b, #8. Those swaps can be precomputed, and the H{1-8} values stored in swapped form in the HTable, thereby eliminating the swaps from the critical loop of AES-GCM.

This gives a small performance gain for AES-GCM on Graviton3, at the cost of slightly slower one-off initialization. For Graviton2, the AES-GCM AArch64 assembly loads the H-table only once, outside of the critical loop; hence, there is no performance benefit.

Testing:

  • Locally: ssl_test and crypto/crypto_test
  • CI: TBD
  • Performance measurements using bssl speed: TBD

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and the ISC license.

@hanno-becker hanno-becker requested a review from a team as a code owner January 15, 2024 17:13
@codecov-commenter
Copy link

codecov-commenter commented Jan 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.20%. Comparing base (622366f) to head (70fbe05).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1403      +/-   ##
==========================================
+ Coverage   78.19%   78.20%   +0.01%     
==========================================
  Files         571      571              
  Lines       95465    95465              
  Branches    13704    13705       +1     
==========================================
+ Hits        74653    74663      +10     
+ Misses      20201    20191      -10     
  Partials      611      611              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hanno-becker hanno-becker changed the title DRAFT: AES-GCM AArch64: Store twisted Htable values DRAFT: AES-GCM AArch64: Store swapped Htable values Jan 16, 2024
@torben-hansen torben-hansen marked this pull request as draft January 30, 2024 18:52
Copy link
Contributor

@andrewhop andrewhop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmarks on a Graviton 3 instance running 1,000 iterations for each benchmark. All values are in microseconds (lower is better), the ratio is simply before/after (greater than 1 is better).

Graviton 3 AEAD-AES-256-GCM seal

num bytes or init Main min Main avg Main max PR min PR avg PR max min ratio avg ratio max ratio
init 0.09242 0.09401 0.09685 0.0943 0.09651 0.0994 0.98003 0.97413 0.97426
16 0.0904 0.0923 0.0998 0.09 0.0917 0.0993 1.0055 1.0061 1.0045
256 0.1327 0.1346 0.1402 0.1317 0.1338 0.1390 1.0077 1.0058 1.008
1350 0.3191 0.3211 0.3285 0.3112 0.3136 0.3198 1.0256 1.0237 1.0272
8192 1.3207 1.3251 1.337 1.278 1.2828 1.2908 1.0334 1.0330 1.0361
16384 2.5457 2.553 2.575 2.4582 2.4675 2.487 1.0356 1.0348 1.0352

Graviton 3 AEAD-AES-128-GCM seal

num bytes or init Main min Main avg Main max PR min PR avg PR max min ratio avg ratio max ratio
init 0.09558 0.09738 0.09997 0.09812 0.09981 0.10255 0.97414 0.97571 0.97487
16 0.0948 0.0967 0.1026 0.0939 0.0954 0.1019 1.0095 1.0139 1.0072
256 0.1425 0.1449 0.1507 0.1422 0.1442 0.1499 1.0019 1.0044 1.0057
1350 0.3632 0.3655 0.3724 0.3534 0.3560 0.3623 1.0277 1.0266 1.0278
8192 1.5352 1.5415 1.5601 1.4948 1.5018 1.51 1.0270 1.0264 1.0284
16384 2.9717 2.9817 3.0072 2.8912 2.9030 2.9 1.0278 1.027 1.0263

Graviton 2 AEAD-AES-128-GCM seal

num bytes Main min Main avg Main max PR min PR avg PR max min ratio avg ratio max ratio
init 0.1351 0.1371 0.1392 0.1370 0.1393 0.1423 0.9864 0.9841
16 0.1267 0.1326 0.13 0.128 0.1333 0.1374 0.98 0.9951 0.9897
256 0.2212 0.2264 0.2315 0.2202 0.2256 0.2300 1.0046 1.0033 1.0064
1350 0.7260 0.7306 0.7359 0.7272 0.7321 0.739 0.9984 0.9979 0.9951
8192 3.6756 3.6830 3.7103 3.6793 3.6985 3.7166 0.99 0.995 0.998
16384 7.23 7.2479 7.28 7.24 7.2778 7.39 0.998 0.9958 0.9853

Overall on Graviton 3 the init is slightly slower but encrypting all sizes is slightly faster. Graviton 2 is also slower for init but basically no change for encryption time.

@hanno-becker hanno-becker changed the title DRAFT: AES-GCM AArch64: Store swapped Htable values AES-GCM AArch64: Store swapped Htable values Mar 12, 2024
@hanno-becker hanno-becker marked this pull request as ready for review March 12, 2024 19:21
Implementations of AES-GCM in AWS-LC may use an "H-Table" to
precompute and cache common computations across multiple
invocations of AES-GCM using the same key, thereby improving
performance.

The main example of such common precomputation is the
computation of powers of the H-value used in the GHASH algorithm
-- giving the H-Table its name. However, despite the name, the
structure of the H-Table is opaque to the code invoking AES-GCM,
and implementations are free to populate it with arbitrary data.

This freedom is already being leveraged: Currently, the AArch64
implementation of AES-GCM not only stores powers of H in the
HTable (H1-H8 in the code), but also their 'Karatsuba
preprocessing's, which are the EORs of the low and high halves.
Those naturally occur when using Karatsuba's algorithm to reduce a
128-bit polynomial multiplication over GF(2) to 3x 64-bit
polynomial.

This commit changes the structure of the H-Table for AArch64
implementations of AES-GCM slightly to obtain a small performance gain:

It is observed that every time a power of H is loaded from the
H-Table (H1-H8), the first operation that happens to it in both
aesv8-gcm-armv8.pl and aesv8-gcm-armv8-unroll8.pl is to swap low
and high halves via `ext arg.16b, arg.16b, arg.16b, aws#8`. Those swaps
can be precomputed, and the H{1-8} values stored in swapped form in the
HTable, thereby eliminating the swaps from the critical loop of AES-GCM.

This commit modifies the H-table precomputation ghash_init_v8 in the
simplest way possible to introduce the desired swaps, bracketing store
instructions for H-table values X with `vext.8 X, X, X, aws#8`. The resulting
initialization code is slightly slower than the original one and will
be simplified in the next commit.
This commit simplifies the pre-computation of the H-table by
'absorbing' the newly introduced swap instructions `vext` into
the surrounding code. This brings the performance of the H-table
initialization on par with the previous initialiation routine.
@hanno-becker
Copy link
Contributor Author

@nebeid @andrewhop Let me know if there is something I can do to facilitate the review.

@hanno-becker
Copy link
Contributor Author

@nebeid @andrewhop @dkostic Any update on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @hanno-becker for this change. I suggest that since you dove into the details of this implementation to add comments at the beginning to explain what's calculated and where it is stored in the H table, maybe using ASCII representation of the table.

Copy link
Contributor Author

@hanno-becker hanno-becker Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nebeid It is a good idea to document better what is being stored in the HTable. However, it is not necessary to vet this PR, I think: The main point is that certain entries in the HTable are always swapped right after loading -- so one may just store the swapped versions to begin with. This does not rely on knowledge of what it is that is being stored.

@nebeid nebeid merged commit 90315e2 into aws:main Jul 11, 2024
99 of 100 checks passed
pennyannn added a commit to pennyannn/LNSym-public that referenced this pull request Aug 1, 2024
shigoel added a commit to leanprover/LNSym that referenced this pull request Aug 5, 2024
### Description:

The AES-GCM programs are updated in the following two PRs,
aws/aws-lc#1403 and PR
aws/aws-lc#1639. Updating them in LNSym as well.

### Testing:

Make all succeeds and conformance testing is successful. 

### License:

By submitting this pull request, I confirm that my contribution is
made under the terms of the Apache 2.0 license.

Co-authored-by: Shilpi Goel <shigoel@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants