Fix compile issues when build on RHEL5_64 with gcc 4.9.4 #8

bryce-shang · 2020-06-19T23:55:51Z

Issues

This change is to fix compile issues when build on RHEL5_64 with gcc 4.9.4.

Resolves parsing issues for ARMv8 assembly with clang7 on ubuntu 20.04 in fips static build (found through PR #566 for SHA3 assembly implementation). - Fix parsing issue in `delocate.peg` for ARM assembly. - Edit rule `RegisterOrConstant` to allow shifting a register/constant by two digit value (e.g., the case of ARMv8 mask for SHA3 hardware support) instead of just one digit. - Add a new rule for allowing addition, subtraction and multiplication in the offset. (Note: useful for looping address accesses, e.g., `#8*($i+2)`). Add a set of `OffsetOperator` to define the operations allowed in the offset. Add a new set of `Offset` rule operations interpreted depending on parenthesis location, if added. Note: The parenthesis in the `Offset` rule should be either both included or both left out; i.e., the parenthesis set should be closed. The `OffsetOperator` includes addition, subtraction and multiplication only. This change was tested successfully in PR #566.

Implementations of AES-GCM in AWS-LC may use an "H-Table" to precompute and cache common computations across multiple invocations of AES-GCM using the same key, thereby improving performance. The main example of such a common precomputation is the computation of powers of the H-value used the GHASH algorithm -- giving the H-Table its name. However, despite the name, the structure of the H-Table is opaque to the code invoking AES-GCM, and implementations are free to populate it with arbitrary data. This freedom is already being leveraged: Currently, the AArch64 implementation of AES-GCM not only stores powers of H in the HTable (H1-H8 in the code), but also their 'Karatsuba preprocessing's, which are the EORs of the low and high halves. Those naturally occur when using Karatsuba's algorithm to reduce a 128-bit polynomial multiplication over GF(2) to 3x 64-bit polynomial. This commit changes the structure of the H-Table for AArch64 implementations slightly for better performance: It is observed that every time a power of H is loaded from the H-Table (H1-H8), the first operation that happens to it in both aesv8-gcm-armv8.pl and aesv8-gcm-armv8-unroll8.pl is to swap low and high halves via `ext arg.16b, arg.16b, arg.16b, aws#8`. Those swaps can be precomputed, and the Hi values stores in swapped form in the HTable, thereby eliminating the swaps from the critical loop of AES-GCM.

Implementations of AES-GCM in AWS-LC may use an "H-Table" to precompute and cache common computations across multiple invocations of AES-GCM using the same key, thereby improving performance. The main example of such common precomputation is the computation of powers of the H-value used in the GHASH algorithm -- giving the H-Table its name. However, despite the name, the structure of the H-Table is opaque to the code invoking AES-GCM, and implementations are free to populate it with arbitrary data. This freedom is already being leveraged: Currently, the AArch64 implementation of AES-GCM not only stores powers of H in the HTable (H1-H8 in the code), but also their 'Karatsuba preprocessing's, which are the EORs of the low and high halves. Those naturally occur when using Karatsuba's algorithm to reduce a 128-bit polynomial multiplication over GF(2) to 3x 64-bit polynomial. This commit changes the structure of the H-Table for AArch64 implementations of AES-GCM slightly to obtain a small performance gain: It is observed that every time a power of H is loaded from the H-Table (H1-H8), the first operation that happens to it in both aesv8-gcm-armv8.pl and aesv8-gcm-armv8-unroll8.pl is to swap low and high halves via `ext arg.16b, arg.16b, arg.16b, aws#8`. Those swaps can be precomputed, and the H{1-8} values stored in swapped form in the HTable, thereby eliminating the swaps from the critical loop of AES-GCM.

Implementations of AES-GCM in AWS-LC may use an "H-Table" to precompute and cache common computations across multiple invocations of AES-GCM using the same key, thereby improving performance. The main example of such common precomputation is the computation of powers of the H-value used in the GHASH algorithm -- giving the H-Table its name. However, despite the name, the structure of the H-Table is opaque to the code invoking AES-GCM, and implementations are free to populate it with arbitrary data. This freedom is already being leveraged: Currently, the AArch64 implementation of AES-GCM not only stores powers of H in the HTable (H1-H8 in the code), but also their 'Karatsuba preprocessing's, which are the EORs of the low and high halves. Those naturally occur when using Karatsuba's algorithm to reduce a 128-bit polynomial multiplication over GF(2) to 3x 64-bit polynomial. This commit changes the structure of the H-Table for AArch64 implementations of AES-GCM slightly to obtain a small performance gain: It is observed that every time a power of H is loaded from the H-Table (H1-H8), the first operation that happens to it in both aesv8-gcm-armv8.pl and aesv8-gcm-armv8-unroll8.pl is to swap low and high halves via `ext arg.16b, arg.16b, arg.16b, aws#8`. Those swaps can be precomputed, and the H{1-8} values stored in swapped form in the HTable, thereby eliminating the swaps from the critical loop of AES-GCM. This commit modifies the H-table precomputation ghash_init_v8 in the simplest way possible to introduce the desired swaps, bracketing store instructions for H-table values X with `vext.8 X, X, X, aws#8`. The resulting initialization code is slightly slower than the original one and will be simplified in the next commit.

This is the first in a series of commits aiming to rewrite gcm_ghash_v8 to work directly with the swapped H-table values, rather than swapping them back after loading and falling back to the old code. As a first step, the swapping of A = {H,H2} are removed and all uses of ``` pmull.64 Y, A, X ``` replaced by the equivalent ``` vext.8 X, X, X, aws#8 pmull2.64 Y, A, X vext.8 X, X, X, aws#8 ``` (and similarly for pmull2). This works so long as X and Y don't alias. Of course, the above conversion makes the code much less efficient, and is not final. The next commit will eliminate `vext`.

`In` and `t1` are swapped versions of each other. Therefore, ``` vext.8 $In, $In, $In, aws#8 vpmull2.p64 $Xln,$H,$In @ H·Ii+1 vext.8 $In, $In, $In, aws#8 ``` is equivalent to ``` vext.8 $In, $In, $In, aws#8 vpmull2.p64 $Xln,$H,$t1 @ H·Ii+1 vext.8 $In, $In, $In, aws#8 ``` is equivalent to ``` vpmull2.p64 $Xln,$H,$t1 @ H·Ii+1 vext.8 $In, $In, $In, aws#8 vext.8 $In, $In, $In, aws#8 ``` is equivalent to ``` vpmull2.p64 $Xln,$H,$t1 @ H·Ii+1 ```

In the context of the change, t0 and IN are the same after ``` veor $IN,$t0,$t2 @ inp^=Xi veor $t1,$t0,$t2 @ $t1 is rotated inp^Xi ``` Moreover, after all of ``` vpmull2.p64 $Xl,$H,$IN @ H.lo·Xi.lo vext.8 $IN, $IN, $IN, aws#8 veor $t1,$t1,$IN @ Karatsuba pre-processing vpmull.p64 $Xm,$Hhl,$t1 @ (H.lo+H.hi)·(Xi.lo+Xi.hi) vext.8 $IN, $IN, $IN, aws#8 ``` `IN` is unchanged because it was swapped twice, and t1 only feeds into the computation of Xm and is not used further afterwards. Hence, the above is equivalent to ``` vpmull2.p64 $Xl,$H,$IN @ H.lo·Xi.lo vext.8 $t1, $IN, $IN, aws#8 veor $t1,$t1,$IN @ Karatsuba pre-processing vpmull.p64 $Xm,$Hhl,$t1 @ (H.lo+H.hi)·(Xi.lo+Xi.hi) ``` removing one `vext`.

Implementations of AES-GCM in AWS-LC may use an "H-Table" to precompute and cache common computations across multiple invocations of AES-GCM using the same key, thereby improving performance. The main example of such common precomputation is the computation of powers of the H-value used in the GHASH algorithm -- giving the H-Table its name. However, despite the name, the structure of the H-Table is opaque to the code invoking AES-GCM, and implementations are free to populate it with arbitrary data. This freedom is already being leveraged: Currently, the AArch64 implementation of AES-GCM not only stores powers of H in the HTable (H1-H8 in the code), but also their 'Karatsuba preprocessing's, which are the EORs of the low and high halves. Those naturally occur when using Karatsuba's algorithm to reduce a 128-bit polynomial multiplication over GF(2) to 3x 64-bit polynomial. This commit changes the structure of the H-Table for AArch64 implementations of AES-GCM slightly to obtain a small performance gain: It is observed that every time a power of H is loaded from the H-Table (H1-H8), the first operation that happens to it in both aesv8-gcm-armv8.pl and aesv8-gcm-armv8-unroll8.pl is to swap low and high halves via `ext arg.16b, arg.16b, arg.16b, aws#8`. Those swaps can be precomputed, and the H{1-8} values stored in swapped form in the HTable, thereby eliminating the swaps from the critical loop of AES-GCM. This commit modifies the H-table precomputation ghash_init_v8 in the simplest way possible to introduce the desired swaps, bracketing store instructions for H-table values X with `vext.8 X, X, X, aws#8`. The resulting initialization code is slightly slower than the original one and will be simplified in the next commit.

AArch64 assembly implementations of AES-GCM in AWS-LC use an "H-Table" to precompute and cache common computations across multiple invocations of AES-GCM using the same key, thereby improving performance. The main example of such common precomputation is the computation of powers of the H-value used in the GHASH algorithm -- giving the H-Table its name. However, despite the name, the structure of the H-Table is opaque to the code invoking AES-GCM, and implementations are free to populate it with arbitrary data. This freedom is already being leveraged: Currently, the AArch64 implementation of AES-GCM not only stores powers of H in the HTable (H1-H8 in the code), but also their 'Karatsuba preprocessing's, which are the EORs of the low and high halves. Those naturally occur when using Karatsuba's algorithm to reduce a 128-bit polynomial multiplication over GF(2) to 3x 64-bit polynomial. This commit changes the structure of the H-Table for AArch64 implementations of AES-GCM slightly to obtain a small performance gain: It is observed that every time a power of H is loaded from the H-Table (H1-H8), the first operation that happens to it in both aesv8-gcm-armv8.pl and aesv8-gcm-armv8-unroll8.pl is to swap low and high halves via `ext arg.16b, arg.16b, arg.16b, #8`. Those swaps can be precomputed, and the H{1-8} values stored in swapped form in the HTable, thereby eliminating the swaps from the critical loop of AES-GCM. This gives a small performance gain for AES-GCM on Graviton3, at the cost of slightly slower one-off initialization. For Graviton2, the AES-GCM AArch64 assembly loads the H-table only once, outside of the critical loop; hence, there is no performance benefit.

bryce-shang added 3 commits June 19, 2020 16:45

Fix build error '__u32 does not name a type' for RHEL5_64.

552eb0f

Rename variables to avoid gcc4.9 sensitive shadow warning.

e35eab4

Define PTRACE_O_TRACESYSGOOD if not exits.

f000c78

bryce-shang requested review from andrewhop and drucker-nir June 19, 2020 23:56

bryce-shang closed this Jul 6, 2020

darylmartin100 deleted the gcc4.9 branch August 26, 2020 15:14

bryce-shang mentioned this pull request Jan 22, 2021

Add EVP_PKEY_RSA_PSS ameth. #86

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix compile issues when build on RHEL5_64 with gcc 4.9.4 #8

Fix compile issues when build on RHEL5_64 with gcc 4.9.4 #8

bryce-shang commented Jun 19, 2020

Fix compile issues when build on RHEL5_64 with gcc 4.9.4 #8

Fix compile issues when build on RHEL5_64 with gcc 4.9.4 #8

Conversation

bryce-shang commented Jun 19, 2020