Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use sha2-asm for feature compress on aarch64? #220

Closed
dgbo opened this issue Jan 26, 2021 · 5 comments · Fixed by #224
Closed

Use sha2-asm for feature compress on aarch64? #220

dgbo opened this issue Jan 26, 2021 · 5 comments · Fixed by #224

Comments

@dgbo
Copy link

dgbo commented Jan 26, 2021

Hi,

We met real performance issue when we tested the application lotus (dependent on sha2 with feature compress) on aarch64 platform.
The root cause is, for aarch64 only with feature compress, the sha2::compress256 does not use the aarch64 accelerators, e.g. sha256su0, it degradates to soft_compress.

We tried three ways to solve this.

A. Add feature compress, asm in applications for aarch64 when using sha2::compress256.
To keep x86 platforms using the intrinsics other than asm, modifications to Cargo.toml of applications would be:

+[target."cfg(target_arch = \"aarch64\")".dependencies]
+sha2 = { version = "0.9.1", features = [ "compress", "asm" ] }
+[target."cfg(target_arch = \"x86_64\")".dependencies]
+sha2 = { version = "0.9.1", features = [ "compress" ] }

However, this will enable both "compress" and "asm" no matter on aarch64 or x86.
This is a known BUG of cargo, see [1][2], the solution is still unstable.

B. Implement the instrinsic like we did for x86. We need core::arch::aarch64::* to do this.
It is still unstable for aarch64, we investigated the assembly generated for vsha256su0q_u32,
there are extra two memory loads before the actual sha256su0.
So as of now, the intrinsic can not deliver the best performance as the asm did.

C. Use sha2-asm for aarch64 with feature compress, like:

diff --git a/sha2/Cargo.toml b/sha2/Cargo.toml
index 98bd626..10f9ed1 100644
--- a/sha2/Cargo.toml
+++ b/sha2/Cargo.toml
@@ -35,6 +35,6 @@ hex-literal = "0.2"
 default = ["std"]
 std = ["digest/std"]
 asm = ["sha2-asm", "libc"]
-compress = [] # Expose compress function
+compress = ["sha2-asm", "libc"] # Expose compress function
 force-soft = [] # Force software implementation
 asm-aarch64 = ["asm"] # DEPRECATED: use `asm` instead
diff --git a/sha2/src/sha256.rs b/sha2/src/sha256.rs
index 3eef745..56c5c7a 100644
--- a/sha2/src/sha256.rs
+++ b/sha2/src/sha256.rs
@@ -142,7 +142,7 @@ cfg_if::cfg_if! {
     if #[cfg(feature = "force-soft")] {
         mod soft;
         use soft::compress;
-    } else if #[cfg(all(feature = "asm", target_arch = "aarch64", target_os = "linux"))] {
+    } else if #[cfg(all(any(feature = "compress", feature = "asm"), target_arch = "aarch64", target_os = "linux"))] {
         mod soft;
         mod aarch64;
         use aarch64::compress;

We want to fix the performance issue via C, it would be OK for both aarch64 and x86.
With it, the package is easy to import by applications and delivers same performance with feature compress and asm.

Suggestions?
Regards.

[1] rust-lang/cargo#2524
[2] rust-lang/cargo#7914

@newpavlov
Copy link
Member

IIUC your use case, you can use sha2-asm directly on aarch64 and fall back to sha2::compress256 on other targets:

# Cargo.toml
[target."cfg(not(target_arch = \"aarch64\"))".dependencies]
sha2 = { version = "0.9.1", features = [ "compress" ] }
[target."cfg(target_arch = \"aarch64\")".dependencies]
sha2-asm = "0.5"
// code
#[cfg(not(target_arch = "aarch64"))]
fn compress256(state: &mut [u32; 8], blocks: &[Block]) {
    sha2::compress256(state, blocks);
}

#[cfg(target_arch = "aarch64")]
fn compress256(state: &mut [u32; 8], blocks: &[Block]) {
    for block in blocks {
        sha2_asm::compress256(state, block);
    }
}

You will need some glue code to handle conversion between GenericArray<u8, U64> and [u8; 64], but nothing too difficult.

we investigated the assembly generated for vsha256su0q_u32, there are extra two memory loads before the actual sha256su0.

It could be worth to open an issue in the Rust language repository for that. The current assembly version is mostly intended as a temporary workaround around instability of the ARM intrinsics. Also note that for the best possible performance we probably have to work on RustCrypto/asm-hashes#26.

@dgbo
Copy link
Author

dgbo commented Jan 26, 2021

IIUC your use case, you can use sha2-asm directly on aarch64 and fall back to sha2::compress256 on other targets:

# Cargo.toml
[target."cfg(not(target_arch = \"aarch64\"))".dependencies]
sha2 = { version = "0.9.1", features = [ "compress" ] }
[target."cfg(target_arch = \"aarch64\")".dependencies]
sha2-asm = "0.5"
// code
#[cfg(not(target_arch = "aarch64"))]
fn compress256(state: &mut [u32; 8], blocks: &[Block]) {
    sha2::compress256(state, blocks);
}

#[cfg(target_arch = "aarch64")]
fn compress256(state: &mut [u32; 8], blocks: &[Block]) {
    for block in blocks {
        sha2_asm::compress256(state, block);
    }
}

You will need some glue code to handle conversion between GenericArray<u8, U64> and [u8; 64], but nothing too difficult.

Correctly, this would solve my problem.
But as the glue code, we have to see if we have sha256 implemented by the running hardware first.
Seems we need the exact same code sha2::compress256 and all code in src/sha256/aarch64.rs as the glue code.
It's a little bit weird for me that we add the code we already have in a library into the applications.

we investigated the assembly generated for vsha256su0q_u32, there are extra two memory loads before the actual sha256su0.

It could be worth to open an issue in the Rust language repository for that. The current assembly version is mostly intended as a temporary workaround around instability of the ARM intrinsics.

Totally agree, it is definitely an issue.
As of now, I'm not sure the extra memory loads are instroduced by rust or llvm.
I am still working on it, this needs some time.
Before we have an aarch64 instrinsic, can we provide an easily-used workaround (maybe C)? :)

Also note that for the best possible performance we probably have to work on RustCrypto/asm-hashes#26.

Great idea! I think I can do it for aarch64.

@newpavlov
Copy link
Member

newpavlov commented Jan 26, 2021

Hm, I think a better solution would be to check availability of the SHA extension on x86 and use the intrinsic backend even if the asm feature is enabled, i.e. enabling it will simply replace the software fallback with the ASM one.

@dgbo
Copy link
Author

dgbo commented Jan 27, 2021

Sounds reasonable. Is this OK?

--- a/sha2/src/sha256.rs
+++ b/sha2/src/sha256.rs
@@ -146,6 +146,10 @@ cfg_if::cfg_if! {
         mod soft;
         mod aarch64;
         use aarch64::compress;
+    } else if #[cfg(all(feature = "compress", any(target_arch = "x86", target_arch = "x86_64")))] {
+        mod soft;
+        mod x86;
+        use x86::compress;
     } else if #[cfg(all(feature = "asm", any(target_arch = "x86", target_arch = "x86_64")))] {
         fn compress(state: &mut [u32; 8], blocks: &[[u8; 64]]) {
             for block in blocks {

@dgbo
Copy link
Author

dgbo commented Jan 28, 2021

Hm, I think a better solution would be to check availability of the SHA extension on x86 and use the intrinsic backend even if the asm feature is enabled, i.e. enabling it will simply replace the software fallback with the ASM one.

Hi, does the code below match your thoughts?

--- a/sha2/src/sha256.rs
+++ b/sha2/src/sha256.rs
@@ -146,6 +146,10 @@ cfg_if::cfg_if! {
         mod soft;
         mod aarch64;
         use aarch64::compress;
+    } else if #[cfg(all(feature = "compress", any(target_arch = "x86", target_arch = "x86_64")))] {
+        mod soft;
+        mod x86;
+        use x86::compress;
     } else if #[cfg(all(feature = "asm", any(target_arch = "x86", target_arch = "x86_64")))] {
         fn compress(state: &mut [u32; 8], blocks: &[[u8; 64]]) {
             for block in blocks {

The SHA extension on x86 is checked in x86::compress.
We use the x86 intrinsic backend even with -- features "compress, asm".
If only enabling asm, the ASM one is used as before.

With this, applications can just use --features "compress, asm" for x86 and aarch64, x86 stays the same as before to use the intrinsic backend, while aarch64 use the ASM one to better performance.
It would also be fine if only use --features "compress" for x86.

I'am sorry if I misunderstand what you mean.

Regards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants