-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AES implementation should be chosen at runtime rather than compile time #25
Comments
Switch between implementations should be done as high-level as possible, not at the block cipher level. There is a |
I believe ideally we need a support for this at the language level. I've described my view here. |
What I'd like in the short term:
|
Because it silently falls back to only using the software implementation of AES. In fact, if you compiled with I have plastered Miscreant with instructions to use It would be nice if the ergonomics of target features were better, but failing that, if the |
I don't think that's sound. Once you enable +ssse3 for the whole crate, the compiler could choose to use ssse3 anywhere. For groestl-aesni, I used a combination of:
It works fine; the version with aes detected at runtime runs just as fast as compiling with --target-feature=+aes, and much faster than the soft-aes path. |
That's better than what this crate presently does, however unless I'm mistaken it still appears to require the crate's consumer to enable the I think all builds for all x86(_64) targets should always attempt to detect and use AES-NI, and if the required target features aren't enabled, the crate should refuse to compile. That could potentially have a way to opt out for users who really don't want AES-NI, but the default should be to attempt to use it, IMO. Barring that, I think no one will enable the target features and nothing will even attempt to use AES-NI. |
No, that's what |
(Btw, the |
Interesting. I’ll have to look into it more. |
Gah. What I've been doing doesn't actually work. I was fooled by rust-lang/stdarch#323, and the functions I thought were LLVM's fallbacks are actually just Rust's pointless wrappers. The bug is here: cryptocorrosion/cryptocorrosion#7 |
@kazcw oof, sorry to hear, almost thought you had this one licked! Any plans for how to address it? |
@tarcieri Yeah, it won't be too bad. The upshot is this is forcing me to put everything through traits, which was on the roadmap anyway (cryptocorrosion/cryptocorrosion#6) but I'd been putting it off because ad-hoc polymorphism by switching out imports took zero work. But it's not turning out to be as hard as I thought, and the new API is cleaner (the user parameterizes code that will be compiled with different features sets by a Machine trait, whose impls are ZSTs that define a set of types indicating what kind of registers to use, what cpu flags are enabled, etc; possessing a particular Machine instance is used as a marker that its associated types can safely be created). This also removes the barrier to AVX2, or rather what I needed to do to get AVX2 working turns out to also be what I need to make everything else work anyway. I already have ports of JH and BLAKE working on the new API. I can probably do ChaCha later today, and then I'll release the new versions, with packed_simd disabled for now. Putting packed_simd through the traits will be about as much work as implementing them on coresimd has been, and packed_simd is less crucial now that ppv-lite86 can do AVX2. |
We are hitting this error on Debian when trying to package these RustCrypto AES crates as Debian's x64 baseline does not include AESNI. All you need to do is use this: https://doc.rust-lang.org/std/macro.is_x86_feature_detected.html |
@infinity0 we are aware of that macro and have made past attempts to use it. Offhand I don’t remember what the objections were. |
I saw the above old discussion above about stdarch, and I believe today's stdarch uses these newer macros I mentioned. Hopefully the objections can be overcome straightforwardly? |
See this PR: RustCrypto/universal-hashes#11 To summarize: it appears doing I think it’d be good to confirm the performance impact empirically. |
@infinity0 can you spell out Debian's requirements a little more specifically? Why can't you omit the RUSTFLAGS to enable AES-NI? |
In order to support as many machines as possible, the Debian "amd64" architecture guarantees support even for x64 CPUs without AES-NI. That means we can't use these RUSTFLAGS. Debian also doesn't want to build everything twice just to support CPUs with AES-NI. |
And I assume this is the same for pretty much any other FOSS OS distro and not just Debian. |
Yes, that’s what I’m saying: to even use AES-NI you have to configure RUSTFLAGS to do so, so if you want to avoid AES-NI and fall back on the bitsliced implementation, simply don’t configure them. |
Yes, that's what I'm already doing. However this means no Debian users will be able to use AESNI, even if their CPU supports it. It would be better to detect it at runtime and enable it on that condition. |
Ok, yes I agree, but the tentative plan is to do that sort of runtime selection in higher-level crates like the AEAD crates (e.g. |
Well, "how high do you want to go", why not an even higher-level application crate? If performance is a concern can't you use |
We need to benchmark the performance of various strategies |
Fair enough, although I would expect/hope that the macro itself is already pretty well-optimised so that in practise all-except-the-first call to it would be as cheap as checking a boolean. |
My understanding is it invokes the CPUID instruction every time, which is cheap but may have negative impacts on pipelining. |
CPUID instruction itself certainly is not cheap (measured throughput can be as high as 1500 cycles), but A much more serious problem is incompatible layouts between the bitsliced and AES-NI implementations. Sure, we could use an enum, but I guess we should benchmark runtime detection on the level of the |
Another option to consider is gating runtime detection behind a cargo feature (which itself would depend on
Yes, that is unfortunate, but I'm not sure what else there is to be done (other than a more efficient bitsliced implementation). |
Another option would be to detect at runtime once, then set some function pointers and call through those. Branch predictors are very good at noticing that every call through a function pointer goes to the same place. |
As I wrote in my previous message, branching is the least of our problems here (and the pointer chasing probably will be worse than a simple branch on an enum tag). The main problem is that by default we keep cipher state on the stack and states in I guess we could optionally store |
I think we could feature gate autodetection. Yes, it would increase AES-NI stack usage considerably, but in some cases being able to ship a relocatable binary which supports both is more important than that |
@newpavlov So, the blocker is having a software implementation to fall back to that has a reasonably sized state? (As well as ensuring that @tarcieri Yes, eating some extra stack or heap space (especially if it doesn't need initializing when not used) would be a small price to pay to be able to ship binaries that support systems without hardware acceleration but that can use hardware acceleration if available. Also, I'd happily have a heap-based state for the software implementation if that means the hardware-accelerated implementation has less overhead and doesn't have a pile of unused memory. |
Why not let the user chose? If it's on "the stack", then the user can simply move it to the heap by boxing it, right? This could be documented somewhere, and have the default Alternatively, maybe a mechanism like ManagedSlice can be investigated, where the backend storage of the object is selected by the user (either using a reference, or an owned object if alloc feature is enabled). |
I think there's something even better we can do to get the best of both worlds. First, add an Next, define enums for the various key sizes which look like this: // Hypothetically we could wrap these enums in an opaque struct too
pub enum Aes128 {
AesNi(aesni::Aes128),
AesSoft(Box<aes_soft::Aes128>)
} In other words, only box |
Now that we've implemented AES fixslicing (see #176, #177, and #185 among others) the memory consumed by the
Or in other words, AES-NI and 32-bit fixslicing use the same amount of space, and 64-bit fixslicing needs twice as much as the other two. This is because AES-NI needs separate sets of encryption and decryption round keys, 32-bit fixslicing can share round keys but operates over 2 blocks in parallel, and 64-bit fixslicing can also share them but operates over 4 blocks in parallel. With the size difference eliminated/reduced, it seems like my suggestion above of using |
I've opened a PR which implements runtime autodetection for AES-NI: #208 |
AES implementation should be chosen at runtime rather than compile time, otherwise it makes people very hard to ship products with this crate, because they cannot choose the environment a product would be run on.
The text was updated successfully, but these errors were encountered: