Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chacha20-Poly1305 encryption #14249

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Chacha20-Poly1305 encryption #14249

wants to merge 1 commit into from

Conversation

robn
Copy link
Member

@robn robn commented Dec 2, 2022

I’ve added a new encryption option to OpenZFS, chacha20-poly1305, which implements the Chacha20-Poly1305 AEAD described in RFC 8439. I’m interested in feedback, code review, testing and benchmarking from anyone interested in such things!

Important note: I have no particular skill in math, cryptography or secure software. I’m as confident as a reasonably good programmer can be that this is right, but I wouldn’t go trusting your secrets to this just yet!

Closes #8679.

Rationale

AES (and particularly the default AES-GCM mode used in OpenZFS) is known to be very slow on systems without hardware assistance. There are many such machines out there could make good use of OpenZFS, especially low-cost machines and small boards that would otherwise make very nice storage machines. The Raspberry Pi series of machines are a good example.

The best option for these systems is an encryption option that performs well in software. Chacha20-Poly1305 is the current “standard” option for this in many contexts, and is a good choice for OpenZFS.

More information about the rationale can be found in Google’s work on Adiantum (though their solution is necessarily different because unlike OpenZFS it operates on whole disk blocks).

Implementation

Core algorithms

I’ve taken the “core” Chacha20 and Poly1305 implementations from Loup Vaillant’s Monocypher. They are small and easy to read and understand, and the author has written extensively about its development and in particular the mistakes made along the way, which (at least to me) inspires confidence in the implementation.

I considered a number of other implementations, including those in Sodium and OpenSSL as well as a number of other smaller implementations. Despite the relative simplicity of the algorithms, there is lot of variability in the overall quality and usability of different implementations, so I opted for the most straightforward ones I could find.

ICP/KCF module

I’ve written a new KCF-style module for the ICP, that implements the AEAD (the “construction” that binds the cipher (ChaCha20) and authenticator (Poly1305) together) and the glue to make it available to OpenZFS proper.

Illumos does not currently support these algorithms, so it wasn’t possible to just take one from there. I’ve implemented a module, however it only implements what OpenZFS needs and is not sufficiently general-purpose to contribute back to Illumos. I have no plans to do this.

The rest of the module is as simple as I could figure out how to make, especially in some of the buffer management which is quite complex in the AES modes.

Tests

I’ve added a test program that runs the test vectors in RFC 8439. For Chacha20 and Poly1305 this is mostly a basic sanity check to assist in the future with possible alternate implementations (eg hardware-accelerated versions), as there are many implementations out there that have surprising quirks even when used correctly (most commonly endianness problems).

There’s also a full module test, checking that the AEAD is constructed correctly (again using the test vectors) and also makng sure the ICP module interface works.

The tests are only there to confirm basic operation; they have nothing to say about security properties or correctness.

Performance

I have only done very light performance tests to gain a rough idea of how much “better” or “worse” this is than aes-256-gcm on a handful of machines I have here. I do not know how to drive fio very well and I did not make much effort to control variables like other programs running on the machine, so I will not defend these numbers very hard. I’ve run them enough times to be reasonably confident they’re not wildly off the mark.

Method and fio output: https://gist.github.com/robn/4b40251ba84728cf8e669649ce77ad56

  • Broadcom BCM2711 (Raspberry Pi 4B): ~2.4x faster (no AES hardware accel)
  • Celeron J3455 (Fitlet2): ~1.3x faster (partial AES hardware accel)
  • Core i7-8565U (Thinkpad X1 Gen7): ~1.2x slower (full AES hardware accel)

(ChaPoly in software coming so close to AES in hardware makes me think there’s probably still optimisation opportunities in the AES code, but I have not really considered this even a little bit).

Other notes

  • This is only for Linux right now. It should be straightforward to implement for FreeBSD too, which has the necessary crypto primitives available in the kernel. I will try to add that soon. Now with FreeBSD support.
  • Related to that, I have not fully updated the tests yet because I don’t want to them to break any automatic PR test stuff that might run on FreeBSD. Done.
  • XChacha20 (the “better” Chacha variant) isn’t an option because OpenZFS cannot provide a large enough nonce to initialise it. In practice it wouldn’t matter because OpenZFS already takes steps to mitigate nonce reuse.

TODO

After discussion in monthly call:

  • figure out if anyone else wants this feature
  • document when to choose chapoly vs AES
  • test chapoly dataset on older ZFS (failure modes)
    • get encryption property
    • attempt to mount
    • (other dataset ops?)
  • test send/recv raw send
    • both sides support
    • source supports, receiver doesn't
    • source doesn't support (?), receiver does
    • neither support (?)
  • add feature flag

Future work

Just stuff I thought about along the way that I’m not committing to but might like to look at in the future.

  • I want to at least try accelerated versions (read: assembly implementations of Chacha20 and/or Poly1305) of this in the future, to see if there are significant gains to be had. That will likely make a runtime selection mechanism necessary (like icp_aes_impl and icp_gcm_impl).
  • The entire ICP/KCF system has a lot of unused stuff and overhead that I’d like to see radically slimmed down. I think its probably possible lift zio_crypt.c up to be common to all ports, with a much smaller layer that calls down to OS-provided cryptographic functions or uses the bundled ones where the system doesn’t have something available. That’s not just for Linux, but would help in (for example) Illumos, which has AES in the kernel but not ChaPoly. I’d probably wait to see what the Windows and Mac ports look like when they’re finally merged before considering this seriously.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

module/icp/monocypher.c Fixed Show resolved Hide resolved
module/icp/monocypher.c Fixed Show fixed Hide fixed
module/icp/monocypher.c Fixed Show fixed Hide fixed
module/icp/monocypher.c Fixed Show fixed Hide fixed
@robn
Copy link
Member Author

robn commented Dec 4, 2022

Last push has support for FreeBSD, which is very straightforward because all the crypto code is already available in the kernel.

Importantly, I've done a interop test: a pool with a chacha20-poly1305 dataset created on Linux can be imported and read and written to on FreeBSD, and back again. I expected that, but its nice to see it in action.

@lundman
Copy link
Contributor

lundman commented Dec 4, 2022

I appreciate very much you made it part of ICP, it means nearly no work needs to be done for me.

@robn robn marked this pull request as ready for review December 5, 2022 02:19
@behlendorf behlendorf added the Status: Design Review Needed Architecture or design is under discussion label Dec 5, 2022
@robn
Copy link
Member Author

robn commented Dec 13, 2022

So, what's the next step for this? If its just waiting until volunteers have time to review, no problem, I get it! If there's something I could do or need to do though, I'd like to get on to that.

(Should I be dropping in on the monthly call to talk about it?)

@ryao
Copy link
Contributor

ryao commented Dec 13, 2022

So, what's the next step for this? If its just waiting until volunteers have time to review, no problem, I get it! If there's something I could do or need to do though, I'd like to get on to that.

Yes. There are a number of PRs waiting review and reviewers have limited bandwidth to look at them. Larger PRs like this take substantially longer to see review.

(Should I be dropping in on the monthly call to talk about it?)

That might help attract reviewers.

@evrim
Copy link

evrim commented Jan 13, 2023

Hey there,

I must admit I'm a little bit confused here. Linux has already support for Chacha20Poly1305 under lib/ and relevant arch/, why do we need another implementation?

@robn
Copy link
Member Author

robn commented Jan 13, 2023

@evrim there's two ways to look at it. One is that Linux's crypto code is only available to GPL kernel modules, which OpenZFS is not. The other is that by doing it in the ICP, other ICP-using ports of OpenZFS (Windows and macOS) can use it.

@jittygitty
Copy link

jittygitty commented Feb 2, 2023

@robn Some (#11357) of us are building ZFS and setting it as "GPL" module when we build, how can we then use the existing Linux built-in GPL implementations of chacha20 etc? thx

@robn
Copy link
Member Author

robn commented Feb 3, 2023

@jittygitty there's no code in ZFS right now that can call into the kernel crypto API, regardless of how you build it - you will always be using the ICP for crypto even for AES modes. You'd need a variant of zio_crypt for that, which is well outside the scope of this PR.

@robn
Copy link
Member Author

robn commented Feb 26, 2023

We talked about this at the 31 Jan call. Feedback was generally favourable, thanks all!

I was waiting to check my notes against the recording but that hasn't surfaced yet and I want to get moving on this a bit this week. So here's what I wrote down. If it turns out I got something wrong, I will of course adjust it.

  • We agreed that it was fine to use Monocypher, under a more general principle that we don't really have the resources or skills to review crypto implementations ourselves, and trusting external implementations that appear to know what they're doing is a reasonable approach for us. The existence of a proof was welcomed (Add machine-checked proofs for limb overflows LoupVaillant/Monocypher#246).
  • There should be wording in the manpage to give guidance about using AES vs chapoly that mostly boils down to "AES if you have hardware acceleration, chapoly otherwise".
  • More testing is required. I didn't write down specifics, but I've been thinking about it, and this is at least around upgrade/downgrade from/to ZFS versions with/without chapoly support, and how send/receive copes with same. Possibly this will require a new feature (like we do with checksum & compression algorithms)

Finally, and most importantly, it was agreed that this is obviously a nice feature, but if no one else wants it then perhaps it we shouldn't be taking on the maintenance burden. I will soon go and try to get a rough show of hands to see if anyone would else would use this. Probably, a carefully-worded query to the mailing list, discussion forum and reddit, and go from there.

I'll return this to draft status for the moment, and add some additional todo items to the top post.

@mcmilk
Copy link
Contributor

mcmilk commented Mar 3, 2023

We talked about this at the 31 Jan call. Feedback was generally favourable, thanks all!

I was waiting to check my notes against the recording but that hasn't surfaced yet and I want to get moving on this a bit this week. So here's what I wrote down. If it turns out I got something wrong, I will of course adjust it.

  • We agreed that it was fine to use Monocypher, under a more general principle that we don't really have the resources or skills to review crypto implementations ourselves, and trusting external implementations that appear to know what they're doing is a reasonable approach for us. The existence of a proof was welcomed (Add machine-checked proofs for limb overflows LoupVaillant/Monocypher#246).
  • There should be wording in the manpage to give guidance about using AES vs chapoly that mostly boils down to "AES if you have hardware acceleration, chapoly otherwise".
  • More testing is required. I didn't write down specifics, but I've been thinking about it, and this is at least around upgrade/downgrade from/to ZFS versions with/without chapoly support, and how send/receive copes with same. Possibly this will require a new feature (like we do with checksum & compression algorithms)

Finally, and most importantly, it was agreed that this is obviously a nice feature, but if no one else wants it then perhaps it we shouldn't be taking on the maintenance burden. I will soon go and try to get a rough show of hands to see if anyone would else would use this. Probably, a carefully-worded query to the mailing list, discussion forum and reddit, and go from there.

I'll return this to draft status for the moment, and add some additional todo items to the top post.

I find this feature usefull and would help with this PR if you want.

There is also a rfc7905 which update TLS / DTLS with ChaCha20-Poly1305 - so why should OpenZFS not go with the time and have this option.

@robn
Copy link
Member Author

robn commented Mar 4, 2023

I did all the testing I wanted to do with a view to adding a feature flag.

I created a new pool on Linux with master 620a977 + this PR, exported it, then rebuilt on master alone and imported the pool.

All the encryption properties are returned except for encryption itself:

root@fitlet:~# zfs get encryption,encryptionroot,keyformat,keylocation,keystatus sandisk
NAME     PROPERTY        VALUE               SOURCE
sandisk  encryption      -                   -
sandisk  encryptionroot  sandisk             -
sandisk  keyformat       passphrase          -
sandisk  keylocation     file:///tmp/zfskey  local
sandisk  keystatus       unavailable         -

This appears to be because the underlying value is out bounds for this enum, so the default (ZIO_CRYPT_OFF) is returned.

Attempting to load the key fails, in a slightly weird way:

root@fitlet:~# zfs load-key sandisk
Key load error: Incorrect key provided for 'sandisk'.

and of course it can't be mounted:

root@fitlet:~# zfs mount sandisk
cannot mount 'sandisk': encryption key not loaded

I say it fails in a slightly weird way, because I think its just lucky this time.

If we're compiled with --enable-debug, it fails a lot earlier:

root@fitlet:~# zfs load-key sandisk

Message from syslogd@fitlet at Mar  4 20:25:08 ...
 kernel:[1281777.454629] VERIFY3(crypt < ZIO_CRYPT_FUNCTIONS) failed (9 < 9)

Message from syslogd@fitlet at Mar  4 20:25:08 ...
 kernel:[1281777.454642] PANIC at zio_crypt.c:568:zio_crypt_key_unwrap()

When asserts are disabled, it seems like its only doing ok because the crypto table entry does actually exist for crypt == 9, its the NULL entry at the end. Linux zio_crypt_key_unwrap sets up to decrypt the key and even tries, fails and gets an ECKSUM, which is then swallowed and EACCES is returned. On FreeBSD I think it will fall out a little earlier, at the top of zio_do_crypt_uio_opencrypto, and returns ENOTSUP, which again gets swallowed and replaced for EACCES.

I do think that behaviour is lucky. For the next new cipher suite crypt == 10, it'd be running off the end of the crypto table. Even now, on Linux, its trying to decrypt using only a partial state, so it could easily crash in other circumstances.

So probably this alone is enough to suggest a feature flag is warranted.

@robn
Copy link
Member Author

robn commented Mar 4, 2023

I did further send/receive testing as well.

When sender and receiver both support chapoly, it works exactly as it does for AES - raw send doesn't require the dataset loaded on the sender, not-raw does. All fine.

When there's no chapoly support, raw send still works. A stream is produced, and if that's later received to a chapoly-supporting pool, it goes in just fine, key can be loaded and it can't be mounted.

Importing a chapoly stream into a pool with no support fails:

root@fitlet:~# zfs recv -v sandisk/recv@snap < snap-from-chapoly.zfs                                                           
receiving full stream of sandisk@snap into sandisk/recv@snap                                                                   
cannot receive new filesystem stream: invalid backup stream     

Its not totally clear to me without looking inside that that's indicating. Presumably, invalid property of some sort. I'll check more closely tomorrow.

But again, likely the a feature flag is going to benefit.

@robn
Copy link
Member Author

robn commented Mar 4, 2023

I find this feature usefull and would help with this PR if you want.

@mcmilk sorry I missed this! Thanks! At this point I don't think there's much else to do other than as much code review as we can. The main thing I want to feel good about is that I'm not misusing the crypto stuff in some important way. General bugs would be bad but survivable, but the crypto being no good is worse than not having it at all. I think its probably fine, but if you've got good ideas that'd be great.

@robn
Copy link
Member Author

robn commented Mar 4, 2023

Ehh, I've got my brain in backwards this evening. A feature in the dataset isn't going to make an old implementation magically able to cope with it.

That said, I'll see if I can do something now to help for the next new crypto algorithm.

@LoupVaillant
Copy link

@ryao I’ve just ran Monocypher’s speed tests under -fno-strict-aliasing, there was no measurable effect. This is not surprising considering Monocypher almost exclusively deals with byte buffers, and is explicitly written to allow them to alias in many cases. Things other than bytes tend to be local variables instead.

@mcmilk before resorting to assembly you might want to take a look at compiler intrinsics instead. I would start with Chacha20. Some day I’ll write a high performance version of Monocypher for vector capable 64-bit processors, but if you don’t want to wait an unspecified amount of time I recommend you take a look at Dolbeau’s implementation (featured in libsodium).

@ryao
Copy link
Contributor

ryao commented May 9, 2023

@ryao I’ve just ran Monocypher’s speed tests under -fno-strict-aliasing, there was no measurable effect. This is not surprising considering Monocypher almost exclusively deals with byte buffers, and is explicitly written to allow them to alias in many cases. Things other than bytes tend to be local variables instead.

@mcmilk before resorting to assembly you might want to take a look at compiler intrinsics instead. I would start with Chacha20. Some day I’ll write a high performance version of Monocypher for vector capable 64-bit processors, but if you don’t want to wait an unspecified amount of time I recommend you take a look at Dolbeau’s implementation (featured in libsodium).

It is possible to implement high level vector operations in GNU C and it is much easier to use/review than compiler intrinsics. In addition, the same code can be used on multiple architectures, which is more difficult to do with compiler intrinsics. Clang and GCC also do not always agree on compiler intrinsics, but they do agree on GNU C’s vector support, so you have a choice of both compilers. You can see me use GNU C’s vector extensions here:

#14234

The biggest downside is that GCC does not optimize it correctly when you have a larger vector width than the hardware due to a vector lowering bug (with the workaround I found only applying to fairly recent GCC versions), so my plan is to cache the assembly output from clang in the source tree and build with that when gcc is used. When the compiler is doing good optimization, the end result can outperform hand written assembly. The WIP code in that PR already outperforms the existing hard written assembly because the compiler was able to do loop optimizations that are not possible with inline assembly since people had kept the loop in C. I am not sure offhand if compiler intrinsics would be similarly optimized.

The other downsides of GNU C’s vector extension are that the builtins intended to extend it are not supported in all of the compiler versions we support and getting the best results requires tweaking the code in ways that the existing optimization passes can handle in places where it is obvious that the compiler’s optimization passes are weak when looking at the assembly output.

Intel has its own alternative that is likely similarly easy to use and according to Intel, is more performant than compiler intrinsics, but it presumably is limited to x86:

https://www.intel.com/content/www/us/en/developer/articles/technical/simd-made-easy-with-intel-ispc.html

@LoupVaillant
Copy link

The biggest downside is that GCC does not optimize it correctly when you have a larger vector width than the hardware due to a vector lowering bug (with the workaround I found only applying to fairly recent GCC versions), so my plan is to cache the assembly output from clang in the source tree and build with that when gcc is used.

Ah, I see. Caching generated assembly might be good too. Note that Chacha20 has 32 integers worth of state, which means as many vectors, plus any intermediate variable. I believe the typical CPU doesn’t have that many vectors, so you’d have to suffer manual register allocation. I wouldn’t look forward to that, so it’s good that you can steal generated code from a non-buggy compiler.

@ryao
Copy link
Contributor

ryao commented May 12, 2023

The biggest downside is that GCC does not optimize it correctly when you have a larger vector width than the hardware due to a vector lowering bug (with the workaround I found only applying to fairly recent GCC versions), so my plan is to cache the assembly output from clang in the source tree and build with that when gcc is used.

Ah, I see. Caching generated assembly might be good too. Note that Chacha20 has 32 integers worth of state, which means as many vectors, plus any intermediate variable. I believe the typical CPU doesn’t have that many vectors, so you’d have to suffer manual register allocation. I wouldn’t look forward to that, so it’s good that you can steal generated code from a non-buggy compiler.

I just realized that this conflicts with my earlier suggestion that Monocypher is a candidate to rewrite in WUFFS. If you were to adopt WUFFS, you would be forced to use compiler intrinsics to make a vectorized version with WUFFS since WUFFS does not support GNU C's vector extension. However, you could use GNU C for an initial version, do some tweaking to get good code generation out of clang based on how the code performs and then rewrite it in WUFFS using intrinsics that map the the instruction choices that the compiler made. That would likely give a better result than manually guessing which intrinsics are the most appropriate.

Passing the WUFFS compiler would would prove certain properties of the code, although implementing vectorized code in WUFFS would have the downside that you would need to implement a different version for each major architecture unless you were to rely on SIMDe. Ironically, SIMDe has code to use the GNU C vector extension, which would bring things back to GNU C vector code, although it would likely be less optimal than the hypothetical original that I suggested.

By the way, it occurs to me that there is an easier way to prove the absence of integer overflows in monocypher than a rewrite in WUFFS, which would be to run the code through NASA' IKOS:

https://github.com/NASA-SW-VnV/ikos

There is a presentation on it here, which includes a list of the various issues that it is designed to detect:

https://ntrs.nasa.gov/api/citations/20220007431/downloads/SWS_TC3_IKOS_Tutorial.pdf

If it complains, it does not necessarily mean that there is an issue, but if it does not complain, then it has proven the absence of the issues that it is designed to detect, which includes integer overflow issues. It is unfortunate that IKOS does not support concurrency, which has kept me from using it to find bugs in ZFS.

@ryao
Copy link
Contributor

ryao commented May 12, 2023

It turns out that IKOS can prove a superset of what the WUFFS compiler proves, so if the monocypher code passes inspection by IKOS, it would be proven to have all of the correctness properties that the WUFFS compiler guarantees (minus hermeticity, which IKOS does not check, although it is fairly obvious that the monocypher code is hermetic).

It occurs to me that we should look into using IKOS for the portions of ZFS that are non-concurrent, like compression/decompression, encryption/decryption, checksumming and parity calculations. Since it relies on the Clang frontend, it presuambly can check code using GNU C's vector extension too. :)

@robn
Copy link
Member Author

robn commented May 13, 2023

@behlendorf thanks for the nice list, getting on it tomorrow!

  • Register a new feature flag for Chacha20-Poly1305 and enable it for the relevant datasets. As you noticed above in a production build you get a bit lucky and just get a strange error, but in a debug build it ASSERTs immediately. We need to detect that incompatibility early with the feature flag.

I didn't really want to do a feature flag, because it seems like missing support means being unable to unlock and mount the dataset, but still allows the pool to be imported, scrubbed and otherwise manipulated, and not being able to unlock the dataset is actually a stated feature of ZFS encryption (remote backups without needing the keys). Blocking import entirely feels like a bit much. This is what I was hoping to head off with #14577 (which I have to get back into 2.1).

But, receive doesn't work (even though it could), and obviously none of this helps older ZFS, which at least will check for feature flags, and tripping assertions is an indicator of some danger. So I'll do it, and I'm not really arguing, just a bit bummed about it.

  • For zfs send/recv when adding a compression or checksum type we've added a new DMU_BACKUP_ flag for it. That said, for encryption it looks to me like that's not going to be needed (which is good since we're out of bit for flags). For a normal send stream there's no issue since we're not sending anything encrypted. And for a raw send stream the receiving side will already validate that the key is for a known crypto suite (dsl_crypto_recv_raw() -> dsl_crypto_recv_key_check() -> dsl_crypto_recv_raw_key_check()). What still might be nice here is to look in to printing as useful an error as possible in the ZFS_ERR_CRYPTO_NOTSUP case.

Sorry, what are you asking for specifically? Right now zfs recv will handle ZFS_ERR_CRYPTO_NOTSUP by printing stream uses crypto parameters not compatible with this pool. Are you thinking more than that?

I suppose we could do a feature flag check at that point as well, and not accept the stream if the admin has explicitly disabled the feature.

  • Look in to extending some of the encryption related ZTS test cases to use the new feature. Most of those test cases have "encrypt" in the file name and use encryption=on. However, a few of them do test all the algorithms, for example cli_root/zpool_create/zpool_create_crypt_combos.ksh.

I've already added chacha20-poly1305 to the list tested by zpool_create_crypt_combos and zfs_create_crypt_combos. I'll extend those in some way to include feature tests.

I didn't check them all, but a quick look suggests the other tests don't seem to be dependent on the type of crypto in use. I could change the ones that explicitly call for an AES suite to just say encryption=on, and then extend that to have some test helper function choose the encryption if the test doesn't care (cycling or something).

The receive tests will need to make sure they do the right thing when the feature is disabled, of course.

  • If you're convinced the CodeQL warnings are in fact false positives just let me know and I can mark those as false positives.

I am satisfied there's no problem per #14249 (comment). Please mark them as false positives.

@robn robn force-pushed the chapoly branch 2 times, most recently from 02498c9 to 5924253 Compare May 21, 2023 22:13
@robn
Copy link
Member Author

robn commented May 21, 2023

Top commit adds chacha20_poly1305 feature, and checks/activates at dataset creation and raw receive. Tests to come.

@robn robn force-pushed the chapoly branch 2 times, most recently from aa3c336 to 5c73546 Compare May 23, 2023 11:15
@jimthedj65
Copy link

jimthedj65 commented May 25, 2023 via email

const u32 s4 = h4 + end;

// (h + c) * r, without carry propagation
const u64 x0 = s0*r0+ s1*rr3+ s2*rr2+ s3*rr1+ s4*rr0;

Check failure

Code scanning / CodeQL

Multiplication result converted to larger type High

Multiplication result may overflow 'unsigned int' before it is converted to 'unsigned long long'.

// (h + c) * r, without carry propagation
const u64 x0 = s0*r0+ s1*rr3+ s2*rr2+ s3*rr1+ s4*rr0;
const u64 x1 = s0*r1+ s1*r0 + s2*rr3+ s3*rr2+ s4*rr1;

Check failure

Code scanning / CodeQL

Multiplication result converted to larger type High

Multiplication result may overflow 'unsigned int' before it is converted to 'unsigned long long'.
// (h + c) * r, without carry propagation
const u64 x0 = s0*r0+ s1*rr3+ s2*rr2+ s3*rr1+ s4*rr0;
const u64 x1 = s0*r1+ s1*r0 + s2*rr3+ s3*rr2+ s4*rr1;
const u64 x2 = s0*r2+ s1*r1 + s2*r0 + s3*rr3+ s4*rr2;

Check failure

Code scanning / CodeQL

Multiplication result converted to larger type High

Multiplication result may overflow 'unsigned int' before it is converted to 'unsigned long long'.
const u64 x0 = s0*r0+ s1*rr3+ s2*rr2+ s3*rr1+ s4*rr0;
const u64 x1 = s0*r1+ s1*r0 + s2*rr3+ s3*rr2+ s4*rr1;
const u64 x2 = s0*r2+ s1*r1 + s2*r0 + s3*rr3+ s4*rr2;
const u64 x3 = s0*r3+ s1*r2 + s2*r1 + s3*r0 + s4*rr3;

Check failure

Code scanning / CodeQL

Multiplication result converted to larger type High

Multiplication result may overflow 'unsigned int' before it is converted to 'unsigned long long'.
@robn
Copy link
Member Author

robn commented Oct 28, 2023

Last push updates Monocypher to 4.0.2, adds a test to ensure replication streams with encrypted datasets are accepted or not depending on whether or not the feature flag is set, and collapses it down into a single commit.

The CodeQL failures at the same as the ones we saw right at the beginning, and are safe.

@behlendorf how do you feel about this nearly a year on? :) I don't think there's any more to do.

@robszy
Copy link

robszy commented Nov 23, 2023

Did anyone thought about wireguard zinc implementation like in comment: #8679 (comment)
It would be out of the box a high performance ( vector extensions ) formally verified implementation from wireguard.

Today we will only have implementation which don't use vector extensions targeted only at raspbery pi like devices :(

@ErrorNoInternet
Copy link
Contributor

ErrorNoInternet commented Apr 13, 2024

Hi, thanks for working on this!

I decided to try this out on a Raspberry Pi 4B (4 core, 4 GB RAM) and a 1 TB HDD connected via a USB enclosure.

One issue though, I noticed I'm only getting about 30 MB/s read speeds and massive amounts of CPU usage. I can get 100 MB/s write speeds just fine (and this matches the speed of the chacha20 Rust crate without neon).

Without any encryption, I get 150 MB/s write and 170 MB/s read.

Upon further inspection, it seems to be because of the massive amounts of messages in dmesg. I don't get them with AES-256-GCM though.

image

[77650.336483] CPU: 3 PID: 1235586 Comm: z_rd_int Tainted: P         C O       6.6.26 #1-NixOS
[77650.336491] Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT)
[77650.336494] Call trace:
[77650.336498]  dump_backtrace+0x9c/0x128
[77650.336507]  show_stack+0x20/0x38
[77650.336511]  dump_stack_lvl+0x48/0x60
[77650.336520]  dump_stack+0x18/0x28
[77650.336526]  spl_kmem_alloc+0x140/0x158 [spl]
[77650.336557]  chapoly_decrypt_atomic+0x74/0x2a0 [zfs]
[77650.337177]  crypto_decrypt+0x80/0x1c0 [zfs]
[77650.337596]  zio_do_crypt_uio+0x1b8/0x228 [zfs]
[77650.337971]  zio_do_crypt_data+0x1e8/0xf98 [zfs]
[77650.338522] Large kmem_alloc(131328, 0x0), please file an issue at:
               https://github.com/openzfs/zfs/issues/new
[77650.338349]  spa_do_crypt_abd+0x1fc/0x308 [zfs]
[77650.338802]  zio_decrypt+0xf4/0x498 [zfs]
[77650.339167]  zio_pop_transforms+0x68/0x98 [zfs]
[77650.339572]  zio_done+0x12c/0x10e0 [zfs]
[77650.339991]  zio_execute+0xb4/0x160 [zfs]
[77650.340353]  taskq_thread+0x2cc/0x518 [spl]
[77650.340388]  kthread+0xe8/0xf8
[77650.340395]  ret_from_fork+0x10/0x20
[77650.340403] CPU: 2 PID: 1235585 Comm: z_rd_int Tainted: P         C O       6.6.26 #1-NixOS
[77650.340409] Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT)
[77650.340412] Call trace:
[77650.340415]  dump_backtrace+0x9c/0x128
[77650.340424]  show_stack+0x20/0x38
[77650.340428]  dump_stack_lvl+0x48/0x60
[77650.340437]  dump_stack+0x18/0x28
[77650.340443]  spl_kmem_alloc+0x140/0x158 [spl]
[77650.340474]  chapoly_decrypt_atomic+0x74/0x2a0 [zfs]
[77650.341008]  crypto_decrypt+0x80/0x1c0 [zfs]
[77650.341407]  zio_do_crypt_uio+0x1b8/0x228 [zfs]
[77650.341814]  zio_do_crypt_data+0x1e8/0xf98 [zfs]
[77650.342376] Large kmem_alloc(131328, 0x0), please file an issue at:
               https://github.com/openzfs/zfs/issues/new
[77650.342204]  spa_do_crypt_abd+0x1fc/0x308 [zfs]
[77650.342694]  zio_decrypt+0xf4/0x498 [zfs]
[77650.343064]  zio_pop_transforms+0x68/0x98 [zfs]
[77650.343436]  zio_done+0x12c/0x10e0 [zfs]
[77650.343805]  zio_execute+0xb4/0x160 [zfs]
[77650.344177]  taskq_thread+0x2cc/0x518 [spl]
[77650.344211]  kthread+0xe8/0xf8
[77650.344218]  ret_from_fork+0x10/0x20
[77650.344227] CPU: 3 PID: 1235586 Comm: z_rd_int Tainted: P         C O       6.6.26 #1-NixOS
[77650.344234] Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT)
[77650.344238] Call trace:
[77650.344241]  dump_backtrace+0x9c/0x128
[77650.344251]  show_stack+0x20/0x38
[77650.344255]  dump_stack_lvl+0x48/0x60
[77650.344263]  dump_stack+0x18/0x28
[77650.344269]  spl_kmem_alloc+0x140/0x158 [spl]
[77650.344300]  chapoly_decrypt_atomic+0x74/0x2a0 [zfs]
[77650.344873]  crypto_decrypt+0x80/0x1c0 [zfs]
[77650.345284]  zio_do_crypt_uio+0x1b8/0x228 [zfs]
[77650.345685]  zio_do_crypt_data+0x1e8/0xf98 [zfs]
[77650.346261] Large kmem_alloc(131328, 0x0), please file an issue at:
               https://github.com/openzfs/zfs/issues/new
[77650.346059]  spa_do_crypt_abd+0x1fc/0x308 [zfs]
[77650.346512]  zio_decrypt+0xf4/0x498 [zfs]
[77650.346873]  zio_pop_transforms+0x68/0x98 [zfs]
[77650.347266]  zio_done+0x12c/0x10e0 [zfs]
[77650.347755]  zio_execute+0xb4/0x160 [zfs]
[77650.348141]  taskq_thread+0x2cc/0x518 [spl]
[77650.348179]  kthread+0xe8/0xf8
[77650.348187]  ret_from_fork+0x10/0x20
[77650.348195] CPU: 2 PID: 1235585 Comm: z_rd_int Tainted: P         C O       6.6.26 #1-NixOS
[77650.348202] Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT)
[77650.348205] Call trace:
[77650.348209]  dump_backtrace+0x9c/0x128
[77650.348218]  show_stack+0x20/0x38
[77650.348222]  dump_stack_lvl+0x48/0x60
[77650.348231]  dump_stack+0x18/0x28
[77650.348236]  spl_kmem_alloc+0x140/0x158 [spl]
[77650.348267]  chapoly_decrypt_atomic+0x74/0x2a0 [zfs]
[77650.348746]  crypto_decrypt+0x80/0x1c0 [zfs]
[77650.349117]  zio_do_crypt_uio+0x1b8/0x228 [zfs]
[77650.349540]  zio_do_crypt_data+0x1e8/0xf98 [zfs]
[77650.350165] Large kmem_alloc(131328, 0x0), please file an issue at:
               https://github.com/openzfs/zfs/issues/new
[77650.349927]  spa_do_crypt_abd+0x1fc/0x308 [zfs]
[77650.350404]  zio_decrypt+0xf4/0x498 [zfs]
[77650.350766]  zio_pop_transforms+0x68/0x98 [zfs]
[77650.351133]  zio_done+0x12c/0x10e0 [zfs]
[77650.351500]  zio_execute+0xb4/0x160 [zfs]
[77650.351916]  taskq_thread+0x2cc/0x518 [spl]
[77650.351959]  kthread+0xe8/0xf8
[77650.351968]  ret_from_fork+0x10/0x20
[77650.351977] CPU: 3 PID: 1235586 Comm: z_rd_int Tainted: P         C O       6.6.26 #1-NixOS
[77650.351984] Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT)
[77650.351987] Call trace:
[77650.351990]  dump_backtrace+0x9c/0x128
[77650.352001]  show_stack+0x20/0x38
[77650.352005]  dump_stack_lvl+0x48/0x60
[77650.352013]  dump_stack+0x18/0x28
[77650.352019]  spl_kmem_alloc+0x140/0x158 [spl]
[77650.352049]  chapoly_decrypt_atomic+0x74/0x2a0 [zfs]
[77650.352645]  crypto_decrypt+0x80/0x1c0 [zfs]
[77650.353010]  zio_do_crypt_uio+0x1b8/0x228 [zfs]
[77650.353396]  zio_do_crypt_data+0x1e8/0xf98 [zfs]
[77650.353765]  spa_do_crypt_abd+0x1fc/0x308 [zfs]
[77650.354077] Large kmem_alloc(131328, 0x0), please file an issue at:
               https://github.com/openzfs/zfs/issues/new
[77650.354217]  zio_decrypt+0xf4/0x498 [zfs]
[77650.354597]  zio_pop_transforms+0x68/0x98 [zfs]
[77650.354982]  zio_done+0x12c/0x10e0 [zfs]
[77650.355356]  zio_execute+0xb4/0x160 [zfs]
[77650.355727]  taskq_thread+0x2cc/0x518 [spl]
[77650.355761]  kthread+0xe8/0xf8
[77650.355769]  ret_from_fork+0x10/0x20
[77650.355778] CPU: 2 PID: 1235585 Comm: z_rd_int Tainted: P         C O       6.6.26 #1-NixOS
[77650.355786] Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT)
[77650.355789] Call trace:
[77650.355792]  dump_backtrace+0x9c/0x128
[77650.355802]  show_stack+0x20/0x38
[77650.355806]  dump_stack_lvl+0x48/0x60
[77650.355814]  dump_stack+0x18/0x28
[77650.355820]  spl_kmem_alloc+0x140/0x158 [spl]
[77650.355850]  chapoly_decrypt_atomic+0x74/0x2a0 [zfs]
[77650.356370]  crypto_decrypt+0x80/0x1c0 [zfs]
[77650.356742]  zio_do_crypt_uio+0x1b8/0x228 [zfs]
[77650.357155]  zio_do_crypt_data+0x1e8/0xf98 [zfs]
[77650.357576]  spa_do_crypt_abd+0x1fc/0x308 [zfs]
[77650.358118]  zio_decrypt+0xf4/0x498 [zfs]
[77650.358631]  zio_pop_transforms+0x68/0x98 [zfs]
[77650.359133]  zio_done+0x12c/0x10e0 [zfs]
[77650.359632]  zio_execute+0xb4/0x160 [zfs]
[77650.360101]  taskq_thread+0x2cc/0x518 [spl]
[77650.360141]  kthread+0xe8/0xf8
[77650.360148]  ret_from_fork+0x10/0x20
[77650.360160] CPU: 0 PID: 1857 Comm: z_rd_int Tainted: P         C O       6.6.26 #1-NixOS
[77650.360181] Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT)
[77650.360185] Call trace:
[77650.360188]  dump_backtrace+0x9c/0x128
[77650.360198]  show_stack+0x20/0x38
[77650.360203]  dump_stack_lvl+0x48/0x60
[77650.360212]  dump_stack+0x18/0x28
[77650.360218]  spl_kmem_alloc+0x140/0x158 [spl]
[77650.360252]  chapoly_decrypt_atomic+0x74/0x2a0 [zfs]
[77650.360861]  crypto_decrypt+0x80/0x1c0 [zfs]
[77650.361271]  zio_do_crypt_uio+0x1b8/0x228 [zfs]
[77650.361756]  zio_do_crypt_data+0x1e8/0xf98 [zfs]
[77650.362434]  spa_do_crypt_abd+0x1fc/0x308 [zfs]
[77650.363077]  zio_decrypt+0xf4/0x498 [zfs]
[77650.363705]  zio_pop_transforms+0x68/0x98 [zfs]
[77650.364241]  zio_done+0x12c/0x10e0 [zfs]
[77650.364844]  zio_execute+0xb4/0x160 [zfs]
[77650.365425]  taskq_thread+0x2cc/0x518 [spl]
[77650.365469]  kthread+0xe8/0xf8
[77650.365478]  ret_from_fork+0x10/0x20

zpool iostat -vw during read:

ata-WDC_WD10EZEX-08M2NA0_WD-WCC3F6RAKJ32     total_wait     disk_wait    syncq_wait    asyncq_wait
latency                                      read  write   read  write   read  write   read  write  scrub   trim  rebuild
------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
1ns                                             0      0      0      0      0      0      0      0      0      0      0
3ns                                             0      0      0      0      0      0      0      0      0      0      0
7ns                                             0      0      0      0      0      0      0      0      0      0      0
15ns                                            0      0      0      0      0      0      0      0      0      0      0
31ns                                            0      0      0      0      0      0      0      0      0      0      0
63ns                                            0      0      0      0      0      0      0      0      0      0      0
127ns                                           0      0      0      0      0      0      0      0      0      0      0
255ns                                           0      0      0      0      0      0      0      0      0      0      0
511ns                                           0      0      0      0      0      0      0      0      0      0      0
1us                                             0      0      0      0      0      0      0      0      0      0      0
2us                                             0      0      0      0      0      0      0      0      0      0      0
4us                                             0      0      0      0      0      0      0      0      0      0      0
8us                                             0      0      0      0      0      0      0      0      0      0      0
16us                                            0      0      0      0      0      0      0      0      0      0      0
32us                                            0      0      0      0      0      0      0      0      0      0      0
65us                                            0      0      0      0      0      0      0      0      0      0      0
131us                                           0      0      0      0      0      0      0      0      0      0      0
262us                                           0      0      0      0      0      0      0      0      0      0      0
524us                                           0      0      0      0      0      0      0      0      0      0      0
1ms                                             0      0      0      0      0      0      0      0      0      0      0
2ms                                             0      0      0      0      0      0      0      0      0      0      0
4ms                                             0      0      0      0      0      0      0      0      0      0      0
8ms                                             0      0      0      0      0      0      0      0      0      0      0
16ms                                            0      0      0      0      0      0      0      0      0      0      0
33ms                                            0      0      0      0      0      0      0      0      0      0      0
67ms                                            0      0      0      0      0      0      0      0      0      0      0
134ms                                           0      0     29      0      0      0      0      0      0      0      0
268ms                                           0      0      0      0      0      0      0      0      0      0      0
536ms                                           0      0      0      0      0      0      0      0      0      0      0
1s                                              0      0      0      0      0      0      0      0      0      0      0
2s                                             29      0      0      0      0      0     29      0      0      0      0
4s                                              0      0      0      0      0      0      0      0      0      0      0
8s                                              0      0      0      0      0      0      0      0      0      0      0
17s                                             0      0      0      0      0      0      0      0      0      0      0
34s                                             0      0      0      0      0      0      0      0      0      0      0
68s                                             0      0      0      0      0      0      0      0      0      0      0
137s                                            0      0      0      0      0      0      0      0      0      0      0
-----------------------------------------------------------------------------------------------------------------------

@robn
Copy link
Member Author

robn commented Apr 13, 2024

@ErrorNoInternet thanks for the report! I've pushed a small change that should at least deal with the log spam.

@ErrorNoInternet
Copy link
Contributor

ErrorNoInternet commented Apr 13, 2024

Nice! Getting stable 110+ MB/s read speeds now :) Thank you!

@robn
Copy link
Member Author

robn commented Apr 13, 2024

Great to know, thanks :)

This commit implements the Chacha20-Poly1305 AEAD from RFC 8439 as a new
algorithm option for encrypted datasets.

AES (and particularly the default AES-GCM mode used in OpenZFS) is known
to be very slow on systems without hardware assistance. There are many
such machines out there could make good use of OpenZFS, especially
low-cost machines and small boards that would otherwise make very nice
storage machines. The Raspberry Pi series of machines are a good
example.

The best option for these systems is an encryption option that performs
well in software. Chacha20-Poly1305 is the current "standard" option for
this in many contexts, and is a good choice for OpenZFS.

The core Chacha20 and Poly1305 implementations are taken from Loup
Valliant's Monocypher. These were chosen because they are compact, easy
to read, easy to use and the author has written extensively about its
development, all of which give me confidence that there are unlikely to
be any surprises.

I've added a KCF-style module to the ICP to implement the AEAD. This
implements just enough for OpenZFS, and is not suitable as a
general-purpose KCF for Illumos (though it could be the starting point
for one).

For FreeBSD, which does not use the ICP, I've instead hooked it up to
FreeBSD's builtin crypto stack.

The rest is adding an enabling property value and a feature flag and and
hooking it up to all the right touch points, and documentation updates.

The existing tests that cycle through the possible encryption options
have been extended to add one more.

I've added a test to ensure that raw receives of chacha20-poly1305
datasets do the right thing based on the state of the feature flag on
the receiving side.

There's also a test unit that runs the test vectors in RFC 8439 against
Chacha20, Poly1305 and the AEAD in the ICP that combines them. This is
most useful as a sanity check during future work to add alternate
(accelerated) implementations.

Finally, manual interop testing has been done to confirm that pools and
streams can be moved between Linux and FreeBSD correctly.

Light and uncontrolled performance testing on a Raspberry Pi 4B
(Broadcom BCM2711, no hardware AES) writing to a chacha20-poly1305
dataset was ~2.4x faster than aes-256-gcm on the same hardware. On a
Fitlet2 (Celeron J3455, AES-NI but no AVX (openzfs#10846)) it was ~1.3x faster.

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Design Review Needed Architecture or design is under discussion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature: ChaCha20-Poly1305 Encryption