Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various changes to ChaChaRng #243

Merged
merged 11 commits into from
Jan 25, 2018
Merged

Various changes to ChaChaRng #243

merged 11 commits into from
Jan 25, 2018

Conversation

pitdicker
Copy link
Contributor

This PR is based on top of #233.

Parts of the changes here come from my attempts with a with BlockRngCore trait, like #242. And better to finish this seperately, so all functional changes and changes to the documentation can be reviewed better.

The init function contained useful documentation that was not visible because init was not a public function. So I moved that to the documentation of ChaChaRng. Also I wrote some more in a similar style as HC-128.

I added some tests and better documentation for set_counter. Trying to use it to set a nonce is difficult, considering the trouble with endianness and different standards specifying different schemes for the nonce and counter. So I documented it to be not a supported use of set_rounds.

It turned out to be easy to make the number of rounds the core ChaCha algorithm runs configurable, just the addition of a set_rounds method.

And I stumbled on a way to improve the performance, by using an usize for the number of rounds and keeping the core algorithm a seperate function. No idea why it works though.

Benchmarks before:

test gen_bytes_chacha   ... bench:   2,782,004 ns/iter (+/- 5,165) = 368 MB/s
test gen_u32_chacha     ... bench:      10,774 ns/iter (+/- 11) = 371 MB/s
test gen_u64_chacha     ... bench:      21,564 ns/iter (+/- 75) = 370 MB/s

Benchmarks after this and #242:

test gen_bytes_chacha   ... bench:   1,898,515 ns/iter (+/- 6,578) = 539 MB/s
test gen_u32_chacha     ... bench:       8,357 ns/iter (+/- 24) = 478 MB/s
test gen_u64_chacha     ... bench:      16,207 ns/iter (+/- 47) = 493 MB/s

Finally remembered why it is a good idea to wait with generating results until the first use: it makes things like ChaChaRng::new().set_counter(500) possible.

cc @burdges, because I know you have an interest in ChaCha.

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Just curious, how does ChaCha with 8 or 12 rounds compare to HC128 and ISAAC in terms of performance?

// https://tools.ietf.org/html/draft-nir-cfrg-chacha20-poly1305-04
let seed = [0u8; 32];
let mut rng: ChaChaRng = SeedableRng::from_seed(seed);
rng.set_counter(0, 2u64.to_be());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user needs to call .to_be() on the indexes to get portable results? That is counter-intuitive for a counter. Are you sure it's correct, given that actually set_counter converts each u64 to two u32s via bit-shift? I suspect it's not. (In fact, it's different from your example in the documentation.)

Think we should replace set_counter with something easier to use correctly? Would it be better to use four u32s?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes from the test vector, which sets a 96-bit nonce to 00 00 00 00 00 00 00 00 00 00 00 02. I am not sure I got it right with the .to_be() call. Probably should just not add that test, because in the comment I claimed we do not support using it for a nonce.

Think we should replace set_counter with something easier to use correctly? Would it be better to use four u32s?

If we of think ChaCha in terms of an RNG and not a stream cipher, an u128 counter is best. And two u64s an ok alternative. I don't think four u32s are handy to set a counter.

If we really want to set a nonce it should take a byte array, to match from_seed.

Just thought of another reason not to support setting a nonce: we still use it as part of a counter, and not as a constant value. So if you run the RNG long enough, eventually the counter will tick over and increment the nonce., producing 'incorrect' results.

Copy link
Member

@dhardy dhardy Jan 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then probably that test should use:

let counter = 2u64;
(counter >> 32) | (counter << 32)

— but I don't understand why the test passes as-is (it sounds like state[15] should be set to 2u32).

To be honest, I'm not really sure where this function would be useful, if at all. Perhaps we should also add:

fn set_nonce(&mut self, counter: u64, nonce: [u8; 8])

So if you run the RNG long enough, eventually the counter will tick over and increment the nonce., producing 'incorrect' results.

The only correct behaviour would be to stop with an error, which conflicts with our primary usage (as a good RNG). If used to encrypt/decrypt a longer byte stream, I guess it has to be up to external code to handle this part correctly (e.g. split into chunks, or just continue encrypting with a note on compatibility).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened an issue: dhardy#86

For now just try shifting the nonce as I suggested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then probably that test should use:

let counter = 2u64;
(counter >> 32) | (counter << 32)

— but I don't understand why the test passes as-is (it sounds like state[15] should be set to 2u32).

Your code misses the endian troubles... But my 2u64.to_be() was wrong also because it does not get converted on BE. rng.set_counter(0, 2u64 << 56); works. But I am quite happy to see the test gone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A slice [0u8, 0, 0, 2] is the same as 0x0200_0000u32 in little-endian, the format we standardised upon inside the RNG. Together with the u32s that are in opposite order the value is 0x0200_0000_0000_0000u64.

///
/// [1]: D. J. Bernstein, [*ChaCha, a variant of
/// Salsa20*](https://cr.yp.to/chacha.html)
/// ChaCha uses add-rotate-xor (ARX) operations as basis. These are safe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"as its basis"

///
/// With the ChaCha algorithm it is possible to choose the number of rounds the
/// core algorithm should run. By default `ChaChaRng` is created as ChaCha20,
/// with means 20 rounds. The number of rounds is a tradeoff between performance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"which means"

/// With the ChaCha algorithm it is possible to choose the number of rounds the
/// core algorithm should run. By default `ChaChaRng` is created as ChaCha20,
/// with means 20 rounds. The number of rounds is a tradeoff between performance
/// an security, 8 rounds are considered the minimum to be secure. A different
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and security

/// core algorithm should run. By default `ChaChaRng` is created as ChaCha20,
/// with means 20 rounds. The number of rounds is a tradeoff between performance
/// an security, 8 rounds are considered the minimum to be secure. A different
/// number of rounds can be set with [`set_rounds`].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set using set_rounds, or via

@pitdicker
Copy link
Contributor Author

pitdicker commented Jan 21, 2018

Good comments, thank you.

how does ChaCha with 8 or 12 rounds compare to HC128 and ISAAC in terms of performance?

Still slowest, but not bad at all (with 8 rounds):

test gen_bytes_chacha   ... bench:     902,419 ns/iter (+/- 15,331) = 1134 MB/s
test gen_bytes_hc128    ... bench:     441,192 ns/iter (+/- 2,454) = 2320 MB/s
test gen_bytes_isaac    ... bench:     710,000 ns/iter (+/- 2,664) = 1442 MB/s
test gen_bytes_isaac64  ... bench:     381,510 ns/iter (+/- 870) = 2684 MB/s

gen_uint_new!(gen_u64_std, u64, StdRng);
gen_uint_new!(gen_u64_os, u64, OsRng);

// Do not test JitterRng like the others by running it RAND_BENCH_N times per,
// measurement, because it is way to slow. Only run it once
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's time you stopped confusing 'to' and 'too'!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I will try too not make to many mistakes with them anymore. Starting tomorrow 😄.

}

/// Refill the internal output buffer (`self.buffer`)
fn update(&mut self) {
core(&mut self.buffer, &self.state);
// For some reason extracting this part into a seperate function
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate

@pitdicker
Copy link
Contributor Author

I tried using the code from @PeterReid's implementation of ChaCha, integrating it was super easy.

Results on stable:

test gen_bytes_chacha20 ... bench:   2,088,500 ns/iter (+/- 3,958) = 490 MB/s
test gen_u32_chacha20   ... bench:       9,195 ns/iter (+/- 31) = 435 MB/s
test gen_u64_chacha20   ... bench:      18,828 ns/iter (+/- 67) = 424 MB/s

Results on nightly with #[repr(simd)]:

test gen_bytes_chacha20 ... bench:   1,796,552 ns/iter (+/- 4,974) = 569 MB/s
test gen_u32_chacha20   ... bench:       7,722 ns/iter (+/- 24) = 518 MB/s
test gen_u64_chacha20   ... bench:      15,510 ns/iter (+/- 90) = 515 MB/s

Compared with our current implementation:

test gen_bytes_chacha   ... bench:   1,898,515 ns/iter (+/- 6,578) = 539 MB/s
test gen_u32_chacha     ... bench:       8,357 ns/iter (+/- 24) = 478 MB/s
test gen_u64_chacha     ... bench:      16,207 ns/iter (+/- 47) = 493 MB/s

The result with SIMD can be improved again with ±3% by making the state also #[repr(simd)]. The assembly for SIMD looks pretty much optimal to me. Yet it regresses stable by another 12%.

I am not sure if this change is worth it at the moment, because it is only faster on nightly. But nice to know that our implementation is actually pretty decent already.

@dhardy
Copy link
Member

dhardy commented Jan 25, 2018

Think I'll go ahead and merge manually. It's clear from the Travis failures that GitHub messed up the merge :-(

@dhardy dhardy merged commit c320c7d into rust-random:master Jan 25, 2018
@pitdicker pitdicker deleted the chacha branch January 25, 2018 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants