Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reseeding perf #76

Merged
merged 4 commits into from
Jan 1, 2018
Merged

Reseeding perf #76

merged 4 commits into from
Jan 1, 2018

Conversation

pitdicker
Copy link

I wanted to try inverting the counter in ReseedingRng as discussed in #59, but it turned out the performance was pretty bad to begin with.

Benchmarks before:

test reseeding_xorshift_bytes ... bench:     559,635 ns/iter (+/- 5,160) = 1829 MB/s
test reseeding_xorshift_u32   ... bench:       4,265 ns/iter (+/- 27) = 937 MB/s
test reseeding_xorshift_u64   ... bench:       4,893 ns/iter (+/- 30) = 1634 MB/s

After:

test reseeding_xorshift_bytes ... bench:     562,940 ns/iter (+/- 980) = 1819 MB/s
test reseeding_xorshift_u32   ... bench:       2,323 ns/iter (+/- 9) = 1721 MB/s
test reseeding_xorshift_u64   ... bench:       2,918 ns/iter (+/- 3) = 2741 MB/s

And plain Xorshift for comparison:

test gen_bytes_xorshift       ... bench:     555,592 ns/iter (+/- 10,734) = 1843 MB/s
test gen_u32_xorshift         ... bench:       1,372 ns/iter (+/- 12) = 2915 MB/s
test gen_u64_xorshift         ... bench:       2,643 ns/iter (+/- 28) = 3026 MB/s

I don't like several parts of the current design of ReseedingRng, but that is for another issue.

Copy link
Owner

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you never cease to find something to optimise!

Looks good other than the one thing. I suppose swapping the check, generate logic allows some degree of parallelism. Not sure what you mean about not liking the way it works though?

src/reseeding.rs Outdated
if e.kind.should_wait() {
// Delay reseeding
self.bytes_until_reseed = self.threshold >> 8;
break;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sets bytes_until_reseed twice. Should it return instead?

I admit that the logic of this function is weird but I guess something like this is useful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hee, you're right.

@pitdicker
Copy link
Author

Not sure what you mean about not liking the way it works though?

Some parts seem not very ergonomic. For example the new function takes an already existing RNG. Why does it need you to initialize the wrapped rng separately? If it knows how to reseed an rng, surely it can also initialize one?

The from_reseeder function does not make much sense to me. The main difference with new is that it comes with a default threshold. The DEFAULT_RESEEDING_THRESHOLD is very small and basically always wrong. The threshold value depends on the paranoia of the application/library, and on the wrapped rng, so I don't thing having a default is a good idea. Also it takes a fixed seed, instead of using the SeedFromRng trait.

And I didn't like the Reseeder trait. Why can't ReseedingRng just use some other Rng, without requiring a custom wrapper. That were my thoughts yesterday though, I may be coming around on the last issue...

@pitdicker
Copy link
Author

I have been trying out a wild idea today to reduce the overhead of ThreadRng, which uses ReseedingRng. Because that seems like the commonly used interface, I don't really want it to look bad :-). In rand in the nursery its performance is only 50% of the RNG it wraps. After sprinkling some [inline]'s and removing some wrappers it gets to 80% of the RNG.

Is seems sensible to assume that if you want to reseed an RNG, it will probably be a cryptographic RNG, not some simple one. The variants I have seen so far all generate blocks of results, not one result at a time. One way to reduce the overhead of ReseedingRng is to do the reseed 'bookkeeping' only when a new block of results is generated, not with every next_*.

As a test I have added an RngCore trait:

pub trait RngCore: Sized {
    type Results: AsRef<[u32]>;

    fn init(seed: &[u32]) -> Self;
    fn generate(&mut self, results: &mut Self::Results);
    fn results_empty() -> Self::Results;
}

It just exposes the core algorithm form an RNG, and does not include the buffering etc necessary to implement the Rng trait. It does not really make the implementation of HC-128 less clean, and should also not be to hard too implement for ISAAC and ChaCha.

Because ReseedingRng now has access to the RNG's algorithm, it can only use that part and implement it's own buffering and bookkeeping. This should (not completely true for next_u32 yet) bring the overhead of ReseedingRng down to almost 0%, and bring the performance of ThreadRng within 90% of the wrapped RNG.

Do you think this a direction worth pursuing?

This is my current super ugly, many things comment out, WIP pitdicker@25dfbdd

@dhardy
Copy link
Owner

dhardy commented Dec 18, 2017

This trait requires the implementation to use u32 internally though, right? Can that be made a parameter? Also, I don't much like using a &[u32] seed; maybe you can do something like:

pub trait<T> RngCore: SeedableRng {
    type Results: AsRef<[T]>;

    fn init(seed: &<Self as SeedableRng>::Seed) -> Self;

Not quite sure what I think right now; this would make ReseedingRng only usable for certain classes of RNGs, right? But maybe that's an advantage allowing reseed to combine both current state (or output buffer) with a fresh seed instead of simply replace with a new seed. You're right, the wrapper is not very useful for fast, weak PRNGs.

@pitdicker
Copy link
Author

I was already a little proud I got the extra abstraction working, including the Asref slice trick to be generic over arrays. But what you write is much better!

But maybe that's an advantage allowing reseed to combine both current state (or output buffer) with a fresh seed instead of simply replace with a new seed.

It is more that it combines the step to fill a new output buffer with the bookkeeping when to reseed the rng. Checking and managing counters can take almost as much time as filling the output buffer with fresh values, so every little thing we have to do less each step helps.

Not quite sure what I think right now; this would make ReseedingRng only usable for certain classes of RNGs, right?

I think I named it ReseedingBlockRng? I think nothing prevent using the current ReseedingRng with all kinds of RNG's. But this should be a faster alternative for what are basically CryptoRng. But if reseeding other kinds of RNG's does not make much sense, we could remove ReseedingRng.

What do you think about introducing an RngCore trait? In some way I find it a bit ugly because the main motivation is just to increase the performance of a reseeding wrapper. But because of TreadRng this may be worthwhile.

On the other hand IsaacRng, ChaChaRng and Hc128Rng now all have an impl block with the core algorithm, and this trait would make that a bit more formal. Maybe it can even become possible to share all the extra code for impl Rng between these implementations.

@dhardy
Copy link
Owner

dhardy commented Dec 18, 2017

Maybe the trait should be named BlockRng instead? But I think this means CSPRNGs would have to implement:

  • Rng
  • CryptoRng
  • BlockRng
  • SeedableRng
  • SeedFromRng

Is that a few too many traits? I'm wondering whether Rng should be implemented automatically for every BlockRng (perhaps via trait extension with default impls). Actually, if init is provided, then SeedableRng and SeedFromRng could be implemented automatically too? I'm not totally sure this will work, but it may simplify CSPRNG implementations.

@pitdicker
Copy link
Author

Maybe there are tricks with traits I don't know, but I couldn't see a way to do this automatically. Rng implementations would have a struct that includes the struct from RngCore, an output buffer and counter/index. But a macro could certainly do it.

@pitdicker
Copy link
Author

It turns out AsRef<[T]> is not implemented for [u32; 256], so that doesn't work as a bound. @dhardy do you happen to have an idea on how to have a generic Results type, but still be able to do something useful like index into it or getting the length?

@pitdicker
Copy link
Author

Found a terrible workaround: use a newtype with AsRef and Deref implementations:

#[derive(Copy, Clone)]
pub struct IsaacArray([u32; RAND_SIZE]);
impl ::core::convert::AsRef<[u32]> for IsaacArray {
    fn as_ref(&self) -> &[u32] {
        &self.0[..]
    }
}
impl ::core::ops::Deref for IsaacArray {
    type Target = [u32; RAND_SIZE];
    fn deref(&self) -> &Self::Target {
        &self.0
    }
}
impl ::core::ops::DerefMut for IsaacArray {
    fn deref_mut(&mut self) -> &mut [u32; RAND_SIZE] {
        &mut self.0
    }
}

@dhardy
Copy link
Owner

dhardy commented Dec 21, 2017

Sorry, no, but I guess this is another thing that will be fixed by constant generics eventually.

@pitdicker
Copy link
Author

@dhardy I now have this mostly working, and am cleaning up the changes. A lot of code has to move around... Implementing RNG's with the BlockRng trait looks pretty clean now, here for example HC-128. The overhead of ReseedingBlockRng with this method is not really measurable, as hoped.

As trait I now have:

pub trait BlockRng<T>: Sized {
    type Results: AsRef<[T]> + Default;

    fn generate(&mut self, results: &mut Self::Results);
}

But there is one place I am stuck with the traits: https://github.com/pitdicker/rand/blob/blockrng_part2/rand_core/src/lib.rs#L168. I want to add a BlockRngWrapper to implement Rng for a BlockRng. And it needs a separate implementation for those that return [u32] and [u64].

The implementation for [u64] is commented out at the moment, otherwise I get the error:

error[E0119]: conflicting implementations of trait `Rng` for type `BlockRngWrapper<_, _>

Do you know how to fix this?

@dhardy
Copy link
Owner

dhardy commented Dec 27, 2017

Merry Christmas @pitdicker. I had a look, and I think the problem is that it would be possible for some R to implement both BlockRng<u32> and BlockRng<u64> since essentially they are separate traits. Perhaps it would be useful to prevent a type implementing both traits, but I'm not sure if that's possible.

It might be possible to use specialisation somehow by making BlockRng<u32> more general than BlockRng<u64> (i.e. the latter extends the former).

Alternatively you could just use two separate BlockRngWrapper traits.

BTW I think the names would be better like this:

  • BlockRngBlockRngCore
  • ChaChaChaChaCore
  • BlockRngWrapperBlockRng

@pitdicker
Copy link
Author

Thank you! And I like the names you listed better.

essentially they are separate traits

Ah, that explains it. Back to the drawing board then. The trait system must be logical, I should really get a handle on it, but don't know of a good resource...

My hope was to end up with something like:
ReseedingBlockRng > BlockRng > ReseedingCore > (ChaChaCore, Isaac64Core, etc.)
Two traits would help with implementing RNG's, but not with the reseeding mechanism?

I think these changes are only worth it if the end results is reasonably clean, and am starting to give up.
On the other hand the not directly perfect code for ReseedingBlockRng I have now is a win for anything except ISAAC-64.

@dhardy
Copy link
Owner

dhardy commented Dec 28, 2017

@pitdicker the logic you want is essentially this:

pub trait A {}  // BlockRng<u32>
pub trait B {}  // BlockRng<u64>
pub trait T {}  // BlockRngWrapper

impl<X: A> T for X {}
impl<X: B> T for X {}

The compiler won't allow the second impl because if an X were to implement both, that type would have two impls for T. I don't think it's possible to tell the compiler no type can implement both A and B. Alternatively it would be nice to say impl<X: B> T for X where not X: A {}, but I don't think that's possible either.

There's a workaround: specialization allows multiple implementations, so long as one is more specific than the other. But that's not stable yet. Example.

BTW feel free to bring this up on https://internals.rust-lang.org/ but I doubt there will be any rapid progress on it. You may also like N. Matsakis's blog, but it's a bit off-topic here.

Edit: found something related

@dhardy
Copy link
Owner

dhardy commented Dec 31, 2017

So @pitdicker should I merge this PR while the BlockRng thing is left on the side for now? If so you could open a tracking issue for that.

@pitdicker
Copy link
Author

Yes, I think merging this is a good idea.

I am not really sure how to proceed with the BlockRng idea. What do you think about implementing it for 32-bit RNG's for now, with the possibility open to extending it to 64 bit once specialisation is ready?

I somewhere messed up this branch but will fix it in a moment.

* Move the check if it is time to reseed out of the `try_reseed_if_necessary`
  and make sure that function does not get inlined.
* Invert the counter direction. This way we can compare against 0 instead of
  `self.threshold`
* Doing the reseed check after generating a value turns out to be a bit faster.`
@dhardy
Copy link
Owner

dhardy commented Dec 31, 2017

So for now 64-bit block RNGs would not use BlockRng? How then will ReseedingRng work on StdRng? I guess if we switch StdRng to HC-128 first that doesn't matter so much.

@pitdicker
Copy link
Author

That is the idea 😄. And else it can still work and be faster with the ReseedingRng from this PR.

Travis seems very busy today...

@dhardy
Copy link
Owner

dhardy commented Dec 31, 2017

Yes it is. I wonder if it's something to do with the new year?

@pitdicker
Copy link
Author

O wow, maybe... But ready after all.

Copy link
Owner

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After another look, I think it would be worth changing the benchmarks. It will probably reduce the apparent impact of your improvement, but is more realistic and pertinent, especially when considering BlockRng later.

src/reseeding.rs Outdated
@@ -44,13 +43,14 @@ impl<R: Rng, Rsdr: Reseeder<R>> ReseedingRng<R, Rsdr> {
/// # Arguments
///
/// * `rng`: the random number generator to use.
/// * `generation_threshold`: the number of bytes of entropy at which to reseed the RNG.
/// * `threshold`: the amount of generated bytes after which to reseed the RNG.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe "the number of [generated] bytes" is correct English; amount is typically used for "uncountable" things (e.g. water, money, food). But I'm not really fussed (I know I accepted something similar recently anyway).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep correcting my English, It is not my first language and it is better if the language in the documentation is correct.


#[bench]
fn reseeding_xorshift_bytes(b: &mut Bencher) {
let mut rng = ReseedingRng::new(XorShiftRng::new().unwrap(),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make more sense to benchmark the PRNG we're interested in (ISAAC or HC128)? Especially given that your BlockRng idea integrates tighter with the PRNG algorithm.

Copy link
Author

@pitdicker pitdicker Dec 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With HC-128 the overhead of reseeding is much larger:

test gen_bytes_hc128       ... bench:     445,483 ns/iter (+/- 16,674) = 2298 MB/s
test gen_u32_hc128         ... bench:       2,809 ns/iter (+/- 194) = 1423 MB/s
test gen_u64_hc128         ... bench:       4,254 ns/iter (+/- 360) = 1880 MB/s
test init_hc128            ... bench:       4,539 ns/iter (+/- 412)
test reseeding_hc128_bytes ... bench:     451,584 ns/iter (+/- 25,463) = 2267 MB/s
test reseeding_hc128_u32   ... bench:       3,690 ns/iter (+/- 125) = 1084 MB/s
test reseeding_hc128_u64   ... bench:       5,907 ns/iter (+/- 157) = 1354 MB/s

And before this PR:

test reseeding_hc128_bytes ... bench:     449,635 ns/iter (+/- 5,755) = 2277 MB/s
test reseeding_hc128_u32   ... bench:       6,418 ns/iter (+/- 84) = 623 MB/s
test reseeding_hc128_u64   ... bench:       7,693 ns/iter (+/- 124) = 1039 MB/s

This makes sense, because the overhead of checking and indexing in the results array makes up 20~40%. With ReseedingRng that percentage gets doubled because it also does checks for reseeding.

You are right, this benchmark is testing something nonsensical. Reseeding Xorshift ?!. I don't think it matters much because they both show the overhead of ReseedingRng, but I'll change it.

@dhardy dhardy merged commit d5d9c75 into dhardy:master Jan 1, 2018
@pitdicker pitdicker deleted the reseeding_perf branch January 1, 2018 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants