Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RngCore::bytes_per_round #396

Closed
wants to merge 4 commits into from

Conversation

pitdicker
Copy link
Contributor

@pitdicker pitdicker commented Apr 12, 2018

From the doc comment:

Number of bytes generated per round of this RNG.

Some algorithms would benefit from knowing some basic properties about the RNG. In terms of performance an algorithm may want to know whether an RNG is best at generating u32s, or could provide u64s or more at little to no extra cost.

For many RNGs a simple definition is: the smallest number of bytes this RNG can generate without throwing away part of the generated value.

bytes_per_round has a default implementation that returns 4 (bytes).

I have thought quite a few times over the last couple of months: It would be great to know if this is a 32 or 64-bit RNG. Then I could implement this algorithm more optimally. Now, when playing with SIMD, this became even more visible.

I added one example in this PR: generating an u128. For most RNGs the current method of combining to u64s is optimal. But for OsRng and SIMD RNGs it would be twice as fast two use fill_bytes. Now it can make the choice.

The same is true for HighPrecision01. The implementation for f32 now always uses next_u32. With a 64-bit RNG it always throws away half the generated bits. And when it finds out it needs more, generates another 32. It could easily be made more optimal, if only we knew whether the RNG is best at generating 32 or 64 bit at a time.

And a bit more controversial: gen_bool now uses 32 bits to make its decision on. Fast and usually good enough. Using 64 bits could halve the performance for many RNGs. But if the RNG produces 64 bits at a time (and throws away half of it), it could just as well use them to increase precision at no extra cost.

/// RNG can generate without throwing away part of the generated value.
///
/// `bytes_per_round` has a default implementation that returns `4` (bytes).
fn bytes_per_round(&self) -> usize { 4 }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be an associated constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried something like that in a not really thought through attempt: #377 (comment). The problem is that we then can't make RngCore into a trait object.

Copy link
Member

@dhardy dhardy Apr 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird; doesn't sound like associated constants should prevent a trait from becoming object-safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☹️

error[E0038]: the trait `rand_core::RngCore` cannot be made into an object
    --> src\lib.rs:1189:21
     |
1189 |         let mut r = Box::new(rng) as Box<RngCore>;
     |                     ^^^^^^^^^^^^^ the trait `rand_core::RngCore` cannot be made into an object
     |
     = note: the trait cannot contain associated consts like `BYTES_PER_ROUND`
     = note: required because of the requirements on the impl of `std::ops::CoerceUnsized<std::boxed::Box<rand_core::RngCore>>` for `std::boxed::Box<test::TestRng<StdRng>>`

@vks
Copy link
Collaborator

vks commented Apr 12, 2018

This makes me wonder whether we should make next_* optional and add a default implementation using fill_bytes.

@pitdicker
Copy link
Contributor Author

pitdicker commented Apr 12, 2018

This makes me wonder whether we should make next_* optional and add a default implementation using fill_bytes.

I was curious about this to for the last couple of days, and tried that before adding bytes_per_round. But at least for sizes <= 128 bits there is no way for fill_bytes to perform as good as the direct methods. I tried about 6 variants to optimize fill_bytes, even hard_coding parts just for testing, but 60% of the performance of the direct methods seemed to be the max. (I'll have to make a PR with the best method I found...). The number of CPU instructions is exactly the same, and most of the instructions are similar. Probably things are just harder to get in the right registers, in in an execution order that can be parallelized or something.

Edit: on reading your comment again you meant something else. But the idea is the same: what would happen in fill_bytes was the primary interface to an RNG. And the answers are the same, worse performance for normal integer sizes.

@vks
Copy link
Collaborator

vks commented Apr 12, 2018 via email

@pitdicker
Copy link
Contributor Author

pitdicker commented Apr 12, 2018

I like this idea a lot!

Ah, great 😄.

I think bytes_per_round can be generalized to represent the optimal number of bytes per round. For some RNGs this will be usize::MAX. This should probably be the default value.

Than we have a different idea in mind of how the value works. Can you explain it a bit more?

Then I'll try my thoughts:
I did not go with the size that produces the best MB/s in the benchmarks, but what often gives the best ns/iter.

For example: Isaac64Rng::next_u64() has better MB/s than Isaac64Rng::next_u32(). Still I did not change the default of 4 bytes (i.e. next_u32), because next_u32 needs less ns/iter. (in theory, seems like it regressed a bit somewhere).
Isaac64Rng::next_u32 would be better in algorithms like the ones mentioned in the first post, like HighPrecision01 and gen_bool, because the entire algorithm runs faster with the methods taking less ns/iter. Another algorithm where I could use it this way is Rng::shuffle.

With Xoroshiro128+ 8 would be the best value, because there is not really a faster way to produce 32-bit values from it except by throwing away half of the results. Now other algorithms could say: I am going to need multiple 32-bit values. Let's get them two at a time, so not half of the bits have to be thrown away. Although such a strategy works reasonably well for 32-bit RNGs too.

But it is difficult to describe the performance trade-offs with a single value...

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea... but using this to optimise large generators vs the next_u32 / next_u64 we use for smaller ones feels a strange mixture of methods. Still it may be the best compromise.

Do you think there will be many other uses of bytes_per_round?

What if bytes_per_round cannot be evaluated at compile time (e.g. if R is unsized therefore cannot be inlined)?

/// RNG can generate without throwing away part of the generated value.
///
/// `bytes_per_round` has a default implementation that returns `4` (bytes).
fn bytes_per_round(&self) -> usize { 4 }
Copy link
Member

@dhardy dhardy Apr 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird; doesn't sound like associated constants should prevent a trait from becoming object-safe.

/// RNG can generate without throwing away part of the generated value.
///
/// `bytes_per_round` has a default implementation that returns `4` (bytes).
fn bytes_per_round(&self) -> usize { 4 }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really give a default implementation here?

Did you forget to implement for Jitter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure. A disadvantage with a default implementation is that you can easily forget a wrapper.

Did you forget to implement for Jitter?

JitterRng::next_u32() is about twice as fast as next_u64(), so 4 bytes would be the best fit there.


// Implement `RngCore` for references to an `RngCore`.
// Force inlining all functions, so that it is up to the `RngCore`
// implementation and the optimizer to decide on inlining.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually if R is unsized these cannot be inlined

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, true.

But before fill_bytes was basically never inlined because we always use the RNG through this implementation, through a reference (and for some reason LLVM really does not like our abstractions...). So now we can at least control things when it is not a trait object.

let b_ptr = &mut *(ptr as *mut u128 as *mut [u8; 16]);
rng.fill_bytes(b_ptr);
}
val.to_le()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to see benchmarks for this on a BE platform

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sometimes just change things to to_be() and measure on x86_64. But it really only starts to show improvements for things like my SIMD experiment (twice as fast there).

@dhardy
Copy link
Member

dhardy commented Apr 13, 2018

In the case of HighPrecision01 for f32 the extra bits are needed so rarely that there won't be a measurable improvement to benchmarks.

Overall I wonder if you're over-estimating the utility of the extra method?

Especially because it cannot always be evaluated at compile time I'm less convinced.

@nagisa
Copy link
Contributor

nagisa commented Apr 13, 2018 via email

@pitdicker
Copy link
Contributor Author

Overall I wonder if you're over-estimating the utility of the extra method?

Maybe, hard to say without using it in various situations. But I think it can be pretty useful.

Especially because it cannot always be evaluated at compile time I'm less convinced.

I thought about this too, especially in combination with unsized RNGs. But even if it can't be done at compile time, the branch predictor should be able to figure things out after a few rounds I think?

@dhardy
Copy link
Member

dhardy commented Apr 13, 2018

I thought about this too, especially in combination with unsized RNGs. But even if it can't be done at compile time, the branch predictor should be able to figure things out after a few rounds I think?

I suppose so, but there's still an extra operation involved. It should be easy enough to add a few benchmarks using &mut RngCore or Box<RngCore> generators.

@pitdicker
Copy link
Contributor Author

Benchmarked with HC-128 as a trait object. It normally performs at 78% of the normal speed, and with the change at 75%.

test gen_u128_hc128           ... bench:       7,281 ns/iter (+/- 248) = 2197 MB/s
test gen_u128_hc128_trait_obj ... bench:       9,310 ns/iter (+/- 559) = 1718 MB/s (without `bytes_per_round`)
test gen_u128_hc128_trait_obj ... bench:       9,691 ns/iter (+/- 458) = 1651 MB/s (with `bytes_per_round`)

@dhardy
Copy link
Member

dhardy commented Apr 15, 2018

That's better than I expected. And what about a small generator like Xorshift?

@pitdicker
Copy link
Contributor Author

Also better than I expected. Xorshift a little less so:

test gen_u128_xorshift           ... bench:       5,354 ns/iter (+/- 14) = 2988 MB/s
test gen_u128_xorshift_trait_obj ... bench:       6,354 ns/iter (+/- 30) = 2518 MB/s (without `bytes_per_round`)
test gen_u128_xorshift_trait_obj ... bench:       7,285 ns/iter (+/- 13) = 2196 MB/s (with `bytes_per_round`)

After a few more days you get a bit more objective 😄.
I think for most situations there are good, even better, solutions that this bytes_per_round. I see only 3 situations where it could be useful, and where there is no alternative:

  • When you need one value, an u32 for example, and may need another but don't know that yet. If this is a 64-bit RNG it makes sense to get two u32s at once, instead of possibly generating an u64 twice and throwing away half of it. Nut if this is a 32-bit RNG getting multiple u32s at once while they may not be needed is a waste. I think this is something not uncommon for rejection sampling.
  • As already mentioned, when you could make use of some extra precision, but don't wan't the slowdown this would cost a 32-bit RNG.
  • It can open ways to improve performance in combination with SIMD RNGs.

So we could do fine without this method, but I believe it can enable optimizations that would otherwise not be possible.

@dhardy
Copy link
Member

dhardy commented Apr 15, 2018

A bit more objective maybe — but I haven't had much time the last few days (travelling)!

  • For rejection sampling, fair enough (though sometimes the extra samples are needed too rarely to make much difference)
  • For extra precision — this comes across as quite strange to me; ideally you'd choose how much precision you need to start with rather than get weird extra effects of choosing some RNGs
  • Okay, but this approach does seem a little hacky. How does the performance of gen_u128 compare using Xorshift with the fill_bytes method for example? If it's much worse we could also add next_u128 again for this case — though then there's also the question whether fill_bytes works well for 256-bits etc.

So I'm still unconvinced on this one.

@pitdicker
Copy link
Contributor Author

I would like to see the first commit here in rand_core 0.1, but don't mind if the rest follows later, or maybe never.

@dhardy
Copy link
Member

dhardy commented Apr 16, 2018

The first commit looks fine if you want to merge that separately.

@dhardy dhardy added X-enhancement E-question Participation: opinions wanted F-new-int Functionality: new, within Rand labels Apr 16, 2018
@pitdicker
Copy link
Contributor Author

I'm going to close this PR, at least for now.

@pitdicker pitdicker closed this Apr 27, 2018
@dhardy
Copy link
Member

dhardy commented Apr 27, 2018

If you like. I don't mind it remaining as a discussion item for now, though it is tidier to close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E-question Participation: opinions wanted F-new-int Functionality: new, within Rand
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants