Add RngCore::bytes_per_round #396

pitdicker · 2018-04-12T18:55:44Z

From the doc comment:

Number of bytes generated per round of this RNG.

Some algorithms would benefit from knowing some basic properties about the RNG. In terms of performance an algorithm may want to know whether an RNG is best at generating u32s, or could provide u64s or more at little to no extra cost.

For many RNGs a simple definition is: the smallest number of bytes this RNG can generate without throwing away part of the generated value.

bytes_per_round has a default implementation that returns 4 (bytes).

I have thought quite a few times over the last couple of months: It would be great to know if this is a 32 or 64-bit RNG. Then I could implement this algorithm more optimally. Now, when playing with SIMD, this became even more visible.

I added one example in this PR: generating an u128. For most RNGs the current method of combining to u64s is optimal. But for OsRng and SIMD RNGs it would be twice as fast two use fill_bytes. Now it can make the choice.

The same is true for HighPrecision01. The implementation for f32 now always uses next_u32. With a 64-bit RNG it always throws away half the generated bits. And when it finds out it needs more, generates another 32. It could easily be made more optimal, if only we knew whether the RNG is best at generating 32 or 64 bit at a time.

And a bit more controversial: gen_bool now uses 32 bits to make its decision on. Fast and usually good enough. Using 64 bits could halve the performance for many RNGs. But if the RNG produces 64 bits at a time (and throws away half of it), it could just as well use them to increase precision at no extra cost.

vks · 2018-04-12T19:07:11Z

rand_core/src/lib.rs

+    /// RNG can generate without throwing away part of the generated value.
+    ///
+    /// `bytes_per_round` has a default implementation that returns `4` (bytes).
+    fn bytes_per_round(&self) -> usize { 4 }


Could this be an associated constant?

I tried something like that in a not really thought through attempt: #377 (comment). The problem is that we then can't make RngCore into a trait object.

Weird; doesn't sound like associated constants should prevent a trait from becoming object-safe.

☹️

error[E0038]: the trait `rand_core::RngCore` cannot be made into an object --> src\lib.rs:1189:21 | 1189 | let mut r = Box::new(rng) as Box<RngCore>; | ^^^^^^^^^^^^^ the trait `rand_core::RngCore` cannot be made into an object | = note: the trait cannot contain associated consts like `BYTES_PER_ROUND` = note: required because of the requirements on the impl of `std::ops::CoerceUnsized<std::boxed::Box<rand_core::RngCore>>` for `std::boxed::Box<test::TestRng<StdRng>>`

vks · 2018-04-12T19:09:25Z

This makes me wonder whether we should make next_* optional and add a default implementation using fill_bytes.

pitdicker · 2018-04-12T19:57:54Z

This makes me wonder whether we should make next_* optional and add a default implementation using fill_bytes.

I was curious about this to for the last couple of days, and tried that before adding bytes_per_round. But at least for sizes <= 128 bits there is no way for fill_bytes to perform as good as the direct methods. I tried about 6 variants to optimize fill_bytes, even hard_coding parts just for testing, but 60% of the performance of the direct methods seemed to be the max. (I'll have to make a PR with the best method I found...). The number of CPU instructions is exactly the same, and most of the instructions are similar. Probably things are just harder to get in the right registers, in in an execution order that can be parallelized or something.

Edit: on reading your comment again you meant something else. But the idea is the same: what would happen in fill_bytes was the primary interface to an RNG. And the answers are the same, worse performance for normal integer sizes.

vks · 2018-04-12T20:01:29Z

I like this idea a lot! I think `bytes_per_round` can be generalized to represent the optimal number of bytes per round. For some RNGs this will be `usize::MAX`. This should probably be the default value.

pitdicker · 2018-04-12T20:19:53Z

I like this idea a lot!

Ah, great 😄.

I think bytes_per_round can be generalized to represent the optimal number of bytes per round. For some RNGs this will be usize::MAX. This should probably be the default value.

Than we have a different idea in mind of how the value works. Can you explain it a bit more?

Then I'll try my thoughts:
I did not go with the size that produces the best MB/s in the benchmarks, but what often gives the best ns/iter.

For example: Isaac64Rng::next_u64() has better MB/s than Isaac64Rng::next_u32(). Still I did not change the default of 4 bytes (i.e. next_u32), because next_u32 needs less ns/iter. (in theory, seems like it regressed a bit somewhere).
Isaac64Rng::next_u32 would be better in algorithms like the ones mentioned in the first post, like HighPrecision01 and gen_bool, because the entire algorithm runs faster with the methods taking less ns/iter. Another algorithm where I could use it this way is Rng::shuffle.

With Xoroshiro128+ 8 would be the best value, because there is not really a faster way to produce 32-bit values from it except by throwing away half of the results. Now other algorithms could say: I am going to need multiple 32-bit values. Let's get them two at a time, so not half of the bits have to be thrown away. Although such a strategy works reasonably well for 32-bit RNGs too.

But it is difficult to describe the performance trade-offs with a single value...

dhardy

Interesting idea... but using this to optimise large generators vs the next_u32 / next_u64 we use for smaller ones feels a strange mixture of methods. Still it may be the best compromise.

Do you think there will be many other uses of bytes_per_round?

What if bytes_per_round cannot be evaluated at compile time (e.g. if R is unsized therefore cannot be inlined)?

dhardy · 2018-04-13T08:38:03Z

rand_core/src/lib.rs

+    /// RNG can generate without throwing away part of the generated value.
+    ///
+    /// `bytes_per_round` has a default implementation that returns `4` (bytes).
+    fn bytes_per_round(&self) -> usize { 4 }


Weird; doesn't sound like associated constants should prevent a trait from becoming object-safe.

dhardy · 2018-04-13T08:39:26Z

rand_core/src/lib.rs

+    /// RNG can generate without throwing away part of the generated value.
+    ///
+    /// `bytes_per_round` has a default implementation that returns `4` (bytes).
+    fn bytes_per_round(&self) -> usize { 4 }


Should we really give a default implementation here?

Did you forget to implement for Jitter?

I was not sure. A disadvantage with a default implementation is that you can easily forget a wrapper.

Did you forget to implement for Jitter?

JitterRng::next_u32() is about twice as fast as next_u64(), so 4 bytes would be the best fit there.

dhardy · 2018-04-13T08:40:45Z

rand_core/src/lib.rs

-
+// Implement `RngCore` for references to an `RngCore`.
+// Force inlining all functions, so that it is up to the `RngCore`
+// implementation and the optimizer to decide on inlining.


Actually if R is unsized these cannot be inlined

No, true.

But before fill_bytes was basically never inlined because we always use the RNG through this implementation, through a reference (and for some reason LLVM really does not like our abstractions...). So now we can at least control things when it is not a trait object.

dhardy · 2018-04-13T08:45:22Z

src/distributions/integer.rs

+                let b_ptr = &mut *(ptr as *mut u128 as *mut [u8; 16]);
+                rng.fill_bytes(b_ptr);
+            }
+            val.to_le()


It would be nice to see benchmarks for this on a BE platform

I sometimes just change things to to_be() and measure on x86_64. But it really only starts to show improvements for things like my SIMD experiment (twice as fast there).

dhardy · 2018-04-13T08:53:42Z

In the case of HighPrecision01 for f32 the extra bits are needed so rarely that there won't be a measurable improvement to benchmarks.

Overall I wonder if you're over-estimating the utility of the extra method?

Especially because it cannot always be evaluated at compile time I'm less convinced.

nagisa · 2018-04-13T09:14:08Z

The method seems fine but it cannot be a constant. For hardware generators it may be necessary to talk to hardware before figuring out its best performing (whichever of lowest latency or greatest throughput criteria is used) block size.

…

On Fri, Apr 13, 2018, 12:07 Paul Dicker ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/distributions/integer.rs <#396 (comment)> : > - fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> i128 { - rng.gen::<u128>() as i128 + fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> u128 { + if rng.bytes_per_round() < 128 { + // Use LE; we explicitly generate one value before the next. + let x = rng.next_u64() as u128; + let y = rng.next_u64() as u128; + (y << 64) | x + } else { + let mut val = 0u128; + unsafe { + let ptr = &mut val; + let b_ptr = &mut *(ptr as *mut u128 as *mut [u8; 16]); + rng.fill_bytes(b_ptr); + } + val.to_le() I sometimes just change things to to_be() and measure on x86_64. But it really only starts to show improvements for things like my SIMD experiment (twice as fast there). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#396 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AApc0uPL4I9UTLRqdITE8jVjeI6y1bPsks5toGrngaJpZM4TSUO-> .

pitdicker · 2018-04-13T09:30:12Z

Overall I wonder if you're over-estimating the utility of the extra method?

Maybe, hard to say without using it in various situations. But I think it can be pretty useful.

Especially because it cannot always be evaluated at compile time I'm less convinced.

I thought about this too, especially in combination with unsized RNGs. But even if it can't be done at compile time, the branch predictor should be able to figure things out after a few rounds I think?

dhardy · 2018-04-13T12:03:33Z

I thought about this too, especially in combination with unsized RNGs. But even if it can't be done at compile time, the branch predictor should be able to figure things out after a few rounds I think?

I suppose so, but there's still an extra operation involved. It should be easy enough to add a few benchmarks using &mut RngCore or Box<RngCore> generators.

pitdicker · 2018-04-13T13:48:18Z

Benchmarked with HC-128 as a trait object. It normally performs at 78% of the normal speed, and with the change at 75%.

test gen_u128_hc128           ... bench:       7,281 ns/iter (+/- 248) = 2197 MB/s
test gen_u128_hc128_trait_obj ... bench:       9,310 ns/iter (+/- 559) = 1718 MB/s (without `bytes_per_round`)
test gen_u128_hc128_trait_obj ... bench:       9,691 ns/iter (+/- 458) = 1651 MB/s (with `bytes_per_round`)

dhardy · 2018-04-15T10:05:43Z

That's better than I expected. And what about a small generator like Xorshift?

pitdicker · 2018-04-15T11:35:22Z

Also better than I expected. Xorshift a little less so:

test gen_u128_xorshift           ... bench:       5,354 ns/iter (+/- 14) = 2988 MB/s
test gen_u128_xorshift_trait_obj ... bench:       6,354 ns/iter (+/- 30) = 2518 MB/s (without `bytes_per_round`)
test gen_u128_xorshift_trait_obj ... bench:       7,285 ns/iter (+/- 13) = 2196 MB/s (with `bytes_per_round`)

After a few more days you get a bit more objective 😄.
I think for most situations there are good, even better, solutions that this bytes_per_round. I see only 3 situations where it could be useful, and where there is no alternative:

When you need one value, an u32 for example, and may need another but don't know that yet. If this is a 64-bit RNG it makes sense to get two u32s at once, instead of possibly generating an u64 twice and throwing away half of it. Nut if this is a 32-bit RNG getting multiple u32s at once while they may not be needed is a waste. I think this is something not uncommon for rejection sampling.
As already mentioned, when you could make use of some extra precision, but don't wan't the slowdown this would cost a 32-bit RNG.
It can open ways to improve performance in combination with SIMD RNGs.

So we could do fine without this method, but I believe it can enable optimizations that would otherwise not be possible.

dhardy · 2018-04-15T12:05:20Z

A bit more objective maybe — but I haven't had much time the last few days (travelling)!

For rejection sampling, fair enough (though sometimes the extra samples are needed too rarely to make much difference)
For extra precision — this comes across as quite strange to me; ideally you'd choose how much precision you need to start with rather than get weird extra effects of choosing some RNGs
Okay, but this approach does seem a little hacky. How does the performance of gen_u128 compare using Xorshift with the fill_bytes method for example? If it's much worse we could also add next_u128 again for this case — though then there's also the question whether fill_bytes works well for 256-bits etc.

So I'm still unconvinced on this one.

pitdicker · 2018-04-16T07:01:03Z

I would like to see the first commit here in rand_core 0.1, but don't mind if the rest follows later, or maybe never.

dhardy · 2018-04-16T08:31:42Z

The first commit looks fine if you want to merge that separately.

pitdicker · 2018-04-27T13:50:25Z

I'm going to close this PR, at least for now.

dhardy · 2018-04-27T15:35:06Z

If you like. I don't mind it remaining as a discussion item for now, though it is tidier to close.

pitdicker added 3 commits April 12, 2018 20:57

Add RngCore::bytes_per_round

06ce015

Optimize u128 Standard distribution

e53fc14

Change inlining in RngCore implementation for references

1745287

pitdicker force-pushed the bytes_per_round branch from 48d4d1e to 1745287 Compare April 12, 2018 18:58

vks reviewed Apr 12, 2018

View reviewed changes

pitdicker mentioned this pull request Apr 13, 2018

Optimize fill_bytes_via_* and prepare rand_core 0.1 #397

Merged

dhardy reviewed Apr 13, 2018

View reviewed changes

Benchmark trait object

d06cbf0

pitdicker mentioned this pull request Apr 16, 2018

Change inlining in RngCore implementation for references #401

Merged

dhardy mentioned this pull request Apr 16, 2018

Release: prepare rand_core 0.1.0 #402

Merged

dhardy added X-enhancement E-question Participation: opinions wanted F-new-int Functionality: new, within Rand labels Apr 16, 2018

pitdicker closed this Apr 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RngCore::bytes_per_round #396

Add RngCore::bytes_per_round #396

pitdicker commented Apr 12, 2018 •

edited

Loading

vks Apr 12, 2018

pitdicker Apr 12, 2018

dhardy Apr 13, 2018 •

edited by pitdicker

Loading

pitdicker Apr 13, 2018

vks commented Apr 12, 2018

pitdicker commented Apr 12, 2018 •

edited

Loading

vks commented Apr 12, 2018 via email

pitdicker commented Apr 12, 2018 •

edited

Loading

dhardy left a comment

dhardy Apr 13, 2018 •

edited by pitdicker

Loading

dhardy Apr 13, 2018

pitdicker Apr 13, 2018

dhardy Apr 13, 2018

pitdicker Apr 13, 2018

dhardy Apr 13, 2018

pitdicker Apr 13, 2018

dhardy commented Apr 13, 2018

nagisa commented Apr 13, 2018 via email

pitdicker commented Apr 13, 2018

dhardy commented Apr 13, 2018

pitdicker commented Apr 13, 2018

dhardy commented Apr 15, 2018

pitdicker commented Apr 15, 2018

dhardy commented Apr 15, 2018

pitdicker commented Apr 16, 2018

dhardy commented Apr 16, 2018

pitdicker commented Apr 27, 2018

dhardy commented Apr 27, 2018

Add RngCore::bytes_per_round #396

Add RngCore::bytes_per_round #396

Conversation

pitdicker commented Apr 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhardy Apr 13, 2018 • edited by pitdicker Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vks commented Apr 12, 2018

pitdicker commented Apr 12, 2018 • edited Loading

vks commented Apr 12, 2018 via email

pitdicker commented Apr 12, 2018 • edited Loading

dhardy left a comment

Choose a reason for hiding this comment

dhardy Apr 13, 2018 • edited by pitdicker Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhardy commented Apr 13, 2018

nagisa commented Apr 13, 2018 via email

pitdicker commented Apr 13, 2018

dhardy commented Apr 13, 2018

pitdicker commented Apr 13, 2018

dhardy commented Apr 15, 2018

pitdicker commented Apr 15, 2018

dhardy commented Apr 15, 2018

pitdicker commented Apr 16, 2018

dhardy commented Apr 16, 2018

pitdicker commented Apr 27, 2018

dhardy commented Apr 27, 2018

pitdicker commented Apr 12, 2018 •

edited

Loading

dhardy Apr 13, 2018 •

edited by pitdicker

Loading

pitdicker commented Apr 12, 2018 •

edited

Loading

pitdicker commented Apr 12, 2018 •

edited

Loading

dhardy Apr 13, 2018 •

edited by pitdicker

Loading