Avoid low bits #68

pitdicker · 2017-12-07T19:48:02Z

Many small RNG's are of lower quality in their least significant bits. For example LCG's with a power of two modulus, MCG's, Xorshift+ and Xoroshiro+.

If it costs us nothing to avoid the least significant bits it seems sensible to me to do so. Even if none of these RNG's becomes part of rand itself.

pitdicker · 2017-12-08T09:43:13Z

Please don't look at it yet, I did things stupidly...

dhardy · 2017-12-08T11:44:29Z

src/distributions/range.rs

+                            // least significant bits. Because many RNG's of
+                            // lower quality are weak in their least significant
+                            // bits, we use division.
+                            return self.low.wrapping_add((v / range) as $ty);


Maybe you meant let n = (zone+1) / range; return v / n (avoiding the overflow)?

That means an extra division somewhere in new_*. Since n = (unsigned_max - range + 1) / range + 1 you may be able to compute both remainder and quotient in a single op with something like div (I thought there was a Rust equivalent but can't find it). Even then there's an extra copy and addition IIUC so I expect performance will suffer.

So I'm not convinced this is worth it...

Yes, I think I was sleeping while doing the math 🙁.

To avoid two divisions I was thinking along these lines in new_*:

let range = (high as $u_large) .wrapping_sub(low as $u_large) .wrapping_add(1); let divisor = unsigned_max / range; let zone = range * divisor;

Two subtractions are traded for one multiply, and one modulus for a division. But have to think about it more careful yet.

Asleep, I see :D

You need (unsigned_max + 1) / range which of course overflows, so you need this:

let divisor = (unsigned_max - range + 1) / range + 1; let zone = range * divisor; // may wrap to zero

Simultaneous division & remainder would probably still be faster (but more CPU dependent maybe).

There is a Rust equivalent, on num::Integer and as an unstable intrinsic.

I have used them together before, LLVM is able to optimize the two operations into one when they are close together.

https://github.com/rust-num/num/blob/master/integer/src/lib.rs#L309

dhardy · 2017-12-08T11:45:08Z

src/distributions/mod.rs

@@ -176,7 +178,7 @@ fn ziggurat<R: Rng+?Sized, P, Z>(
                } else {
                    bits.closed_open01_fixed()
                };
-        let i = (bits & 0xff) as usize;
+        let i = ((bits >> 3) & 0xff) as usize;


Were there three unused bits? I thought you used them all?

There are 11 spare bits, and we only need 8

pitdicker · 2017-12-10T21:14:35Z

I have replaced the modulus in RangeInt with a widening multiply, see https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/. The post does not describe how to turn this into a uniform distribution though, but comparing the low word to zone does the trick.

Benchmarks (see the diff colums for the results with the time Xorshift takes removed):

x86:

	baseline	old	new	old (diff)	new (diff)	percent
i8	1466	5602	4224	4136	2758	66,7%
i16	1471	5834	4062	4363	2591	59,4%
i32	1471	6201	4709	4730	3238	68,5%
i64	4751	12112	10562	7361	5811	78,9%
i64 (i128_support)	4751	12112	8530	7361	3779	51,3%

x86_64:

	baseline	old	new	old (diff)	new (diff)	percent
i8	1380	4864	2683	3484	1303	37,4%
i16	1379	4862	2862	3483	1483	42,6%
i32	1381	5416	3390	4035	2009	49,8%
i64	2650	11225	5895	8575	3245	37,8%
i64 (i128_support)	2650	11225	3226	8575	576	6,7%

Like using a division, widening multiply depends more on the most significant bits than the least significant ones.

I have changed the tests for WeightedChoice because they assume a modulus is used for the range reduction.

dhardy · 2017-12-11T11:02:11Z

Great work, but I'm still wondering: do we want this? Is it a question between using an RNG like Xorshift with these changes or using another RNG which doesn't appear to have these weaknesses in the low bits? You should be able to answer this question better than anyone based on the research you've been doing lately, I guess.

pitdicker · 2017-12-11T11:55:28Z

Even when we pick an RNG that is good and does not need those changes, there would still be RNG's in other crates that would benefit from them.

The change to the ziggurat layer makes it little bit slower, 3~4%. I wonder if some indexing tricks can recover that...
For bools I have just added a benchmark, it improves from 1,405 ns/iter to 1,030 ns/iter, 27% faster.
And the range code is now about twice as fast.

So two of the three are definitely faster, and it helps with weaker RNG's. Seems win-win to me 😄

dhardy · 2017-12-11T12:05:00Z

The numbers you posted are ns/iter? Ah, ok. 😆 👍

dhardy · 2017-12-11T12:07:34Z

Can you also benchmark constructing and sampling from a range? Single-use ranges are quite common, e.g. the code in the seq module uses this heavily.

pitdicker · 2017-12-11T12:48:19Z

Results:

Before:

test gen_range_i16            ... bench:       8,537 ns/iter (+/- 583) = 234 MB/s
test gen_range_i32            ... bench:       9,105 ns/iter (+/- 54) = 439 MB/s
test gen_range_i64            ... bench:      17,442 ns/iter (+/- 410) = 458 MB/s
test gen_range_i8             ... bench:       8,521 ns/iter (+/- 146) = 117 MB/s

After:

test gen_range_i16            ... bench:       6,996 ns/iter (+/- 206) = 285 MB/s
test gen_range_i32            ... bench:       6,835 ns/iter (+/- 16) = 585 MB/s
test gen_range_i64            ... bench:      13,603 ns/iter (+/- 189) = 588 MB/s
test gen_range_i8             ... bench:       6,563 ns/iter (+/- 9) = 152 MB/s

So there is some improvement, but they are both terrible... Maybe it is worth thinking about a single-use range?

dhardy · 2017-12-11T12:58:39Z

gen_range is a single-use range so it doesn't need API adjustment. But is any optimisation possible anyway?

pitdicker · 2017-12-11T13:15:02Z

I think it is possible to get within 70% of the normal version.

If we don't search for the optimal zone with a modulus but pick one with bitshifts that should be a win. On average 75% of the RNG's results should still be acceptable. And with everything in one function it can maybe optimise a bit better.

dhardy · 2017-12-11T15:42:05Z

Okay, shall I go ahead and merge?

It looks like the CI failure is something unrelated I messed up (seeding-minimal branch I think).

pitdicker · 2017-12-11T17:09:24Z

I can't figure out how to shuffle the traits to get something like sample_single in RangeImpl. Can you help me with it? https://gist.github.com/pitdicker/ab086693964a71c031d63b34ccfb8dba

Feel free to merge this though.

dhardy · 2017-12-11T17:43:22Z

I think you want X::T:

    pub fn sample_single<X: SampleRange, R: Rng+?Sized>(low: X, high: X, rng: &mut R) -> X {
        X::T::sample_single(low, high, rng)
    }

pitdicker · 2017-12-11T17:44:40Z

Thank you, that could have taken me hours!

dhardy · 2017-12-11T17:48:11Z

No problem! It was a bit of a weird error and something you wonder why Rustc didn't figure out.

dhardy reviewed Dec 8, 2017

View reviewed changes

pitdicker added 3 commits December 9, 2017 06:38

Use a sign test for bools

43267f7

Ignore the 3 least significant bits for the ziggurat layer

c8c720e

Use widening multiply instead of modulus in RangeInt

3f22ddd

pitdicker force-pushed the avoid_low_bits branch from a8641a2 to 3f22ddd Compare December 10, 2017 21:00

Add a few extra distr benchmarks

68edc9f

pitdicker force-pushed the avoid_low_bits branch from 625f86a to 68edc9f Compare December 11, 2017 11:58

Benchmark constructing and sampling from a range

173c2b4

dhardy merged commit 51cc0ed into dhardy:master Dec 11, 2017

pitdicker deleted the avoid_low_bits branch December 11, 2017 17:15

This was referenced Dec 11, 2017

Add sample_single to Range #69

Merged

Speed up range sampling. rust-random/rand#115

Closed

Change the Rand trait for bool rust-random/rand#113

Closed

Rename AsciiWordChar and optimize Codepoint #75

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid low bits #68

Avoid low bits #68

pitdicker commented Dec 7, 2017

pitdicker commented Dec 8, 2017 •

edited

Loading

dhardy Dec 8, 2017 •

edited

Loading

pitdicker Dec 8, 2017

dhardy Dec 8, 2017 •

edited

Loading

dhardy Dec 8, 2017

pitdicker Dec 8, 2017

pitdicker Dec 8, 2017

dhardy Dec 8, 2017 •

edited

Loading

pitdicker Dec 8, 2017

pitdicker commented Dec 10, 2017

dhardy commented Dec 11, 2017

pitdicker commented Dec 11, 2017

dhardy commented Dec 11, 2017

dhardy commented Dec 11, 2017

pitdicker commented Dec 11, 2017

dhardy commented Dec 11, 2017

pitdicker commented Dec 11, 2017

dhardy commented Dec 11, 2017

pitdicker commented Dec 11, 2017

dhardy commented Dec 11, 2017

pitdicker commented Dec 11, 2017

dhardy commented Dec 11, 2017

Avoid low bits #68

Avoid low bits #68

Conversation

pitdicker commented Dec 7, 2017

pitdicker commented Dec 8, 2017 • edited Loading

dhardy Dec 8, 2017 • edited Loading

Choose a reason for hiding this comment

pitdicker Dec 8, 2017

Choose a reason for hiding this comment

dhardy Dec 8, 2017 • edited Loading

Choose a reason for hiding this comment

dhardy Dec 8, 2017

Choose a reason for hiding this comment

pitdicker Dec 8, 2017

Choose a reason for hiding this comment

pitdicker Dec 8, 2017

Choose a reason for hiding this comment

dhardy Dec 8, 2017 • edited Loading

Choose a reason for hiding this comment

pitdicker Dec 8, 2017

Choose a reason for hiding this comment

pitdicker commented Dec 10, 2017

dhardy commented Dec 11, 2017

pitdicker commented Dec 11, 2017

dhardy commented Dec 11, 2017

dhardy commented Dec 11, 2017

pitdicker commented Dec 11, 2017

dhardy commented Dec 11, 2017

pitdicker commented Dec 11, 2017

dhardy commented Dec 11, 2017

pitdicker commented Dec 11, 2017

dhardy commented Dec 11, 2017

pitdicker commented Dec 11, 2017

dhardy commented Dec 11, 2017

pitdicker commented Dec 8, 2017 •

edited

Loading

dhardy Dec 8, 2017 •

edited

Loading

dhardy Dec 8, 2017 •

edited

Loading

dhardy Dec 8, 2017 •

edited

Loading