Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid low bits #68

Merged
merged 5 commits into from
Dec 11, 2017
Merged

Avoid low bits #68

merged 5 commits into from
Dec 11, 2017

Conversation

pitdicker
Copy link

This implements #52 (comment).

Many small RNG's are of lower quality in their least significant bits. For example LCG's with a power of two modulus, MCG's, Xorshift+ and Xoroshiro+.

If it costs us nothing to avoid the least significant bits it seems sensible to me to do so. Even if none of these RNG's becomes part of rand itself.

@pitdicker
Copy link
Author

pitdicker commented Dec 8, 2017

Please don't look at it yet, I did things stupidly...

// least significant bits. Because many RNG's of
// lower quality are weak in their least significant
// bits, we use division.
return self.low.wrapping_add((v / range) as $ty);
Copy link
Owner

@dhardy dhardy Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you meant let n = (zone+1) / range; return v / n (avoiding the overflow)?

That means an extra division somewhere in new_*. Since n = (unsigned_max - range + 1) / range + 1 you may be able to compute both remainder and quotient in a single op with something like div (I thought there was a Rust equivalent but can't find it). Even then there's an extra copy and addition IIUC so I expect performance will suffer.

So I'm not convinced this is worth it...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think I was sleeping while doing the math 🙁.

To avoid two divisions I was thinking along these lines in new_*:

let range = (high as $u_large)
            .wrapping_sub(low as $u_large)
            .wrapping_add(1);
let divisor = unsigned_max / range;
let zone = range * divisor;

Two subtractions are traded for one multiply, and one modulus for a division. But have to think about it more careful yet.

Copy link
Owner

@dhardy dhardy Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asleep, I see :D

You need (unsigned_max + 1) / range which of course overflows, so you need this:

let divisor = (unsigned_max - range + 1) / range + 1;
let zone = range * divisor;  // may wrap to zero

Simultaneous division & remainder would probably still be faster (but more CPU dependent maybe).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a Rust equivalent, on num::Integer and as an unstable intrinsic.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have used them together before, LLVM is able to optimize the two operations into one when they are close together.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -176,7 +178,7 @@ fn ziggurat<R: Rng+?Sized, P, Z>(
} else {
bits.closed_open01_fixed()
};
let i = (bits & 0xff) as usize;
let i = ((bits >> 3) & 0xff) as usize;
Copy link
Owner

@dhardy dhardy Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were there three unused bits? I thought you used them all?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 11 spare bits, and we only need 8

@pitdicker
Copy link
Author

I have replaced the modulus in RangeInt with a widening multiply, see https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/. The post does not describe how to turn this into a uniform distribution though, but comparing the low word to zone does the trick.

Benchmarks (see the diff colums for the results with the time Xorshift takes removed):

x86:

baseline old new old (diff) new (diff) percent
i8 1466 5602 4224 4136 2758 66,7%
i16 1471 5834 4062 4363 2591 59,4%
i32 1471 6201 4709 4730 3238 68,5%
i64 4751 12112 10562 7361 5811 78,9%
i64 (i128_support) 4751 12112 8530 7361 3779 51,3%

x86_64:

baseline old new old (diff) new (diff) percent
i8 1380 4864 2683 3484 1303 37,4%
i16 1379 4862 2862 3483 1483 42,6%
i32 1381 5416 3390 4035 2009 49,8%
i64 2650 11225 5895 8575 3245 37,8%
i64 (i128_support) 2650 11225 3226 8575 576 6,7%

Like using a division, widening multiply depends more on the most significant bits than the least significant ones.

I have changed the tests for WeightedChoice because they assume a modulus is used for the range reduction.

@dhardy
Copy link
Owner

dhardy commented Dec 11, 2017

Great work, but I'm still wondering: do we want this? Is it a question between using an RNG like Xorshift with these changes or using another RNG which doesn't appear to have these weaknesses in the low bits? You should be able to answer this question better than anyone based on the research you've been doing lately, I guess.

@pitdicker
Copy link
Author

Even when we pick an RNG that is good and does not need those changes, there would still be RNG's in other crates that would benefit from them.

The change to the ziggurat layer makes it little bit slower, 3~4%. I wonder if some indexing tricks can recover that...
For bools I have just added a benchmark, it improves from 1,405 ns/iter to 1,030 ns/iter, 27% faster.
And the range code is now about twice as fast.

So two of the three are definitely faster, and it helps with weaker RNG's. Seems win-win to me 😄

@dhardy
Copy link
Owner

dhardy commented Dec 11, 2017

The numbers you posted are ns/iter? Ah, ok. 😆 👍

@dhardy
Copy link
Owner

dhardy commented Dec 11, 2017

Can you also benchmark constructing and sampling from a range? Single-use ranges are quite common, e.g. the code in the seq module uses this heavily.

@pitdicker
Copy link
Author

Results:

Before:

test gen_range_i16            ... bench:       8,537 ns/iter (+/- 583) = 234 MB/s
test gen_range_i32            ... bench:       9,105 ns/iter (+/- 54) = 439 MB/s
test gen_range_i64            ... bench:      17,442 ns/iter (+/- 410) = 458 MB/s
test gen_range_i8             ... bench:       8,521 ns/iter (+/- 146) = 117 MB/s

After:

test gen_range_i16            ... bench:       6,996 ns/iter (+/- 206) = 285 MB/s
test gen_range_i32            ... bench:       6,835 ns/iter (+/- 16) = 585 MB/s
test gen_range_i64            ... bench:      13,603 ns/iter (+/- 189) = 588 MB/s
test gen_range_i8             ... bench:       6,563 ns/iter (+/- 9) = 152 MB/s

So there is some improvement, but they are both terrible... Maybe it is worth thinking about a single-use range?

@dhardy
Copy link
Owner

dhardy commented Dec 11, 2017

gen_range is a single-use range so it doesn't need API adjustment. But is any optimisation possible anyway?

@pitdicker
Copy link
Author

I think it is possible to get within 70% of the normal version.

If we don't search for the optimal zone with a modulus but pick one with bitshifts that should be a win. On average 75% of the RNG's results should still be acceptable. And with everything in one function it can maybe optimise a bit better.

@dhardy
Copy link
Owner

dhardy commented Dec 11, 2017

Okay, shall I go ahead and merge?

It looks like the CI failure is something unrelated I messed up (seeding-minimal branch I think).

@pitdicker
Copy link
Author

I can't figure out how to shuffle the traits to get something like sample_single in RangeImpl. Can you help me with it? https://gist.github.com/pitdicker/ab086693964a71c031d63b34ccfb8dba

Feel free to merge this though.

@dhardy dhardy merged commit 51cc0ed into dhardy:master Dec 11, 2017
@pitdicker pitdicker deleted the avoid_low_bits branch December 11, 2017 17:15
@dhardy
Copy link
Owner

dhardy commented Dec 11, 2017

I think you want X::T:

    pub fn sample_single<X: SampleRange, R: Rng+?Sized>(low: X, high: X, rng: &mut R) -> X {
        X::T::sample_single(low, high, rng)
    }

@pitdicker
Copy link
Author

Thank you, that could have taken me hours!

@dhardy
Copy link
Owner

dhardy commented Dec 11, 2017

No problem! It was a bit of a weird error and something you wonder why Rustc didn't figure out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants