-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shootout: small, fast PRNGs #52
Comments
I have some benchmarks:
However, I don't trust them that much, because I experienced systematic differences just by reordering the benchmarks. |
Added
|
I think the most interesting ones are xoroshiro128+ and xorshift1024*, as suggested by Vigna. I don't have an opinion on cryptographic generators. It is important to understand what the binary rank test does when judging the failures, see http://xoroshiro.di.unimi.it/:
(I did not look up Marsaglia's paper to verify this claim.) |
I adapted your crate to use Interestingly In light of your arguments about binary rank tests, it might be interesting to try PractRand with those disabled (if it has other useful tests). But I am not convinced we should ignore them entirely. Test output:
|
Splitmix64 is recommended by Vigna for seeding xoroshiro128+ and xorshift1024*. It works with any seed (even zero), but it has a period of 2^64 which might be a bit short. |
I have done a lot of experiments on Xoroshiro128+, SplitMix etc. Xoroshiro128+ is always the fastest, with a margin of up to 20%. But it has several weaknesses, and should be used with a lot of care unless you use it to generate I personally like truncated Xorshift*, and PCG. I have optimized and well-tested implementations here that are mostly ready. They are flawless under PractRand and TestU01. Benchmarks from a month ago:
|
The Xorshift variants in table form:
|
As @vks says the period of SplitMix is much to small to be used as a general PRNG. Unless we also use streams, that come with their own set of problems. |
I'm not sure xoroshiro's failures of the binary rank test qualify as a weakness in practice, see this discussion. |
I saw it after my post 😄. This is from memory, but I ran Xoroshiro+ through test, and disabled the binary rank tests. There were other ones it failed also. I also tested it by reversing the bits, but don't remember the results. Simply put, it doesn't mix (avalance) its bits enough. Xorshift (or plain Xoroshiro, if that were a thing) have patterns. A simple addition reduces those patterns, but is not enough to remove them. Something like a multiply should be used instead. Using a large state also helps masking those patterns. Not to say that I don't think Xoroshiro+ can have it's uses. Xorshift has it use too. Especially when converted to floating point, weaker least significant digits should just about never be interesting. But any user of it should be careful. @dhardy already raised good points. Do you know the impact of the weaker bits on: (1) generating booleans, (2) the ziggurat table lookup, (3) sampling from a range, (4) our custom, high precision conversions to double? And that are just a few uses in For general use, I think there are better PRNG's we can go with. Then we can have a simple guideline: if you need a good RNG but don't need to worry about predictability (e.g. no adversaries), use Xorshift* or PCG. If you need unpredictability, use a cryptographic RNG. Of course performance is a thing. But I consider anything that is on average close to / better than our current Xorshift extremely fast and good enough. |
What kind of patterns? If I understand correctly, all linear generators have patterns, the question is whether they are observable.
Why? (Are you basing this on O'Neill's blog post?) It's not clear that this is better. I recently had an email exchange with Sebastiano Vigna about the merits of xoroshiro* compared to xoroshiro+. He did some experiments, showing that xoroshiro+ has some Hamming dependencies after a lot of values, in this regard xoroshiro* is slightly better. But the lowest two bits of xoroshiro* are the same LSFR, while for xoroshiro+ the second lowest bit is a perturbed LSFR, so in that regard xoroshiro+ is slightly better.
I'm not sure which patterns you mean, but the linear dependencies don't go away by using a larger state. Large states also have trouble to recover from lots of zeros in the initial state and require more careful seeding.
Well, empirically there seems to be no practical impact whatsoever. All popular RNGs (that is the Mersenne Twister for most programming languages and xorshift for browsers and the current rand crate) are an LSFR in all bits. AFAIK, the only known impact is on the calculation of the rank of random binary matrices. You are saying users should be careful about the linear dependencies. Why?
Xorshift* linear dependencies in the lowest two bits as well. I don't see a reason why to use it over xoroshiro+, unless you are talking about xorshift1024* vs. xoroshiro128+.
On the other hand, performance is the only reason to use such a generator over a cryptographic one. |
Completely agree. And if they are observable depends on how the random numbers are used. From the first paper by Marsaglia it was noted that is was best to use Xorshift as a base generator, and apply some output function or combine it with a second generator of a different type. All the variants over the years, like XORWOW, XSadd, xorgens, RanQ1, Xorshift*, truncated Xorshift*, Xorshift+ and Xoroshiro+ differ in their choice of output function. Xoroshiro+ just trades quality for a little bit of performance.
Ah, yes. I was imprecise. It is true that just multiplying effects the last two bits about as little as addition does. I once made a table of the avalancing effect of a single multiply. What makes a good output function is a multiply, and truncating the result to only the high bits.
I think we have listed a couple of the problematic cases already. Can you have a look at them?
The performance is so good already, that the difference between two PRNG algorithms is not much more than the difference caused by using one more register, or having one instruction that can not be run in parallel with another on the processor. There are only 8 or 9 instructions used per new random number (and a couple of At such a point it seems to me almost any use of the random number wil take quite a bit longer than generating one. Of course there may be uses where that last cycle can be the difference, and quality is less important. Then Xoroshiro128+ is great! Note: I will be the last to say good performance is not a good quality. Optimising RNG's is fun ;-) |
@vks I (we) are guilty of derailing this issue 😄. If you want to reply, could you open a new issue or something and mention my name? |
Is this discussion off-topic? I thought the shootout was about performance and quality of the generators.
I did, and my point was that basically everyone uses RNGs in these cases where all bits have linear dependencies. I'm not aware of any problems with this. You keep saying linear dependencies are problematic, but you never say why. |
No, you're not really off-topic, but I do fear the thread will be long and not so easy to read later! Great that you already did significant research on this @pitdicker; I didn't realise you'd done so much. Have you come across v3b? I noticed it mentioned in a comment thread earlier; can't find a source for it however. It sounds like there may be reason to include multiple "small fast PRNGs" in |
Okay, let me try to make a list of some of the problems in 1) generating booleansCurrent code for generating booleans: impl Distribution<bool> for Uniform {
#[inline]
fn sample<R: Rng+?Sized>(&self, rng: &mut R) -> bool {
rng.next_u32() & 1 == 1
}
} This should change to compare against some other bit than the last one to generate bools that really have a 50% chance. 2) the ziggurat table lookupThe 11 least significant bits of the random number are used to pick the 'layer' of the ziggurat, and the 53 most significant bits as fraction for a f64. The f64 is then multiplied by some value that belongs to the layer. If the few bits used for the layer are not actually random, this has relatively large effects on the shape of the distribution. This can have a mostly easy fix: there are only 8 bits needed for the layer, so don't use the 8 least significant bits, but bits 3..10. 3) sampling from a rangeIn If we assume an RNG with weak least significant bits, those weaker bits will remain weak also in the much smaller range. A solution would be to use division instead of a modulus. At least I remember reading that that works. Without changes we can't really keep the promise of a uniformly distributed range. 4) high precision conversions to doubleIt turns out this part is ok. We use the 53 least significant bits for the fraction, and the remeaning 11 most significant bits for the exponent. This means only the last bits of the fracion are weak. It it were reversed, some exponents would occur more often than others. And the effect of that would be huge for floats. As you see, it takes some nontrivial thinking to see if your algorithm works well with a generator with some weaknesses. Now we could adapt this code to make using it with Xoroshiro+ safe. That seems like a good idea to me anyway. But what to do when there is some other RNG that happens to have weaker most significant bits? Okay, I have not heard of such. But how much should generic code cater to the weaknesses of one RNG? |
Do linear dependencies in the low bits imply bias (P(1) ≠ 1/2)? I assumed not(?). Do they imply high predictability within these two bits over a very short sequence? As @vks says the weakness may not be so significant. |
v3b comes from here: http://cipherdev.org/. I know nothing about it though. |
Maybe of interest. An evaluation of Xoroshiro by the author of PractRand, and since I last looked at some replies from Sebastiano Vigna. |
For fun I wrote a Xoroshiro64+. It is not meant to be used. I had to calculate new constants for a, b and c. Creating a smaller version is useful, because for something like PractRand it is easier to analyse. #[derive(Clone, Debug)]
pub struct Xoroshiro64PlusRng {
s0: u32,
s1: u32,
}
impl SeedFromRng for Xoroshiro64PlusRng {
fn from_rng<R: Rng>(mut other: R) -> Result<Self, Error> {
let mut tuple: (u32, u32);
loop {
tuple = (other.next_u32(), other.next_u32());
if tuple != (0, 0) {
break;
}
}
let (s0, s1) = tuple;
Ok(Xoroshiro64PlusRng{ s0: s0, s1: s1 })
}
}
impl Rng for Xoroshiro64PlusRng {
#[inline]
fn next_u32(&mut self) -> u32 {
let s0 = w(self.s0);
let mut s1 = w(self.s1);
let result = (s0 + s1).0;
s1 ^= s0;
self.s0 = (w(s0.0.rotate_left(19)) ^ s1 ^ (s1 << 13)).0; // a, b
self.s1 = s1.0.rotate_left(10); // c
result
}
#[inline]
fn next_u64(&mut self) -> u64 {
::rand_core::impls::next_u64_via_u32(self)
}
#[cfg(feature = "i128_support")]
fn next_u128(&mut self) -> u128 {
::rand_core::impls::next_u128_via_u64(self)
}
fn fill_bytes(&mut self, dest: &mut [u8]) {
::rand_core::impls::fill_bytes_via_u32(self, dest)
}
fn try_fill(&mut self, dest: &mut [u8]) -> Result<(), Error> {
Ok(self.fill_bytes(dest))
}
} Results (only the last results, al the other failures before are similar):
|
Another generator and test suite that may be worth looking into: http://gjrand.sourceforge.net/boast.html |
I like the idea of having multiple PRNGs, but personally would have an extension crate for them. I feel like Rust should have a canonical implementation for getting a random number. P.S. Sorry for continuing to recommend more crates, maybe I just love making crates :P. |
As I said in the first post, that discussion is for another issue! It may be that |
@pitdicker I think you are confused about the properties of LFSRs.
What makes you think this is not the case for LSFRs? If you look at the distribution, it is fine. (This is not a very hard test, a generator outputting
What do you mean with "not actually random"? Do you mean the bits don't have a uniform distribution? This is not the case for LSFRs. In fact, they are usually designed to be equidistributed. If there was a large effect on the distribution of the sampled values, no one would be using LSFR-based generators like the Mersenne Twister or xorshiftt.
Yes, but I don't think this "weakness" is a problem in practice. |
Let me emphasise I have nothing against Xoroshiro128+. It is one of the fastest small PRNG's currently in use, and it does not perform terrible on most statistical test. But there are other RNG's that have up to 80% of the performance of Xoroshiro128+, and that don't have detectable statistical weaknesses. I think the question should be:
@vks You raise a good point by questioning how much statistical weaknesses matter for real applications. That depends on the algorithm it is used in, and requires evaluation for every case. I can't answer it. Do we want every user to ask himself that question? Again I want to note that 20~25% performance difference sounds like much, but it is a difference of only one or two clock cycles.
You are right, my bad. It follows patterns, but the chance for one bit to be 1 or 0 should remain about 50%. |
@dhardy Do you plan on collecting all small RNG's mentioned here in one repository/crate, for easy comparison? That would be an interesting collection. Or just mention them here? |
I'm considering that, but don't know if I will get around to it or not. It's not a priority anyway. |
It also depends on the statistical weakness in question. I looked up the paper where the binary rank test was introduced (Marsaglia, Tsay 1985):
So the lowest two bits in xoroshiro might give you some problems when generating random incidence matrices. I don't think this is a common use case, as those linear dependencies are the status quo of the currently used RNGs. It seems the authors of xoroshiro thought it was worth the tradeoff (xoroshiro website):
So the question is whether we want to improve on the status quo at the cost of some performance. In my opinion, it is not necessary, because it does not seem that users expect this property from an RNG. Should we write something like an RFC? All this binary-rank discussion is a bit scattered at the moment... |
Yeah, exposing Ultimately, we have to consider what the |
@Lokathor - I just looked at your implementation, and while you don't expose This is easy to fix, though, just by doing a little bit-mixing on the input. |
I thought so too, but it is not as bad as it seems. A quote from myself:
Combined this means it takes 2^27 initializations to get a chance of about 1 in a million before part of a window of 2^48 results is reused. Seems good enough to me. I really tried to get a scheme using Xorshift jumps to work. In our conversations I may sound negative, but I like the Xorshift variants. Especially the papers of Vigna I have studied several times, and played with the code supporting his papers. I wrote a jumping function, and tried calculating a custom jump polynomal. A fixed jump takes as many rounds of the RNG as it has state bits. Together with bookkeeping, this makes a jump as slow as requesting new bytes from I also experimented with variable jumps. For now I calculated the jump polynomal by hand in Excel :-(. A variable jump is at least several times slower than a fixed jump. An other idea was to use variable jumps that are multiples of 2^50. Picking them at random gives a similar chance of duplicate streams as PCG streams. But this improves on the birthday problem only a little, while taking very much more time. In the end using jumps to make the birthday problem when initializing RNGs less of an issue, seemed to not really work out. Unless you have a single 'mother' RNG that is jumped every time a new 'child' RNG is split off. And even then it is slow. |
@Ichoran I admit that my code is not fully idiot proof because it's mostly intended for only me to ever use it, but I already also provide One thing is though that @pitdicker I forget my birthday problem formula exactly, but your math seems wrong just because its logic is wrong. The period of the PCG used has absolutely nothing to do with stream selection overlap. It's not the case that the |
@Lokathor |
Ah, my mistake then. |
I'm getting this result as well with the approximate solution of the birthday problem:
|
I just tried this out for the XSL RR 128/64 (MCG) variant. The advantage of a custom permutation is that the truncation to u32 can happen earlier. Because on x86_64 64-bit operations are about as fast as 32-bit, it did not change the benchmarks at al... On x86 the speed was already abysmal, and it remained so. And for the 64/32 variants there is not much creative we can do with the output functions, right? |
Wow, the extension method for PCG is complex! And the C++ template stuff and lack of comments don't help either. I tried implementing That claim does not seem true though, as it is implemented with an extension array of 32 32-bit words. The file The extension mechanism comes with two choices: we can pick a size, and whether we want k-dimensional equidistribution (kdd). It is best if the size is a power of two, this makes the point when the extension table should be updated easier to recognise. To generate a new random number, the output of the base generator (PCG-XSR-RR 64/32 in my case) is xored with a randomly picked value from the extension array. Which function to use to pick a value from the extension array depends on whether we want kdd. The PCG paper explains:
Sometimes the values in the extension table need to be updated. PCG chooses for the following scheme:
Every value in the extension array is its own little PCG-RXS-M-XS RNG. The process to update a value is complex, slow, and in my opinion ugly. First the inverse of the RXS-M-XS output function is applied. Which includes using a recursive un-xorshift function twice, and multiplying with the modular inverse of the multiplier of RXS-M-XS. Than the state recovered with that is advanced as if it is an LCG. Next the RXS-M-XS output function is applied to get the new value for the extension array. Repeat until all values are updated. In about 50% (?) of the cases a value is also advanced a second time, to break the different array values out of lockstep. If the base generator is an MCG at least. I did not finish my implementation of PCG with an extension array. It certainly does not fit in the 'small, fast RNG' category. I suppose the PCG EXT variants only look so fast in the benchmarks if the whole table update part does not happen. |
Yeah, the extension for k-dimensional equidistribution is tricky. The Xoroshiro-style extension with cycling through array positions doesn't give k-dimensional but it is dead simple. Unfortunately, I'm pretty sure (I think?) that it doesn't give more period length very quickly. 2^n state slots in an array gives (again, i think) +n to your period (eg: 2^64 becomes 2^(64+n) instead). I'm not sure what a good answer is, but I'll try to keep thinking about it in the next few days. |
@Lokathor I'm pretty sure the period is as large as it can be, since there is only one cycle for these RNGs. |
False. Please read the PCG paper. |
It really is worth reading the PCG paper. It's amazingly clear and approachable for what often seems like an arcane and difficult topic. With regard to So you need a more complex scheme, some of which have self-similarity problems. So I'd agree with @pitdicker that it isn't really a small fast RNG any more. There aren't many instructions executed per clock cycle, but the logic for advancing is somewhat complex, and not completely trivial mathematically. (I'd still characterize it as straightforward, but there are plenty of opportunities for implementation errors.) Anyway, is there a compelling reason not to just pick one of the non-extension schemes and have that be the default? Having fast yet decent-quality random numbers (even if on some architectures it's not as fast as others) seems like a sizable improvement over the status quo; and you can always leave in the existing implementations for people who have reason to prefer the old algorithm. The nice thing about the PCG family is that not only are the algorithms close to as fast as they can be, there's also a theoretical framework that helps reassure us that it's unlikely that a really problematic non-random structure is lurking in there somewhere that just doesn't happen to be tested with the typical tests. This is a great reassurance for a standard library to have. (Note: we only have that reassurance for a single stream, not for comparison between multiple streams.) |
Well, the only problem is that the default PCG you want kinda depends on the output you want. If you want mostly u32 values then 64/32 will probably serve you better than 128/64 simply because it's less space taken and it's somewhere between slightly faster to much faster depending on machine (it's not unreasonable to think that rust will regularly be run on 32 bit devices as well). We could provide both and then explain why you'd want one or the other. Later on we might even be able to provide a pcg extras crate with macros that build a PCG type for you on the fly, complete with Rng impl and such. Fancy stuff can come later once we pick a good default. |
I think 32-bit x86 is pretty dead by now, but 32-bit ARM is still common, so there is some value. Note that the default generator does not need to be the same on all platforms; however I don't think we can't switch the algorithm depending on whether more |
The generator stepping and permuting the LCG output are separate phases of the process. As long as the generator stepping is consistent for both modes, you can use a different permutation for And yeah, I own a 32-bit ARM device that I use often enough, the Raspberry Pi board series. I'm sure that plenty enough other single-board computers are also 32-bit ARM devices and that people want to be able to use Rust on them. |
For a quick summary: An RNG that outputs u32's is not great, because then producing one u64 means combining two outputs. This is more than twice as slow, and also reduces the period. One RNG that can generate u64's directly with good statistical quality is the 128-bit variant of PCG. Another is the 64-bit Xorshift/Xoroshiro with a widening multiply to 128 bits as an output function. Both are great on x86_64, but very slow on x86 because both need 128-bit multiplies which are not available and need to be emulated. What we are trying here is:
As the PCG paper notes the problem of the RXS M XS variants is that every random number appears exactly once over a period of 2^64. It is relatively quickly possible to see that results are not truely random because there are no duplicates. Of course a PRNG never is truly random, but it should appear so. The question is: does the extension mechanism of PCG not only enlarge the period, but also fix this problem with RXS M XS? I think the problem is simple: a requirement to get the proper number of doubles according to the generalized birthday problem is that at least 128 bits of state need to get updated frequently. It seems to me a simple solution could be good enough: xor the output of PCG RXS M XS 64 with a 64-bit counter. And if we use a Weyl sequence instead of a counter, we can maybe even get away with using MCG as a base generator (and if we want we can still have streams). So just about no slowdown 😄. I think this gives the proper distribution, but has the consequence that some results will not appear at all... Something like this: fn next_u64() -> u64 {
// MCG
self.m = self.m.wrapping_mul(MULTIPLIER);
// Weyl sequence
self.w = self.m.wrapping_add(INCREMENT);
let state = self.m ^ self.w;
output_rxs_m_xs(state)
} It will take a few days before I can test this though. It should also be possible to test the distribution of the results with a 32-bit variant, that would need only 4gb of memory. |
I'm not sure the Weyl sequence adds anything beyond a simple incrementer, given the mixing afterwards (assuming we use the PCG mixers). MCG is a bit risky given that it's degenerate when |
It should. It is my understanding that the 64/64 permutation has problems with each output appearing exactly once precisely because the period is too small. If you used it with a larger period, it would be fine. The reason that the 128/64 scheme works fine with 64-bit output is because the period is 2^128, not because the permutation is magically better on its own.
This sums it up nicely.
That sounds like a very improper solution! I'd be upset to use a generator where some results can't possibly happen, even despite the fact that any particular result only has a 1/(2^64) chance to begin with. Proposed Alternate Solution: We could wait a release cycle or two for 128-bit rust to become stable (assuming that it's out in the next cycle or two?), write the 128/64 PCG (which will have great 64-bit output), and then just accept that it will run very slowly on a 32-bit machine and tell people in the docs. The reason that you'd use 128/64 is because you want to focus on 64-bit output, and if you're doing something that needs 64-bits at a time but running it on a 32-bit machine I hardly know what you're doing to begin with. That's just goofy. People don't normally think about the 32-bit/64-bit jump at all, but PRNGs are one of the things where it was a big deal and it continues to be a big deal and you do have to think about it. That's not something that we can fix ourselves because it's just part of how the math and hardware works out. |
Found time to do some testing already. Code: fn next_u64(&mut self) -> u64 {
// MCG
self.m = self.m.wrapping_mul(6364136223846793005);
// Weyl sequence
self.w = self.w.wrapping_add(1442695040888963407);
let mut state = self.m ^ self.w;
// output function RXS M XS:
// random xorshift, mcg multiply, fixed xorshift
const BITS: u64 = 64;
const OP_BITS: u64 = 5; // log2(BITS)
const MASK: u64 = BITS - 1;
let rshift = (state >> (BITS - OP_BITS)) & MASK;
state ^= state >> (OP_BITS + rshift);
state = state.wrapping_mul(6364136223846793005);
state ^ (state >> ((2 * BITS + 2) / 3))
}
fn next_u32(&mut self) -> u32 {
self.m = self.m.wrapping_mul(6364136223846793005);
self.w = self.w.wrapping_add(1442695040888963407);
let state = self.m ^ self.w;
// output function XSH RR: xorshift high (bits), followed by a random rotate
const IN_BITS: u32 = 64;
const OUT_BITS: u32 = 32;
const OP_BITS: u32 = 5; // log2(OUT_BITS)
const ROTATE: u32 = IN_BITS - OP_BITS; // 59
const XSHIFT: u32 = (OUT_BITS + OP_BITS) / 2; // 18
const SPARE: u32 = IN_BITS - OUT_BITS - OP_BITS; // 27
let xsh = (((state >> XSHIFT) ^ state) >> SPARE) as u32;
xsh.rotate_right((state >> ROTATE) as u32)
}
Performance is not bad, but not as good as I hoped. About 15~25% better than combining two outputs from PXG XSH 64/32 RR. x86_64:
x86:
PractRand seems pretty happy with it until now (half a terabyte tested).
Good point. We would have to make sure the seed is not 0, just like we have to for Xorshift/Xoroshiro and PCG with MCG as a base generator.
Yes. Although it also depends a little on how the period works. For example imagine a scheme where the base generator first gives every number between 0 and 2^64 in one order, and for the next period every number again only once but in some other order. That is why I wanted to know the details of the extension mechanism.
I agree it is not nice. On the other hand I don't think it matters. You only know which values are missing after generating and keeping track of 2^64 numbers. I don't think that is even possible. And for every seed the numbers that are double / triple, and the results that are missing are different. Thanks both for thinking about along seriously! I am not going to push this RNG too far, but it seems to work well and is faster than the other alternatives for generating good-quality u64's on x86. |
Another possible solution: the XSH output function needs only 6 bits to do it's work, and should be able to output up to 58 bits. The mantissa of a f64 can only store 53 bits. So we could make something like PCG XSH RR 64/53 work. I see a few disadvantages though:
|
Well, converting one or more |
We've already had a lot of discussion on this. Lets summarise algorithms here.
This is about small, fast PRNGs. Speed, size and performance in tests like PractRand and TestU01 is of interest here; cryptographic quality is not.
Edit: see @pitdicker's work in #60.
The text was updated successfully, but these errors were encountered: