Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDRAND-based output is (too) biased #228

Closed
briansmith opened this issue Sep 30, 2021 · 8 comments · Fixed by #335
Closed

RDRAND-based output is (too) biased #228

briansmith opened this issue Sep 30, 2021 · 8 comments · Fixed by #335
Labels
bug Something isn't working

Comments

@briansmith
Copy link
Contributor

briansmith commented Sep 30, 2021

See

getrandom/src/rdrand.rs

Lines 33 to 42 in 30308ae

if rdrand_step(&mut el) == 1 {
// AMD CPUs from families 14h to 16h (pre Ryzen) sometimes fail to
// set CF on bogus random data, so we check these values explicitly.
// See https://github.com/systemd/systemd/issues/11810#issuecomment-489727505
// We perform this check regardless of target to guard against
// any implementation that incorrectly fails to set CF.
if el != 0 && el != !0 {
return Ok(el.to_ne_bytes());
}
// Keep looping in case this was a false positive.
, merged from PR #48:

 if rdrand_step(&mut el) == 1 {
            // AMD CPUs from families 14h to 16h (pre Ryzen) sometimes fail to
            // set CF on bogus random data, so we check these values explicitly.
            // See https://github.com/systemd/systemd/issues/11810#issuecomment-489727505
            // We perform this check regardless of target to guard against
            // any implementation that incorrectly fails to set CF.
            if el != 0 && el != !0 {
                return Ok(el.to_ne_bytes());
            }
            // Keep looping in case this was a false positive.
        }

The condition if el != 0 && el != !0 is testing that the value returned by RDRAND, after it reports success, is not zero or all-one bits, i.e. never usize::MIN or usize::MAX. Consequently, getrandom::getrandom will never return a result where there a single word is zero or all-one bits, where as word is a 4-byte chunk on 32-bit x86 or a 8-byte chunk on 64-bit x86, when the RDRAND implementation is being used. Such values are expected to occur every 2/2^N words on average. As a result, any use of getrandom::getrandom returns results that are wrongly biased; 2/2^N values are rejected on an N-bit platform.

32-bit x86 support for the RDRAND feature was added in PR #134, after PR #48. The analysis on the bias when the code was implemented was based on the probability 2/2^64 which is correct when this code runs on a 64-bit CPU, but not for when it runs on a 32-bit CPU. I suspect that when PR #134 was under consideration, the difference in the bias was perhaps overlooked. The bias is notably worse on 32-bit platforms since 2/2^32 is much more likely than 2/2^64.

Outside of cryptography, I find the present solution particularly unfortunate because if getrandom's output is used for randomized testing of a 64-bit/32-bit function on x86_64/x86 on a target for which the RDRAND implementation is used, the important boundary conditions of the values usize::MIN and usize::MAX will never be reached, ever!

In the context of implementing cryptographic functions, and especially keygen and nonce generation, I suspect any user of this crate is likely to get pestered about this bias and will have to address it in some way.

In any case, I think the present solution is weird enough that it is worth having more eyes on it and more discussion, and also I hope we could find a better solution.

Incidentally, BoringSSL's code has this to say:

      // Also disable for family 0x17, models 0x70–0x7f, due to possible RDRAND
      // failures there too.

It seems like Google has reason to believe that some family 0x17 models may also be bad. So the comment indicating the problem is only with families 14h-16h may be worth updating, at least.

@briansmith
Copy link
Contributor Author

In the rdrand crate, I saw very similar logic:

if el == 0 || el == !0 {
                    // Certain AMD CPUs may return broken data for `rdrand`. In those cases they
                    // will return a bitpattern where all-ones is set. In case this was a
                    // false-positive we just try again. If this happens `RETRY_LIMIT` times, we
                    // will naturally fall into the `HardwareFailure` branch.

Interestingly, the rdrand crate source code says that the bit pattern is all-ones when an AMD CPU fails, but it also checks all-zeroes, but doesn't explain why. And getrandom does the same, without any explanation.

I dug into the systemd patch at poettering/systemd@c6372be and it also checks both values. It has a clearer explanation of the reasoning:

    /* Apparently on some AMD CPUs RDRAND will sometimes (after a suspend/resume cycle?) report success
     * via the carry flag but nonetheless return the same fixed value -1 in all cases. This appears to be
     * a bad bug in the CPU or firmware. Let's deal with that and work-around this by explicitly checking
     * for this special value (and also 0, just to be sure) and filtering it out. 

So the value 0 is being screened out "just in case." Doesn't sound good to me.

I think that what might be not too terrible for systemd in that particular circumstance may be not quite good enough for a general purpose getrandom library.

@dhardy
Copy link
Member

dhardy commented Oct 1, 2021

I try not to comment on this repo too much (since I passed off maintainership), but I do remember a little about this issue.

RDRAND-based output is (too) biased

Biased, yes, and you are right that this was initially only considered in the case of u64 values. But is 1-in-2^31 bias too much? Depends on the application I guess; certainly this case is more arguable than for 1-in-2^63. We could perhaps set the minimum output size at 64-bits?

So the value 0 is being screened out "just in case." Doesn't sound good to me.
It seems like Google has reason to believe that some family 0x17 models may also be bad.

I wish we could just go to the AMD website and find a listing of all possible failure cases. Alas, we lack comprehensive sources of information, and also the possibility that this might later be discovered in future CPUs, so is there much more we can do?

@dhardy
Copy link
Member

dhardy commented Oct 1, 2021

Apparently this also affects some of the latest AMD CPUs, but in a harder-to-detect way: collisions on Ryzen 5900X.

(Perhaps the better answer is never to trust RDRAND on an AMD CPU? Not to say there aren't potential issues with other vendors.)

@briansmith
Copy link
Contributor Author

Thanks @dhardy. I remember that second systemd issue now. Here's an important comment from that issue:

systemd uses RDRAND to generate UUIDs only, nothing else, and exactly as suggested by the whitepapers. if this generates non-unique UUIDs that so easily collide then the RNG is just rubbish.

I don't know if that claim is accurate, but if it is accurate then it explains why systemd is comfortable with using a non-uniform RNG.

The rdrand crate was also doing the same thing as this; over the weekend they changed their implementation to avoid filtering out usize::MIN and usize::MAX; see nagisa/rust_rdrand#16 (comment).

@newpavlov
Copy link
Member

I think we should introduce separate code paths for affected families and everyone else. I have proposed it previously, but we chose the current approach for simplicity sake. It would be nice to have more info on the 0x17 family.

So the value 0 is being screened out "just in case." Doesn't sound good to me.

This hardware bug is really scary, so I think "just in case" is perfectly fine here. I think it's better to have small bias, than potentially zeroed nonces/keys. Personally, I think that on affected families it may be worth to go even further and check that we do not encounter collisions by calling the step function twice and checking that it returns different results each time.

@briansmith
Copy link
Contributor Author

briansmith commented Oct 20, 2021

So the value 0 is being screened out "just in case." Doesn't sound good to me.

This hardware bug is really scary, so I think "just in case" is perfectly fine here.

Why only all-zeroes and all-ones are scary, but no other values? My understanding is that AMD documented and/or somebody observed that all-ones happen in the case of the specific AMD failure; did AMD document or did somebody observe that the all-zero value occurs for these failures?

Personally, I think that on affected families it may be worth to go even further and check that we do not encounter collisions by calling the step function twice and checking that it returns different results each time.

False positives would occur. (This type of test is documented in NIST/FIPS standards and implementation guidance, FYI.)

As I mentioned in #230, I don't think it is a good idea to default to using RDRAND for any target. What I would prefer is that there is no purely-RDRAND-based implementation in getrandom; instead, when targeting the OS-less targets, users can register their own custom RNG using the existing mechanism, and if they want their custom RNG to use (just) RDRAND then they could use the rdrand crate for that. That would remove the use of RDRAND completely from getrandom. Then people could build "real" CSPRNGs that use RDRAND only as an entropy source instead of as the entire CSPRNG.

@josephlr
Copy link
Member

Why only all-zeroes and all-ones are scary, but no other values? My understanding is that AMD documented and/or somebody observed that all-ones happen in the case of the specific AMD failure; did AMD document or did somebody observe that the all-zero value occurs for these failures?

Some context on why that check is the way it is. The check is basically "did the implementation forget to set the CF flag". Both AMD and Intel document what value is put in the destination register on failure. For AMD it is all 1s (as discussed above), for Intel it is all 0s (see section 5.2 of Intel's DRNG whitepaper). So we just check for both.

This is obviously not an ideal state, as ideally the RDRAND instruction would never have bugs on any architecture, but it is what it is.

That would remove the use of RDRAND completely from getrandom.

I agree on many of the things in #230, but removing RDRAND completely is unlikely. Many users of this crate (myself included) rely on this functionality and its simple ergonomics, and I wouldn't want to break them.

@josephlr josephlr added the bug Something isn't working label Oct 23, 2022
@josephlr
Copy link
Member

josephlr commented Jan 28, 2023

I was looking into this issue today, and I think we should just mimic what the Linux Kernel does here:

  • Check if cpuid claims to support RDRAND
  • Unconditionally disable on AMD families 0x15 (Bulldozer / Piledriver / Steamroller / Excavator) and 0x16 (Jaguar / Puma)
    • These are the known-bad implementations where AMD CPUs will report success in RDRAND despite returning usize::MAX. This is the bug the current code tries to avoid.
    • Linux implementation in init_amd_bd and init_amd_jg
    • Given that families older than 0x15 didn't even have RDRAND, it's fine to just check is_amd && family < 0x17.
  • When doing the CPU checks in init, also run a small collision test to make sure we aren't repeating values
  • Don't check/bias the output when doing the normal buffer fills as part of getrandom_inner. This should avoid any issues around bias (either on 32-bit or 64-bit platforms). It also happens to make the generated code smaller and easier to inline.

This would also make our code almost match that in BoringSSL as well, except for them disabling RDRAND on family 0x17, models 0x70–0x7f, as @briansmith noted in #228 (comment). This change was added in BoringSSL change 37604, and I did some internal and external digging to figure out why ths change was added.

It turns out it was added in response to the Zen 2 RDRAND issues (Phoronix, Reddit post). As far as I can tell, a bad BIOS caused RDRAND to always fail, but still signal failure correctly via CF. On BoringSSL, this would be an issue, so the feature is disabled. However, for this library, we will just (always) return an error, so no blacklisting for that family/model is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants