-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seed HashMaps thread-locally, straight from the OS. #31356
Conversation
r? @brson (rust_highfive has picked a reviewer for you, use r? to override) |
cc @rust-lang/libs
I only had a few minutes to write this, and wanted to get the conversation started if there is one. The next step will be a patch that moves the rest of the private |
Can you expand on what this means and why it is? I don't see why changing the source of randomness would affect the behavior. |
let mut bytes = [0_u8; 16]; | ||
internal_rand::fill_bytes(&mut bytes); | ||
|
||
let keys: (u64, u64) = unsafe { mem::transmute(bytes) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird transmute. Is this really ok? Seems like the alignment of the u8 array is probably less than the u64 tuple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, should start with keys, and transmute keys to bytes before passing them to internal_rand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A transmute like this should work fine: it's reinterpreting the value, not the memory location. That is, let out = transmute::<A, B>(x)
is/should be equivalent to (in C)
B out;
memcpy(&out, &x, sizeof(B))
It would definitely be problematic if it was &[u8; 16]
to &(u64, u64)
.
(If the above is not the case, it seems to me like it is an unnecessary footgun.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Do we guarantee tuple layout?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but there's no way for that to matter here. If the transmute succeeds then you've loaded up 128 bits with randomness, and that's all that matters.
I am intrigued by the by-val transmute issue... I don't think I see a lot of transmuting of non-pointers... I have no idea how it should behave!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding padding will make the tuple larger, and hence cause the transmute to fail. Also, note that this is in the standard library, and hence can rely on current behaviours of the compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gankro You shouldn't see much transmutes of pointers either, since those are covered by regular casts (and &*
to create a reference).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a more natural way to write this is to create a [u64; 2]
and take a slice of it as &mut [u8]
(its byte representation) and pass it to fill bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is, let out = transmute::<A, B>(x) is/should be equivalent to (in C) ... (memcpy)
should be, but istm llvm could decide there are alignment-dependent instructions that would work better. just handwaving
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a more natural way to write this is to create a [u64; 2] and take a slice of it as &mut [u8](its byte representation) and pass it to fill bytes?
I did this originally, but it is significantly more fiddly.
llvm could decide there are alignment-dependent instructions that would work better. just handwaving
I think that would mean that LLVM is explicitly disobeying our instructions and hence would be a miscompilation. (I had a look at the IR, and it does compile down to a memcpy
.)
@brson today HashMap::new is like:
with the PR it's like
Today if you make two HashMaps and perform identical operations on them, you'll get the different iteration orders. With this patch, they will iterate identically. Historically programmers have been all-too-happy to latch onto any perceived determinism in HashMap iteration order. This is why JavaScript now specifies that an object is semantically a LinkedHashMap. People relied on this fact, and when it didn't hold drop-downs got shuffled. Now, this is less of a risk for Rust, because the JS case is relying on individual maps being really reliable, and this doesn't hold today or with this patch. However, one could imagine something to the effect of: fn test() {
let results: Vec<Vec<u8>> = vec![];
for vec in get_data_sets() {
let map = HashMap::new();
for k in vec {
map.insert(k, compute(k));
}
results.push(map.values().cloned().collect());
}
assert!(results[0] == results[1]);
} Today this would be hopelessly broken, but with this patch this will "happen to work" if the input/output sequences line up. This is one of those situations where we can wag our finger and tell people not to do it, but at the end of the day it's externally observable it can and will be leveraged. Note that the following will still be hopelessly and obviously broken: fn test() {
// I got this when I ran it the first time, shouldn't change!
let expected = vec![1, 3, 5, 6, 11, 13];
let map = HashMap::new();
for k in get_test_input() {
map.insert(k, compute(k));
}
let result = map.values().cloned().collect();
assert_eq!(result, expected);
} Which is the thing I would most worry about. |
|
||
pub use self::imp::fill_bytes; | ||
|
||
#[cfg(all(unix, not(target_os = "ios"), not(target_os = "openbsd")))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to deduplicate this with the existing randomness support? At least in terms of an internal implementation detail it'd be good to not have to keep this in sync in two places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mention above, I was/am going to remove the existing support from std since we only need what's in this file now. However, the rand
crate will still need to exist for internal testing purposes; should it still be dedup'd with that? (It's definitely possible, it just means slightly more boilerplate in here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry for letting this slip through the cracks, but yeah that's seems fine to me!
Wow those are some impressive numbers! In the past I don't remember them being quite so dramatic... If we're moving to thread-locals I wouldn't mind moving to full-blown process-globals. We've concluded in the past that there's no need in terms of DOS protection to have per-hashmap or even per-thread keys, a global one should be good enough. |
Currently each hashmap has different keys since they retrieve new ones from the thread local on creation. This patch changes those
It'll probably depend on the platform (incl. Linux kernel version), since that'll change the performance characteristics of retrieving random numbers (the old thread rng requires retrieving kilobytes, while this just needs a few).
Yeah, definitely possible, but this implementation is particularly simple (don't have to worry about |
It would be possible to address @gankro's concern by generating the actual keys with a per-thread PRNG seeded with the OS random data. EDIT: but isn't this what the old implementation was already doing? |
The improvement from 130_000 ns to 3000 ns looks huge. Would it make sense to investigate why |
Yes, it was, and it's why I was sure to point out this change has that downside in the PR. If we absolutely want to proactively help people avoid relying on hashmap order, a simpler and more efficient method would be to just increment one of the cached keys each time. (NB. this doesn't work quite so well with global rather than thread-local values: the locked increment/contention makes things slower.)
Maybe, but it's almost certainly fundamental to the algorithm (it requires an extensive initialisation). In any case, the thread_rng in std is only used here, and hence is being/can be removed. |
If the update of the global cached keys is that simple, we could do atomic increments. There might be some contention, but it should still be way faster than locking. Another option would be to keep the thread-local approach and use a simpler (non crypto-safe) RNG to have cheaper initialisation and key generation. In the repo for the |
This reduces how much rand is required in std, and makes creating hash-maps noticably faster. The first invocation of HashMap::new() in a thread goes from ~130_000 ns to ~1500 ns (i.e. 86x faster) and later invocations go from 10-20 ns to a more consistent 1.8-2.0ns. The mean for many invocations in a loop, on a single thread, goes from ~77ns to ~1.9ns, and the standard deviation drops from 2800ns to 33ns (the *massive* variance of the old scheme comes from the occasional reseeding of the thread rng). These new numbers are only slightly higher than creating the `RandomState` outside the loop and calling `with_state` in it (i.e. only measuring the non-random parts of hashmap initialisation): 1.3 and 18 ns respectively. This new scheme has the slight downside of being consistent on a single thread, so users may unintentionally rely on the order being fixed (e.g. two hashmaps with the same contents). Closes rust-lang#27243.
27f9142
to
3f84916
Compare
(Incidentally, my method of measuring had a lot of overhead due to the timers, I've updated the numbers.)
Yes, but... locking was never proposed? In the single threaded case (i.e. no contention), doing the atomic increment seems to be more than 2.8 times slower than the non-incrementing global version and both thread local ones (incrementing or not): it's about 5.4 ns / invocation.
One could say that the incrementing was exactly this sort of non-crypto safe RNG (and is about as fast as you can get). There's two factors at play here:
The first we are assuming (see e.g. #27243) is guaranteed by the combination of 128-bits of randomness plus the design of SipHash, so this discussion is all about the second. |
I'm still a fan of this. @huonw is this ready to go? |
The libs team discussed this during triage yesterday and the conclusion was:
Once that's in place this should be good to go, thanks @huonw! |
I suggest preserving nondeterministic iteration order only in debug builds, not release builds. However, there's no rush to change the team's decision. (It seems the only flag for conditional compilation in debug builds is |
alex's solution sounds good. Making it conditional on debug_assertions inside libstd doesn't affect most users: almost all users will use a libstd compiled with debug_assertions off, until recently not even our buildbots would test have such a configuration. |
☔ The latest upstream changes (presumably #32635) made this pull request unmergeable. Please resolve the merge conflicts. |
I've started to continue this in #33318 so I'm gonna close this in favor of that |
This is a rebase and extension of rust-lang#31356 where we cache the keys in thread local storage. This should give us a nice speed bost in creating hash maps along with mostly retaining the property that all maps have a nondeterministic iteration order. Closes rust-lang#27243
std: Cache HashMap keys in TLS This is a rebase and extension of #31356 where we not only cache the keys in thread local storage but we also bump each key every time a new `HashMap` is created. This should give us a nice speed bost in creating hash maps along with retaining the property that all maps have a nondeterministic iteration order. Closes #27243
This reduces how much rand is required in std, and makes creating
hash-maps noticably faster. The first invocation of HashMap::new() in a
thread goes from ~130_000 ns to ~1500 ns (i.e. 64x faster) and later
invocations (non-reseeding ones) go from 10-20 ns to a more consistent
1.8-2.0ns.
The mean for many invocations in a loop, on a single thread, goes from
78ns to ~1.9ns, and the standard deviation drops from 2800ns to
33ns (the massive variance of the old scheme comes from the occasional
reseeding of the thread rng).
These new numbers are only slightly higher than creating the
RandomState
outside the loop and callingwith_state
in it (i.e.only measuring the non-random parts of hashmap initialisation): 1.3 and
18 ns respectively.
This new scheme has the slight downside of being consistent on a single
thread, so users may unintentionally rely on the order being
fixed (e.g. two hashmaps with the same contents).
Closes #27243.