-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache conscious hashmap table #36692
Conversation
r? @aturon (rust_highfive has picked a reviewer for you, use r? to override) |
976004c
to
006c6ba
Compare
@@ -371,8 +370,7 @@ impl<K, V, M> EmptyBucket<K, V, M> | |||
pub fn put(mut self, hash: SafeHash, key: K, value: V) -> FullBucket<K, V, M> { | |||
unsafe { | |||
*self.raw.hash = hash.inspect(); | |||
ptr::write(self.raw.key as *mut K, key); | |||
ptr::write(self.raw.val as *mut V, value); | |||
ptr::write(self.raw.pair as *mut (K, V), (key, value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would feel more natural to have two writes here and skip making a tuple. Does it matter either direction for performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the disassembler and the end result seems to be the same
for (usize, usize) it's both MOVDQU
for (usize, [u64; 10]) it's
pub fn put(mut self, hash: SafeHash, key: K, value: V) -> FullBucket<K, V, M> {
unsafe {
*self.raw.hash = hash.inspect();
ec1d: 49 89 39 mov %rdi,(%r9)
ec20: 48 8b 8d 10 fe ff ff mov -0x1f0(%rbp),%rcx
ec27: 49 89 0c 24 mov %rcx,(%r12)
ec2b: 0f 28 85 50 ff ff ff movaps -0xb0(%rbp),%xmm0
ec32: 41 0f 11 44 24 48 movups %xmm0,0x48(%r12)
ec38: 0f 28 85 10 ff ff ff movaps -0xf0(%rbp),%xmm0
ec3f: 0f 28 8d 20 ff ff ff movaps -0xe0(%rbp),%xmm1
ec46: 0f 28 95 30 ff ff ff movaps -0xd0(%rbp),%xmm2
ec4d: 0f 28 9d 40 ff ff ff movaps -0xc0(%rbp),%xmm3
ec54: 41 0f 11 5c 24 38 movups %xmm3,0x38(%r12)
ec5a: 41 0f 11 54 24 28 movups %xmm2,0x28(%r12)
ec60: 41 0f 11 4c 24 18 movups %xmm1,0x18(%r12)
ec66: 41 0f 11 44 24 08 movups %xmm0,0x8(%r12)
ec6c: 4c 8b 75 b8 mov -0x48(%rbp),%r14
let pair_mut = self.raw.pair as *mut (K, V);
ptr::write(&mut (*pair_mut).0, key);
ptr::write(&mut (*pair_mut).1, value);
pub fn put(mut self, hash: SafeHash, key: K, value: V) -> FullBucket<K, V, M> {
unsafe {
*self.raw.hash = hash.inspect();
ec1d: 49 89 39 mov %rdi,(%r9)
ec20: 48 8b 8d 10 fe ff ff mov -0x1f0(%rbp),%rcx
ec27: 49 89 0c 24 mov %rcx,(%r12)
ec2b: 0f 28 85 50 ff ff ff movaps -0xb0(%rbp),%xmm0
ec32: 41 0f 11 44 24 48 movups %xmm0,0x48(%r12)
ec38: 0f 28 85 10 ff ff ff movaps -0xf0(%rbp),%xmm0
ec3f: 0f 28 8d 20 ff ff ff movaps -0xe0(%rbp),%xmm1
ec46: 0f 28 95 30 ff ff ff movaps -0xd0(%rbp),%xmm2
ec4d: 0f 28 9d 40 ff ff ff movaps -0xc0(%rbp),%xmm3
ec54: 41 0f 11 5c 24 38 movups %xmm3,0x38(%r12)
ec5a: 41 0f 11 54 24 28 movups %xmm2,0x28(%r12)
ec60: 41 0f 11 4c 24 18 movups %xmm1,0x18(%r12)
ec66: 41 0f 11 44 24 08 movups %xmm0,0x8(%r12)
ec6c: 4c 8b 75 b8 mov -0x48(%rbp),%r14
let pair_mut = self.raw.pair as *mut (K, V);
ptr::write(pair_mut, (key, value));
for (String, usize) it's
pub fn put(mut self, hash: SafeHash, key: K, value: V) -> FullBucket<K, V, M> {
unsafe {
*self.raw.hash = hash.inspect();
f670: 4d 89 20 mov %r12,(%r8)
f673: 48 8b 45 90 mov -0x70(%rbp),%rax
f677: 49 89 06 mov %rax,(%r14)
f67a: f3 41 0f 7f 46 08 movdqu %xmm0,0x8(%r14)
f680: 49 89 5e 18 mov %rbx,0x18(%r14)
f684: 48 8b 5d c8 mov -0x38(%rbp),%rbx
let pair_mut = self.raw.pair as *mut (K, V);
ptr::write(pair_mut, (key, value));
pub fn put(mut self, hash: SafeHash, key: K, value: V) -> FullBucket<K, V, M> {
unsafe {
*self.raw.hash = hash.inspect();
f670: 4d 89 20 mov %r12,(%r8)
f673: 48 8b 45 90 mov -0x70(%rbp),%rax
f677: 49 89 06 mov %rax,(%r14)
f67a: f3 41 0f 7f 46 08 movdqu %xmm0,0x8(%r14)
f680: 49 89 5e 18 mov %rbx,0x18(%r14)
f684: 48 8b 5d c8 mov -0x38(%rbp),%rbx
let pair_mut = self.raw.pair as *mut (K, V);
// ptr::write(pair_mut, (key, value));
ptr::write(&mut (*pair_mut).0, key);
ptr::write(&mut (*pair_mut).1, value);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. As usual, the compiler is very smart.
This subset are the .contains_key() benchmarks (retrieving key but not value).
So this would be the drawback, where the old layout had better cache usage. It seems ok to give this up in return for the rest? |
|
Results for x86
x86 again (with usize hashes from #36595, thus 31 hash bits)
|
Maybe with bigger hashmaps? To make sure it's well out of the cpu cache size. |
After the 3000x look I finally saw that iter_keys_big_value was busted, here are several others for good measure:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the potential downside is wasted space, shouldn't there be some memory benchmarks as well?
/// | ||
/// This design uses less memory and is a lot faster than the naive | ||
/// This design uses is a lot faster than the naive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"uses is"
/// | ||
/// This design uses less memory and is a lot faster than the naive | ||
/// This design uses is a lot faster than the naive | ||
/// `Vec<Option<u64, K, V>>`, because we don't pay for the overhead of an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this supposed to say Vec<Option<(u64, K, V)>>
?
@@ -48,12 +48,14 @@ const EMPTY_BUCKET: u64 = 0; | |||
/// which will likely map to the same bucket, while not being confused | |||
/// with "empty". | |||
/// | |||
/// - All three "arrays represented by pointers" are the same length: | |||
/// - All two "arrays represented by pointers" are the same length: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Both"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I fixed all three.
006c6ba
to
9098c5c
Compare
cc @pczarn |
@Veedrac PTAL |
@arthurprs Seems like a solid improvement. |
@rfcbot fcp merge Looks like we've got solid wins all around to consider merging? |
FCP proposed with disposition to merge. Review requested from: No concerns currently listed. |
I'm happy to go along with the experts here. |
Has anyone evaluated this on a real workload? The first one that comes to mind is of course rustc. |
I'm not familiar enough with the bootstrap process, but if somebody provide some guidance I could do it. |
Tip from simulacrum, that we can use https://github.com/rust-lang-nursery/rustc-benchmarks to test rustc impact. Rustc building itself is a heavier (and more important?) benchmark, don't know exactly what to time there |
@arthurprs short of timing an execution |
@arthurprs You can run I agree with changing the memory layout. However, the tradeoffs are subtle. The benefits and drawbacks of this change depend on circumstances such as the sizes of keys and values. There is one more drawback that you didn't describe in detail. Let's say the user wants to iterate through HashMap's keys. The user will access every key, which will waste some memory and cache bandwidth on loading the map's values. So neither layout is truly cache conscious. Both are cache conscious in different ways. Of course you have to decide if the efficiency of the keys() and values() iterators is important enough to give the change to the layout a second thought. I think the benefits outweigh the drawbacks, because accessing single map entries is very common. |
I don't think those tests will be feasible in my laptop. Specially considering the trial and error involved. I think the benefits far outweighs the drawbacks, there's potential to waste some padding but in the real world it's frequently not the case (try using github search in rust repo and skim some pages). We shouldn't optimize for keys() and values() and those will definitely take a hit (as per benchmarks). |
☔ The latest upstream changes (presumably #36753) made this pull request unmergeable. Please resolve the merge conflicts. |
9098c5c
to
70f9b98
Compare
Nice work @arthurprs ! |
⌛ Testing commit c5068a4 with merge e33334f... |
💔 Test failed - auto-linux-64-nopt-t |
I'll fix it. |
c5068a4
to
c435821
Compare
Travis is happy again. |
@arthurprs Have you looked into why the buildbot tests failed? log link Since it's in the big testsuite and I don't see the PR changing anything there. It was unfortunately green on travis before, and still the buildbot build failed. |
It's the nopt builder, so presumably related to debug assertions? |
Yes, I should have said "CI should be happy". |
@bors: r+ |
📌 Commit c435821 has been approved by |
⌛ Testing commit c435821 with merge 2e0a3dc... |
💔 Test failed - auto-linux-cross-opt |
I'm not sure it's related to the PR. |
@bors: retry |
⌛ Testing commit c435821 with merge 2353987... |
💔 Test failed - auto-win-gnu-32-opt-rustbuild |
error: pretty-printing failed in round 0 revision None |
@bors retry |
⌛ Testing commit c435821 with merge 40cd1fd... |
Cache conscious hashmap table Right now the internal HashMap representation is 3 unziped arrays hhhkkkvvv, I propose to change it to hhhkvkvkv (in further iterations kvkvkvhhh may allow inplace grow). A previous attempt is at #21973. This layout is generally more cache conscious as it makes the value immediately accessible after a key matches. The separated hash arrays is a _no-brainer_ because of how the RH algorithm works and that's unchanged. **Lookups**: Upon a successful match in the hash array the code can check the key and immediately have access to the value in the same or next cache line (effectively saving a L[1,2,3] miss compared to the current layout). **Inserts/Deletes/Resize**: Moving values in the table (robin hooding it) is faster because it touches consecutive cache lines and uses less instructions. Some backing benchmarks (besides the ones bellow) for the benefits of this layout can be seen here as well http://www.reedbeta.com/blog/2015/01/12/data-oriented-hash-table/ The obvious drawbacks is: padding can be wasted between the key and value. Because of that keys(), values() and contains() can consume more cache and be slower. Total wasted padding between items (C being the capacity of the table). * Old layout: C * (K-K padding) + C * (V-V padding) * Proposed: C * (K-V padding) + C * (V-K padding) In practice padding between K-K and V-V *can* be smaller than K-V and V-K. The overhead is capped(ish) at sizeof u64 - 1 so we can actually measure the worst case (u8 at the end of key type and value with aliment of 1, _hardly the average case in practice_). Starting from the worst case the memory overhead is: * `HashMap<u64, u8>` 46% memory overhead. (aka *worst case*) * `HashMap<u64, u16>` 33% memory overhead. * `HashMap<u64, u32>` 20% memory overhead. * `HashMap<T, T>` 0% memory overhead * Worst case based on sizeof K + sizeof V: | x | 16 | 24 | 32 | 64 | 128 | |----------------|--------|--------|--------|-------|-------| | (8+x+7)/(8+x) | 1.29 | 1.22 | 1.18 | 1.1 | 1.05 | I've a test repo here to run benchmarks https://github.com/arthurprs/hashmap2/tree/layout ``` ➜ hashmap2 git:(layout) ✗ cargo benchcmp hhkkvv:: hhkvkv:: bench.txt name hhkkvv:: ns/iter hhkvkv:: ns/iter diff ns/iter diff % grow_10_000 922,064 783,933 -138,131 -14.98% grow_big_value_10_000 1,901,909 1,171,862 -730,047 -38.38% grow_fnv_10_000 443,544 418,674 -24,870 -5.61% insert_100 2,469 2,342 -127 -5.14% insert_1000 23,331 21,536 -1,795 -7.69% insert_100_000 4,748,048 3,764,305 -983,743 -20.72% insert_10_000 321,744 290,126 -31,618 -9.83% insert_int_bigvalue_10_000 749,764 407,547 -342,217 -45.64% insert_str_10_000 337,425 334,009 -3,416 -1.01% insert_string_10_000 788,667 788,262 -405 -0.05% iter_keys_100_000 394,484 374,161 -20,323 -5.15% iter_keys_big_value_100_000 402,071 620,810 218,739 54.40% iter_values_100_000 424,794 373,004 -51,790 -12.19% iterate_100_000 424,297 389,950 -34,347 -8.10% lookup_100_000 189,997 186,554 -3,443 -1.81% lookup_100_000_bigvalue 192,509 189,695 -2,814 -1.46% lookup_10_000 154,251 145,731 -8,520 -5.52% lookup_10_000_bigvalue 162,315 146,527 -15,788 -9.73% lookup_10_000_exist 132,769 128,922 -3,847 -2.90% lookup_10_000_noexist 146,880 144,504 -2,376 -1.62% lookup_1_000_000 137,167 132,260 -4,907 -3.58% lookup_1_000_000_bigvalue 141,130 134,371 -6,759 -4.79% lookup_1_000_000_bigvalue_unif 567,235 481,272 -85,963 -15.15% lookup_1_000_000_unif 589,391 453,576 -135,815 -23.04% merge_shuffle 1,253,357 1,207,387 -45,970 -3.67% merge_simple 40,264,690 37,996,903 -2,267,787 -5.63% new 6 5 -1 -16.67% with_capacity_10e5 3,214 3,256 42 1.31% ``` ``` ➜ hashmap2 git:(layout) ✗ cargo benchcmp hhkkvv:: hhkvkv:: bench.txt name hhkkvv:: ns/iter hhkvkv:: ns/iter diff ns/iter diff % iter_keys_100_000 391,677 382,839 -8,838 -2.26% iter_keys_1_000_000 10,797,360 10,209,898 -587,462 -5.44% iter_keys_big_value_100_000 414,736 662,255 247,519 59.68% iter_keys_big_value_1_000_000 10,147,837 12,067,938 1,920,101 18.92% iter_values_100_000 440,445 377,080 -63,365 -14.39% iter_values_1_000_000 10,931,844 9,979,173 -952,671 -8.71% iterate_100_000 428,644 388,509 -40,135 -9.36% iterate_1_000_000 11,065,419 10,042,427 -1,022,992 -9.24% ```
All relevant subteam members have reviewed. No concerns remain. |
It has been one week since all blocks to the FCP were resolved. |
Right now the internal HashMap representation is 3 unziped arrays hhhkkkvvv, I propose to change it to hhhkvkvkv (in further iterations kvkvkvhhh may allow inplace grow). A previous attempt is at #21973.
benefits
This layout is generally more cache conscious as it makes the value immediately accessible after a key matches. The separated hash arrays is a no-brainer because of how the RH algorithm works and that's unchanged.
Lookups: Upon a successful match in the hash array the code can check the key and immediately have access to the value in the same or next cache line (effectively saving a L[1,2,3] miss compared to the current layout).
Inserts/Deletes/Resize: Moving values in the table (robin hooding it) is faster because it touches consecutive cache lines and uses less instructions.
Some backing benchmarks (besides the ones bellow) for the benefits of this layout can be seen here as well http://www.reedbeta.com/blog/2015/01/12/data-oriented-hash-table/
drawbacks
The obvious drawbacks is: padding can be wasted between the key and value. Because of that keys(), values() and contains() can consume more cache and be slower.
Total wasted padding between items (C being the capacity of the table).
In practice padding between K-K and V-V can be smaller than K-V and V-K. The overhead is capped(ish) at sizeof u64 - 1 so we can actually measure the worst case (u8 at the end of key type and value with aliment of 1, hardly the average case in practice).
Starting from the worst case the memory overhead is:
HashMap<u64, u8>
46% memory overhead. (aka worst case)HashMap<u64, u16>
33% memory overhead.HashMap<u64, u32>
20% memory overhead.HashMap<T, T>
0% memory overheadbenchmarks
I've a test repo here to run benchmarks https://github.com/arthurprs/hashmap2/tree/layout