Make short string hashing 30% faster by splitting Hash::hash_end from Hash::hash #29139

sorear · 2015-10-18T09:25:07Z

Why?

Since hash functions are already designed to prevent collisions between a string and its prefixes, it's somewhat inelegant that we append a sentinel byte to strings for hashing. I was looking at #25237 a few days ago and realized that if we distinguish hashing contexts which are at the end of the key from those that aren't, we can suppress the sentinel byte (and also vector lengths) in the cases where they aren't needed; in addition to saving a byte of hashing, it saves a call to update and associated buffer-management overhead.

This attacks the same problem as #28044.

How?

This adds a new method hash_end to the Hash trait, which behaves exactly as hash except that it need not produce a prefix-free encoding. It is always legal for hash_end to be the same as hash, and as such this is the default implementation. There are specialized implementations for strings and slices which remove the end/length markers.

How much?

Here's a small benchmark script:

use std::hash::{Hasher,Hash,SipHasher};
use std::env;

fn main() {
    let args : Vec<String> = env::args().collect();
    let mut acc = 0u64;
    match &*args[1] {
        "0" => {
            for i in 1 .. 10_000_000 {
                acc += format!("{}", i).len() as u64; // not doing hashing
            }
        },
        "1" => {
            for i in 1 .. 10_000_000 {
                let mut _h = SipHasher::new();
                format!("{}", i).hash_end(&mut _h);
                acc += _h.finish();
            }
        },
        "2" => {
            for i in 1 .. 10_000_000 {
                let mut _h = SipHasher::new();
                format!("{}", i).hash(&mut _h);
                acc += _h.finish();
            }
        },
        "3" => {
            let mut s = std::collections::HashSet::new();
            for i in 1 .. 10_000_000 {
                s.insert(format!("{}", i));
            }
            acc = s.len() as u64;
        },
        "4" => {
            let mut s = std::collections::HashSet::new();
            for i in 1 .. 100_000 {
                s.insert(format!("{}", i));
            }
            for i in 1 .. 10_000_000 {
                if s.contains(&format!("{}", i)) { acc += 1; }
            }
        }
        _ => {},
    }
    println!("{}", acc);
}

I ran it in each mode on the patched and baseline rust compilers (with -O, on x86_64 OSX), median of 27 runs each time, for the following timings:

            (0)   (1)   (2)   (3)   (4)
PATCHED   0.864 1.133 1.276 5.564 1.619
BASELINE  0.852 ----- 1.298 5.654 1.736

Subtracting out the baseline (0) case which just allocates and frees strings, it looks like a 34% improvement on short string hashing, 13% on hashset queries, and 2% on hashset insertions. Uncertainty for the medians seems to be around 10ms.

What's the catch?

Naturally this changes hash values. It's not clear how big a deal that is, especially in re semver.
More importantly: anybody who needs two types to have the same hash values (in particular Borrow implementers) can no longer generally do so by forwarding the hash method; hash_end must be forwarded as well. This situation exists exactly once in the compiler. #[derive(Hash)] has been modified to forward hash_end, so newtype-ish wrappers will just work (outside of the compiler; the compiler needs to implement hash_end itself when it's needed for Borrow, because we can't rely on the stage0 to do it.)

Wait!

hash_end should probably be feature gated. It's not in this version of the patch, because when I tried feature gating it deriving broke; I'm not sure how to tell rustc to ignore feature gates in deriving-generated code.

I'm not sure whether this belongs as an RFC.

By distinguishing the end hash operations from middle hash operations, we can avoid hashing unnecessary sentinels. For instance, (String, String) only needs a 0xFF in the middle, not at the end.

rust-highfive · 2015-10-18T09:25:21Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @brson (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. The way Github handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

bluss · 2015-10-18T11:29:19Z

More importantly: anybody who needs two types to have the same hash values (in particular Borrow implementers) can no longer generally do so by forwarding the hash method; hash_end must be forwarded as well.

I'm a bit worried this will run into the same problem @gankro found in the first attempt: This is a breaking change for any external impls of Hash, especially those that try to hash the same as str or slices, and do so by method forwarding, they'd need to now define this method too.

I'm absolutely interested in any approach to solve or improve short input hashing and accommodating other hash algorithms.

I offer one argument in favour of the approach in my PR: Whether to care about prefixfreeness or not, and how to solve it, should be a property of the Hasher, not the value to be hashed (Hash trait). I also think it has much lower backward compat risk.

alexcrichton · 2015-10-19T21:19:30Z

I think that with this and #28044 it may be the point that we should hold off for an RFC to work through the design space here. I'm personally a little unsure about what the constraints are and e.g. where it falls down today.

sorear · 2015-10-21T06:03:30Z

@alexcrichton What would the path forward for that be? Shall I reformat my version of the proposal as an RFC and take it there?

alexcrichton · 2015-10-21T16:18:06Z

@sorear yeah I think that may be the best path forward, you may want to work with @bluss on the RFC and at least have a mention of #28044 in the alternatives section

bors · 2015-10-25T18:38:35Z

☔ The latest upstream changes (presumably #29254) made this pull request unmergeable. Please resolve the merge conflicts.

bstrie · 2015-10-27T03:09:40Z

@sorear Always happy to see yet another DCSS developer here. :P

brson · 2015-11-23T21:28:48Z

Seems like an RFC was desired. Closing.

Split Hash::hash_end from Hash::hash

a0aef13

By distinguishing the end hash operations from middle hash operations, we can avoid hashing unnecessary sentinels. For instance, (String, String) only needs a 0xFF in the middle, not at the end.

rust-highfive assigned brson Oct 18, 2015

brson closed this Nov 23, 2015

sorear mentioned this pull request Jul 2, 2016

hashmap: use siphash-1-3 as default hasher #33940

Merged

pczarn mentioned this pull request Jul 4, 2016

Extend the Hasher trait with fn delimit to support one-shot hashing rust-lang/rfcs#1666

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make short string hashing 30% faster by splitting Hash::hash_end from Hash::hash #29139

Make short string hashing 30% faster by splitting Hash::hash_end from Hash::hash #29139

sorear commented Oct 18, 2015

rust-highfive commented Oct 18, 2015

bluss commented Oct 18, 2015

alexcrichton commented Oct 19, 2015

sorear commented Oct 21, 2015

alexcrichton commented Oct 21, 2015

bors commented Oct 25, 2015

bstrie commented Oct 27, 2015

brson commented Nov 23, 2015

Make short string hashing 30% faster by splitting Hash::hash_end from Hash::hash #29139

Make short string hashing 30% faster by splitting Hash::hash_end from Hash::hash #29139

Conversation

sorear commented Oct 18, 2015

Why?

How?

How much?

What's the catch?

Wait!

rust-highfive commented Oct 18, 2015

bluss commented Oct 18, 2015

alexcrichton commented Oct 19, 2015

sorear commented Oct 21, 2015

alexcrichton commented Oct 21, 2015

bors commented Oct 25, 2015

bstrie commented Oct 27, 2015

brson commented Nov 23, 2015