First phase of a Hash reimplementation #11863

erickt · 2014-01-28T04:52:23Z

This PR merges IterBytes and Hash into a trait that allows for generic non-stream-based hashing. It makes use of @eddyb's default type parameter support in order to have a similar usage to the old Hash framework.

Fixes #8038.

Todo:

Better documentation
Benchmark
Parameterize HashMap on a Hasher.

huonw · 2014-01-28T12:43:58Z

src/libstd/hash2/mod.rs

+}
+
+/// The `StreamHash` represents a value that can be hashed by iterating over
+/// the bytes of the value.


s/the bytes of the value/its bytes/

erickt · 2014-01-31T16:59:14Z

@huonw / @thestinger: I just pushed an update. Can you re-review? I'll squash the commits down once it looks good.

huonw · 2014-02-01T01:40:28Z

src/libstd/hash2/mod.rs

+
+/// The `Hasher` trait represents an object that can compute a hash of another
+/// value.
+pub trait Hasher<H, Result> {


I'm ... not sure about Result. I guess it depends what we're trying to use this for, i.e. just HashMap or a more general hashing interface.

If it's just HashMap (and similar structures) then I imagine that Result will essentially always be u64, and so there's not much point having it generic. (Under this scheme, hashes that return u32, like, say, xxHash, would just have the high bits zeroed, and hashes that return larger results, like an AES based one, can just truncate.)

Of course, if we're trying to aim this for more general use, then Result is probably OK.

I added Result so that it could support other hashing schemes like xxHash or CRC32. I figured that there might be circumstances where it may be more efficient to deal with u32 bytes. It doesn't add that much syntactic overhead since I doubt too many people will be directly using Hasher, so I think it's worth keeping it around.

Ok, sounds fine.

I'm slightly concerned that it will make things less composable (i.e. HashMap will always want u64 so one cannot use xxHash (where Result = u64) with it directly), but we can always define some thin adaptors that convert u32 to u64, etc. I guess.

huonw · 2014-02-01T08:39:40Z

I was away for your ping on IRC about [T] for T: Pod, and no, I don't think think casting to [u8] will work; since e.g. [&T] contains Pod values, but we don't want to hash the elements based on the pointer values (which casting to &[u8] would do).

brson · 2014-02-01T18:33:01Z

I'd not expose result_bytes and result_str if they aren't needed outside siphash.
What are the tradeoffs for resettable stream hashes?
What is the benefit of Hasher::hash2 consuming the hasher?
I'm not sure what resetting entails but it sounds like that could defeat some of the cryptographic properties of SipHash. Is there a term for this kind of behavior that we could point out in the docs?

brson · 2014-02-01T18:34:19Z

This looks good. I'm glad to get rid of IterBytes. This does introduce a number of hash traits that most people will never need to know about. Are they going to end up deriving StreamHash, and if so does this provide any generality that is going to go unused (and can be simplified)?

brson · 2014-02-01T18:34:59Z

Would be good to implement additional hashing strategies besides sip hash to prove this architecture, maybe starting with http://www.reddit.com/r/rust/comments/1wqjsf/more_xxhash_benchmarks/

pcwalton · 2014-02-01T21:43:44Z

Can't we have the hash table store an instance of SipHash with the setup phase already done with the hash table's key, and just clone an instance of that for every hashing operation to avoid repeating the key setup process every time? I assume that SipHash works on the same general principle behind CBC mode, so that should be doable, no?

thestinger · 2014-02-02T03:05:00Z

@pcwalton: I was saying "setup" in a misleading way, as I was including the minimum level of work done before and after the actual compression in various hashes.

Some of the other hashes do some work before, but SipHash doesn't. SipHash2-4 runs 2 rounds on each block during compression and then 4 finalization rounds, so that's the minimum level of overhead I'm talking about. We could consider using a weaker form of SipHash like 1-1.

pcwalton · 2014-02-02T04:13:18Z

Well in that case why not try running AESENC in CBC mode for short keys of 16 bytes or less (all node IDs)? Create a nonce, run AES via AES-NI on it, save that in the hash table, then when we want to hash a key xor it with the ciphertext of the nonce and then run AES on that. I wonder what the performance would be for small keys.

Disclaimer: I'm not going to invent a cryptosystem without consulting with crypto gurus; this is basically just a sketch of an idea and may have atrocious performance anyway. Do not implement this.

pcwalton · 2014-02-02T04:56:27Z

@thestinger Just ran some tests here. Even all 10 rounds of AES is significantly faster than SipHash for 16-byte values in CBC mode!

Here's the AES test: https://gist.github.com/pcwalton/8763232
And here's the SipHash test: https://gist.github.com/pcwalton/8763238

Results: AES finishes in 2.238s, while SipHash finishes in 3.096s.

erickt · 2014-02-08T17:50:22Z

It's taken a while, but I finally pushed up a version that allows a HashMap to be used with Equiv types. I'm still not thrilled with the design though because I now need to copy-paste multiple impls of Hash for each stream type. It seems like the only way to implement Hash once is to do:

impl<S: Stream, H: StreamHasher<S>, T: StreamHash<S> Hash<S> for T { ... }

But that prevents other types from implementing Hash because of #9075. Can anyone think up a better way to implement this?

erickt · 2014-02-08T19:43:08Z

Figured out a way to get rid of the copy-pasted implementations of Hash :)

huonw · 2014-02-08T22:31:51Z

In summary (to check my understanding): your trick basically promotes StreamHash<H> to Hash<H>, removing the requirement that H: Stream?

(If so, looks good.)

brson · 2014-02-09T08:25:15Z

Since these are very important types we need to proceed carefully, please. I'd like to have confidence about this direction, but don't understand the issues. Hoping to get more review and opinions.

huonw · 2014-02-09T11:34:57Z

Hm, I just thought of something, doesn't impl<S: Stream> Hash<S> for uint also disallow impl Hash<SuperAwesomeUintHasher> for uint? i.e. we would be restricted to using streaming hashes with the built in types.

erickt · 2014-02-09T16:59:23Z

@huonw: Sorry I had to run when I posted that last code, so I didn't really go into a good description. My realization was that hash::Stream can be used directly as a hasher as long as the method hash doesn't return a value. This lets me remove H: StreamHasher<S> from the typarams.

Regarding impl Hash<SuperAwesomeUintHasher> for uint, yeah we wouldn't be able to write that. I suppose one option would be to support it is we could go back to more of a Encodable-esque approach that I had in an old gist of mine. Instead of having a StreamHasher, we could have:

trait Hasher {
    fn hash_u8(x: u8);
    fn hash_u16(x: u16);
    ...
}

trait Hash<H: Hasher> {
    fn hash(&self, &mut H);
}

I had decided against that route because I thought the StreamHash was a bit simpler. I'll spend a couple minutes playing around with it to see if I can get it to work.

@brson: I completely agree. I don't think this should be merged after we get a lot of input. I believe I have a approach, but I'm not sure if it's the right approach yet.

erickt · 2014-02-10T19:27:33Z

This is getting really close to my final design. It takes advantage of @eddyb's default type parameters in order to make the code compatible with the old Hash trait. I've also fleshed out the StreamState to include the ability for a StreamState to take care of converting a value into bytes. This allows it to overload how it'll hash primitive types like u64 or a &str.

Things left to do:

Merge hashmap.rs and hashmap2.rs
Merge #[deriving(Hash, StreamHash)] into #[deriving(Hash)]
Change #[deriving(Hash)] to insert #[allow(default_type_params)]
Improve the documentation.
Migrate everything over to the new hash code, and remove the old one.
Parameterize HashMap with the Hasher.
Get @cmr to benchmark the before/after of this PR.
#[deriving(Hash)]'s Hash implementation should insert a use std::hash::{HashState, StreamHash} into the body.

erickt · 2014-02-15T06:52:26Z

After a long gestation, this is ready for review for merging now!

huonw · 2014-02-15T06:54:51Z

src/libsyntax/ext/deriving/generic.rs

@@ -196,6 +196,8 @@ pub struct TraitDef<'a> {
    /// The span for the current #[deriving(Foo)] header.
    span: Span,

+    attributes: ~[ast::Attribute],


@pcwalton would much prefer this to be Vec<ast::Attribute> (~[] is being removed, so no need to add more)

alexcrichton · 2014-02-20T04:55:08Z

src/libstd/io/mod.rs

@@ -969,6 +969,16 @@ pub trait Writer {
    fn write_i8(&mut self, n: i8) -> IoResult<()> {
        self.write([n as u8])
    }
+
+    /// Write a pointer.


Could you mention the endianness here? (and below)

alexcrichton · 2014-02-20T05:00:07Z

This is looking fantastic to me, very nice work!

I think with a run-pass test for deriving(Hash) and a few comments here and there this is good to go.

alexcrichton · 2014-02-20T05:01:28Z

Some squashings would be nice, but this is a PR with a long history, so I'll leave it up to you (I'll r+ regardless)

erickt · 2014-02-20T05:03:50Z

One thing to point out with this latest version is I've removed hashing support for floats. The old hashing library made sure to hash 0.0 and -0.0 to the same value. If we wanted this new framework to have the same behavior, we'd either have to:

Copy-paste this special case into all hashers
Remove the special case and treat 0.0 and -0.0 as separate hashers.
Remove hashing of floats

I chose the last one because @brson pointed out the plan is to switch HashMap over to using TotalEq (#5283), which will prevent floats from being used in a key. Would anyone else prefer the second option?

erickt · 2014-02-20T05:04:37Z

@alexcrichton: My plan is to squash everything together once it all looks good. I wanted to keep the history long so it would be easier to do variations.

alexcrichton · 2014-02-20T05:05:58Z

Wouldn't something like this work?

impl Hash for f32 {
    fn hash(&self, state: &mut SipState) {
        if *self == -0.0 {
            state.write_le_f32(0.0);
        } else {
            state.write_le_f32(*self);
        }
    }
}

Whenever you want to hash a float you'd need to call float.hash(state) instead of state.write_le_f32(float), but that should be fairly easy to remember regardless, right?

alexcrichton · 2014-02-20T05:42:51Z

src/libstd/hash/sip.rs

+}
+
+/// `Sip` computes the SipHash algorithm from a stream of bytes.
+pub struct SipStateFactory {


SipStateHasher?

alexcrichton · 2014-02-20T05:54:17Z

This is awesome. r=me with comments addressed

This patch merges IterBytes and Hash traits, which clears up the confusion of using `#[deriving(IterBytes)]` to support hashing. Instead, it now is much easier to use the new `#[deriving(Hash)]` for making a type hashable with a stream hash. Furthermore, it supports custom non-stream-based hashers, such as if a value's hash was cached in a database. This does not yet replace the old IterBytes-hash with this new version.

pcwalton · 2014-02-22T21:09:26Z

This is likely to be a big performance improvement in Servo. Well done: 💯

@eddyb

This PR merges `IterBytes` and `Hash` into a trait that allows for generic non-stream-based hashing. It makes use of @eddyb's default type parameter support in order to have a similar usage to the old `Hash` framework. Fixes #8038. Todo: - [x] Better documentation - [ ] Benchmark - [ ] Parameterize `HashMap` on a `Hasher`.

brson · 2014-02-23T01:10:45Z

@erickt congrats! This was a huge effort.

emberian · 2014-02-23T01:41:51Z

woo!

On Sat, Feb 22, 2014 at 8:10 PM, Brian Anderson notifications@github.comwrote:

@erickt https://github.com/erickt congrats! This was a huge effort.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/11863#issuecomment-35820639
.

Nit re `matches!` formatting I think formatting `matches!` with `if` guards is [still unsupported](rust-lang/rustfmt#5547), which is probably why this was missed. changelog: none

huonw reviewed Jan 28, 2014
View reviewed changes

huonw mentioned this pull request Jan 29, 2014

Implement a fast HashMap / hash function #11783

Closed

huonw reviewed Feb 1, 2014
View reviewed changes

huonw reviewed Feb 15, 2014
View reviewed changes

alexcrichton reviewed Feb 20, 2014
View reviewed changes

erickt added 5 commits February 21, 2014 19:57

extra: rename Uuid::to_bytes() to as_bytes()

8b81510

std: minor whitespace cleanup

87f936f

syntax: Allow syntax extensions to have attributes

bb8721d

syntax: add syntax extension helper to make simple view items

0135b52

std: fix the hash doctest

ca6d512

bors closed this Feb 23, 2014

This was referenced Feb 24, 2014

Updated to latest Rust PistonDevelopers/glfw-rs#103

Merged

Updated to latest Rust rustgd/cgmath#48

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First phase of a Hash reimplementation #11863

First phase of a Hash reimplementation #11863

erickt commented Jan 28, 2014

huonw Jan 28, 2014

erickt commented Jan 31, 2014

huonw Feb 1, 2014

erickt Feb 1, 2014

huonw Feb 1, 2014

huonw commented Feb 1, 2014

brson commented Feb 1, 2014

brson commented Feb 1, 2014

brson commented Feb 1, 2014

pcwalton commented Feb 1, 2014

thestinger commented Feb 2, 2014

pcwalton commented Feb 2, 2014

pcwalton commented Feb 2, 2014

erickt commented Feb 8, 2014

erickt commented Feb 8, 2014

huonw commented Feb 8, 2014

brson commented Feb 9, 2014

huonw commented Feb 9, 2014

erickt commented Feb 9, 2014

erickt commented Feb 10, 2014

erickt commented Feb 15, 2014

huonw Feb 15, 2014

alexcrichton Feb 20, 2014

alexcrichton commented Feb 20, 2014

alexcrichton commented Feb 20, 2014

erickt commented Feb 20, 2014

erickt commented Feb 20, 2014

alexcrichton commented Feb 20, 2014

alexcrichton Feb 20, 2014

alexcrichton Feb 20, 2014

alexcrichton commented Feb 20, 2014

pcwalton commented Feb 22, 2014

brson commented Feb 23, 2014

emberian commented Feb 23, 2014

First phase of a Hash reimplementation #11863

First phase of a Hash reimplementation #11863

Conversation

erickt commented Jan 28, 2014

Choose a reason for hiding this comment

erickt commented Jan 31, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huonw commented Feb 1, 2014

brson commented Feb 1, 2014

brson commented Feb 1, 2014

brson commented Feb 1, 2014

pcwalton commented Feb 1, 2014

thestinger commented Feb 2, 2014

pcwalton commented Feb 2, 2014

pcwalton commented Feb 2, 2014

erickt commented Feb 8, 2014

erickt commented Feb 8, 2014

huonw commented Feb 8, 2014

brson commented Feb 9, 2014

huonw commented Feb 9, 2014

erickt commented Feb 9, 2014

erickt commented Feb 10, 2014

erickt commented Feb 15, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexcrichton commented Feb 20, 2014

alexcrichton commented Feb 20, 2014

erickt commented Feb 20, 2014

erickt commented Feb 20, 2014

alexcrichton commented Feb 20, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexcrichton commented Feb 20, 2014

pcwalton commented Feb 22, 2014

brson commented Feb 23, 2014

emberian commented Feb 23, 2014