Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First phase of a Hash reimplementation #11863

Closed
wants to merge 6 commits into from
Closed

Conversation

erickt
Copy link
Contributor

@erickt erickt commented Jan 28, 2014

This PR merges IterBytes and Hash into a trait that allows for generic non-stream-based hashing. It makes use of @eddyb's default type parameter support in order to have a similar usage to the old Hash framework.

Fixes #8038.

Todo:

  • Better documentation
  • Benchmark
  • Parameterize HashMap on a Hasher.

}

/// The `StreamHash` represents a value that can be hashed by iterating over
/// the bytes of the value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/the bytes of the value/its bytes/

@erickt
Copy link
Contributor Author

erickt commented Jan 31, 2014

@huonw / @thestinger: I just pushed an update. Can you re-review? I'll squash the commits down once it looks good.


/// The `Hasher` trait represents an object that can compute a hash of another
/// value.
pub trait Hasher<H, Result> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ... not sure about Result. I guess it depends what we're trying to use this for, i.e. just HashMap or a more general hashing interface.

If it's just HashMap (and similar structures) then I imagine that Result will essentially always be u64, and so there's not much point having it generic. (Under this scheme, hashes that return u32, like, say, xxHash, would just have the high bits zeroed, and hashes that return larger results, like an AES based one, can just truncate.)

Of course, if we're trying to aim this for more general use, then Result is probably OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added Result so that it could support other hashing schemes like xxHash or CRC32. I figured that there might be circumstances where it may be more efficient to deal with u32 bytes. It doesn't add that much syntactic overhead since I doubt too many people will be directly using Hasher, so I think it's worth keeping it around.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sounds fine.

I'm slightly concerned that it will make things less composable (i.e. HashMap will always want u64 so one cannot use xxHash (where Result = u64) with it directly), but we can always define some thin adaptors that convert u32 to u64, etc. I guess.

@huonw
Copy link
Member

huonw commented Feb 1, 2014

I was away for your ping on IRC about [T] for T: Pod, and no, I don't think think casting to [u8] will work; since e.g. [&T] contains Pod values, but we don't want to hash the elements based on the pointer values (which casting to &[u8] would do).

@brson
Copy link
Contributor

brson commented Feb 1, 2014

  • I'd not expose result_bytes and result_str if they aren't needed outside siphash.
  • What are the tradeoffs for resettable stream hashes?
  • What is the benefit of Hasher::hash2 consuming the hasher?
  • I'm not sure what resetting entails but it sounds like that could defeat some of the cryptographic properties of SipHash. Is there a term for this kind of behavior that we could point out in the docs?

@brson
Copy link
Contributor

brson commented Feb 1, 2014

This looks good. I'm glad to get rid of IterBytes. This does introduce a number of hash traits that most people will never need to know about. Are they going to end up deriving StreamHash, and if so does this provide any generality that is going to go unused (and can be simplified)?

@brson
Copy link
Contributor

brson commented Feb 1, 2014

Would be good to implement additional hashing strategies besides sip hash to prove this architecture, maybe starting with http://www.reddit.com/r/rust/comments/1wqjsf/more_xxhash_benchmarks/

@pcwalton
Copy link
Contributor

pcwalton commented Feb 1, 2014

Can't we have the hash table store an instance of SipHash with the setup phase already done with the hash table's key, and just clone an instance of that for every hashing operation to avoid repeating the key setup process every time? I assume that SipHash works on the same general principle behind CBC mode, so that should be doable, no?

@thestinger
Copy link
Contributor

@pcwalton: I was saying "setup" in a misleading way, as I was including the minimum level of work done before and after the actual compression in various hashes.

Some of the other hashes do some work before, but SipHash doesn't. SipHash2-4 runs 2 rounds on each block during compression and then 4 finalization rounds, so that's the minimum level of overhead I'm talking about. We could consider using a weaker form of SipHash like 1-1.

@pcwalton
Copy link
Contributor

pcwalton commented Feb 2, 2014

Well in that case why not try running AESENC in CBC mode for short keys of 16 bytes or less (all node IDs)? Create a nonce, run AES via AES-NI on it, save that in the hash table, then when we want to hash a key xor it with the ciphertext of the nonce and then run AES on that. I wonder what the performance would be for small keys.

Disclaimer: I'm not going to invent a cryptosystem without consulting with crypto gurus; this is basically just a sketch of an idea and may have atrocious performance anyway. Do not implement this.

@pcwalton
Copy link
Contributor

pcwalton commented Feb 2, 2014

@thestinger Just ran some tests here. Even all 10 rounds of AES is significantly faster than SipHash for 16-byte values in CBC mode!

Here's the AES test: https://gist.github.com/pcwalton/8763232
And here's the SipHash test: https://gist.github.com/pcwalton/8763238

Results: AES finishes in 2.238s, while SipHash finishes in 3.096s.

@erickt
Copy link
Contributor Author

erickt commented Feb 8, 2014

It's taken a while, but I finally pushed up a version that allows a HashMap to be used with Equiv types. I'm still not thrilled with the design though because I now need to copy-paste multiple impls of Hash for each stream type. It seems like the only way to implement Hash once is to do:

impl<S: Stream, H: StreamHasher<S>, T: StreamHash<S> Hash<S> for T { ... }

But that prevents other types from implementing Hash because of #9075. Can anyone think up a better way to implement this?

@erickt
Copy link
Contributor Author

erickt commented Feb 8, 2014

Figured out a way to get rid of the copy-pasted implementations of Hash :)

@huonw
Copy link
Member

huonw commented Feb 8, 2014

In summary (to check my understanding): your trick basically promotes StreamHash<H> to Hash<H>, removing the requirement that H: Stream?

(If so, looks good.)

@brson
Copy link
Contributor

brson commented Feb 9, 2014

Since these are very important types we need to proceed carefully, please. I'd like to have confidence about this direction, but don't understand the issues. Hoping to get more review and opinions.

@huonw
Copy link
Member

huonw commented Feb 9, 2014

Hm, I just thought of something, doesn't impl<S: Stream> Hash<S> for uint also disallow impl Hash<SuperAwesomeUintHasher> for uint? i.e. we would be restricted to using streaming hashes with the built in types.

@erickt
Copy link
Contributor Author

erickt commented Feb 9, 2014

@huonw: Sorry I had to run when I posted that last code, so I didn't really go into a good description. My realization was that hash::Stream can be used directly as a hasher as long as the method hash doesn't return a value. This lets me remove H: StreamHasher<S> from the typarams.

Regarding impl Hash<SuperAwesomeUintHasher> for uint, yeah we wouldn't be able to write that. I suppose one option would be to support it is we could go back to more of a Encodable-esque approach that I had in an old gist of mine. Instead of having a StreamHasher, we could have:

trait Hasher {
    fn hash_u8(x: u8);
    fn hash_u16(x: u16);
    ...
}

trait Hash<H: Hasher> {
    fn hash(&self, &mut H);
}

I had decided against that route because I thought the StreamHash was a bit simpler. I'll spend a couple minutes playing around with it to see if I can get it to work.

@brson: I completely agree. I don't think this should be merged after we get a lot of input. I believe I have a approach, but I'm not sure if it's the right approach yet.

@erickt
Copy link
Contributor Author

erickt commented Feb 10, 2014

This is getting really close to my final design. It takes advantage of @eddyb's default type parameters in order to make the code compatible with the old Hash trait. I've also fleshed out the StreamState to include the ability for a StreamState to take care of converting a value into bytes. This allows it to overload how it'll hash primitive types like u64 or a &str.

Things left to do:

  • Merge hashmap.rs and hashmap2.rs
  • Merge #[deriving(Hash, StreamHash)] into #[deriving(Hash)]
  • Change #[deriving(Hash)] to insert #[allow(default_type_params)]
  • Improve the documentation.
  • Migrate everything over to the new hash code, and remove the old one.
  • Parameterize HashMap with the Hasher.
  • Get @cmr to benchmark the before/after of this PR.
  • #[deriving(Hash)]'s Hash implementation should insert a use std::hash::{HashState, StreamHash} into the body.

@erickt
Copy link
Contributor Author

erickt commented Feb 15, 2014

After a long gestation, this is ready for review for merging now!

@@ -196,6 +196,8 @@ pub struct TraitDef<'a> {
/// The span for the current #[deriving(Foo)] header.
span: Span,

attributes: ~[ast::Attribute],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pcwalton would much prefer this to be Vec<ast::Attribute> (~[] is being removed, so no need to add more)

@@ -969,6 +969,16 @@ pub trait Writer {
fn write_i8(&mut self, n: i8) -> IoResult<()> {
self.write([n as u8])
}

/// Write a pointer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you mention the endianness here? (and below)

@alexcrichton
Copy link
Member

This is looking fantastic to me, very nice work!

I think with a run-pass test for deriving(Hash) and a few comments here and there this is good to go.

@alexcrichton
Copy link
Member

Some squashings would be nice, but this is a PR with a long history, so I'll leave it up to you (I'll r+ regardless)

@erickt
Copy link
Contributor Author

erickt commented Feb 20, 2014

One thing to point out with this latest version is I've removed hashing support for floats. The old hashing library made sure to hash 0.0 and -0.0 to the same value. If we wanted this new framework to have the same behavior, we'd either have to:

  • Copy-paste this special case into all hashers
  • Remove the special case and treat 0.0 and -0.0 as separate hashers.
  • Remove hashing of floats

I chose the last one because @brson pointed out the plan is to switch HashMap over to using TotalEq (#5283), which will prevent floats from being used in a key. Would anyone else prefer the second option?

@erickt
Copy link
Contributor Author

erickt commented Feb 20, 2014

@alexcrichton: My plan is to squash everything together once it all looks good. I wanted to keep the history long so it would be easier to do variations.

@alexcrichton
Copy link
Member

Wouldn't something like this work?

impl Hash for f32 {
    fn hash(&self, state: &mut SipState) {
        if *self == -0.0 {
            state.write_le_f32(0.0);
        } else {
            state.write_le_f32(*self);
        }
    }
}

Whenever you want to hash a float you'd need to call float.hash(state) instead of state.write_le_f32(float), but that should be fairly easy to remember regardless, right?

}

/// `Sip` computes the SipHash algorithm from a stream of bytes.
pub struct SipStateFactory {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SipStateHasher?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SipHasher*

@alexcrichton
Copy link
Member

This is awesome. r=me with comments addressed

This patch merges IterBytes and Hash traits, which clears up the
confusion of using `#[deriving(IterBytes)]` to support hashing.
Instead, it now is much easier to use the new `#[deriving(Hash)]`
for making a type hashable with a stream hash.

Furthermore, it supports custom non-stream-based hashers, such as
if a value's hash was cached in a database.

This does not yet replace the old IterBytes-hash with this new
version.
@pcwalton
Copy link
Contributor

This is likely to be a big performance improvement in Servo. Well done: 💯

bors added a commit that referenced this pull request Feb 22, 2014
This PR merges `IterBytes` and `Hash` into a trait that allows for generic non-stream-based hashing. It makes use of @eddyb's default type parameter support in order to have a similar usage to the old `Hash` framework.

Fixes #8038.

Todo:

- [x] Better documentation
- [ ] Benchmark
- [ ] Parameterize `HashMap` on a `Hasher`.
@bors bors closed this Feb 23, 2014
@brson
Copy link
Contributor

brson commented Feb 23, 2014

@erickt congrats! This was a huge effort.

@emberian
Copy link
Member

woo!

On Sat, Feb 22, 2014 at 8:10 PM, Brian Anderson notifications@github.comwrote:

@erickt https://github.com/erickt congrats! This was a huge effort.


Reply to this email directly or view it on GitHubhttps://github.com//pull/11863#issuecomment-35820639
.

bors added a commit to rust-lang-ci/rust that referenced this pull request Dec 1, 2023
Nit re `matches!` formatting

I think formatting `matches!` with `if` guards is [still unsupported](rust-lang/rustfmt#5547), which is probably why this was missed.

changelog: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFC: Why #[deriving(IterBytes)], not #[deriving(Hash)]?
8 participants