-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First phase of a Hash reimplementation #11863
Conversation
} | ||
|
||
/// The `StreamHash` represents a value that can be hashed by iterating over | ||
/// the bytes of the value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/the bytes of the value/its bytes/
@huonw / @thestinger: I just pushed an update. Can you re-review? I'll squash the commits down once it looks good. |
|
||
/// The `Hasher` trait represents an object that can compute a hash of another | ||
/// value. | ||
pub trait Hasher<H, Result> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ... not sure about Result
. I guess it depends what we're trying to use this for, i.e. just HashMap
or a more general hashing interface.
If it's just HashMap
(and similar structures) then I imagine that Result
will essentially always be u64
, and so there's not much point having it generic. (Under this scheme, hashes that return u32
, like, say, xxHash, would just have the high bits zeroed, and hashes that return larger results, like an AES based one, can just truncate.)
Of course, if we're trying to aim this for more general use, then Result
is probably OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added Result
so that it could support other hashing schemes like xxHash or CRC32. I figured that there might be circumstances where it may be more efficient to deal with u32
bytes. It doesn't add that much syntactic overhead since I doubt too many people will be directly using Hasher
, so I think it's worth keeping it around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sounds fine.
I'm slightly concerned that it will make things less composable (i.e. HashMap
will always want u64
so one cannot use xxHash (where Result = u64
) with it directly), but we can always define some thin adaptors that convert u32
to u64
, etc. I guess.
I was away for your ping on IRC about |
|
This looks good. I'm glad to get rid of IterBytes. This does introduce a number of hash traits that most people will never need to know about. Are they going to end up deriving |
Would be good to implement additional hashing strategies besides sip hash to prove this architecture, maybe starting with http://www.reddit.com/r/rust/comments/1wqjsf/more_xxhash_benchmarks/ |
Can't we have the hash table store an instance of SipHash with the setup phase already done with the hash table's key, and just clone an instance of that for every hashing operation to avoid repeating the key setup process every time? I assume that SipHash works on the same general principle behind CBC mode, so that should be doable, no? |
@pcwalton: I was saying "setup" in a misleading way, as I was including the minimum level of work done before and after the actual compression in various hashes. Some of the other hashes do some work before, but SipHash doesn't. SipHash2-4 runs 2 rounds on each block during compression and then 4 finalization rounds, so that's the minimum level of overhead I'm talking about. We could consider using a weaker form of SipHash like 1-1. |
Well in that case why not try running AESENC in CBC mode for short keys of 16 bytes or less (all node IDs)? Create a nonce, run AES via AES-NI on it, save that in the hash table, then when we want to hash a key xor it with the ciphertext of the nonce and then run AES on that. I wonder what the performance would be for small keys. Disclaimer: I'm not going to invent a cryptosystem without consulting with crypto gurus; this is basically just a sketch of an idea and may have atrocious performance anyway. Do not implement this. |
@thestinger Just ran some tests here. Even all 10 rounds of AES is significantly faster than SipHash for 16-byte values in CBC mode! Here's the AES test: https://gist.github.com/pcwalton/8763232 Results: AES finishes in 2.238s, while SipHash finishes in 3.096s. |
It's taken a while, but I finally pushed up a version that allows a
But that prevents other types from implementing |
Figured out a way to get rid of the copy-pasted implementations of |
In summary (to check my understanding): your trick basically promotes (If so, looks good.) |
Since these are very important types we need to proceed carefully, please. I'd like to have confidence about this direction, but don't understand the issues. Hoping to get more review and opinions. |
Hm, I just thought of something, doesn't |
@huonw: Sorry I had to run when I posted that last code, so I didn't really go into a good description. My realization was that Regarding
I had decided against that route because I thought the @brson: I completely agree. I don't think this should be merged after we get a lot of input. I believe I have a approach, but I'm not sure if it's the right approach yet. |
This is getting really close to my final design. It takes advantage of @eddyb's default type parameters in order to make the code compatible with the old Hash trait. I've also fleshed out the Things left to do:
|
After a long gestation, this is ready for review for merging now! |
@@ -196,6 +196,8 @@ pub struct TraitDef<'a> { | |||
/// The span for the current #[deriving(Foo)] header. | |||
span: Span, | |||
|
|||
attributes: ~[ast::Attribute], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pcwalton would much prefer this to be Vec<ast::Attribute>
(~[]
is being removed, so no need to add more)
@@ -969,6 +969,16 @@ pub trait Writer { | |||
fn write_i8(&mut self, n: i8) -> IoResult<()> { | |||
self.write([n as u8]) | |||
} | |||
|
|||
/// Write a pointer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you mention the endianness here? (and below)
This is looking fantastic to me, very nice work! I think with a |
Some squashings would be nice, but this is a PR with a long history, so I'll leave it up to you (I'll r+ regardless) |
One thing to point out with this latest version is I've removed hashing support for floats. The old hashing library made sure to hash
I chose the last one because @brson pointed out the plan is to switch |
@alexcrichton: My plan is to squash everything together once it all looks good. I wanted to keep the history long so it would be easier to do variations. |
Wouldn't something like this work? impl Hash for f32 {
fn hash(&self, state: &mut SipState) {
if *self == -0.0 {
state.write_le_f32(0.0);
} else {
state.write_le_f32(*self);
}
}
} Whenever you want to hash a float you'd need to call |
} | ||
|
||
/// `Sip` computes the SipHash algorithm from a stream of bytes. | ||
pub struct SipStateFactory { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SipStateHasher?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SipHasher*
This is awesome. r=me with comments addressed |
This patch merges IterBytes and Hash traits, which clears up the confusion of using `#[deriving(IterBytes)]` to support hashing. Instead, it now is much easier to use the new `#[deriving(Hash)]` for making a type hashable with a stream hash. Furthermore, it supports custom non-stream-based hashers, such as if a value's hash was cached in a database. This does not yet replace the old IterBytes-hash with this new version.
This is likely to be a big performance improvement in Servo. Well done: 💯 |
This PR merges `IterBytes` and `Hash` into a trait that allows for generic non-stream-based hashing. It makes use of @eddyb's default type parameter support in order to have a similar usage to the old `Hash` framework. Fixes #8038. Todo: - [x] Better documentation - [ ] Benchmark - [ ] Parameterize `HashMap` on a `Hasher`.
@erickt congrats! This was a huge effort. |
woo! On Sat, Feb 22, 2014 at 8:10 PM, Brian Anderson notifications@github.comwrote:
|
Nit re `matches!` formatting I think formatting `matches!` with `if` guards is [still unsupported](rust-lang/rustfmt#5547), which is probably why this was missed. changelog: none
This PR merges
IterBytes
andHash
into a trait that allows for generic non-stream-based hashing. It makes use of @eddyb's default type parameter support in order to have a similar usage to the oldHash
framework.Fixes #8038.
Todo:
HashMap
on aHasher
.