Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend the Hasher trait with fn delimit to support one-shot hashing #1666

Closed
wants to merge 5 commits into from

Conversation

pczarn
Copy link

@pczarn pczarn commented Jul 4, 2016

Rich view

Related work
rust-lang/rust#28044 "WIP: Hash and Hasher update for faster common case hashing"
rust-lang/rust#29139 "Make short string hashing 30% faster by splitting Hash::hash_end from Hash::hash"
contain-rs/hashmap2#5 "Implement adaptive hashing using specialization"

@pczarn pczarn changed the title Extend the Hasher trait with fn delimit for one-shot hashing Extend the Hasher trait with fn delimit to support one-shot hashing Jul 4, 2016
@pczarn pczarn force-pushed the one-shot-hashing branch from b3f6791 to b90731c Compare July 4, 2016 09:40
@nrc nrc added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Jul 5, 2016
@alexcrichton
Copy link
Member

Thanks for the RFC @pczarn! Could you also expand the RFC with some details about how implementations of Hash or usage of Hasher through HashMap would change? I'm at least personally a little fuzzy on the details and would benefit from seeing some concrete instances of what would change.

Additionally, would it be possible to prototype this change and take a look at some performance benchmarks as well? Those would probably be the most convincing for ensuring that a change like this is merged quickly.

@pczarn
Copy link
Author

pczarn commented Jul 5, 2016

@alexcrichton This RFC as-is doesn't concern current usage. Only the definition and some implementations of Hasher would change. The wording of this RFC may be confusing, since it mentions usage that will be introduced in another RFC. I'll revisit it tomorrow.

@arthurprs
Copy link

arthurprs commented Jul 6, 2016

I came up with this arthurprs/hash-rs@1a5bfd2 in order to test the outcome of this change.

Results: http://i.imgur.com/N5AVi12.png
EDIT: here's a better graph (linear Y axis) http://imgur.com/a/0ruST

Yes, it's that good.

Now that we have sip13 in core we won't even need any other algorithm for hashing the bytes in the specialized form, the good old write + finish is probably enough.

@sfackler
Copy link
Member

sfackler commented Jul 6, 2016

The motivation talks about one-shot hashing a lot as motivation, but then doesn't seem to follow up on how the addition of delimit will help the one-shot case. Even if this RFC isn't proposing any additions to the standard library other than delimit, it would still be helpful to go into some explicit detail on how consumers of these APIs will benefit.

@pczarn
Copy link
Author

pczarn commented Jul 6, 2016

I made an update. I hope the motivation is more clear.

@arthurprs You implemented an alternative listed in the RFC. The main goal of this RFC is basically refactoring -- there's nothing to benchmark.

Is Sip13 fast enough for all purposes? Good question.

@sfackler What do you mean by "consumers of these APIs", specifically? Did you see the update?

@arthurprs
Copy link

@pczarn right, but it gives a realistic preview of the speedup unlocked by this proposal.

@bluss
Copy link
Member

bluss commented Jul 10, 2016

Thanks for reviving this. We need to discuss some of the drawbacks.

You can do this to make it possible to use the FarmHasher for multi-part things:

self.result ^= farmhash::hash64(msg);  // xor is the simplest possible mixer, could be something else

Either with that or without it, a drawback remains: The Hash trait is formulated so that it is used to "stream" bytes to a hasher in slices of bytes. The implicit protocol is that it doesn't matter how you slice your bytes, and equivalent stream of bytes should hash the same way. There's no way for the usual farmhash algorithm to support that, so either way you use it as a hasher together with the Hash trait, you break the rules of the Hash trait a bit.

@alexcrichton alexcrichton self-assigned this Jul 25, 2016
@alexcrichton
Copy link
Member

@pczarn the libs team talked about this awhile ago, but our conclusion was that while the motivation seems sound we were a little confused as to how this would affect the standard library. Right now this seems to be adding a bit of complexity but then not taking advantage of it? Could you expand a bit on how the standard library (e.g. HashMap) would change, and if so, reference some benchmarks in the RFC text?

@arthurprs
Copy link

arthurprs commented Aug 16, 2016

@alexcrichton I agree it's a bit confusing in the current form, that's why I was trying to show another face of the improvement @ #1666 (comment) EDIT: here's a better graph (linear Y axis) http://imgur.com/a/0ruST

In that commit/test I modified write_usize to skip over the first delimiter, the same could be done for the write_delimiter and give big improvements on small keys. Of course this applies for all hashers.

Oneshot hashers benefit from this even more (as finish involves little to no work), but would still have to mix possible n-part hashes as @bluss said above.

@alexcrichton
Copy link
Member

Yeah we were basically just thinking it'd be good for the RFC itself to encode the performance wins and/or changes to HashMap (if any)

@bluss
Copy link
Member

bluss commented Aug 17, 2016

Is there a way to attack this problem with specialization? So that we can reach one-shot hashing for precisely the types that support it.

@pczarn
Copy link
Author

pczarn commented Aug 17, 2016

Yeah we were basically just thinking it'd be good for the RFC itself to encode the performance wins and/or changes to HashMap (if any)

Exactly. There are no substantial changes in this RFC on its own.

Is there a way to attack this problem with specialization? So that we can reach one-shot hashing for precisely the types that support it.

A trait method with a default definition will reach one-shot hashing for the hashers that need it. Using specialization is possible, but does not give any advantage.

@bluss
Copy link
Member

bluss commented Aug 17, 2016

What about specializing a trait for types that are compatible with one-shot hashing. For example str/String would be, but compound types that contain str/String are not.

@pczarn
Copy link
Author

pczarn commented Aug 17, 2016

Yes, the plan is to make such a marker trait for specialization. The code for that trait is in contain-rs/hashmap2#5: https://github.com/pczarn/hashmap2/blob/e8fb6cacde8dcb166b009057afce7588b8d27a77/src/adaptive_map.rs#L194

@alexcrichton
Copy link
Member

@pczarn unfortunately though the libs team was hesitant about considering for a merge as-is. It wasn't clear why we would do this as it doesn't detail the follow-up work and how it relates to HashMap, possibly. Could the RFC text be expanded with that?

@arthurprs
Copy link

@pczarn

do you plan to pursue this PR further? I think you are onto something here. Hashing an usize for every slice is ridiculous performance wise. Preventing the write_u8(0xFF) for strings would provide a significant gain too.

@burdges
Copy link

burdges commented Oct 21, 2016

I do like the idea of hash functions offering a special delimiter input, but..

Afaik there are no common hash functions that offer such a delimiter input, and one cannot provide it using a hash function's existing API that hides the state, as doing so allows delimiters to be faked.

We discussed hashing recently in #1768 but I can rehash ;) the direction taken there :

There was a suggestion that hashing should be built on a writer trait than cannot fail, perhaps a general purpose writer trait with an associated error type set to !, ala type HashWriter = Writer<E=!>; or whatever. A writer trait makes sense because protocols require cross platform hashing, including specifying endiannes and being careful around isize and usize. Also, there are hashing algorithms that can fail on startup, like say if Argon2 cannot allocate enough memory, possibly gigabytes, but maybe they can be ignored. Afaik none can fail during operation though, so the hasher input methods cannot fail, which rules out the existing writer methods or crates like byteorder.

There is a wider array of hash outputs than simply a u64 too, so any finish should return another associated type, ala trait Hasher : HashWriter { type F; fn finish(&self) -> F; }. Just a few examples : Almost all cryptographic hash functions probably return a [u8; n] with several different modes for different n. SHA3's Shake mode returns an arbitrary byte stream, maybe a Reader<E=!> where Reader is a reader trait that cannot fail. Pairing-based cryptography commonly uses hash functions that produce an element of the elliptic curve, meaning some big but fixed size struct. In RSA signatures, you sometimes use a full domain hash, meaning an integer mod the RSA modulus, so finish should return a closure that takes an additional argument.

In this vein, one might handle delimiters with trait that provided a delimit function. In fact, it could be a trait HashWriterDelimit : HashWriter { ... } whose delimit function mutably borrowed the hash as a second HashWriter for delimiter input, but maybe that's needlessly complex.

@arthurprs
Copy link

Afaik there are no common hash functions that offer such a delimiter input, and one cannot provide it using a hash function's existing API that hides the state, as doing so allows delimiters to be faked.

I'm not sure what you mean here. The proposed function is implemented by default using the already existing writer methods.

The other proposals/motivations seem orthogonal even if dealing with the same trait. The objective here is to avoid hashing additional stuff when not needed.

@burdges
Copy link

burdges commented Oct 21, 2016

If your delimiter is created using a hash function's existing writer method, then an input can be crafted to fake a delimiter, possibly invalidating security assumptions.

I'm saying : It's a nice idea to support delimiters that cannot be faked, but since afaik no current hash functions provide them, then one should do it with another separate trait.

I'll say that stronger : It's a nice idea to support a delimiter that provably cannot be faked, relative to cryptographic assumptions, even if the hash function itself is not cryptographically secure for speed reasons. You could hash on user_input||delimiter||other_data and get good performance on user_input and other_data but know that a user cannot tweak user_data to launch an attack based on faking other_data.

You can do exactly this with an HMAC construction using two-ish calls to SipHash, so the key protection properties of SipHash give the desired security. Yet, modern cryptographic hash functions like SHA3 have much more resistance to extension attacks, etc. without being used in an HMAC, so maybe it's worth asking say the SipHash authors (DJB, et al.) if a delimiter could be more efficient than an HMAC. I donno hash functions well but conceivably adding a cryptographic delimiter to SipHash could almost double the speed over HMAC constructions.

Anyways, if you want a secure delimiter right now, then you need to compose two or three invocations of SipHash into an HMAC, and make another Hasher for the result.

@burdges
Copy link

burdges commented Nov 22, 2016

We need a hash API to be capable of producing specific results, like say if an existing protocol requires hashing a series of u16s. If you hash the length when hashing a &[T] then you'll break this.

@sfackler
Copy link
Member

@burdges why does this Hasher trait "need" to support those use cases? It is designed to be used by HashMap and HashSet.

@seanmonstar
Copy link
Contributor

seanmonstar commented Nov 22, 2016

We need a hash API to be capable of producing specific results, like say if an existing protocol requires hashing a series of u16s. If you hash the length when hashing a &[T] then you'll break this.

Yes, it would be incorrect to use the current Hasher trait (or any other trait that works with Hash types) to implement something like sha256.

@burdges
Copy link

burdges commented Nov 23, 2016

There need not be any security problem with using the same traits to feed data to both cryptographic and non-cryptographic hash functions. And Go does roughly that if I recall.

It's true Hash does not worry about endianness. At present, I do manually unpack my structs to use digest methods, but that's mostly so I can see the digest construction in one place. It's easy enough to manually impl Hash on a newtype to do the endianness conversions.

I'd favor addressing the endianness issue correctly within the existing family of hashing traits, so that they can be used cleanly across the board. Remember, there are still hash tables one writes to disk, which benefit from fixing both the endianness issues, and anything that benefits in memory hash tables.

@burdges
Copy link

burdges commented Nov 23, 2016

Just noticed the current impl<T: Hash> Hash for [T] already hashes the length, so actually this proposal counts as an improvement because a Hasher can implement delimit as a noop to disable that.

About endianness, I think the easy solution is wrapping the Hasher as follows.

struct HasherBE<H: Hasher>(h: H);

impl<H: Hasher> Hasher for HasherBE<H> {
    fn finish(&self) { self.finish() }
    fn write(&mut self, bytes: &[u8]) { self.write(bytes) }

    fn write_u8(&mut self, i: u8) { self.write_u8(i) }
    fn write_u16(&mut self, i: u16) { self.write_u16(i.to_be()) }
    fn write_u32(&mut self, i: u32) { self.write_u32(i.to_be()) }
    fn write_u64(&mut self, i: u64) { self.write_u64(i.to_be()) }
    fn write_usize(&mut self, i: usize) { self.write_usize(i.to_be()) }
}

struct HasherLE<H: Hasher>(h: H);

impl<H: Hasher> Hasher for HasherLE<H> {
    fn finish(&self) { self.finish() }
    fn write(&mut self, bytes: &[u8]) { self.write(bytes) }

    fn write_u8(&mut self, i: u8) { self.write_u8(i) }
    fn write_u16(&mut self, i: u16) { self.write_u16(i.to_le()) }
    fn write_u32(&mut self, i: u32) { self.write_u32(i.to_le()) }
    fn write_u64(&mut self, i: u64) { self.write_u64(i.to_le()) }
    fn write_usize(&mut self, i: usize) { self.write_usize(i.to_le()) }
}

These impls cannot be used to wrap a hasher that does something strange with signed values, but afaik none do.

@burdges
Copy link

burdges commented Nov 23, 2016

I suppose the way one implements the secure delimiters I suggested up thread might be :

struct MyHasher { ... }

struct MyDelimitHasher(&mut MyHasher);

impl Hasher for MyDelimitHasher {
    ... do something different ...
     fn delimit<T: Hash>(&mut self, delimiter: T) { unreachable!() }
}

impl Hasher for MyHasher {
    ... do the usual thing ...
    fn delimit<T: Hash>(&mut self, delimiter: T) {
        let s = MyDelimitHash(self);
        delimiter.hash(s);
    }
}

I'm rather happy with all this now. I do still wonder if we want some way to pass richer information to delimit(). An associated constant DELIMITER could be added to Hash that defaulted to () perhaps.

@Kimundi
Copy link
Member

Kimundi commented Nov 28, 2016

@pczarn Any update on this? Adding the implications to the existing uses of Hash by, eg, HashMap into the RFC text would be nice, even if its just to say that there would be no changes involved.

@ticki
Copy link
Contributor

ticki commented Nov 28, 2016

Looking at the various impl blocks of Hasher I can find via Github search, it seems there is some disagreement of how Hasher is used. Some implement it like write actually writing a new block of data, and some as if it is extending the stream. This exact behavior isn't specified in the docs, yet it is relied on in derive(Hash) for proper quality (e.g. there's no length padding between strings).

This is an issue, and I think the exact thing this RFC is trying to solve is solved simply by specifying that write is writing a new block of data (i.e. there is an implicit delimiter). This is also the most natural thing considering that you rarely need to construct your block in streaming fasion from variable length chunks.

Furthermore, there is a performance advantage: It would mean that there is no need for "left over" chunks, which aren't complete enough to be committed to the state value yet. These can hurt performance a lot, to the extend where the streaming version of the hash function is up to 2x slower than the static version.

@burdges
Copy link

burdges commented Nov 28, 2016

Very good point. I suppose, if you need more fine grained control over the block filling, then you need a specific hash function too, so you're non-generic enough to build a simple wrapper to handle left overs anyways.

@sfackler
Copy link
Member

After rereading the RFC and associated discussion, it is still not clear to me how the addition of a delimit method to Hasher provides value.

  1. The example implementation of FarmHasher is incorrect even for [T] - you will end up with the hash of the last value of the slice, not the slice as a whole. An "after" version of the example hasher would be very helpful to assess what the actual impact of the changes proposed would be.
  2. Hasher is fundamentally inappropriate for one shot hashing - the signature simply does not make sense. This RFC seems to be proposing an addition that will not allow one shot hashing to be supported correctly, just to be supported incorrectly but faster.
  3. The Hasher/Hash traits exist to be used by HashMap/HashSet. Why should we make a change to them that apparently nothing in the standard library will be able to take advantage of?

I'm personally inclined to FCP to close - cc @rust-lang/libs.

@bluss
Copy link
Member

bluss commented Dec 19, 2016

Less rigorous hashers like FnvHasher (or similar) can choose to skip hashing the delimiters at all. That's a marginal but noticeable improvement.

@sfackler
Copy link
Member

@rfcbot fcp close

@rfcbot
Copy link
Collaborator

rfcbot commented Dec 20, 2016

Team member @sfackler has proposed to close this. The next step is review by the rest of the tagged teams:

No concerns currently listed.

Once these reviewers reach consensus, this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@arthurprs
Copy link

I posted benchmarks upthread that give 5-10% speedup in siphash backed hashmap benchmarks using short strings (by skipping the first delimiter so the hasher stays "correct" in the case of multi part keys). So it's apliable to any hashers while remaining correct.

@sfackler
Copy link
Member

The RFC text's motivation is focused entirely on one shot hashing - is that not actually the motivation for this change?

The change made in that benchmark seems like it will break third party types that implement Borrow<str> - is that not the case?

@arthurprs
Copy link

I agree the rfc motivations (as it's written now) may not sound appealing. The part I'm defending is just a single line under Alternatives in the RFC.

I don't think it breaks as the calls to the hasher remain the same, like string.hash(&mut s); and wrapper.0.hash(&mut hasher);

@pczarn
Copy link
Author

pczarn commented Jan 3, 2017

@sfackler
If I understand you right, you're saying it's better to have two APIs (for streaming and one-shot hashing), instead of one more general hasher API with an additional method. I agree now. The former API would be messy, and the latter adds some duplicated code, but is the right thing.

I may change the RFC's motivation to focus on speedup, but I'm not convinced by @arthurprs's benchmark. Do you think the speedup will matter in real use cases? Is it important to optimize for siphash when adaptive hashing might work without siphash most of the time? If not, I'll close the RFC.

@arthurprs
Copy link

arthurprs commented Jan 3, 2017

Just to be extra clear, I really want to see the proposal of this RFC merged. I was just defending it from a different POV, avoiding hashing stuff we don't have too (thus performance gains).

There's of course the POV of allowing non-multi-part-ish types to be really hashed in a single shot. But I'd argue again that the end motivation is performance, as the current trait can be implemented with a non-streaming hash (ex: farmhash) that mixes the intermediate hash results. It's just not as good (hash quality) nor fast as it could be.

@aturon
Copy link
Member

aturon commented Jan 6, 2017

Ping @sfackler, can you work with @arthurprs and @pczarn to reach a resolution?

@arthurprs
Copy link

Hasher is fundamentally inappropriate for one shot hashing - the signature simply does not make sense. This RFC seems to be proposing an addition that will not allow one shot hashing to be supported correctly, just to be supported incorrectly but faster.

It can be implement correctly, mixing the intermediate hashes is still correct even if quality-wise it isn't optimal. Fnv, Fx, farmhash, ..., and others all do this. Having fn delimit will make the single component case optimal.

The Hasher/Hash traits exist to be used by HashMap/HashSet. Why should we make a change to them that apparently nothing in the standard library will be able to take advantage of?

All hashers can probably take advantage of the added fn delimit. The only requirement is fixing &str to hash more like &[u8]. The later uses a length based delimiter instead of positional.

@sfackler
Copy link
Member

If we're going to merge this, I think the RFC text needs to be updated to reflect the current thoughts on motivation - the FarmHash example in the RFC is still incorrect!

I'm still a bit worried from downstream breakage if we change str's Hash impl but I could also see it working out fine and it's probably something we can resolve without too much pain if it does come up.

@arielb1
Copy link
Contributor

arielb1 commented Feb 1, 2017

The other day I was talking with @michaelwoerister about the "specification" of Hasher, and he was of the opinion that Hasher need not always return the same result for h.write(b"x"); h.write(b"y"); and h.write(b"xy"); (and in fact, rustc contains an Hasher implementation where h.write_u16(0); is not equal to h.write(&[0,0])).

The requirements we had in mind are:

  1. Hasher is supposed to be pure - return the same result for the same sequence of calls.
  2. Hasher is required to be "almost"-injective for "type-compatible" inputs - if x.finish() == y.finish() and no hash collision occurred, then the first different function calls to x and y (if any exist) were calls to different functions or to write with different lengths.

@arthurprs
Copy link

arthurprs commented Feb 1, 2017

That's how most non-streaming hashers implement it, FNV being the biggest example in crates.io

@alexcrichton
Copy link
Member

ping @BurntSushi, @brson (checkboxes)

@rfcbot rfcbot added the final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. label Mar 14, 2017
@rfcbot
Copy link
Collaborator

rfcbot commented Mar 14, 2017

🔔 This is now entering its final comment period, as per the review above. 🔔

@rfcbot
Copy link
Collaborator

rfcbot commented Mar 24, 2017

The final comment period is now complete.

@aturon
Copy link
Member

aturon commented Mar 27, 2017

I'm going to close this RFC for the time being, as per the commentary above. We're definitely still interested in motion in this area! Please ping @sfackler if you'd like to take up the mantle.

@aturon aturon closed this Mar 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.