Extend the `Hasher` trait with `fn delimit` to support one-shot hashing #1666

pczarn · 2016-07-04T09:24:49Z

Related work
rust-lang/rust#28044 "WIP: Hash and Hasher update for faster common case hashing"
rust-lang/rust#29139 "Make short string hashing 30% faster by splitting Hash::hash_end from Hash::hash"
contain-rs/hashmap2#5 "Implement adaptive hashing using specialization"

alexcrichton · 2016-07-05T18:48:52Z

Thanks for the RFC @pczarn! Could you also expand the RFC with some details about how implementations of Hash or usage of Hasher through HashMap would change? I'm at least personally a little fuzzy on the details and would benefit from seeing some concrete instances of what would change.

Additionally, would it be possible to prototype this change and take a look at some performance benchmarks as well? Those would probably be the most convincing for ensuring that a change like this is merged quickly.

pczarn · 2016-07-05T22:30:32Z

@alexcrichton This RFC as-is doesn't concern current usage. Only the definition and some implementations of Hasher would change. The wording of this RFC may be confusing, since it mentions usage that will be introduced in another RFC. I'll revisit it tomorrow.

arthurprs · 2016-07-06T10:03:18Z

I came up with this arthurprs/hash-rs@1a5bfd2 in order to test the outcome of this change.

Results: http://i.imgur.com/N5AVi12.png
EDIT: here's a better graph (linear Y axis) http://imgur.com/a/0ruST

Yes, it's that good.

Now that we have sip13 in core we won't even need any other algorithm for hashing the bytes in the specialized form, the good old write + finish is probably enough.

sfackler · 2016-07-06T22:29:07Z

The motivation talks about one-shot hashing a lot as motivation, but then doesn't seem to follow up on how the addition of delimit will help the one-shot case. Even if this RFC isn't proposing any additions to the standard library other than delimit, it would still be helpful to go into some explicit detail on how consumers of these APIs will benefit.

pczarn · 2016-07-06T22:33:24Z

I made an update. I hope the motivation is more clear.

@arthurprs You implemented an alternative listed in the RFC. The main goal of this RFC is basically refactoring -- there's nothing to benchmark.

Is Sip13 fast enough for all purposes? Good question.

@sfackler What do you mean by "consumers of these APIs", specifically? Did you see the update?

arthurprs · 2016-07-06T22:41:10Z

@pczarn right, but it gives a realistic preview of the speedup unlocked by this proposal.

bluss · 2016-07-10T12:04:29Z

Thanks for reviving this. We need to discuss some of the drawbacks.

You can do this to make it possible to use the FarmHasher for multi-part things:

self.result ^= farmhash::hash64(msg);  // xor is the simplest possible mixer, could be something else

Either with that or without it, a drawback remains: The Hash trait is formulated so that it is used to "stream" bytes to a hasher in slices of bytes. The implicit protocol is that it doesn't matter how you slice your bytes, and equivalent stream of bytes should hash the same way. There's no way for the usual farmhash algorithm to support that, so either way you use it as a hasher together with the Hash trait, you break the rules of the Hash trait a bit.

alexcrichton · 2016-08-16T00:17:49Z

@pczarn the libs team talked about this awhile ago, but our conclusion was that while the motivation seems sound we were a little confused as to how this would affect the standard library. Right now this seems to be adding a bit of complexity but then not taking advantage of it? Could you expand a bit on how the standard library (e.g. HashMap) would change, and if so, reference some benchmarks in the RFC text?

arthurprs · 2016-08-16T07:26:08Z

@alexcrichton I agree it's a bit confusing in the current form, that's why I was trying to show another face of the improvement @ #1666 (comment) EDIT: here's a better graph (linear Y axis) http://imgur.com/a/0ruST

In that commit/test I modified write_usize to skip over the first delimiter, the same could be done for the write_delimiter and give big improvements on small keys. Of course this applies for all hashers.

Oneshot hashers benefit from this even more (as finish involves little to no work), but would still have to mix possible n-part hashes as @bluss said above.

alexcrichton · 2016-08-16T21:13:02Z

Yeah we were basically just thinking it'd be good for the RFC itself to encode the performance wins and/or changes to HashMap (if any)

bluss · 2016-08-17T11:10:18Z

Is there a way to attack this problem with specialization? So that we can reach one-shot hashing for precisely the types that support it.

pczarn · 2016-08-17T14:52:53Z

Yeah we were basically just thinking it'd be good for the RFC itself to encode the performance wins and/or changes to HashMap (if any)

Exactly. There are no substantial changes in this RFC on its own.

Is there a way to attack this problem with specialization? So that we can reach one-shot hashing for precisely the types that support it.

A trait method with a default definition will reach one-shot hashing for the hashers that need it. Using specialization is possible, but does not give any advantage.

bluss · 2016-08-17T14:58:28Z

What about specializing a trait for types that are compatible with one-shot hashing. For example str/String would be, but compound types that contain str/String are not.

pczarn · 2016-08-17T15:50:29Z

Yes, the plan is to make such a marker trait for specialization. The code for that trait is in contain-rs/hashmap2#5: https://github.com/pczarn/hashmap2/blob/e8fb6cacde8dcb166b009057afce7588b8d27a77/src/adaptive_map.rs#L194

alexcrichton · 2016-08-17T17:59:12Z

@pczarn unfortunately though the libs team was hesitant about considering for a merge as-is. It wasn't clear why we would do this as it doesn't detail the follow-up work and how it relates to HashMap, possibly. Could the RFC text be expanded with that?

arthurprs · 2016-10-20T11:39:39Z

@pczarn

do you plan to pursue this PR further? I think you are onto something here. Hashing an usize for every slice is ridiculous performance wise. Preventing the write_u8(0xFF) for strings would provide a significant gain too.

burdges · 2016-10-21T00:22:16Z

I do like the idea of hash functions offering a special delimiter input, but..

Afaik there are no common hash functions that offer such a delimiter input, and one cannot provide it using a hash function's existing API that hides the state, as doing so allows delimiters to be faked.

We discussed hashing recently in #1768 but I can rehash ;) the direction taken there :

There was a suggestion that hashing should be built on a writer trait than cannot fail, perhaps a general purpose writer trait with an associated error type set to !, ala type HashWriter = Writer<E=!>; or whatever. A writer trait makes sense because protocols require cross platform hashing, including specifying endiannes and being careful around isize and usize. Also, there are hashing algorithms that can fail on startup, like say if Argon2 cannot allocate enough memory, possibly gigabytes, but maybe they can be ignored. Afaik none can fail during operation though, so the hasher input methods cannot fail, which rules out the existing writer methods or crates like byteorder.

There is a wider array of hash outputs than simply a u64 too, so any finish should return another associated type, ala trait Hasher : HashWriter { type F; fn finish(&self) -> F; }. Just a few examples : Almost all cryptographic hash functions probably return a [u8; n] with several different modes for different n. SHA3's Shake mode returns an arbitrary byte stream, maybe a Reader<E=!> where Reader is a reader trait that cannot fail. Pairing-based cryptography commonly uses hash functions that produce an element of the elliptic curve, meaning some big but fixed size struct. In RSA signatures, you sometimes use a full domain hash, meaning an integer mod the RSA modulus, so finish should return a closure that takes an additional argument.

In this vein, one might handle delimiters with trait that provided a delimit function. In fact, it could be a trait HashWriterDelimit : HashWriter { ... } whose delimit function mutably borrowed the hash as a second HashWriter for delimiter input, but maybe that's needlessly complex.

arthurprs · 2016-10-21T08:23:54Z

Afaik there are no common hash functions that offer such a delimiter input, and one cannot provide it using a hash function's existing API that hides the state, as doing so allows delimiters to be faked.

I'm not sure what you mean here. The proposed function is implemented by default using the already existing writer methods.

The other proposals/motivations seem orthogonal even if dealing with the same trait. The objective here is to avoid hashing additional stuff when not needed.

burdges · 2016-10-21T10:58:10Z

If your delimiter is created using a hash function's existing writer method, then an input can be crafted to fake a delimiter, possibly invalidating security assumptions.

I'm saying : It's a nice idea to support delimiters that cannot be faked, but since afaik no current hash functions provide them, then one should do it with another separate trait.

I'll say that stronger : It's a nice idea to support a delimiter that provably cannot be faked, relative to cryptographic assumptions, even if the hash function itself is not cryptographically secure for speed reasons. You could hash on user_input||delimiter||other_data and get good performance on user_input and other_data but know that a user cannot tweak user_data to launch an attack based on faking other_data.

You can do exactly this with an HMAC construction using two-ish calls to SipHash, so the key protection properties of SipHash give the desired security. Yet, modern cryptographic hash functions like SHA3 have much more resistance to extension attacks, etc. without being used in an HMAC, so maybe it's worth asking say the SipHash authors (DJB, et al.) if a delimiter could be more efficient than an HMAC. I donno hash functions well but conceivably adding a cryptographic delimiter to SipHash could almost double the speed over HMAC constructions.

Anyways, if you want a secure delimiter right now, then you need to compose two or three invocations of SipHash into an HMAC, and make another Hasher for the result.

burdges · 2016-11-22T22:30:06Z

We need a hash API to be capable of producing specific results, like say if an existing protocol requires hashing a series of u16s. If you hash the length when hashing a &[T] then you'll break this.

sfackler · 2016-11-22T22:44:35Z

@burdges why does this Hasher trait "need" to support those use cases? It is designed to be used by HashMap and HashSet.

seanmonstar · 2016-11-22T22:44:42Z

We need a hash API to be capable of producing specific results, like say if an existing protocol requires hashing a series of u16s. If you hash the length when hashing a &[T] then you'll break this.

Yes, it would be incorrect to use the current Hasher trait (or any other trait that works with Hash types) to implement something like sha256.

burdges · 2016-11-23T01:01:18Z

There need not be any security problem with using the same traits to feed data to both cryptographic and non-cryptographic hash functions. And Go does roughly that if I recall.

It's true Hash does not worry about endianness. At present, I do manually unpack my structs to use digest methods, but that's mostly so I can see the digest construction in one place. It's easy enough to manually impl Hash on a newtype to do the endianness conversions.

I'd favor addressing the endianness issue correctly within the existing family of hashing traits, so that they can be used cleanly across the board. Remember, there are still hash tables one writes to disk, which benefit from fixing both the endianness issues, and anything that benefits in memory hash tables.

burdges · 2016-11-23T06:26:25Z

Just noticed the current impl<T: Hash> Hash for [T] already hashes the length, so actually this proposal counts as an improvement because a Hasher can implement delimit as a noop to disable that.

About endianness, I think the easy solution is wrapping the Hasher as follows.

struct HasherBE<H: Hasher>(h: H);

impl<H: Hasher> Hasher for HasherBE<H> {
    fn finish(&self) { self.finish() }
    fn write(&mut self, bytes: &[u8]) { self.write(bytes) }

    fn write_u8(&mut self, i: u8) { self.write_u8(i) }
    fn write_u16(&mut self, i: u16) { self.write_u16(i.to_be()) }
    fn write_u32(&mut self, i: u32) { self.write_u32(i.to_be()) }
    fn write_u64(&mut self, i: u64) { self.write_u64(i.to_be()) }
    fn write_usize(&mut self, i: usize) { self.write_usize(i.to_be()) }
}

struct HasherLE<H: Hasher>(h: H);

impl<H: Hasher> Hasher for HasherLE<H> {
    fn finish(&self) { self.finish() }
    fn write(&mut self, bytes: &[u8]) { self.write(bytes) }

    fn write_u8(&mut self, i: u8) { self.write_u8(i) }
    fn write_u16(&mut self, i: u16) { self.write_u16(i.to_le()) }
    fn write_u32(&mut self, i: u32) { self.write_u32(i.to_le()) }
    fn write_u64(&mut self, i: u64) { self.write_u64(i.to_le()) }
    fn write_usize(&mut self, i: usize) { self.write_usize(i.to_le()) }
}

These impls cannot be used to wrap a hasher that does something strange with signed values, but afaik none do.

burdges · 2016-11-23T09:18:29Z

I suppose the way one implements the secure delimiters I suggested up thread might be :

struct MyHasher { ... }

struct MyDelimitHasher(&mut MyHasher);

impl Hasher for MyDelimitHasher {
    ... do something different ...
     fn delimit<T: Hash>(&mut self, delimiter: T) { unreachable!() }
}

impl Hasher for MyHasher {
    ... do the usual thing ...
    fn delimit<T: Hash>(&mut self, delimiter: T) {
        let s = MyDelimitHash(self);
        delimiter.hash(s);
    }
}

I'm rather happy with all this now. I do still wonder if we want some way to pass richer information to delimit(). An associated constant DELIMITER could be added to Hash that defaulted to () perhaps.

Kimundi · 2016-11-28T19:15:31Z

@pczarn Any update on this? Adding the implications to the existing uses of Hash by, eg, HashMap into the RFC text would be nice, even if its just to say that there would be no changes involved.

ticki · 2016-11-28T19:21:00Z

Looking at the various impl blocks of Hasher I can find via Github search, it seems there is some disagreement of how Hasher is used. Some implement it like write actually writing a new block of data, and some as if it is extending the stream. This exact behavior isn't specified in the docs, yet it is relied on in derive(Hash) for proper quality (e.g. there's no length padding between strings).

This is an issue, and I think the exact thing this RFC is trying to solve is solved simply by specifying that write is writing a new block of data (i.e. there is an implicit delimiter). This is also the most natural thing considering that you rarely need to construct your block in streaming fasion from variable length chunks.

Furthermore, there is a performance advantage: It would mean that there is no need for "left over" chunks, which aren't complete enough to be committed to the state value yet. These can hurt performance a lot, to the extend where the streaming version of the hash function is up to 2x slower than the static version.

burdges · 2016-11-28T20:41:57Z

Very good point. I suppose, if you need more fine grained control over the block filling, then you need a specific hash function too, so you're non-generic enough to build a simple wrapper to handle left overs anyways.

sfackler · 2016-12-19T23:43:23Z

After rereading the RFC and associated discussion, it is still not clear to me how the addition of a delimit method to Hasher provides value.

The example implementation of FarmHasher is incorrect even for [T] - you will end up with the hash of the last value of the slice, not the slice as a whole. An "after" version of the example hasher would be very helpful to assess what the actual impact of the changes proposed would be.
Hasher is fundamentally inappropriate for one shot hashing - the signature simply does not make sense. This RFC seems to be proposing an addition that will not allow one shot hashing to be supported correctly, just to be supported incorrectly but faster.
The Hasher/Hash traits exist to be used by HashMap/HashSet. Why should we make a change to them that apparently nothing in the standard library will be able to take advantage of?

I'm personally inclined to FCP to close - cc @rust-lang/libs.

bluss · 2016-12-19T23:49:13Z

Less rigorous hashers like FnvHasher (or similar) can choose to skip hashing the delimiters at all. That's a marginal but noticeable improvement.

sfackler · 2016-12-20T00:20:53Z

@rfcbot fcp close

rfcbot · 2016-12-20T00:50:21Z

Team member @sfackler has proposed to close this. The next step is review by the rest of the tagged teams:

No concerns currently listed.

Once these reviewers reach consensus, this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

arthurprs · 2016-12-20T08:03:57Z

I posted benchmarks upthread that give 5-10% speedup in siphash backed hashmap benchmarks using short strings (by skipping the first delimiter so the hasher stays "correct" in the case of multi part keys). So it's apliable to any hashers while remaining correct.

sfackler · 2016-12-20T16:51:44Z

The RFC text's motivation is focused entirely on one shot hashing - is that not actually the motivation for this change?

The change made in that benchmark seems like it will break third party types that implement Borrow<str> - is that not the case?

arthurprs · 2016-12-21T09:16:15Z

I agree the rfc motivations (as it's written now) may not sound appealing. The part I'm defending is just a single line under Alternatives in the RFC.

I don't think it breaks as the calls to the hasher remain the same, like string.hash(&mut s); and wrapper.0.hash(&mut hasher);

pczarn · 2017-01-03T20:05:10Z

@sfackler
If I understand you right, you're saying it's better to have two APIs (for streaming and one-shot hashing), instead of one more general hasher API with an additional method. I agree now. The former API would be messy, and the latter adds some duplicated code, but is the right thing.

I may change the RFC's motivation to focus on speedup, but I'm not convinced by @arthurprs's benchmark. Do you think the speedup will matter in real use cases? Is it important to optimize for siphash when adaptive hashing might work without siphash most of the time? If not, I'll close the RFC.

arthurprs · 2017-01-03T22:48:01Z

Just to be extra clear, I really want to see the proposal of this RFC merged. I was just defending it from a different POV, avoiding hashing stuff we don't have too (thus performance gains).

There's of course the POV of allowing non-multi-part-ish types to be really hashed in a single shot. But I'd argue again that the end motivation is performance, as the current trait can be implemented with a non-streaming hash (ex: farmhash) that mixes the intermediate hash results. It's just not as good (hash quality) nor fast as it could be.

aturon · 2017-01-06T22:34:48Z

Ping @sfackler, can you work with @arthurprs and @pczarn to reach a resolution?

arthurprs · 2017-01-07T12:10:44Z

Hasher is fundamentally inappropriate for one shot hashing - the signature simply does not make sense. This RFC seems to be proposing an addition that will not allow one shot hashing to be supported correctly, just to be supported incorrectly but faster.

It can be implement correctly, mixing the intermediate hashes is still correct even if quality-wise it isn't optimal. Fnv, Fx, farmhash, ..., and others all do this. Having fn delimit will make the single component case optimal.

The Hasher/Hash traits exist to be used by HashMap/HashSet. Why should we make a change to them that apparently nothing in the standard library will be able to take advantage of?

All hashers can probably take advantage of the added fn delimit. The only requirement is fixing &str to hash more like &[u8]. The later uses a length based delimiter instead of positional.

sfackler · 2017-01-31T16:46:49Z

If we're going to merge this, I think the RFC text needs to be updated to reflect the current thoughts on motivation - the FarmHash example in the RFC is still incorrect!

I'm still a bit worried from downstream breakage if we change str's Hash impl but I could also see it working out fine and it's probably something we can resolve without too much pain if it does come up.

arielb1 · 2017-02-01T15:06:17Z

The other day I was talking with @michaelwoerister about the "specification" of Hasher, and he was of the opinion that Hasher need not always return the same result for h.write(b"x"); h.write(b"y"); and h.write(b"xy"); (and in fact, rustc contains an Hasher implementation where h.write_u16(0); is not equal to h.write(&[0,0])).

The requirements we had in mind are:

Hasher is supposed to be pure - return the same result for the same sequence of calls.
Hasher is required to be "almost"-injective for "type-compatible" inputs - if x.finish() == y.finish() and no hash collision occurred, then the first different function calls to x and y (if any exist) were calls to different functions or to write with different lengths.

arthurprs · 2017-02-01T17:51:06Z

That's how most non-streaming hashers implement it, FNV being the biggest example in crates.io

alexcrichton · 2017-02-28T15:34:59Z

ping @BurntSushi, @brson (checkboxes)

rfcbot · 2017-03-14T20:34:05Z

🔔 This is now entering its final comment period, as per the review above. 🔔

rfcbot · 2017-03-24T20:36:24Z

The final comment period is now complete.

aturon · 2017-03-27T21:05:08Z

I'm going to close this RFC for the time being, as per the commentary above. We're definitely still interested in motion in this area! Please ping @sfackler if you'd like to take up the mantle.

pczarn changed the title ~~Extend the Hasher trait with fn delimit for one-shot hashing~~ Extend the Hasher trait with fn delimit to support one-shot hashing Jul 4, 2016

pczarn added 3 commits July 4, 2016 11:40

RFC for one-shot hashing support

2f5d220

Remove mentions of Farmhash

88a744f

Correct phrasing

b90731c

pczarn force-pushed the one-shot-hashing branch from b3f6791 to b90731c Compare July 4, 2016 09:40

nrc added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Jul 5, 2016

pczarn added 2 commits July 7, 2016 00:20

Update for more clear motivation

be65caf

Small update

7f0d65c

alexcrichton self-assigned this Jul 25, 2016

arthurprs mentioned this pull request Aug 17, 2016

str and [u8] should hash the same rust-lang/rust#27108

Open

bluss mentioned this pull request Aug 23, 2016

Change SipHash implementation to an optimized assembly version rust-lang/rust#35735

Closed

burdges mentioned this pull request Nov 23, 2016

Add Digest::input_hashable method DaGenix/rust-crypto#333

Open

rfcbot added the final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. label Mar 14, 2017

aturon closed this Mar 27, 2017

Extend the Hasher trait with fn delimit to support one-shot hashing #1666

Extend the Hasher trait with fn delimit to support one-shot hashing #1666

Conversation

pczarn commented Jul 4, 2016 • edited Loading

alexcrichton commented Jul 5, 2016

pczarn commented Jul 5, 2016

arthurprs commented Jul 6, 2016 • edited Loading

sfackler commented Jul 6, 2016

pczarn commented Jul 6, 2016

arthurprs commented Jul 6, 2016

bluss commented Jul 10, 2016

alexcrichton commented Aug 16, 2016

arthurprs commented Aug 16, 2016 • edited Loading

alexcrichton commented Aug 16, 2016

bluss commented Aug 17, 2016

pczarn commented Aug 17, 2016

bluss commented Aug 17, 2016

pczarn commented Aug 17, 2016 • edited Loading

alexcrichton commented Aug 17, 2016

arthurprs commented Oct 20, 2016

burdges commented Oct 21, 2016 • edited Loading

arthurprs commented Oct 21, 2016

burdges commented Oct 21, 2016

burdges commented Nov 22, 2016

sfackler commented Nov 22, 2016

seanmonstar commented Nov 22, 2016 • edited Loading

burdges commented Nov 23, 2016 • edited Loading

burdges commented Nov 23, 2016

burdges commented Nov 23, 2016

Kimundi commented Nov 28, 2016

ticki commented Nov 28, 2016

burdges commented Nov 28, 2016

sfackler commented Dec 19, 2016

bluss commented Dec 19, 2016

sfackler commented Dec 20, 2016

rfcbot commented Dec 20, 2016 • edited by BurntSushi Loading

arthurprs commented Dec 20, 2016

sfackler commented Dec 20, 2016

arthurprs commented Dec 21, 2016

pczarn commented Jan 3, 2017 • edited Loading

arthurprs commented Jan 3, 2017 • edited Loading

aturon commented Jan 6, 2017

arthurprs commented Jan 7, 2017

sfackler commented Jan 31, 2017

arielb1 commented Feb 1, 2017

arthurprs commented Feb 1, 2017 • edited Loading

alexcrichton commented Feb 28, 2017

rfcbot commented Mar 14, 2017

rfcbot commented Mar 24, 2017

aturon commented Mar 27, 2017

Extend the `Hasher` trait with `fn delimit` to support one-shot hashing #1666

Extend the `Hasher` trait with `fn delimit` to support one-shot hashing #1666

pczarn commented Jul 4, 2016 •

edited

Loading

arthurprs commented Jul 6, 2016 •

edited

Loading

arthurprs commented Aug 16, 2016 •

edited

Loading

pczarn commented Aug 17, 2016 •

edited

Loading

burdges commented Oct 21, 2016 •

edited

Loading

seanmonstar commented Nov 22, 2016 •

edited

Loading

burdges commented Nov 23, 2016 •

edited

Loading

rfcbot commented Dec 20, 2016 •

edited by BurntSushi

Loading

pczarn commented Jan 3, 2017 •

edited

Loading

arthurprs commented Jan 3, 2017 •

edited

Loading

arthurprs commented Feb 1, 2017 •

edited

Loading