Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test for stable hash uniqueness of adjacent field values #93193

Merged
merged 1 commit into from
Jan 28, 2022

Conversation

Kobzol
Copy link
Contributor

@Kobzol Kobzol commented Jan 22, 2022

This PR adds a simple test to check that stable hash will produce a different hash if the order of two values that have the same combined bit pattern changes.

r? @the8472

@rustbot rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Jan 22, 2022
@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jan 22, 2022
@Kobzol Kobzol changed the title Add test stable hash uniqueness of adjacent field values Add test for stable hash uniqueness of adjacent field values Jan 22, 2022
@rust-log-analyzer

This comment has been minimized.

@Kobzol Kobzol force-pushed the stable-hash-permutation-test branch from 90449b9 to 1ffd043 Compare January 24, 2022 14:47
@Kobzol Kobzol marked this pull request as ready for review January 24, 2022 14:51
@the8472
Copy link
Member

the8472 commented Jan 27, 2022

@bors r+ rollup

@bors
Copy link
Contributor

bors commented Jan 27, 2022

📌 Commit 1ffd043 has been approved by the8472

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 27, 2022
bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 27, 2022
…askrgr

Rollup of 8 pull requests

Successful merges:

 - rust-lang#91641 (Define c_char using cfg_if rather than repeating 40-line cfg)
 - rust-lang#92899 (Mention std::iter::zip in Iterator::zip docs)
 - rust-lang#93193 (Add test for stable hash uniqueness of adjacent field values)
 - rust-lang#93325 (Introduce a limit to Levenshtein distance computation)
 - rust-lang#93339 (rustdoc: add test case for multiple traits and erased names)
 - rust-lang#93357 (Clarify the `usage-of-qualified-ty` error message.)
 - rust-lang#93363 (`#[rustc_pass_by_value]` cleanup)
 - rust-lang#93365 (More arena cleanups)

Failed merges:

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit bc26f97 into rust-lang:master Jan 28, 2022
@rustbot rustbot added this to the 1.60.0 milestone Jan 28, 2022
@Kobzol Kobzol deleted the stable-hash-permutation-test branch January 28, 2022 15:05
bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 3, 2022
…sion, r=the8472

Compress amount of hashed bytes for `isize` values in StableHasher

This is another attempt to land rust-lang#92103, this time hopefully with a correct implementation w.r.t. stable hashing guarantees. The previous PR was [reverted](rust-lang#93014) because it could produce the [same hash](rust-lang#92103 (comment)) for different values even in quite simple situations. I have since added a basic [test](rust-lang#93193) that should guard against that situation, I also added a new test in this PR, specialised for this optimization.

## Why this optimization helps
Since the original PR, I have tried to analyze why this optimization even helps (and why it especially helps for `clap`). I found that the vast majority of stable-hashing `i64` actually comes from hashing `isize` (which is converted to `i64` in the stable hasher). I only found a single place where is this datatype used directly in the compiler, and this place has also been showing up in traces that I used to find out when is `isize` being hashed. This place is `rustc_span::FileName::DocTest`, however, I suppose that isizes also come from other places, but they might not be so easy to find (there were some other entries in the trace). `clap` hashes about 8.5 million `isize`s, and all of them fit into a single byte, which is why this optimization has helped it [quite a lot](rust-lang#92103 (comment)).

Now, I'm not sure if special casing `isize` is the correct solution here, maybe something could be done with that `isize` inside `DocTest` or in other places, but that's for another discussion I suppose. In this PR, instead of hardcoding a special case inside `SipHasher128`, I instead put it into `StableHasher`, and only used it for `isize` (I tested that for `i64` it doesn't help, or at least not for `clap` and other few benchmarks that I was testing).

## New approach
Since the most common case is a single byte, I added a fast path for hashing `isize` values which positive value fits within a single byte, and a cold path for the rest of the values.

To avoid the previous correctness problem, we need to make sure that each unique `isize` value will produce a unique hash stream to the hasher. By hash stream I mean a sequence of bytes that will be hashed (a different sequence should produce a different hash, but that is of course not guaranteed).

We have to distinguish different values that produce the same bit pattern when we combine them. For example, if we just simply skipped the leading zero bytes for values that fit within a single byte, `(0xFF, 0xFFFFFFFFFFFFFFFF)` and `(0xFFFFFFFFFFFFFFFF, 0xFF)` would send the same hash stream to the hasher, which must not happen.

To avoid this situation, values `[0, 0xFE]` are hashed as a single byte. When we hash a larger (treating `isize` as `u64`) value, we first hash an additional byte `0xFF`. Since `0xFF` cannot occur when we apply the single byte optimization, we guarantee that the hash streams will be unique when hashing two values `(a, b)` and `(b, a)` if `a != b`:
1) When both `a` and `b` are within `[0, 0xFE]`, their hash streams will be different.
2) When neither `a` and `b` are within `[0, 0xFE]`, their hash streams will be different.
3) When `a` is within `[0, 0xFE]` and `b` isn't, when we hash `(a, b)`, the hash stream will definitely not begin with `0xFF`. When we hash `(b, a)`, the hash stream will definitely begin with `0xFF`. Therefore the hash streams will be different.

r? `@the8472`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants