Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up SipHasher128. #68914

Merged
merged 2 commits into from
Feb 12, 2020
Merged

Conversation

nnethercote
Copy link
Contributor

@nnethercote nnethercote commented Feb 7, 2020

The current code in SipHasher128::short_write is inefficient. It uses
u8to64_le (which is complex and slow) to extract just the right number of
bytes of the input into a u64 and pad the result with zeroes. It then
left-shifts that value in order to bitwise-OR it with self.tail.

For example, imagine we have a u32 input 0xIIHH_GGFF and only need three bytes
to fill up self.tail. The current code uses u8to64_le to construct
0x0000_0000_00HH_GGFF, which is just 0xIIHH_GGFF with the 0xII removed and
zero-extended to a u64. The code then left-shifts that value by five bytes --
discarding the 0x00 byte that replaced the 0xII byte! -- to give
0xHHGG_FF00_0000_0000. It then then ORs that value with self.tail.

There's a much simpler way to do it: zero-extend to u64 first, then left shift.
E.g. 0xIIHH_GGFF is zero-extended to 0x0000_0000_IIHH_GGFF, and then
left-shifted to 0xHHGG_FF00_0000_0000. We don't have to take time to exclude
the unneeded 0xII byte, because it just gets shifted out anyway! It also avoids
multiple occurrences of unsafe.

There's a similar story with the setting of self.tail at the method's end.
The current code uses u8to64_le to extract the remaining part of the input,
but the same effect can be achieved more quickly with a right shift on the
zero-extended input.

This commit changes SipHasher128 to use the simpler shift-based approach. The
code is also smaller, which means that short_write is now inlined where
previously it wasn't, which makes things faster again. This gives big
speed-ups for all incremental builds, especially "baseline" incremental
builds.

r? @michaelwoerister

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 7, 2020
@nnethercote
Copy link
Contributor Author

BTW, I'm planning to make the equivalent change to SipHasher in core, but I will do that as a separate PR.

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@bors
Copy link
Contributor

bors commented Feb 7, 2020

⌛ Trying commit e606fe7 with merge 05cb825...

bors added a commit that referenced this pull request Feb 7, 2020
Speed up `SipHasher128`.

The current code in `SipHasher128::short_write` is inefficient. It uses
`u8to64_le` (which is complex and slow) to extract just the right number of
bytes of the input into a u64 and pad the result with zeroes. It then
left-shifts that value in order to bitwise-OR it with `self.tail`.

For example, imagine we have a u32 input 0xIIHH_GGFF and only need three bytes
to fill up `self.tail`. The current code uses `u8to64_le` to construct
0x0000_0000_00HH_GGFF, which is just 0xIIHH_GGFF with the 0xII removed and
zero-extended to a u64. The code then left-shifts that value by five bytes --
discarding the 0x00 byte that replaced the 0xII byte! -- to give
0xHHGG_FF00_0000_0000. It then then ORs that value with self.tail.

There's a much simpler way to do it: zero-extend to u64 first, then left shift.
E.g. 0xIIHH_GGFF is zero-extended to 0x0000_0000_IIHH_GGFF, and then
left-shifted to 0xHHGG_FF00_0000_0000. We don't have to take time to exclude
the unneeded 0xII byte, because it just gets shifted out anyway! It also avoids
multiple occurrences of `unsafe`.

There's a similar story with the setting of `self.tail` at the method's end.
The current code uses `u8to64_le` to extract the remaining part of the input,
but the same effect can be achieved more quickly with a right shift on the
zero-extended input.

This commit changes `SipHasher128` to use the simpler shift-based approach. The
code is also smaller, which means that `short_write` is now inlined where
previously it wasn't, which makes things faster again. This gives big
speed-ups for all incremental builds, especially "baseline" incremental
builds.

r? @michaelwoerister
@nnethercote
Copy link
Contributor Author

nnethercote commented Feb 7, 2020

Local results (check builds only) are excellent:

ctfe-stress-4-check
        avg: -4.7%?     min: -13.1%?    max: -0.1%?
clap-rs-check
        avg: -3.0%      min: -9.7%      max: -0.4%
coercions-check
        avg: -3.4%?     min: -5.7%?     max: -0.6%?
tuple-stress-check
        avg: -3.4%      min: -4.9%      max: -1.0%
ucd-check
        avg: -2.6%      min: -4.4%      max: -0.7%
html5ever-check
        avg: -2.0%      min: -3.9%      max: -0.6%
serde-check
        avg: -1.8%      min: -3.7%      max: -0.4%
unicode_normalization-check
        avg: -1.8%      min: -3.5%      max: -0.4%
keccak-check
        avg: -1.5%      min: -3.5%      max: -0.2%
issue-46449-check
        avg: -1.1%      min: -3.3%      max: -0.4%
piston-image-check
        avg: -2.3%      min: -3.1%      max: -0.7%
serde-serde_derive-check
        avg: -1.6%      min: -3.0%      max: -0.6%
await-call-tree-check
        avg: -1.6%      min: -3.0%      max: -0.6%
cranelift-codegen-check
        avg: -2.1%      min: -3.0%      max: -0.7%
deep-vector-check
        avg: -2.4%      min: -3.0%      max: -1.1%
script-servo-check
        avg: -2.3%      min: -2.9%      max: -0.8%
regex-check
        avg: -2.5%      min: -2.9%      max: -0.7%
ripgrep-check
        avg: -2.0%      min: -2.9%      max: -0.6%
webrender-check
        avg: -2.2%      min: -2.9%      max: -0.7%
encoding-check
        avg: -2.2%      min: -2.9%      max: -0.8%
syn-check
        avg: -2.2%      min: -2.8%      max: -0.8%
inflate-check
        avg: -1.2%      min: -2.8%      max: -0.2%
webrender-wrench-check
        avg: -1.6%      min: -2.8%      max: -0.5%
cargo-check
        avg: -1.7%      min: -2.7%      max: -0.5%
unused-warnings-check
        avg: -2.2%      min: -2.6%      max: -1.4%
futures-check
        avg: -1.8%      min: -2.6%      max: -0.4%
helloworld-check
        avg: -1.3%      min: -2.5%      max: -0.6%
style-servo-check
        avg: -2.0%      min: -2.5%      max: -0.7%
tokio-webpush-simple-check
        avg: -1.4%      min: -2.4%      max: -0.4%
regression-31157-check
        avg: -1.5%      min: -2.4%      max: -0.4%
hyper-2-check
        avg: -1.6%      min: -2.2%      max: -0.5%
deeply-nested-check
        avg: -1.1%      min: -2.0%      max: -0.2%
unify-linearly-check
        avg: -1.1%      min: -2.0%      max: -0.3%
packed-simd-check
        avg: -1.2%      min: -1.9%      max: -0.5%
wg-grammar-check
        avg: -1.0%      min: -1.9%      max: -0.1%
wf-projection-stress-65510-che...
        avg: -0.6%      min: -1.6%      max: -0.0%
token-stream-stress-check
        avg: -0.2%      min: -0.3%      max: -0.1%

It's notable that every benchmark except for token-stream-stress-check got at least a 1.6% speedup for one of the runs. It's rare for any speed improvement to have such a wide effect.

@bors
Copy link
Contributor

bors commented Feb 7, 2020

☀️ Try build successful - checks-azure
Build commit: 05cb825 (05cb82590d75ad280719e8d36c7a39653538be09)

@rust-timer
Copy link
Collaborator

Queued 05cb825 with parent 442ae7f, future comparison URL.

@michaelwoerister
Copy link
Member

Thanks for the PR, @nnethercote. Looks like a great find! We are doing lots of hashing :)

Did you think about which implications these changes might have on big endian systems? Hashing needs to be stable across platforms for cross-compilation. The changes are probably fine with respect to this but it's something to look out for.

I'll review in detail soon.

@nnethercote
Copy link
Contributor Author

I did think about big-endian. AFAIK, the new code will work fine there. Things are simpler because the code operates mostly on integers, with fewer conversions. But I'm happy to hear a second opinion.

Is there a way to know for sure if I'm right about this? Do you know if the tests cover BE, or if we have access to any BE machines for testing?

@michaelwoerister
Copy link
Member

We do run tests on some big endian platforms, I think (right, @rust-lang/infra?)

In the past we had problems with symbol name mismatches when compiling some things on little-endian and the rest on big-endian, because the symbol hashes didn't match up. But we now know which kinds of bug reports to look out for after a change like this and testing should be better too now.

@michaelwoerister
Copy link
Member

https://cfarm.tetaneutral.net/ provides access to some big endian systems.

@kennytm
Copy link
Member

kennytm commented Feb 7, 2020

@michaelwoerister the CI run tests on ARM, x86 and WASM only, so no big-endian platforms.

@nnethercote nnethercote force-pushed the speed-up-SipHasher128 branch from e606fe7 to a2da54b Compare February 10, 2020 03:25
@nnethercote
Copy link
Contributor Author

@michaelwoerister: After some thought I see that my original code was not correct for big-endian. Fortunately there is a cheap and easy fix.

Consider this 9 byte stream:

  • write_u32(0xDDCCBBAA)
  • write_u8(0xEE)
  • write_u32(0xIIHHGGFF)

On little-endian it is equivalent to write([AA,BB,CC,DD, EE, FF,GG,HH,II]). SipHash parses the input stream as 8-byte little-endian words, so it must process the first 8 bytes of this stream as 0xHHGGFF_EE_DDCCBBAA, and the second 8 bytes would be 0x??????????????_II.

On big-endian it is equivalent to write([DD,CC,BB,AA, EE, II,HH,GG,FF]). SipHash parses the input stream as 8-byte little-endian words, so it must process the first 8 bytes of this stream as 0xGGHHII_EE_AABBCCDD, and the second 8 bytes would be 0x??????????????_FF.

The new short_write works correctly for little-endian, i.e. given the above write_u32/write_u8/write_u32 sequence, the first 8 byte value produced from the stream is 0xHHGGFF_EE_DDCCBBAA, with 0xII leftover.

To make it work for big-endian, we just need to call to_le to do a byte-swap on the integer inputs in in write_* before doing anything else, and then things work out. E.g. it's easy to see that the 8 byte value 0xHHGGFF_EE_DDCCBBAA becomes 0xGGHHII_EE_AABBCCDD (with 0xFF left over)
when the individual integers are byte-swapped, and that's the value we want on big-endian.

I have updated the PR to do this. I haven't tested it on a big-endian machine but I'm fairly confident it's correct. But this stuff is tricky to think about so, again, I'm happy to hear second opinions.

@nnethercote nnethercote force-pushed the speed-up-SipHasher128 branch from a2da54b to 7edcdc8 Compare February 10, 2020 03:31
The current code in `SipHasher128::short_write` is inefficient. It uses
`u8to64_le` (which is complex and slow) to extract just the right number of
bytes of the input into a u64 and pad the result with zeroes. It then
left-shifts that value in order to bitwise-OR it with `self.tail`.

For example, imagine we have a u32 input 0xIIHH_GGFF and only need three bytes
to fill up `self.tail`. The current code uses `u8to64_le` to construct
0x0000_0000_00HH_GGFF, which is just 0xIIHH_GGFF with the 0xII removed and
zero-extended to a u64. The code then left-shifts that value by five bytes --
discarding the 0x00 byte that replaced the 0xII byte! -- to give
0xHHGG_FF00_0000_0000. It then then ORs that value with self.tail.

There's a much simpler way to do it: zero-extend to u64 first, then left shift.
E.g. 0xIIHH_GGFF is zero-extended to 0x0000_0000_IIHH_GGFF, and then
left-shifted to 0xHHGG_FF00_0000_0000. We don't have to take time to exclude
the unneeded 0xII byte, because it just gets shifted out anyway! It also avoids
multiple occurrences of `unsafe`.

There's a similar story with the setting of `self.tail` at the method's end.
The current code uses `u8to64_le` to extract the remaining part of the input,
but the same effect can be achieved more quickly with a right shift on the
zero-extended input.

All that works on little-endian. It doesn't work for big-endian, but we
can just do a `to_le` before calling `short_write` and then it works.

This commit changes `SipHasher128` to use the simpler shift-based approach. The
code is also smaller, which means that `short_write` is now inlined where
previously it wasn't, which makes things faster again. This gives big
speed-ups for all incremental builds, especially "baseline" incremental
builds.
@nnethercote nnethercote force-pushed the speed-up-SipHasher128 branch from 7edcdc8 to f8a0286 Compare February 10, 2020 04:54
@michaelwoerister
Copy link
Member

Note that we already call to_le() on layer above in StableHasher. That's probably sufficient then? It's annoying that we don't have test running on big endian platforms :/

@michaelwoerister
Copy link
Member

OK, I wrote the following test program that compares hash values before and after this PR:

https://github.com/michaelwoerister/sip-endian/blob/master/main.rs

On a little endian machine everything works as expected. However, when I tried it on a big endian machine (gcc110 from cfarm.tetaneutral.net), I got different values until I removed the to_le() calls from the PR's implementation. Once I did that the values matched those on the little endian machine (and those of the current implementation).

The requirement here is that the same sequence of write_xyz() calls with the same numeric values must produce the same final hash value, independent of endianess. For the current implementation this is achieved by treating everything as byte slices and making sure that all such slices are brought into a platform independent order (by calling to_le() in StableHasher).

However, the implementation in this PR does not operate on byte slices anymore, so there is no need to do the whole byte-swapping dance. The new short_write uses only uses bit operations and those are endian independent.

So the correct fix, in my opinion, is to remove the to_le() calls from both the short_write() invocations in SipHasher128 and from the write_xyz() calls in StableHasher. (Funnily enough current version of this PR probably works too because it swaps the bytes once in StableHasher and then swaps them back again in SipHasher).

@michaelwoerister
Copy link
Member

michaelwoerister commented Feb 10, 2020

Also, we should be able to replace the u8to64_le() with something more straightforward that does just a single memcpy (and let's us get rid of this weird load_int_le macro):

    /// Loads up to 7 bytes from a byte-slice into a u64.
    #[inline]
    fn u8to64_le(buf: &[u8], start: usize, len: usize) -> u64 {
        assert!(len <= 8 && start + len <= buf.len());
        let mut out = 0u64;

        unsafe {
            let out_ptr = &mut out as *mut _ as *mut u8;
            ptr::copy_nonoverlapping(buf.as_ptr().offset(start as isize), out_ptr, len);
        }

        #[cfg(target_endian = "big")]
        {
            // If this is a big endian system we swap bytes, so that the first
            // byte ends up in the lowest order byte, like SipHash expects.
            out = out.swap_bytes();
        }

        out
    }

@nnethercote
Copy link
Contributor Author

I requested access to the GCC farm on Saturday, but I am still waiting for a response.

The requirement here is that the same sequence of write_xyz() calls with the same numeric values must produce the same final hash value, independent of endianess.

Hmm. I was using this code as the basis for my reasoning:

/// Writes a single `u16` into this hasher.
#[inline]
#[stable(feature = "hasher_write", since = "1.3.0")]
fn write_u16(&mut self, i: u16) {
self.write(&i.to_ne_bytes())
}
/// Writes a single `u32` into this hasher.
#[inline]
#[stable(feature = "hasher_write", since = "1.3.0")]
fn write_u32(&mut self, i: u32) {
self.write(&i.to_ne_bytes())
}
/// Writes a single `u64` into this hasher.
#[inline]
#[stable(feature = "hasher_write", since = "1.3.0")]
fn write_u64(&mut self, i: u64) {
self.write(&i.to_ne_bytes())
}
/// Writes a single `u128` into this hasher.
#[inline]
#[stable(feature = "i128", since = "1.26.0")]
fn write_u128(&mut self, i: u128) {
self.write(&i.to_ne_bytes())
}
/// Writes a single `usize` into this hasher.
#[inline]
#[stable(feature = "hasher_write", since = "1.3.0")]
fn write_usize(&mut self, i: usize) {
self.write(&i.to_ne_bytes())
}

The use of to_ne_bytes shows that, by default, for a given sequence of write_xyz calls, any hasher will give different results on little-endian vs. big-endian. Going back to my example:

  • write_u32(0xDDCCBBAA)
  • write_u8(0xEE)
  • write_u32(0xIIHHGGFF)

On little-endian it is equivalent to write([AA,BB,CC,DD, EE, FF,GG,HH,II]) On big-endian it is equivalent to write([DD,CC,BB,AA, EE, II,HH,GG,FF]). Clearly the results will be different.

I was taking this equivalence to be axiomatic (i.e. required). But it makes sense that StableHasher requires the same results on little-endian and big-endian, therefore it must violate this equivalence. I guess that's ok, so long as it's consistent?

But should SipHasher128 violate this equivalence? Likewise, what about SipHasher in core? I'm not sure. My instinct is that SipHasher128/SipHasher should not violate the equivalence, in which case the endian-independence should be provided by StableHasher -- and it currently does this by using to_le in its write_xyz methods.

On a little endian machine everything works as expected. However, when I tried it on a big endian machine (gcc110 from cfarm.tetaneutral.net), I got different values until I removed the to_le() calls from the PR's implementation. Once I did that the values matched those on the little endian machine (and those of the current implementation).

Thank you for doing this checking. Here's what I was expecting from SipHasher128:

  • le-old-code == le-new-code
  • be-old-code == be-new-code
  • le-old-code != be-old-code
  • le-new-code != be-new-code

Can you write down the values you got for the four combinations?

For StableHasher, I would expect:

  • le-old-code == le-new-code == be-old-code == be-new-code

because of the extra to_le in StableHasher::write_xyz.

@nnethercote
Copy link
Contributor Author

nnethercote commented Feb 11, 2020

In case it helps, here is what I think should happen in the four SipHasher28 cases for the above example.

-----------------------------------------------------------------------------
little-endian
-----------------------------------------------------------------------------
SipHasher128, old code
- write_u32(0xDDCCBBAA)
  - short_write([AA, BB, CC, DD])
  - needed = 8, fill = 4
  - self.tail |= u8to64_le(msg, 0, 4) << 0 --> 0xDDCCBBAA
- write_u8(0xEE)
  - short_write([EE])
  - needed = 4, fill = 1
  - self.tail |= u8to64_le(msg, 0, 1) << 4*8 --> 0xEE_CCDDBBAA
- write_u32(0xIIHHGGFF)
  - short_write([FF, GG, HH, II])
  - needed = 3, fill = 3
  - self.tail |= u8to64_le(msg, 0, 3) << 5*8 --> 0xHHGGFF_EE_CCDDBBAA
  - process
  - self.tail = u8to64_le(msg, 3, 1) --> 0xII

SipHasher128, new code
- write_u32(0xDDCCBBAA)
  - short_write(0x00000000_DDCCBBAA)
  - needed = 8, fill = 4
  - self.tail |= x << 0 --> 0xDDCCBBAA
- write_u8(0xEE)
  - short_write(0x00000000_000000EE)
  - needed = 4, fill = 1
  - self.tail |= x << 4*8 --> 0xEE_CCDDBBAA
- write_u32(0xIIHHGGFF)
  - short_write(0x00000000_IIHHGGFF)
  - needed = 3, fill = 3
  - self.tail |= x << 5*8 --> 0xHHGGFF_EE_CCDDBBAA
  - process
  - self.tail = x >> 3*8 --> 0xII

-----------------------------------------------------------------------------
big-endian
-----------------------------------------------------------------------------
SipHasher128, old code
- write_u32(0xDDCCBBAA)
  - short_write([DD, CC, BB, AA])
  - needed = 8, fill = 4
  - self.tail |= u8to64_le(msg, 0, 4) << 0 --> 0xAABBCCDD
- write_u8(0xEE)
  - short_write([EE])
  - needed = 4, fill = 1
  - self.tail |= u8to64_le(msg, 0, 1) << 4*8 --> 0xEE_AABBCCDD
- write_u32(0xIIHHGGFF)
  - short_write([II, HH, GG, FF])
  - needed = 3, fill = 3
  - self.tail |= u8to64_le(msg, 0, 3) << 5*8 --> 0xGGHHII_EE_AABBCCDD
  - process
  - self.tail = u8to64_le(msg, 3, 1) --> 0xFF

SipHasher128, new code
- write_u32(0xDDCCBBAA)
  - short_write(0x00000000_AABBCCDD)    // was byte-swapped, then zero-extended
  - needed = 8, fill = 4
  - self.tail |= x << 0 --> 0xAABBCCDD
- write_u8(0xEE)
  - short_write(0x00000000_000000EE)    // was zero-extended
  - needed = 4, fill = 1
  - self.tail |= x << 4*8 --> 0xEE_AABBCCDD
- write_u32(0xIIHHGGFF)
  - short_write(0x00000000_FFGGHHII)    // was byte-swapped, then zero-extended
  - needed = 3, fill = 3
  - self.tail |= x << 5*8 --> 0xGGHHII_EE_AABBCCDD
  - process
  - self.tail = x >> 3*8 --> 0xFF

I have confirmed that the two little-endian cases are correct, I haven't been able to confirm the big-endian cases.

@nnethercote
Copy link
Contributor Author

>         #[cfg(target_endian = "big")]
>         {
>             // If this is a big endian system we swap bytes, so that the first
>             // byte ends up in the lowest order byte, like SipHash expects.
>             out = out.swap_bytes();
>         }
>
>         out

This whole snippet can be simplified to out.to_le().

@nnethercote
Copy link
Contributor Author

@michaelwoerister: I have added some debugging eprintln statements to your code:
https://github.com/nnethercote/sip-endian/tree/add-some-printlns

This is the output I get on little-endian:

old early  : 0xddccbbaa, 4
old early  : 0xeeddccbbaa, 5
old process: 0x345678eeddccbbaa, 5
old spill  : 0x12, 1
old: 20f554e44fa4ca9 d68f01a898684a41
new early  : 0xddccbbaa, 4
new early  : 0xeeddccbbaa, 5
new process: 0x345678eeddccbbaa, 5
new spill  : 0x12, 1
new: 20f554e44fa4ca9 d68f01a898684a41

This is the output I expect on big-endian:

old early  : 0xaabbccdd, 4
old early  : 0xeeaabbccdd, 5
old process: 0x563412eeaabbccdd, 5
old spill  : 0x78, 1
old: 20f554e44fa4ca9 d68f01a898684a41
new early  : 0xaabbccdd, 4
new early  : 0xeeaabbccdd, 5
new process: 0x563412eeaabbccdd, 5
new spill  : 0x78, 1
new: <something> <something>

Can you check the big-endian results?

@michaelwoerister
Copy link
Member

Here is what I get on the big endian machine:

old early  : 0xddccbbaa, 4
old early  : 0xeeddccbbaa, 5
old process: 0x345678eeddccbbaa, 5
old spill  : 0x12, 1
old: 20f554e44fa4ca9 d68f01a898684a41
new early  : 0xddccbbaa, 4
new early  : 0xeeddccbbaa, 5
new process: 0x345678eeddccbbaa, 5
new spill  : 0x12, 1
new: 20f554e44fa4ca9 d68f01a898684a41

@michaelwoerister
Copy link
Member

michaelwoerister commented Feb 11, 2020

So it looks the same as on little-endian. This is what I expected because the code in question operates on integers, not on byte sequences, i.e. the number 0xaabbccdd might have a different memory layout on big-endian, but it still has the same numeric value and print!("{:x}", n) prints the same on all architectures.

The use of to_ne_bytes shows that, by default, for a given sequence of write_xyz calls, any hasher will give different results on little-endian vs. big-endian. Going back to my example:

write_u32(0xDDCCBBAA)
write_u8(0xEE)
write_u32(0xIIHHGGFF)

On little-endian it is equivalent to write([AA,BB,CC,DD, EE, FF,GG,HH,II]) On big-endian it is equivalent to write([DD,CC,BB,AA, EE, II,HH,GG,FF]). Clearly the results will be different.

I see, that is interesting. I didn't know that the libstd implementation worked this way. It's clear that that must give different results depending on endianess. At the same time StableHasher must give the same result on all platforms for that sequence of calls. I think it's fine for SipHasher128 to handle this differently than libstd, as long as we document it. I don't think there's an actual requirement that write_u32() corresponds to hashing any specific sequence of bytes. The only real requirement I see for the generic Hasher is that any sequence of calls deterministically results in the same hash value on the same platform (i.e. the minimum requirements for making it usable with a hashtable). I think the main reason the standard library hashes things in native byte order is performance, not because it's a strict requirement.

So I think our options for SipHasher128 are:

  1. Don't do any endianess conversions on short_write arguments and rely on short_write to be implemented in an endian independent way (which it is as long as it only does bitwise and arithmetic operations).
  2. Make short_write take a byte slice again and then make sure that StableHasher makes things endian independent by always converting to little endian. (~= the current implementation)
  3. Try to make SipHasher128 behave exactly the same way as std::hash::Hasher (i.e. giving different results depending on endianess) while still using integer arguments for short_write and then let StableHasher pre-process the integers in a way that leads to endian independent hash values. (~= the current version of this PR?)

I prefer option (1) as it is just simpler.

This whole snippet can be simplified to out.to_le().

Yeah, I know. I just find to_le() confusing in most contexts. E.g. why does x.to_le().to_le() give me big-endian encoding on a big endian system? I personally prefer to call swap_bytes() which is just more explicit. What I usually really want is to_le_bytes(), that makes a lot more sense to me. Anyway, if you strongly prefer to_le() to my more verbose version, I won't fight you on it. I mostly want to get rid of the weird sequence of if statements in u8to64_le.

@nnethercote
Copy link
Contributor Author

So it looks the same as on little-endian. This is what I expected because the code in question operates on integers, not on byte sequences, i.e. the number 0xaabbccdd might have a different memory layout on big-endian, but it still has the same numeric value and print!("{:x}", n) prints the same on all architectures.

The new code operates on integers, so I agree that the new short_write is endian-independent; that's why I added the to_le calls in the write_xyz methods, so that the results on big-endian would be different to little-endian, as is done for libstd.

But the old code involves byte sequences in short_write_gen/short_write, and so should not be endian-independent. And the fact that StableHasher has the to_le calls makes sense with this theory -- because SipHasher128 is not endian-independent, StableHasher::write_xyz has to byte-swap on big-endian to be endian-independent.

I just looked more closely at your sip-endian code and I now understand why it gives the same results for all four cases: le-old-code, le-new-code, be-old-code, be-new-code. You didn't copy the SipHasher128 implementations exactly -- you added to_le calls to the old write_xyz functions, and and removed them from the new write_xyz functions! So you effectively emulated StableHasher, which is supposed to get the same result on big-endian and little-endian. This shows that the PR as written is correct, yay! (If you undo those changes and re-run, you should get the big-endian outputs I predicted above, starting with old early : 0xaabbccdd, 4.)

I see, that is interesting. I didn't know that the libstd implementation worked this way. It's clear that that must give different results depending on endianess. At the same time StableHasher must give the same result on all platforms for that sequence of calls.

Yes.

I think it's fine for SipHasher128 to handle this differently than libstd, as long as we document it.

I'd prefer it to handle it the same way...

So I think our options for SipHasher128 are:

  1. Don't do any endianess conversions on short_write arguments and rely on short_write to be implemented in an endian independent way (which it is as long as it only does bitwise and arithmetic operations).

  2. Make short_write take a byte slice again and then make sure that StableHasher makes things endian independent by always converting to little endian. (~= the current implementation)

  3. Try to make SipHasher128 behave exactly the same way as std::hash::Hasher (i.e. giving different results depending on endianess) while still using integer arguments for short_write and then let StableHasher pre-process the integers in a way that leads to endian independent hash values. (~= the current version of this PR?)

I prefer option (1) as it is just simpler.

I prefer option (3). My desires are:

  • I don't want to change SipHasher's current behaviour, which is the libstd behaviour (i.e. different results on big-endian and little-endian), because it's exposed to every Rust program and so changing it seems like a very bad idea.
  • I want SipHasher and SipHasher128 to be as similar as possible.
    • Because the latter was clearly derived from the former and I want that derivation to be obvious.
    • Because subtle (i.e. big-endian-only) differences between the two could be confusing.
    • Because I want SipHasher to get the same speed-ups that SipHasher128 is getting.

The only way to satisfy all of these is via (3), which the current PR code implements. The downside is extra to_le calls in both StableHasher::write_xyz and SipHasher128::write_xyz, but I think that's reasonable to satisfy the desires above.

Does that sound reasonable?

I just find to_le() confusing in most contexts. E.g. why does x.to_le().to_le() give me big-endian encoding on a big endian system? I personally prefer to call swap_bytes() which is just more explicit.

That's interesting. I find lots of things about little-endian/big-endian confusing, but I don't have trouble with to_le. It's just a no-op on little-endian and a byte-swap on big-endian. So I do prefer the to_le form. I definitely agree that either version is a clear improvement to u8to64_le and that eliminating load_int_le! is a good thing.

@nnethercote
Copy link
Contributor Author

One more tidbit: the unit tests have a test of short_write that check that it has the libstd behaviour, i.e. different results on big-endian vs little-endian. Line 412 is the endian-dependent one:

#[test]
fn test_write_short_works() {
let test_usize = 0xd0c0b0a0usize;
let mut h1 = SipHasher128::new_with_keys(0, 0);
h1.write_usize(test_usize);
h1.write(b"bytes");
h1.write(b"string");
h1.write_u8(0xFFu8);
h1.write_u8(0x01u8);
let mut h2 = SipHasher128::new_with_keys(0, 0);
h2.write(unsafe {
slice::from_raw_parts(&test_usize as *const _ as *const u8, mem::size_of::<usize>())
});
h2.write(b"bytes");
h2.write(b"string");
h2.write(&[0xFFu8, 0x01u8]);
assert_eq!(h1.finish128(), h2.finish128());
}

So the test confirms that the libstd behaviour is intended.

This makes it faster and also changes it to a safe function. (Thanks to
Michael Woerister for the suggestion.) `load_int_le!` is also no longer
necessary.
@nnethercote
Copy link
Contributor Author

@michaelwoerister: I have left the previous changes in place, because I think option (3) is the best. I have also added another commit that makes the u8to64_le changes you suggest. I think this is in a good enough state to land, though if you are able to do a full test run on a big-endian machine that would be welcome.

@michaelwoerister
Copy link
Member

michaelwoerister commented Feb 12, 2020

The only way to satisfy all of these is via (3), which the current PR code implements. The downside is extra to_le calls in both StableHasher::write_xyz and SipHasher128::write_xyz, but I think that's reasonable to satisfy the desires above.

Does that sound reasonable?

Yes, I'm OK with that.

That's interesting. I find lots of things about little-endian/big-endian confusing, but I don't have trouble with to_le. It's just a no-op on little-endian and a byte-swap on big-endian.

I think what I find even more confusing about to_le is when it is used as a substitute for converting from little endian to native, i.e:

// We have some bytes that encode the number 1 as a 32 bit integer in LE format, 
let le_bytes = [1, 0, 0, 0];

// Load the bytes into the `u32` verbatim
let x: u32 = *(&le_bytes as &u32);

// On a big endian machine x is now the number 16777216, so we convert 
// to big endian by, obviously, calling `to_le()`
x.to_le()

It does the right thing but it spells the opposite of what it does. What I usually really want, and I'm glad that Rust has it since recently, is from_le_bytes():

// We have some bytes that encode the number 1 as a 32 bit integer in LE format, 
let le_bytes = [1, 0, 0, 0];

// So much nicer!
let x = u32::from_le_bytes(le_bytes);

I would actually prefer implementing u8to64_le in terms of from_le_bytes if that optimizes as well as the current version. But I don't want to block this PR on it. Thanks for looking into things so thoroughly!

@bors r+

@bors
Copy link
Contributor

bors commented Feb 12, 2020

📌 Commit 9aea154 has been approved by michaelwoerister

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 12, 2020
@bors
Copy link
Contributor

bors commented Feb 12, 2020

💡 This pull request was already approved, no need to approve it again.

@bors
Copy link
Contributor

bors commented Feb 12, 2020

📌 Commit 9aea154 has been approved by michaelwoerister

Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this pull request Feb 12, 2020
…r=michaelwoerister

Speed up `SipHasher128`.

The current code in `SipHasher128::short_write` is inefficient. It uses
`u8to64_le` (which is complex and slow) to extract just the right number of
bytes of the input into a u64 and pad the result with zeroes. It then
left-shifts that value in order to bitwise-OR it with `self.tail`.

For example, imagine we have a u32 input `0xIIHH_GGFF` and only need three bytes
to fill up `self.tail`. The current code uses `u8to64_le` to construct
`0x0000_0000_00HH_GGFF`, which is just `0xIIHH_GGFF` with the `0xII` removed and
zero-extended to a u64. The code then left-shifts that value by five bytes --
discarding the `0x00` byte that replaced the `0xII` byte! -- to give
`0xHHGG_FF00_0000_0000`. It then then ORs that value with `self.tail`.

There's a much simpler way to do it: zero-extend to u64 first, then left shift.
E.g. `0xIIHH_GGFF` is zero-extended to `0x0000_0000_IIHH_GGFF`, and then
left-shifted to `0xHHGG_FF00_0000_0000`. We don't have to take time to exclude
the unneeded `0xII` byte, because it just gets shifted out anyway! It also avoids
multiple occurrences of `unsafe`.

There's a similar story with the setting of `self.tail` at the method's end.
The current code uses `u8to64_le` to extract the remaining part of the input,
but the same effect can be achieved more quickly with a right shift on the
zero-extended input.

This commit changes `SipHasher128` to use the simpler shift-based approach. The
code is also smaller, which means that `short_write` is now inlined where
previously it wasn't, which makes things faster again. This gives big
speed-ups for all incremental builds, especially "baseline" incremental
builds.

r? @michaelwoerister
bors added a commit that referenced this pull request Feb 12, 2020
Rollup of 8 pull requests

Successful merges:

 - #67585 (Improve `char::is_ascii_*` codegen)
 - #68914 (Speed up `SipHasher128`.)
 - #68994 (rustbuild: include channel in sanitizers installed name)
 - #69032 (ICE in nightly-2020-02-08: handle TerminatorKind::Yield in librustc_mir::transform::promote_consts::Validator method)
 - #69034 (parser: Remove `Parser::prev_token_kind`)
 - #69042 (Remove backtrace header text)
 - #69059 (Remove a few unused objects)
 - #69089 (Properly use the darwin archive format on Apple targets)

Failed merges:

r? @ghost
@bors bors merged commit 9aea154 into rust-lang:master Feb 12, 2020
@nnethercote nnethercote deleted the speed-up-SipHasher128 branch February 12, 2020 21:43
@nnethercote
Copy link
Contributor Author

I would actually prefer implementing u8to64_le in terms of from_le_bytes

from_le_bytes takes a [u8; 8] argument, so I'm having trouble seeing how you would write u8to64_le with it. I might be overlooking something.

nnethercote added a commit to nnethercote/rust that referenced this pull request Feb 20, 2020
`SipHasher128`'s `u8to64_le` function was simplified in rust-lang#68914.
Unfortunately, the new version is slower, because it introduces `memcpy`
calls with non-statically-known lengths.

This commit reverts the change, and adds an explanatory comment (which
is also added to `libcore/hash/sip.rs`). This barely affects
`SipHasher128`'s speed because it doesn't use `u8to64_le` much, but it
does result in `SipHasher128` once again being consistent with
`libcore/hash/sip.rs`.
bors added a commit that referenced this pull request Feb 22, 2020
…lwoerister

Revert `u8to64_le` changes from #68914.

`SipHasher128`'s `u8to64_le` function was simplified in #68914.
Unfortunately, the new version is slower, because it introduces `memcpy`
calls with non-statically-known lengths.

This commit reverts the change, and adds an explanatory comment (which
is also added to `libcore/hash/sip.rs`). This barely affects
`SipHasher128`'s speed because it doesn't use `u8to64_le` much, but it
does result in `SipHasher128` once again being consistent with
`libcore/hash/sip.rs`.

r? @michaelwoerister
nnethercote added a commit to nnethercote/rust that referenced this pull request Feb 24, 2020
This commit changes `sip::Hasher` to use the faster `short_write` approach
that was used for `SipHasher128` in rust-lang#68914.

This has no effect because `sip::Hasher::short_write` is currently
unused. See the next commit for more details, and a fix.

(One difference with rust-lang#68914 is that this commit doesn't apply the
`u8to64_le` change from that PR, because I found it is slower, because
it introduces `memcpy` calls with non-statically-known lengths.
Therefore, this commit also undoes the `u8to64_le` change in
`SipHasher128` for this reason. This doesn't affect `SipHasher128` much
because it doesn't use `u8to64_le` much, but I am making the change to
keep the two implementations consistent.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants