Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison to twox-hash? #10

Closed
stepancheg opened this issue Nov 27, 2021 · 12 comments
Closed

Comparison to twox-hash? #10

stepancheg opened this issue Nov 27, 2021 · 12 comments

Comments

@stepancheg
Copy link

Would be helpful to mention in readme how this is different from twox-hash. Is it faster? Is it easier to use? Or just an alternative implementation?

@DoumanAsh
Copy link
Owner

DoumanAsh commented Nov 27, 2021

I didn't do proper benchmarks, only some for my own reference.
twox-hash uses older implementation as basis so it lacks proper xxh3 implementation.

In addition to that it doesn't provide one-shot version of hashing algorithm which tends to be times faster than streaming one.

In general xxhash-rust closely matches to original C implementation in terms of performance, although there can be some minor performance degradations which I didn't really investigate.

So to answer concrete questions:

Is it faster?

Generally speaking yes (albeit in streaming variants of xxh64 you only gain marginal improvement)

Is it easier to use?

Hashing functions are pretty trivial to use in general.
So it is pretty much similar you have single function for one-shot variant and structs that implement Hash trait

Or just an alternative implementation?

It is just implementation of xxhash algorithm.
But it is not just simple alternative because it provides you with const fn variants of all algorithms

@stepancheg
Copy link
Author

it doesn't provide one-shot version of hashing algorithm

Looks like it does:

https://github.com/shepmaster/twox-hash/blob/b312df680851a11be7d77ecd07615b9f86cd383c/src/xxh3.rs#L44

@DoumanAsh
Copy link
Owner

Ah my bad, for some reason I thought it doesn't.
Well then there is that

@firasuke
Copy link

I'm interested in comparison benchmarks between the two crates.

@DoumanAsh
Copy link
Owner

DoumanAsh commented Feb 13, 2022

@firasuke The only algorithm that is worth benchmarking is xxh3, which is not available in its latest version for twox-hash so there is really not much to benchmark(twox-hash uses WIP version of xxh3)
Sorry I kinda missed your message

P.s. Just for reference, I strive to achieve as much parity with C implementation as possible (which is at the current moment very close to perfect)

@firasuke
Copy link

@DoumanAsh Thanks for the reply, and no worries!

I already switched to xxhash-rust as it felt more minimal and had XXH3_128bits working like a charm.

@josephg
Copy link

josephg commented May 10, 2022

The only algorithm that is worth benchmarking is xxh3,

I'm not super familiar with xxh3 (or xxhash in general). I want a good hashing function for data integrity - potentially to replace CRC32. Why xxh3?

@DoumanAsh
Copy link
Owner

In first place CRC32 is not intended as general purpose hashing function so I can only suggest to use xxhash if you need non-cryptographic hashing function.

xxh3 should be your choice due to improved performance comparing to older hashing functions in xxh family.
you might consider xxh32 on platforms where you lack 64bit integers or even 32bit integers

@josephg
Copy link

josephg commented May 10, 2022

Yeah; I'm just throwing a checksum at the end of a new binary file format (for CRDTs) for the sake of detecting silent data corruption. CRC32c is probably fine, but if xxhash would be a better choice, then nows the time to make that decision. xxh3 looks good on paper, but I'm also targetting wasm where code size matters. And xxh32 adds much less code size compared to xxh3. A 32 bit checksum is plenty for what I'm using it for.

But if xxh3 is "the future" then I'm worried that xxh32 might stay niche and not be widely implemented going forward? Stability & portability is very important. O_o I have no idea what the best call is here.

@DoumanAsh
Copy link
Owner

DoumanAsh commented May 10, 2022

xxh32/xxh64 are good for code size I suppose, but if you want something akin to checksum you need to reduce potential for collisions
Only xxh3 with 128bit variant allows it (128bit is wide enough for collisions to be non-existing in case of xxh3 while 64bit variant has some small chance of collision), I do not think xxh3 code is all that big honestly, not to mention I specifically designed library to enable/disable specific algorithms.

Considering your purpose is to use hash as checksum, I believe xxh3 with 128bit variant is the best option.
Hash is not inherently designed to be used as checksum, but as long as collisions are non-existing it should work for your case
See wiki of xxhash original implementation
https://github.com/Cyan4973/xxHash/wiki/Collision-ratio-comparison#testing-128-bit-hashes-=

P.s. 32bit hash is not enough for your purpose, do not look at code size, WASM is not optimized for that anyway

P.s.s if you worry about code size, always enable optimizations for code size:

[profile.release]
opt-level = "z"     # Optimize for size.
lto = true          # Enable Link Time Optimization
codegen-units = 1   # Reduce number of codegen units to increase optimizations.
panic = "abort"     # Abort on panic
strip = true        # Automatically strip symbols from the binary.

@josephg
Copy link

josephg commented May 12, 2022

Thanks! I've spent a lot of time tweaking compilation flags. My CRDT library is ~250kb, and I'm keeping a careful eye on it because I want it to get smaller, not bigger.

In theory TCP / filesystems should keep data consistent anyway. The checksum is really mostly there out of paranoia. Years ago I read that checksums fail in practice (even in situations they shouldn't) much more often than you expect. So I'm a bit curious to see if / when my checksums start failing.

CRDT patches get tiny sometimes since we send one patch per keystroke. A 128 bit hash would make sense for a 1mb file or something, but when I'm sending 20 bytes of patch content, adding 16 bytes of 128-bit xxh3 hash seems excessive.

I'll stick with crc32c for now. Thanks for humoring me! I really appreciate it!

@DoumanAsh
Copy link
Owner

Ah I understand, in this case yeah the best is to choose between xxh32 and crc32.
Unfortunately I'm not really familiar with their collision rates, both should have problem like that due to 32bit not being enough for perfect hash, but which is worse.... that's unclear to me.
So yeah sticking to crc32 is probably fine since it was designed for checksums rather than general purpose hashes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants