-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pseudorandom probing for hash collision #13418
pseudorandom probing for hash collision #13418
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is some superficial comments on your WIP status. Feel free to ignore it if you don't like it.
ddcaadb
to
bc86a80
Compare
Superb work, should also be backported, if possible. @narimiran |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
proc sortedPairs*[T](t: T): auto = | ||
## helps when writing tests involving tables in a way that's robust to | ||
## changes in hashing functions / table implementation. | ||
toSeq(t.pairs).sorted | ||
|
||
template sortedItems*(t: untyped): untyped = | ||
## helps when writing tests involving tables in a way that's robust to | ||
## changes in hashing functions / table implementation. | ||
sorted(toSeq(t)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO you should use toSeq(t.pairs).sorted
and sorted(toSeq(t))
in the tests instead.
This is bloat IMO and saves like 5 keypresses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- sure, it's a bit "bloaty" but it simplifies tests (stdlib + other ones) and encourages using it instead of flaky alternatives like
$t
which has caused much churn in the past every time hashes/table/etc algorithms change; I've had to fix a number of tests (both in stdlib + in other nimble packages) to make those tests non-flaky; this simplifies this. - it's a bit more than 5 keypresses in practice:
toSeq(t.pairs).sorted
prevents using method call syntax/UFCS (because pairs is an iterator), butsortedPairs
can be used with MCS; in practice tests often involve expressions to definet
so having MCS makes the test simpler to read/write - (also, needs 1 import
testutils
instead of 2imports algorithm, sequtils
to achive this goal, but that argument is very minor; in practiceimport sequtils
can cause further issues (see nim issues involving import sequtils except toSeq) so it'd in fact beimports algorithm
, from sequtils import toSeq`; etc)
all those complications I've had in the past when fixing test led me to the simple realization that testutils is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said, use toSeq(....
in the tests.
Even if it is only "a bit bloaty" it is too much to be a candidate for adding to the stdlib IMO.
All the other arguments are very minor, as you admit yourself.
Please remove this module.
Looking over at your one set of benchmarks I could find which you used to justify "low performance impact" (https://github.com/timotheecour/vitanim/blob/master/testcases/tests/t0129b.nim, right???) I see only tests with like a million 8 byte keys in hot loops. That's 100% in L3 cache which has approximately 10X better latency than dynamic RAM. In short, it doesn't really test anything interesting about this table structure change. |
Perhaps more constructively - Robin Hood hashing is honestly pretty simple..arguably simpler than the explanation for this new probe sequence. It just keeps collision chains sorted by their "distance from their home cell" (where they would hash to in an empty table). That's it! You can shift cells around with There are a lot of articles on this if you search for them. E.g., this http://codecapsule.com/2013/11/17/robin-hood-hashing-backward-shift-deletion/ . I've done it twice before once in C and once in Nim. If you need me to, I can do a PR, but it might have to wait until next week. |
Rust used a Robin Hood table for quite a few years (until it switched to the much more complex port of Google's "Swiss" Tables). |
..and those "Swiss" tables mostly only get speed through vectorization tricks and much fine-tuning which is an ongoing maintenance burden as CPUs evolve/diversify..and also IIRC do not have the minimum variance guarantees of Robin Hood or its consequent 90+% memory utilization. And RH has no extra hard to tune parameters. Same old, "when to resize". That's it. RH is even friendly to being used as a disk-resident table, like B-Trees. These days fast computers have a budget of like 500 cycles multiplied by 4-way or 5-way superscalar or 2000-2500 dynamic CPU instructions per cache miss. At that scale, almost anything that is supposed to handle larger than L3 cache data should be done almost the same way as if you did it "on disk". Random probing decidedly does not have that trait. |
Oh, and incidentally - to support users who don't know how tables and hashes work from long collision chains we should just count the loops made in the comparisons. Experts should be able to opt out of with a special Then you can write out some kind of run-time warning message (probably exactly once per Anyway, doing that if you're a naive user and hashes misbehave then you get told what to do - use a better hash function. Also, if you didn't take care against adversaries and you got attacked then this would at least write a message somewhere. E.g., if you had code not designed to be safe from attack that was then run in an unsafe environment. |
I've also run tests on 20M keys and 100M keys (with key, val, hcode are int64, ie cell size with padding = 24B); that is 2.4GB which definitely doesn't fit on my 16MB cache. robin hood hashing
@c-blake if you think you can improve on the numbers I posted in #13440, by all means please do :) ; a draft PR is all that's needed for a start (ie no need to fix everything, just enough to run the benchmark I posted in that PR and compare apples to apples #13440 against robinhood hashing or swiss tables or your own version of it) one concern: seems to me robin hood suffers from #13393 as well (eg try with keys of the form |
As mentioned, tombstones cause performance lumpiness grief for steady-state constant membership heavy deletion workloads like a cache. Also, while I am generally in favor of data-drivenness and benchmarks, as mentioned hot loop benchmarks can be so misleading that theoretical arguments tell you more when the primary cost driver is DIMM latency. The CPU branch predictor detects the hot loop and works ahead by 1000 instructions to mask dynamic RAM latency it severely pollutes these numbers. Even if you benchmark You don't have to fly completely blind. On my main box at home I have 65 ns latency and 32 GB/s bandwidth. So, a scan of up to 65*32=2080 or half a memory page can be hidden. That is very fast RAM, though. Most would be more like (80-120)*(20..24) or 1600..2880 bytes. These are all really large regions for hash cells < 32B. Regular linear probing with a good hash has a longest probe depth per table of about log-base2(table size). So, for lookup latency, what is by far the most sensible thing to do is just fall back on "well, you cannot do better than 1 fetch + a linear-in-physical-memory scan as long as the scan is short-ish". Your threshold thing kind of gets there, but tombstones create unnecessary aging rehashing trouble and temptation to use slow probe sequences whose slowness is not detected by naive benchmarks as happened here. While you can fall down a rabbit hole of ever more complex probe sequences and there are many papers on this, you know what? That's all about memory systems from 40+ years ago. None will be as optimized by the CPU+memory system as a linear scan. Fortunately or unfortunately the entire CPU+memory system has been optimized to hide latency in benchmarks. Even having a nextTry isolated proc and tombstomes and "ease of experimentation" is probably asking for trouble like what happened here. But yeah, you do need your hash to do some kind of minimal spreading out Appeal to "but Python did it this way" is also wrong for multiple reasons. Python was always slow and uses these things as part of its basic object model so much that it's all selected for in-L1-cache tiny tables than broad workloads. CPython also no longer even does it this way for tables, only sets. Also, some 1.5-2.0x performance hit at 75% full without the threshold and tombstones would not bother Python users anyway who would "outsource" performance sensitive stuff to some more carefully done C thing. This depend-on-the-whole hash code sequence doesn't help with attacks - attackers can just collide in the code instead of the modulus. We all get that. It only helps with weak user hashes, but those could also collide the whole code. So, it's still a band-aid on totally awful hashes and @Araq is absolutely correct that the right solution for scaling guarantees is a B-tree. Some warning message like "far more than expected hash loops - use a better hash function or B-tree" might well direct someone to do exactly that. Anyway, in short - A) your hot loops are measure something not of interest besides the hot loop itself, so your 1.06x number is not quite meaningless but not what you should base decisions/comparisons upon, B) the real thing of interest is hard enough to measure that people agreeing on "realistic workloads" will be nearly impossible, C) there exists a simple alternative dodging all this, namely robin hood used by performance freak/non outsourcing Rustaceans for like 5 years until they moved to something with CPU architecture-specific tuning with its obvious ongoing maintenance burden, D) counting loops in rawGet or whatever and warning users will 1) always be necessary, 2) is super easy both to have and to disable and to regulate to avoid zillions of messages, and 3) and can immediately guide them to a better solution - the only thing needing agreement is how the runtime should warn. |
Oh yeah please! |
The code duplication also helps. The compiler hashes Nim identifiers, not "random numbers" and so a specialized implementation can exploit this, a stdlib impl can't. The compiler also never deletes anything inside a table so you can leave out the tombstone logic. |
@c-blake you can take a look at this WIP for robin hood hashing: doesn't yet allow bootstrapping nim from it but shd be usable as a library at least for
seems to work but performance seems unequivocally worse compared to #13440 using same benchmark as on that PR; of course that could very possibly be due to bugs in my implementation which I wrote quickly as a POC; feel free to improve / come up with a better one in your own PR + hopefully report better performance numbers. As I mentioned earlier, it does suffer from #13393 and better hashing would also incur runtime cost.
I would hope, with all the work done to improve tables/hashes, that stdlib implementation is now good enough for such use cases without any compromise on performance, and that at least some of (strtabs, compiler/astalgo, compiler/treetab, system/cellsets, pure/strtabs, sets) could simply reuse tables, or at least a common denominator that would allow fine tuning based on runtime/compile time properties, eg depthThres, mustRehashfillRatio, canDeleteCell. I don't think this would create a monster, quite the opposite. |
Maybe but I don't see the point, instead of optimizing N different implementations you "optimize" a single one for diverse use cases. But this makes the problem much harder to solve, optimization is specialization. For example cellsets hashes pointers, nothing else. The only reason so much effort has to be put into the stdlib's hash tables is that we cannot anticipate how it's used. |
I agree having >1 entry point makes sense and "optimization is specialization" is a very quotable saying. Some require no delete which is another wrinkle besides the ones @Araq mentioned. I'm not sure the right number is the exact current number, but each candidate for conversion should be considered very carefully. Adversarial inputs or bad user hashes are other dimensions the impls do not all share. So, at least 3 bits/8 possibilities. Maybe 2 bits/4 if you want to coalesce adversarial & bad. There are surely others. There is this VLDB 2015 paper "A Seven Dimensional Analysis of Hashing Methods" with a crazy complicated flowchart at the end. Nothing is perfect, but Robin Hood has the most "arrows in" of the bunch. (I also bet it'd be easy to poke holes in their benchmarks, though.) |
…nges (#13816) * Unwind just the "pseudorandom probing" (whole hash-code-keyed variable stride double hashing) part of recent sets & tables changes (which has still been causing bugs over a month later (e.g., two days ago #13794) as well as still having several "figure this out" implementation question comments in them (see just diffs of this PR). This topic has been discussed in many places: #13393 #13418 #13440 #13794 Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in identity hashes) are a better solution that is more general (no illusion of one hard-coded sequence solving all problems) while retaining the virtues of linear probing such as cache obliviousness and age-less tables under delete-heavy workloads (still untested after a month of this change). The only real solution for truly adversarial keys is a hash keyed off of data unobservable to attackers. That all fits better with a few families of user-pluggable/define-switchable hashes which can be provided in a separate PR more about `hashes.nim`. This PR carefully preserves the better (but still hard coded!) probing of the `intsets` and other recent fixes like `move` annotations, hash order invariant tests, `intsets.missingOrExcl` fixing, and the move of `rightSize` into `hashcommon.nim`. * Fix `data.len` -> `dataLen` problem.
* Unwind just the "pseudorandom probing" (whole hash-code-keyed variable stride double hashing) part of recent sets & tables changes (which has still been causing bugs over a month later (e.g., two days ago #13794) as well as still having several "figure this out" implementation question comments in them (see just diffs of this PR). This topic has been discussed in many places: #13393 #13418 #13440 #13794 Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in identity hashes) are a better solution that is more general (no illusion of one hard-coded sequence solving all problems) while retaining the virtues of linear probing such as cache obliviousness and age-less tables under delete-heavy workloads (still untested after a month of this change). The only real solution for truly adversarial keys is a hash keyed off of data unobservable to attackers. That all fits better with a few families of user-pluggable/define-switchable hashes which can be provided in a separate PR more about `hashes.nim`. This PR carefully preserves the better (but still hard coded!) probing of the `intsets` and other recent fixes like `move` annotations, hash order invariant tests, `intsets.missingOrExcl` fixing, and the move of `rightSize` into `hashcommon.nim`. * Fix `data.len` -> `dataLen` problem. * This is an alternate resolution to #13393 (which arguably could be resolved outside the stdlib). Add version1 of Wang Yi's hash specialized to 8 byte integers. This gives simple help to users having trouble with overly colliding hash(key)s. I.e., A) `import hashes; proc hash(x: myInt): Hash = hashWangYi1(int(x))` in the instantiation context of a `HashSet` or `Table` or B) more globally, compile with `nim c -d:hashWangYi1`. No hash can be all things to all use cases, but this one is A) vetted to scramble well by the SMHasher test suite (a necessarily limited but far more thorough test than prior proposals here), B) only a few ALU ops on many common CPUs, and C) possesses an easy via "grade school multi-digit multiplication" fall back for weaker deployment contexts. Some people might want to stampede ahead unbridled, but my view is that a good plan is to A) include this in the stdlib for a release or three to let people try it on various key sets nim-core could realistically never access/test (maybe mentioning it in the changelog so people actually try it out), B) have them report problems (if any), C) if all seems good, make the stdlib more novice friendly by adding `hashIdentity(x)=x` and changing the default `hash() = hashWangYi1` with some `when defined` rearranging so users can `-d:hashIdentity` if they want the old behavior back. This plan is compatible with any number of competing integer hashes if people want to add them. I would strongly recommend they all *at least* pass the SMHasher suite since the idea here is to become more friendly to novices who do not generally understand hashing failure modes. * Re-organize to work around `when nimvm` limitations; Add some tests; Add a changelog.md entry. * Add less than 64-bit CPU when fork. * Fix decl instead of call typo. * First attempt at fixing range error on 32-bit platforms; Still do the arithmetic in doubled up 64-bit, but truncate the hash to the lower 32-bits, but then still return `uint64` to be the same. So, type correct but truncated hash value. Update `thashes.nim` as well. * A second try at making 32-bit mode CI work. * Use a more systematic identifier convention than Wang Yi's code. * Fix test that was wrong for as long as `toHashSet` used `rightSize` (a very long time, I think). `$a`/`$b` depend on iteration order which varies with table range reduced hash order which varies with range for some `hash()`. With 3 elements, 3!=6 is small and we've just gotten lucky with past experimental `hash()` changes. An alternate fix here would be to not stringify but use the HashSet operators, but it is not clear that doesn't alter the "spirit" of the test. * Fix another stringified test depending upon hash order. * Oops - revert the string-keyed test. * Fix another stringify test depending on hash order. * Add a better than always zero `defined(js)` branch. * It turns out to be easy to just work all in `BigInt` inside JS and thus guarantee the same low order bits of output hashes (for `isSafeInteger` input numbers). Since `hashWangYi1` output bits are equally random in all their bits, this means that tables will be safely scrambled for table sizes up to 2**32 or 4 gigaentries which is probably fine, as long as the integer keys are all < 2**53 (also likely fine). (I'm unsure why the infidelity with C/C++ back ends cut off is 32, not 53 bits.) Since HashSet & Table only use the low order bits, a quick corollary of this is that `$` on most int-keyed sets/tables will be the same in all the various back ends which seems a nice-to-have trait. * These string hash tests fail for me locally. Maybe this is what causes the CI hang for testament pcat collections? * Oops. That failure was from me manually patching string hash in hashes. Revert. * Import more test improvements from #13410 * Fix bug where I swapped order when reverting the test. Ack. * Oh, just accept either order like more and more hash tests. * Iterate in the same order. * `return` inside `emit` made us skip `popFrame` causing weird troubles. * Oops - do Windows branch also. * `nimV1hash` -> multiply-mnemonic, type-scoped `nimIntHash1` (mnemonic resolutions are "1 == identity", 1 for Nim Version 1, 1 for first/simplest/fastest in a series of possibilities. Should be very easy to remember.) * Re-organize `when nimvm` logic to be a strict `when`-`else`. * Merge other changes. * Lift constants to a common area. * Fall back to identity hash when `BigInt` is unavailable. * Increase timeout slightly (probably just real-time perturbation of CI system performance).
* Unwind just the "pseudorandom probing" (whole hash-code-keyed variable stride double hashing) part of recent sets & tables changes (which has still been causing bugs over a month later (e.g., two days ago #13794) as well as still having several "figure this out" implementation question comments in them (see just diffs of this PR). This topic has been discussed in many places: #13393 #13418 #13440 #13794 Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in identity hashes) are a better solution that is more general (no illusion of one hard-coded sequence solving all problems) while retaining the virtues of linear probing such as cache obliviousness and age-less tables under delete-heavy workloads (still untested after a month of this change). The only real solution for truly adversarial keys is a hash keyed off of data unobservable to attackers. That all fits better with a few families of user-pluggable/define-switchable hashes which can be provided in a separate PR more about `hashes.nim`. This PR carefully preserves the better (but still hard coded!) probing of the `intsets` and other recent fixes like `move` annotations, hash order invariant tests, `intsets.missingOrExcl` fixing, and the move of `rightSize` into `hashcommon.nim`. * Fix `data.len` -> `dataLen` problem. * Add neglected API call `find` to heapqueue. * Add a changelog.md entry, `since` annotation and rename parameter to be `heap` like all the other procs for consistency. * Add missing import.
(EDIT: see followup #13440 with a "best of both worlds" approach and additional benchmarks)
performance benchmark shows this can work, see below.
this investigates pseudorandom probing for hash collisions, as done in python.
The main idea is to use this recursion in
nextTry
:instead of linear probing:
h = (h + 1) and maxHash
consequences
((5*h) + 1 + perturb) and maxHash
, refer to python internal docs (cpython implementation of dictionaries) or other material which explains pretty well; the basic idea is that((5*h) + 1 + perturb) and maxHash
recursion is affected by all bits of hash(key) (so that different hashkeys will not follow exactly the same path and clustering is avoided), and if a free slot still isn't found by the time perturb = 0, then the recursion becomesh = ((5*h) + 1) and maxHash
which is guaranteed to visit each slot exactly once. The visit occurs in pseudorandom order, and that order depends on the full 64 bit hash(key).h = ((5*h) + 1 + perturb + seed) and maxHash
with seed initialized upoon application startup)consequence for deletion
Algorithm R (Deletion with linear probing)
from Knuth Volume 3 to our new recursion, since multiple unrelated paths can pass through a given index (unlike with linear probingh = (h + 1) and maxHash
or evenh = (h + c) and maxHash
with c constant). So I instead keep track of deletions, and adapt the table implementation as follows:note:
this is a slight modification from python: I'm using
translateBits
which results in large (eg 30x) speedups in cases like #13393 at 0 extra cost; it makes the hash influence the path earlier on. python doesn't dotranslateBits
and does suffer from #13393 (although not catastrophically (eg 1000x) thanks toPERTURB_SHIFT
) as well as you can see with:but with this PR, this isn't the case.
TODO for future PR's
do same with:
strtabs
compiler/astalgo.nim
compiler/treetab.nim
system/cellsets
(I don't quite understand why we have so many Table variants; seems to me like at least some of them could be reusing existing ones)
investigate whether following refinement would improve speed even further:
with the idea being to improve cache locality unless we're detecting we're in a dense collision cluster, in which case we're switching to pseudo-random probing to break out of the cluster. In some benchmarks at least this led to some speedups, but it'd require a bit of code refactoring to avoid introducing code duplication. So this is left for future work.
[EDIT] OrderedTable.del is O(n) (including before PR), because every
del
rebuilds whole table; instead, using tombstones as introduced in this PR would make it O(1)[EDIT] support
dec
, and having zero counts, inCountTable
using tombstonesSharedTable should auto-init, like other tables; it should have some iterators, and simply ask user to protect access via
withLock
instead of current API which is rather cumbersome and unlike other tables