Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pseudorandom probing for hash collision #13418

Merged
merged 19 commits into from
Feb 19, 2020

Conversation

timotheecour
Copy link
Member

@timotheecour timotheecour commented Feb 16, 2020

(EDIT: see followup #13440 with a "best of both worlds" approach and additional benchmarks)

performance benchmark shows this can work, see below.

this investigates pseudorandom probing for hash collisions, as done in python.
The main idea is to use this recursion in nextTry:

perturb = perturb shr PERTURB_SHIFT # with perturb initialized to hash(key) before recursion
h = ((5*h) + 1 + perturb) and maxHash

instead of linear probing:
h = (h + 1) and maxHash

consequences

  • at least similar (within 10%) performance in all cases I tested compared to before PR (see fix #13393 better hash for primitive types, avoiding catastrophic (1000x) slowdowns for certain input distributions #13410 for test cases)
  • no catastrophic slowdowns (issue Slow insertion of uint in Tables and Sets Collections depending on the value #13393) which were causing 100x-1000x (and likely unbounded) slowdowns
  • hash key computation is now only required to produce good hash keys modulo 64 bit, not modulo n bits for n<64, since the whole hash key is used in probing sequence. That means hash key computation for primitive types can remain fast.
  • for theory on justification of ((5*h) + 1 + perturb) and maxHash, refer to python internal docs (cpython implementation of dictionaries) or other material which explains pretty well; the basic idea is that ((5*h) + 1 + perturb) and maxHash recursion is affected by all bits of hash(key) (so that different hashkeys will not follow exactly the same path and clustering is avoided), and if a free slot still isn't found by the time perturb = 0, then the recursion becomes h = ((5*h) + 1) and maxHash which is guaranteed to visit each slot exactly once. The visit occurs in pseudorandom order, and that order depends on the full 64 bit hash(key).
  • should be resilient to adversarial attacks (but can still be made more resilient with adding an (unguessable) fixed seed in the recursion, eg: h = ((5*h) + 1 + perturb + seed) and maxHash with seed initialized upoon application startup)

consequence for deletion

  • there is no way to adapt Algorithm R (Deletion with linear probing) from Knuth Volume 3 to our new recursion, since multiple unrelated paths can pass through a given index (unlike with linear probing h = (h + 1) and maxHash or even h = (h + c) and maxHash with c constant). So I instead keep track of deletions, and adapt the table implementation as follows:
  • each cell is in 3 states: empty (hash=0), deleted(h=-1), valid(h!=0, h!=1)
  • rehash when count(empty) < 1/3 numSlots

note:

this is a slight modification from python: I'm using translateBits which results in large (eg 30x) speedups in cases like #13393 at 0 extra cost; it makes the hash influence the path earlier on. python doesn't do translateBits and does suffer from #13393 (although not catastrophically (eg 1000x) thanks to PERTURB_SHIFT) as well as you can see with:

  n = 10_000_000
  t={}
  for i in range(n):
    key = i << 32 # 68 sec 
    # key = i # 3 sec
    t[key] = str(key)

but with this PR, this isn't the case.

TODO for future PR's

  • do same with:
    strtabs
    compiler/astalgo.nim
    compiler/treetab.nim
    system/cellsets
    (I don't quite understand why we have so many Table variants; seems to me like at least some of them could be reusing existing ones)

  • investigate whether following refinement would improve speed even further:

  if iteration < thres: # eg thres = 4
    result = (h + 1) and maxHash
  else:
    perturb = perturb shr PERTURB_SHIFT
    result = ((5*h) + 1 + perturb) and maxHash
  iteration.inc

with the idea being to improve cache locality unless we're detecting we're in a dense collision cluster, in which case we're switching to pseudo-random probing to break out of the cluster. In some benchmarks at least this led to some speedups, but it'd require a bit of code refactoring to avoid introducing code duplication. So this is left for future work.

  • [EDIT] OrderedTable.del is O(n) (including before PR), because every del rebuilds whole table; instead, using tombstones as introduced in this PR would make it O(1)

  • [EDIT] support dec, and having zero counts, in CountTable using tombstones

  • SharedTable should auto-init, like other tables; it should have some iterators, and simply ask user to protect access via withLock instead of current API which is rather cumbersome and unlike other tables

Copy link
Contributor

@krux02 krux02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is some superficial comments on your WIP status. Feel free to ignore it if you don't like it.

lib/pure/collections/hashcommon.nim Outdated Show resolved Hide resolved
lib/pure/collections/hashcommon.nim Outdated Show resolved Hide resolved
lib/pure/collections/hashcommon.nim Show resolved Hide resolved
lib/pure/collections/hashcommon.nim Outdated Show resolved Hide resolved
@nim-lang nim-lang deleted a comment from krux02 Feb 17, 2020
@timotheecour timotheecour changed the title [WIP] pseudorandom probing for hash collision pseudorandom probing for hash collision Feb 19, 2020
lib/pure/testutils.nim Outdated Show resolved Hide resolved
@Araq Araq added the merge_when_passes_CI mergeable once green label Feb 19, 2020
@Araq
Copy link
Member

Araq commented Feb 19, 2020

Superb work, should also be backported, if possible. @narimiran

Copy link
Contributor

@Clyybber Clyybber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

Comment on lines +7 to +15
proc sortedPairs*[T](t: T): auto =
## helps when writing tests involving tables in a way that's robust to
## changes in hashing functions / table implementation.
toSeq(t.pairs).sorted

template sortedItems*(t: untyped): untyped =
## helps when writing tests involving tables in a way that's robust to
## changes in hashing functions / table implementation.
sorted(toSeq(t))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO you should use toSeq(t.pairs).sorted and sorted(toSeq(t)) in the tests instead.
This is bloat IMO and saves like 5 keypresses.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • sure, it's a bit "bloaty" but it simplifies tests (stdlib + other ones) and encourages using it instead of flaky alternatives like $t which has caused much churn in the past every time hashes/table/etc algorithms change; I've had to fix a number of tests (both in stdlib + in other nimble packages) to make those tests non-flaky; this simplifies this.
  • it's a bit more than 5 keypresses in practice: toSeq(t.pairs).sorted prevents using method call syntax/UFCS (because pairs is an iterator), but sortedPairs can be used with MCS; in practice tests often involve expressions to define t so having MCS makes the test simpler to read/write
  • (also, needs 1 import testutils instead of 2 imports algorithm, sequtils to achive this goal, but that argument is very minor; in practice import sequtils can cause further issues (see nim issues involving import sequtils except toSeq) so it'd in fact be imports algorithm, from sequtils import toSeq`; etc)

all those complications I've had in the past when fixing test led me to the simple realization that testutils is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, use toSeq(.... in the tests.
Even if it is only "a bit bloaty" it is too much to be a candidate for adding to the stdlib IMO.
All the other arguments are very minor, as you admit yourself.
Please remove this module.

lib/pure/collections/intsets.nim Outdated Show resolved Hide resolved
lib/pure/collections/hashcommon.nim Outdated Show resolved Hide resolved
@narimiran narimiran merged commit 8c22518 into nim-lang:devel Feb 19, 2020
@timotheecour timotheecour deleted the pr_pseudorand_probing branch February 19, 2020 20:43
@c-blake
Copy link
Contributor

c-blake commented Feb 19, 2020

Looking over at your one set of benchmarks I could find which you used to justify "low performance impact" (https://github.com/timotheecour/vitanim/blob/master/testcases/tests/t0129b.nim, right???) I see only tests with like a million 8 byte keys in hot loops. That's 100% in L3 cache which has approximately 10X better latency than dynamic RAM. In short, it doesn't really test anything interesting about this table structure change.

@c-blake
Copy link
Contributor

c-blake commented Feb 19, 2020

Perhaps more constructively - Robin Hood hashing is honestly pretty simple..arguably simpler than the explanation for this new probe sequence. It just keeps collision chains sorted by their "distance from their home cell" (where they would hash to in an empty table). That's it!

You can shift cells around with moveMem. Since we save hash codes we can easily and quickly compute the home cell. As a bonus, for unsuccessful lookups (such as at insert new key time) you can stop, on average, halfway down the collision chain (once you pass the spot where a key "should have been" if it were present.

There are a lot of articles on this if you search for them. E.g., this http://codecapsule.com/2013/11/17/robin-hood-hashing-backward-shift-deletion/ . I've done it twice before once in C and once in Nim. If you need me to, I can do a PR, but it might have to wait until next week.

@c-blake
Copy link
Contributor

c-blake commented Feb 19, 2020

Rust used a Robin Hood table for quite a few years (until it switched to the much more complex port of Google's "Swiss" Tables).

@c-blake
Copy link
Contributor

c-blake commented Feb 19, 2020

..and those "Swiss" tables mostly only get speed through vectorization tricks and much fine-tuning which is an ongoing maintenance burden as CPUs evolve/diversify..and also IIRC do not have the minimum variance guarantees of Robin Hood or its consequent 90+% memory utilization. And RH has no extra hard to tune parameters. Same old, "when to resize". That's it.

RH is even friendly to being used as a disk-resident table, like B-Trees. These days fast computers have a budget of like 500 cycles multiplied by 4-way or 5-way superscalar or 2000-2500 dynamic CPU instructions per cache miss. At that scale, almost anything that is supposed to handle larger than L3 cache data should be done almost the same way as if you did it "on disk". Random probing decidedly does not have that trait.

@c-blake
Copy link
Contributor

c-blake commented Feb 19, 2020

Oh, and incidentally - to support users who don't know how tables and hashes work from long collision chains we should just count the loops made in the comparisons. Experts should be able to opt out of with a special define (or maybe as part of danger)..Somehow, though.

Then you can write out some kind of run-time warning message (probably exactly once per Table instance or maybe use a global variable to make it once per program or once per thread using a Table). I'm not sure if there are other similar run-time "things are not going as planned but maybe we should not quit/raise exceptions" warnings in the stdlib or if so how they are reported. It seems unlikely excessive hash collisions are the only case of something like this, though we may currently punt on all of them.

Anyway, doing that if you're a naive user and hashes misbehave then you get told what to do - use a better hash function. Also, if you didn't take care against adversaries and you got attacked then this would at least write a message somewhere. E.g., if you had code not designed to be safe from attack that was then run in an unsafe environment.

@timotheecour
Copy link
Member Author

timotheecour commented Feb 20, 2020

I see only tests with like a million 8 byte keys in hot loops. That's 100% in L3 cache which has approximately 10X better latency than dynamic RAM. In short, it doesn't really test anything interesting about this table structure change.

I've also run tests on 20M keys and 100M keys (with key, val, hcode are int64, ie cell size with padding = 24B); that is 2.4GB which definitely doesn't fit on my 16MB cache.

robin hood hashing

If you need me to, I can do a PR, but it might have to wait until next week.

@c-blake if you think you can improve on the numbers I posted in #13440, by all means please do :) ; a draft PR is all that's needed for a start (ie no need to fix everything, just enough to run the benchmark I posted in that PR and compare apples to apples #13440 against robinhood hashing or swiss tables or your own version of it)
Note that, whether with pseudorandom hashing or robin hood hash or other collision algorithms, the major difficulty is not really implementing the algorithm, but to adapt all the existing code to work with it; there is quite a bit of code duplication that needs to be refactored before we should make further improvements.

one concern: seems to me robin hood suffers from #13393 as well (eg try with keys of the form i shl 32); if you use a better hash with robin hood, you'll pay a price to compute the hash (see performance numbers i have); if you don't you'll end up with a few large clusters and large probe sizes.
However, IMO the main idea of robin hood can be combined with the mixed approach of #13440: for low depths we proceed with linear probe and swap along the search as done in robinhood; and for depth > thres we switch to pseudorandom probing.

@c-blake
Copy link
Contributor

c-blake commented Feb 20, 2020

As mentioned, tombstones cause performance lumpiness grief for steady-state constant membership heavy deletion workloads like a cache.

Also, while I am generally in favor of data-drivenness and benchmarks, as mentioned hot loop benchmarks can be so misleading that theoretical arguments tell you more when the primary cost driver is DIMM latency. The CPU branch predictor detects the hot loop and works ahead by 1000 instructions to mask dynamic RAM latency it severely pollutes these numbers. Even if you benchmark array_access[random_element] where you use a PRNG for the element you get misled about what happens when you don't have a perfectly oiled loop machine. CPU performance counters for aggregate cache misses do not even tell the whole story because of pre-fetching logic. And when you try to use more complicated things as benchmarks then all that complication pollutes your numbers, too. There are model specific registers you can use to turn off pre-fetching but then the numbers are suspect for other reasons.

You don't have to fly completely blind. On my main box at home I have 65 ns latency and 32 GB/s bandwidth. So, a scan of up to 65*32=2080 or half a memory page can be hidden. That is very fast RAM, though. Most would be more like (80-120)*(20..24) or 1600..2880 bytes. These are all really large regions for hash cells < 32B. Regular linear probing with a good hash has a longest probe depth per table of about log-base2(table size). So, for lookup latency, what is by far the most sensible thing to do is just fall back on "well, you cannot do better than 1 fetch + a linear-in-physical-memory scan as long as the scan is short-ish". Your threshold thing kind of gets there, but tombstones create unnecessary aging rehashing trouble and temptation to use slow probe sequences whose slowness is not detected by naive benchmarks as happened here.

While you can fall down a rabbit hole of ever more complex probe sequences and there are many papers on this, you know what? That's all about memory systems from 40+ years ago. None will be as optimized by the CPU+memory system as a linear scan. Fortunately or unfortunately the entire CPU+memory system has been optimized to hide latency in benchmarks. Even having a nextTry isolated proc and tombstomes and "ease of experimentation" is probably asking for trouble like what happened here. But yeah, you do need your hash to do some kind of minimal spreading out

Appeal to "but Python did it this way" is also wrong for multiple reasons. Python was always slow and uses these things as part of its basic object model so much that it's all selected for in-L1-cache tiny tables than broad workloads. CPython also no longer even does it this way for tables, only sets. Also, some 1.5-2.0x performance hit at 75% full without the threshold and tombstones would not bother Python users anyway who would "outsource" performance sensitive stuff to some more carefully done C thing.

This depend-on-the-whole hash code sequence doesn't help with attacks - attackers can just collide in the code instead of the modulus. We all get that. It only helps with weak user hashes, but those could also collide the whole code. So, it's still a band-aid on totally awful hashes and @Araq is absolutely correct that the right solution for scaling guarantees is a B-tree. Some warning message like "far more than expected hash loops - use a better hash function or B-tree" might well direct someone to do exactly that.

Anyway, in short - A) your hot loops are measure something not of interest besides the hot loop itself, so your 1.06x number is not quite meaningless but not what you should base decisions/comparisons upon, B) the real thing of interest is hard enough to measure that people agreeing on "realistic workloads" will be nearly impossible, C) there exists a simple alternative dodging all this, namely robin hood used by performance freak/non outsourcing Rustaceans for like 5 years until they moved to something with CPU architecture-specific tuning with its obvious ongoing maintenance burden, D) counting loops in rawGet or whatever and warning users will 1) always be necessary, 2) is super easy both to have and to disable and to regulate to avoid zillions of messages, and 3) and can immediately guide them to a better solution - the only thing needing agreement is how the runtime should warn.

@Araq
Copy link
Member

Araq commented Feb 20, 2020

If you need me to, I can do a PR, but it might have to wait until next week.

Oh yeah please!

@Araq
Copy link
Member

Araq commented Feb 20, 2020

with further code improvements + refactorings, I expect it will eventually become easier to both experiment with and customize the hashing+collision resolution and have it work across all Table types (there are many); right now there is much code duplication making this hard.

The code duplication also helps. The compiler hashes Nim identifiers, not "random numbers" and so a specialized implementation can exploit this, a stdlib impl can't. The compiler also never deletes anything inside a table so you can leave out the tombstone logic.

@timotheecour
Copy link
Member Author

timotheecour commented Feb 20, 2020

@c-blake you can take a look at this WIP for robin hood hashing:
timotheecour@96a2d49

doesn't yet allow bootstrapping nim from it but shd be usable as a library at least for Table; can be used as follows:

nim c --lib:lib -r -d:danger $vitanim_D/testcases/tableutils/benchmain.nim

seems to work but performance seems unequivocally worse compared to #13440 using same benchmark as on that PR; of course that could very possibly be due to bugs in my implementation which I wrote quickly as a POC; feel free to improve / come up with a better one in your own PR + hopefully report better performance numbers. As I mentioned earlier, it does suffer from #13393 and better hashing would also incur runtime cost.

The compiler hashes Nim identifiers, not "random numbers" and so a specialized implementation can exploit this, a stdlib impl can't.

I would hope, with all the work done to improve tables/hashes, that stdlib implementation is now good enough for such use cases without any compromise on performance, and that at least some of (strtabs, compiler/astalgo, compiler/treetab, system/cellsets, pure/strtabs, sets) could simply reuse tables, or at least a common denominator that would allow fine tuning based on runtime/compile time properties, eg depthThres, mustRehashfillRatio, canDeleteCell. I don't think this would create a monster, quite the opposite.

@Araq
Copy link
Member

Araq commented Feb 20, 2020

I don't think this would create a monster, quite the opposite.

Maybe but I don't see the point, instead of optimizing N different implementations you "optimize" a single one for diverse use cases. But this makes the problem much harder to solve, optimization is specialization. For example cellsets hashes pointers, nothing else. The only reason so much effort has to be put into the stdlib's hash tables is that we cannot anticipate how it's used.

@c-blake
Copy link
Contributor

c-blake commented Feb 20, 2020

I agree having >1 entry point makes sense and "optimization is specialization" is a very quotable saying. Some require no delete which is another wrinkle besides the ones @Araq mentioned. I'm not sure the right number is the exact current number, but each candidate for conversion should be considered very carefully. Adversarial inputs or bad user hashes are other dimensions the impls do not all share. So, at least 3 bits/8 possibilities. Maybe 2 bits/4 if you want to coalesce adversarial & bad. There are surely others.

There is this VLDB 2015 paper "A Seven Dimensional Analysis of Hashing Methods" with a crazy complicated flowchart at the end. Nothing is perfect, but Robin Hood has the most "arrows in" of the bunch. (I also bet it'd be easy to poke holes in their benchmarks, though.)

@dom96 dom96 mentioned this pull request Mar 28, 2020
narimiran pushed a commit that referenced this pull request Mar 31, 2020
…nges (#13816)

* Unwind just the "pseudorandom probing" (whole hash-code-keyed variable
stride double hashing) part of recent sets & tables changes (which has
still been causing bugs over a month later (e.g., two days ago
#13794) as well as still having
several "figure this out" implementation question comments in them (see
just diffs of this PR).

This topic has been discussed in many places:
  #13393
  #13418
  #13440
  #13794

Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in
identity hashes) are a better solution that is more general (no illusion
of one hard-coded sequence solving all problems) while retaining the
virtues of linear probing such as cache obliviousness and age-less tables
under delete-heavy workloads (still untested after a month of this change).

The only real solution for truly adversarial keys is a hash keyed off of
data unobservable to attackers.  That all fits better with a few families
of user-pluggable/define-switchable hashes which can be provided in a
separate PR more about `hashes.nim`.

This PR carefully preserves the better (but still hard coded!) probing
of the  `intsets` and other recent fixes like `move` annotations, hash
order invariant tests, `intsets.missingOrExcl` fixing, and the move of
`rightSize` into `hashcommon.nim`.

* Fix `data.len` -> `dataLen` problem.
@c-blake c-blake mentioned this pull request Apr 9, 2020
Araq pushed a commit that referenced this pull request Apr 15, 2020
* Unwind just the "pseudorandom probing" (whole hash-code-keyed variable
stride double hashing) part of recent sets & tables changes (which has
still been causing bugs over a month later (e.g., two days ago
#13794) as well as still having
several "figure this out" implementation question comments in them (see
just diffs of this PR).

This topic has been discussed in many places:
  #13393
  #13418
  #13440
  #13794

Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in
identity hashes) are a better solution that is more general (no illusion
of one hard-coded sequence solving all problems) while retaining the
virtues of linear probing such as cache obliviousness and age-less tables
under delete-heavy workloads (still untested after a month of this change).

The only real solution for truly adversarial keys is a hash keyed off of
data unobservable to attackers.  That all fits better with a few families
of user-pluggable/define-switchable hashes which can be provided in a
separate PR more about `hashes.nim`.

This PR carefully preserves the better (but still hard coded!) probing
of the  `intsets` and other recent fixes like `move` annotations, hash
order invariant tests, `intsets.missingOrExcl` fixing, and the move of
`rightSize` into `hashcommon.nim`.

* Fix `data.len` -> `dataLen` problem.

* This is an alternate resolution to #13393
(which arguably could be resolved outside the stdlib).

Add version1 of Wang Yi's hash specialized to 8 byte integers.  This gives
simple help to users having trouble with overly colliding hash(key)s.  I.e.,
  A) `import hashes; proc hash(x: myInt): Hash = hashWangYi1(int(x))`
      in the instantiation context of a `HashSet` or `Table`
or
  B) more globally, compile with `nim c -d:hashWangYi1`.

No hash can be all things to all use cases, but this one is A) vetted to
scramble well by the SMHasher test suite (a necessarily limited but far
more thorough test than prior proposals here), B) only a few ALU ops on
many common CPUs, and C) possesses an easy via "grade school multi-digit
multiplication" fall back for weaker deployment contexts.

Some people might want to stampede ahead unbridled, but my view is that a
good plan is to
  A) include this in the stdlib for a release or three to let people try it
     on various key sets nim-core could realistically never access/test
     (maybe mentioning it in the changelog so people actually try it out),
  B) have them report problems (if any),
  C) if all seems good, make the stdlib more novice friendly by adding
     `hashIdentity(x)=x` and changing the default `hash() = hashWangYi1`
     with some `when defined` rearranging so users can `-d:hashIdentity`
     if they want the old behavior back.
This plan is compatible with any number of competing integer hashes if
people want to add them.  I would strongly recommend they all *at least*
pass the SMHasher suite since the idea here is to become more friendly to
novices who do not generally understand hashing failure modes.

* Re-organize to work around `when nimvm` limitations; Add some tests; Add
a changelog.md entry.

* Add less than 64-bit CPU when fork.

* Fix decl instead of call typo.

* First attempt at fixing range error on 32-bit platforms; Still do the
arithmetic in doubled up 64-bit, but truncate the hash to the lower
32-bits, but then still return `uint64` to be the same.  So, type
correct but truncated hash value.  Update `thashes.nim` as well.

* A second try at making 32-bit mode CI work.

* Use a more systematic identifier convention than Wang Yi's code.

* Fix test that was wrong for as long as `toHashSet` used `rightSize` (a
very long time, I think).  `$a`/`$b` depend on iteration order which
varies with table range reduced hash order which varies with range for
some `hash()`.  With 3 elements, 3!=6 is small and we've just gotten
lucky with past experimental `hash()` changes.  An alternate fix here
would be to not stringify but use the HashSet operators, but it is not
clear that doesn't alter the "spirit" of the test.

* Fix another stringified test depending upon hash order.

* Oops - revert the string-keyed test.

* Fix another stringify test depending on hash order.

* Add a better than always zero `defined(js)` branch.

* It turns out to be easy to just work all in `BigInt` inside JS and thus
guarantee the same low order bits of output hashes (for `isSafeInteger`
input numbers).  Since `hashWangYi1` output bits are equally random in
all their bits, this means that tables will be safely scrambled for table
sizes up to 2**32 or 4 gigaentries which is probably fine, as long as the
integer keys are all < 2**53 (also likely fine).  (I'm unsure why the
infidelity with C/C++ back ends cut off is 32, not 53 bits.)

Since HashSet & Table only use the low order bits, a quick corollary of
this is that `$` on most int-keyed sets/tables will be the same in all
the various back ends which seems a nice-to-have trait.

* These string hash tests fail for me locally.  Maybe this is what causes
the CI hang for testament pcat collections?

* Oops. That failure was from me manually patching string hash in hashes.  Revert.

* Import more test improvements from #13410

* Fix bug where I swapped order when reverting the test.  Ack.

* Oh, just accept either order like more and more hash tests.

* Iterate in the same order.

* `return` inside `emit` made us skip `popFrame` causing weird troubles.

* Oops - do Windows branch also.

* `nimV1hash` -> multiply-mnemonic, type-scoped `nimIntHash1` (mnemonic
resolutions are "1 == identity", 1 for Nim Version 1, 1 for
first/simplest/fastest in a series of possibilities.  Should be very
easy to remember.)

* Re-organize `when nimvm` logic to be a strict `when`-`else`.

* Merge other changes.

* Lift constants to a common area.

* Fall back to identity hash when `BigInt` is unavailable.

* Increase timeout slightly (probably just real-time perturbation of CI
system performance).
dom96 pushed a commit that referenced this pull request Jun 10, 2020
* Unwind just the "pseudorandom probing" (whole hash-code-keyed variable
stride double hashing) part of recent sets & tables changes (which has
still been causing bugs over a month later (e.g., two days ago
#13794) as well as still having
several "figure this out" implementation question comments in them (see
just diffs of this PR).

This topic has been discussed in many places:
  #13393
  #13418
  #13440
  #13794

Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in
identity hashes) are a better solution that is more general (no illusion
of one hard-coded sequence solving all problems) while retaining the
virtues of linear probing such as cache obliviousness and age-less tables
under delete-heavy workloads (still untested after a month of this change).

The only real solution for truly adversarial keys is a hash keyed off of
data unobservable to attackers.  That all fits better with a few families
of user-pluggable/define-switchable hashes which can be provided in a
separate PR more about `hashes.nim`.

This PR carefully preserves the better (but still hard coded!) probing
of the  `intsets` and other recent fixes like `move` annotations, hash
order invariant tests, `intsets.missingOrExcl` fixing, and the move of
`rightSize` into `hashcommon.nim`.

* Fix `data.len` -> `dataLen` problem.

* Add neglected API call `find` to heapqueue.

* Add a changelog.md entry, `since` annotation and rename parameter to be
`heap` like all the other procs for consistency.

* Add missing import.
@timotheecour timotheecour changed the title [TODO] pseudorandom probing for hash collision pseudorandom probing for hash collision Sep 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge_when_passes_CI mergeable once green
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow insertion of uint in Tables and Sets Collections depending on the value
6 participants