-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with hash-function for float64 version of klib's hash-map #21866
Comments
A little over my head but can you clarify the problem further? Is this something visible from the end user perspective? I wouldn't really consider the items you've labeled as workarounds to actually be such, as |
Because of workarounds, I'm not aware of a way for trigging an error in NAN-case (as long as one doesn't care exactly which NAN it is). There is however a way to trigger inconsistent behavior for
The size of b is
I do understand, that this is quite an esoteric case. My main issue with the implementation of float64-table as it is: There is a trap which obviously already have bitten at least twice and it will struck again in the future. The problem is not the equal-operator (which is rightly extended with This SO-question helped me to understand the issue, maybe it is better than my issue description. |
Certainly would take an alternative hash function - as the long comment in the code indicates - we used to use python's hash for doubles, but that caused issues due to size truncation, so we're using a generic bit-shuffling one, same that ints use. As you show But I think our approach for NaNs is fine? It's special cased, yes, but |
A very esoteric bug indeed, but nonetheless a bug. At the very least, |
Added my suggestion as PR21904, it fixes both cases NaNs and signed zero. I think both are necessary: Using directly the
or
PS: don't know the right place for whatsnew entry, I hope, that in case the changes are ok, I will be guided to the right place... |
Problem description
Hash-maps for float64 use the following hash-function
However, in order to guarantee consistent behavior, the following constrains must be met:
==
must be an equivalence relationx==y
=>hash(x)==hash(y)
Following IEEE-754, floats aren't an equivalence relation, this is fixed defining "equal" as
Thus, apart from trivial equivalence classes, there are the following two:
{0.0, -0.0}
{x | x is NAN}
for which the second constrain doesn't hold:
hash(0.0)=0 != 1073741824=2^30=hash(-0.0)
hash(float("nan")) = 1073479680 != 2147221504 =hash(-float("nan"))
Due to the way klib uses this hash value, the values "0.0" and "-0.0" end up in different buckets only if there are more than 2^29 (ca. 6e8) elements in the hash-map already. The same holds for
float("nan")
and-float("nan")
. However in the case of not-a-number there are much more elements in the equivalence class, so the inconsistent behavior can be triggered for much smaller sizes, for example forProposed solution:
There are already some workarounds for this problem scattered throughout pandas, for example for
pd.unique(...)
here or forisin
here.However, this
{0.0, -0.0}
A better approach would be to fix the hash-function of float64-hash-map, a possibility would be:
The text was updated successfully, but these errors were encountered: