Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with hash-function for float64 version of klib's hash-map #21866

Closed
realead opened this issue Jul 11, 2018 · 5 comments · Fixed by #21904
Closed

Issues with hash-function for float64 version of klib's hash-map #21866

realead opened this issue Jul 11, 2018 · 5 comments · Fixed by #21904
Labels
Bug Internals Related to non-user accessible pandas implementation Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@realead
Copy link
Contributor

realead commented Jul 11, 2018

Problem description

Hash-maps for float64 use the following hash-function

#define kh_float64_hash_func(key) (khint32_t)((asint64(key))>>33^(asint64(key))^(asint64(key))<<11)

However, in order to guarantee consistent behavior, the following constrains must be met:

  1. == must be an equivalence relation
  2. x==y => hash(x)==hash(y)

Following IEEE-754, floats aren't an equivalence relation, this is fixed defining "equal" as

#define kh_float64_hash_equal(a, b) ((a) == (b) || ((b) != (b) && (a) != (a)))

Thus, apart from trivial equivalence classes, there are the following two:

  1. {0.0, -0.0}
  2. {x | x is NAN}

for which the second constrain doesn't hold:

  • hash(0.0)=0 != 1073741824=2^30=hash(-0.0)
  • hash(float("nan")) = 1073479680 != 2147221504 =hash(-float("nan"))

Due to the way klib uses this hash value, the values "0.0" and "-0.0" end up in different buckets only if there are more than 2^29 (ca. 6e8) elements in the hash-map already. The same holds for float("nan") and -float("nan"). However in the case of not-a-number there are much more elements in the equivalence class, so the inconsistent behavior can be triggered for much smaller sizes, for example for

import struct
NAN1=struct.unpack("d", struct.pack("L", 9221120237041090560))[0]
NAN2=struct.unpack("d", struct.pack("L", 9221120237041090561))[0]

Proposed solution:

There are already some workarounds for this problem scattered throughout pandas, for example for pd.unique(...) here or for isin here.

However, this

  1. is not the right place to work around the bug, as it makes the code not really clearer
  2. fixes only the not-a-number case, but not {0.0, -0.0}
  3. most probably everyone reusing the hash-map-implementation in the fututre will first get the NAN-case wrong and then fix it with a further workaround.

A better approach would be to fix the hash-function of float64-hash-map, a possibility would be:

   #include <math.h>
    //right for all but not -0.0 and NANs
    #define kh_float64_hash_func_0_NAN(key) (khint32_t)((f64_to_i64(key))>>33^(f64_to_i64(key))^(f64_to_i64(key))<<11)
    //right for all except NANs
    #define kh_float64_hash_func_NAN(key) ((key)==0.0 ? kh_float64_hash_func_0_NAN(0.0) : kh_float64_hash_func_0_NAN(key))
    //right for all, also- 0.0 and NANs
    #define kh_float64_hash_func(key) ((key) != (key) ? kh_float64_hash_func_NAN(NAN) : kh_float64_hash_func_NAN(key))
@WillAyd
Copy link
Member

WillAyd commented Jul 11, 2018

A little over my head but can you clarify the problem further? Is this something visible from the end user perspective?

I wouldn't really consider the items you've labeled as workarounds to actually be such, as np.nan != np.nan generally, i.e. that's not just a pandas construct

@WillAyd WillAyd added the Needs Info Clarification about behavior needed to assess issue label Jul 11, 2018
@realead
Copy link
Contributor Author

realead commented Jul 12, 2018

Because of workarounds, I'm not aware of a way for trigging an error in NAN-case (as long as one doesn't care exactly which NAN it is). There is however a way to trigger inconsistent behavior for 0.0 and -0.0 (I hope you have enough RAM:)):

#tested with pandas 0.22.0
import numpy as np
import pandas as pd

a=np.arange(12*10**8, dtype=np.float64)
a[0]=.1
a[-1]=0.0
a[-2]=-0.0

b=pd.unique(a)
print(b.size)

The size of b is 6*10^8, but should be 6*10^8-1, because 0.0 and -0.0 are usually handled as equal, e.g.:

a=np.zeros(2, dtype=np.float64)
a[-1]=0.0
a[-2]=-0.0

b=pd.unique(a)
print(b.size) # size is 1 now, 0.0 and -0.0 are equal

I do understand, that this is quite an esoteric case. My main issue with the implementation of float64-table as it is: There is a trap which obviously already have bitten at least twice and it will struck again in the future.

The problem is not the equal-operator (which is rightly extended with np.nan != np.nan) but the way the hash function is calculated. The proposed fix of the hash-function would eliminate the need for workarounds, because the float64-hashtable would handle NANs and signed zeros correctly out-of-the-box.

This SO-question helped me to understand the issue, maybe it is better than my issue description.

@chris-b1
Copy link
Contributor

Certainly would take an alternative hash function - as the long comment in the code indicates - we used to use python's hash for doubles, but that caused issues due to size truncation, so we're using a generic bit-shuffling one, same that ints use. As you show 0.0, -0.0 could be an issue.

But I think our approach for NaNs is fine? It's special cased, yes, but NaN is a special case throughout pandas, so don't really see a need for purity here? But open to suggestions if there's a cleaner way to embed our assumptions into the hash function itself.

@gfyoung gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Internals Related to non-user accessible pandas implementation Bug and removed Needs Info Clarification about behavior needed to assess issue labels Jul 13, 2018
@gfyoung
Copy link
Member

gfyoung commented Jul 13, 2018

A very esoteric bug indeed, but nonetheless a bug. At the very least, 0.0, -0.0 should be addressed.

@realead
Copy link
Contributor Author

realead commented Jul 13, 2018

Added my suggestion as PR21904, it fixes both cases NaNs and signed zero. I think both are necessary: Using directly the Float64HashTable (instead of pd.unique()) it is much easier to trigger the bug for NaNs and 0.0/-0.0:

from pandas._libs import hashtable as ht
m = ht.Float64HashTable(12 * 10**8) 
m.set_item(np.nan, 0)
m.set_item(-np.nan, 0)
assert len(m) == 1 

or

from pandas._libs import hashtable as ht
m = ht.Float64HashTable(12 * 10**8)
m.set_item(np.nan, -0.0)
m.set_item(-np.nan, -0.0)
assert len(m) == 1 

PS: don't know the right place for whatsnew entry, I hope, that in case the changes are ok, I will be guided to the right place...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Internals Related to non-user accessible pandas implementation Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants