Issues with hash-function for float64 version of klib's hash-map #21866

realead · 2018-07-11T19:31:54Z

Problem description

Hash-maps for float64 use the following hash-function

#define kh_float64_hash_func(key) (khint32_t)((asint64(key))>>33^(asint64(key))^(asint64(key))<<11)

However, in order to guarantee consistent behavior, the following constrains must be met:

== must be an equivalence relation
x==y => hash(x)==hash(y)

Following IEEE-754, floats aren't an equivalence relation, this is fixed defining "equal" as

#define kh_float64_hash_equal(a, b) ((a) == (b) || ((b) != (b) && (a) != (a)))

Thus, apart from trivial equivalence classes, there are the following two:

{0.0, -0.0}
{x | x is NAN}

for which the second constrain doesn't hold:

hash(0.0)=0 != 1073741824=2^30=hash(-0.0)
hash(float("nan")) = 1073479680 != 2147221504 =hash(-float("nan"))

Due to the way klib uses this hash value, the values "0.0" and "-0.0" end up in different buckets only if there are more than 2^29 (ca. 6e8) elements in the hash-map already. The same holds for float("nan") and -float("nan"). However in the case of not-a-number there are much more elements in the equivalence class, so the inconsistent behavior can be triggered for much smaller sizes, for example for

import struct
NAN1=struct.unpack("d", struct.pack("L", 9221120237041090560))[0]
NAN2=struct.unpack("d", struct.pack("L", 9221120237041090561))[0]

Proposed solution:

There are already some workarounds for this problem scattered throughout pandas, for example for pd.unique(...) here or for isin here.

However, this

is not the right place to work around the bug, as it makes the code not really clearer
fixes only the not-a-number case, but not {0.0, -0.0}
most probably everyone reusing the hash-map-implementation in the fututre will first get the NAN-case wrong and then fix it with a further workaround.

A better approach would be to fix the hash-function of float64-hash-map, a possibility would be:

   #include <math.h>
    //right for all but not -0.0 and NANs
    #define kh_float64_hash_func_0_NAN(key) (khint32_t)((f64_to_i64(key))>>33^(f64_to_i64(key))^(f64_to_i64(key))<<11)
    //right for all except NANs
    #define kh_float64_hash_func_NAN(key) ((key)==0.0 ? kh_float64_hash_func_0_NAN(0.0) : kh_float64_hash_func_0_NAN(key))
    //right for all, also- 0.0 and NANs
    #define kh_float64_hash_func(key) ((key) != (key) ? kh_float64_hash_func_NAN(NAN) : kh_float64_hash_func_NAN(key))

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-07-11T20:52:09Z

A little over my head but can you clarify the problem further? Is this something visible from the end user perspective?

I wouldn't really consider the items you've labeled as workarounds to actually be such, as np.nan != np.nan generally, i.e. that's not just a pandas construct

realead · 2018-07-12T06:22:08Z

Because of workarounds, I'm not aware of a way for trigging an error in NAN-case (as long as one doesn't care exactly which NAN it is). There is however a way to trigger inconsistent behavior for 0.0 and -0.0 (I hope you have enough RAM:)):

#tested with pandas 0.22.0
import numpy as np
import pandas as pd

a=np.arange(12*10**8, dtype=np.float64)
a[0]=.1
a[-1]=0.0
a[-2]=-0.0

b=pd.unique(a)
print(b.size)

The size of b is 6*10^8, but should be 6*10^8-1, because 0.0 and -0.0 are usually handled as equal, e.g.:

a=np.zeros(2, dtype=np.float64)
a[-1]=0.0
a[-2]=-0.0

b=pd.unique(a)
print(b.size) # size is 1 now, 0.0 and -0.0 are equal

I do understand, that this is quite an esoteric case. My main issue with the implementation of float64-table as it is: There is a trap which obviously already have bitten at least twice and it will struck again in the future.

The problem is not the equal-operator (which is rightly extended with np.nan != np.nan) but the way the hash function is calculated. The proposed fix of the hash-function would eliminate the need for workarounds, because the float64-hashtable would handle NANs and signed zeros correctly out-of-the-box.

This SO-question helped me to understand the issue, maybe it is better than my issue description.

chris-b1 · 2018-07-12T19:46:22Z

Certainly would take an alternative hash function - as the long comment in the code indicates - we used to use python's hash for doubles, but that caused issues due to size truncation, so we're using a generic bit-shuffling one, same that ints use. As you show 0.0, -0.0 could be an issue.

But I think our approach for NaNs is fine? It's special cased, yes, but NaN is a special case throughout pandas, so don't really see a need for purity here? But open to suggestions if there's a cleaner way to embed our assumptions into the hash function itself.

gfyoung · 2018-07-13T00:30:09Z

A very esoteric bug indeed, but nonetheless a bug. At the very least, 0.0, -0.0 should be addressed.

realead · 2018-07-13T22:22:22Z

Added my suggestion as PR21904, it fixes both cases NaNs and signed zero. I think both are necessary: Using directly the Float64HashTable (instead of pd.unique()) it is much easier to trigger the bug for NaNs and 0.0/-0.0:

from pandas._libs import hashtable as ht
m = ht.Float64HashTable(12 * 10**8) 
m.set_item(np.nan, 0)
m.set_item(-np.nan, 0)
assert len(m) == 1

or

from pandas._libs import hashtable as ht
m = ht.Float64HashTable(12 * 10**8)
m.set_item(np.nan, -0.0)
m.set_item(-np.nan, -0.0)
assert len(m) == 1

PS: don't know the right place for whatsnew entry, I hope, that in case the changes are ok, I will be guided to the right place...

WillAyd added the Needs Info Clarification about behavior needed to assess issue label Jul 11, 2018

gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Internals Related to non-user accessible pandas implementation Bug and removed Needs Info Clarification about behavior needed to assess issue labels Jul 13, 2018

realead mentioned this issue Jul 13, 2018

BUG: issues with hash-function for Float64HashTable (GH21866) #21904

Merged

4 tasks

jreback added this to the 0.24.0 milestone Jul 14, 2018

jbrockmendel closed this as completed in #21904 Jul 25, 2018

This was referenced Jul 29, 2018

pd.core.algorithms.isin() doesn't handle nan correctly if it is a Python-object #22119

Closed

CLN: Remove special handling of nans in the float64-case of isin #22117

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with hash-function for float64 version of klib's hash-map #21866

Issues with hash-function for float64 version of klib's hash-map #21866

realead commented Jul 11, 2018 •

edited

Loading

WillAyd commented Jul 11, 2018

realead commented Jul 12, 2018 •

edited

Loading

chris-b1 commented Jul 12, 2018

gfyoung commented Jul 13, 2018

realead commented Jul 13, 2018

Issues with hash-function for float64 version of klib's hash-map #21866

Issues with hash-function for float64 version of klib's hash-map #21866

Comments

realead commented Jul 11, 2018 • edited Loading

Problem description

Proposed solution:

WillAyd commented Jul 11, 2018

realead commented Jul 12, 2018 • edited Loading

chris-b1 commented Jul 12, 2018

gfyoung commented Jul 13, 2018

realead commented Jul 13, 2018

realead commented Jul 11, 2018 •

edited

Loading

realead commented Jul 12, 2018 •

edited

Loading