BUG: issues with hash-function for Float64HashTable (GH21866) #21904

realead · 2018-07-13T22:09:40Z

The following issues

hash(0.0) != hash(-0.0)
hash(x) != hash(y) for different x,y which are nans

are solved by setting:

hash(-0.0):=hash(0.0)
hash(x):=hash(np.nan) for every x which is nan

closes Issues with hash-function for float64 version of klib's hash-map #21866
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

jbrockmendel · 2018-07-14T03:49:47Z

Test failure appears unrelated. Can you push again to re-run it?

codecov · 2018-07-14T05:16:44Z

Codecov Report

Merging #21904 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #21904   +/-   ##
=======================================
  Coverage   91.99%   91.99%           
=======================================
  Files         167      167           
  Lines       50578    50578           
=======================================
  Hits        46530    46530           
  Misses       4048     4048

Flag	Coverage Δ
#multiple	`90.4% <ø> (ø)`	⬆️
#single	`42.17% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 322dbf4...124b095. Read the comment docs.

jreback · 2018-07-14T14:11:42Z

can you run an asv and report any anomalies

jreback

does this have any user level visible effects?
can you add a whatsnew note (0.24.0), depending on your answer above either API changes or other Enhancements section.

realead · 2018-07-14T14:57:05Z

@jbrockmendel No, it was a problem in the one of the added test cases: somewhat naively 16GB memory were reserved (but not committed!), yet obviously different operating systems with different resources reacts differently to such a request.

It is a little bit strange, that the testing just died and didn't recover from that...

I removed these silly tests and now it looks better (at least it is clear, what goes wrong).

jreback · 2018-07-14T14:58:19Z

pandas/_libs/src/klib/khash_python.h

-#define kh_float64_hash_func(key) (khint32_t)((asint64(key))>>33^(asint64(key))^(asint64(key))<<11)
+
+// correct for all inputs but not -0.0 and NaNs
+#define kh_float64_hash_func_0_NAN(key) (khint32_t)((asint64(key))>>33^(asint64(key))^(asint64(key))<<11)


can you add a blank between cases

@jreback Sorry for the silly question: Do you expect me to add a new commit with the improvements to the branch and you will fixup it when merging or should I amend the current commit?

generally push new commits.

needs to add a whatsnew in any event

realead · 2018-07-15T20:24:18Z

pandas/_libs/src/klib/khash_python.h

+
+// correct for all
+#define kh_float64_hash_func(key) ((key) != (key) ?                       \
+                                   kh_float64_hash_func_NAN(Py_NAN) :     \


Not sure about Py_NAN:

Must the case of Py_NO_NAN be taken into account?

There is PANDAS_NAN, but here Py_NAN didn't require additional includes.

PS: NAN from math.h isn't defined for some plattforms.

jreback · 2018-07-16T10:32:28Z

doc/source/whatsnew/v0.24.0.txt

@@ -235,7 +235,7 @@ Other API Changes
  a ``KeyError`` (:issue:`21678`).
 - Invalid construction of ``IntervalDtype`` will now always raise a ``TypeError`` rather than a ``ValueError`` if the subdtype is invalid (:issue:`21185`)
 - Trying to reindex a ``DataFrame`` with a non unique ``MultiIndex`` now raises a ``ValueError`` instead of an ``Exception`` (:issue:`21770`)
-
+- :class:`Float64HashTable` handles zeros/signed zeros and all flavors of NaNs consistently: it is no longer possible to have both, zero and signed-zero, as keys at the same time in a table, also there can be at most one NaN-key in a table (:issue:`21866`)


this is not public, my question below was whether this has a public change?

It fixes the bug #21866, i.e. produces right results for some esoteric corner cases. This change of behavior can be observed by the end-user, but is this then a public change worth mentioning?

#21866 (comment)

is this the public case? you can list this, but just make it related to the effect on .unique()

also add a test w.r.t. unique

jreback · 2018-07-17T00:53:58Z

doc/source/whatsnew/v0.24.0.txt

@@ -84,6 +84,7 @@ Other Enhancements
 - :meth:`Series.nlargest`, :meth:`Series.nsmallest`, :meth:`DataFrame.nlargest`, and :meth:`DataFrame.nsmallest` now accept the value ``"all"`` for the ``keep`` argument. This keeps all ties for the nth largest/smallest value (:issue:`16818`)
 - :class:`IntervalIndex` has gained the :meth:`~IntervalIndex.set_closed` method to change the existing ``closed`` value (:issue:`21670`)
 - :func:`~DataFrame.to_csv` and :func:`~DataFrame.to_json` now support ``compression='infer'`` to infer compression based on filename (:issue:`15008`)
+- :class:`Float64HashTable` handles zeros/signed zeros and all flavors of NaNs consistently: it is no longer possible to have both, zero and signed-zero, as keys at the same time in a table, also there can be at most one NaN-key in a table (:issue:`21866`)


right, can you reword to just focus on .unique()

jreback

doc comments. rebase & ping on green.

jreback · 2018-07-20T12:56:16Z

pandas/tests/test_algos.py

@@ -500,6 +501,23 @@ def test_obj_none_preservation(self):

        tm.assert_numpy_array_equal(result, expected, strict_nan=True)

+    def test_signed_zero(self):
+        a = np.array([-0.0, 0.0])


can you add the issue number here as a comment (and on test below)

jreback · 2018-07-20T12:57:11Z

doc/source/whatsnew/v0.24.0.txt

@@ -84,6 +84,7 @@ Other Enhancements
 - :meth:`Series.nlargest`, :meth:`Series.nsmallest`, :meth:`DataFrame.nlargest`, and :meth:`DataFrame.nsmallest` now accept the value ``"all"`` for the ``keep`` argument. This keeps all ties for the nth largest/smallest value (:issue:`16818`)
 - :class:`IntervalIndex` has gained the :meth:`~IntervalIndex.set_closed` method to change the existing ``closed`` value (:issue:`21670`)
 - :func:`~DataFrame.to_csv` and :func:`~DataFrame.to_json` now support ``compression='infer'`` to infer compression based on filename (:issue:`15008`)
+- :func:`unique` handles signed zeros consistently: it is no longer possible to have both, 0.0 and -0.0, in the same resulting array (:issue:`21866`)


move to bug fix / Numeric section.

The following issues 1) hash(0.0) != hash(-0.0) 2) hash(x) != hash(y) for different x,y which are nans are solved by setting: 1) hash(-0.0):=hash(0.0) 2) hash(x):=hash(np.nan) for every x which is nan

realead · 2018-07-21T17:22:24Z

@jreback Done.

jreback · 2018-07-25T10:36:25Z

@jbrockmendel any comments? if ok pls merge.

jbrockmendel · 2018-07-25T19:13:24Z

@realead good job catching a subtle bug. Thanks for taking point on this.

it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.

…-dev#21904) * BUG: issues with hash-function for Float64HashTable (GH21866) The following issues 1) hash(0.0) != hash(-0.0) 2) hash(x) != hash(y) for different x,y which are nans are solved by setting: 1) hash(-0.0):=hash(0.0) 2) hash(x):=hash(np.nan) for every x which is nan * add the id of the issue to tests

realead force-pushed the hash_for_float64_GH21866 branch from 0d7fe27 to 2cec96e Compare July 14, 2018 05:16

realead force-pushed the hash_for_float64_GH21866 branch from 2cec96e to 94b7087 Compare July 14, 2018 13:41

jreback requested changes Jul 14, 2018

View reviewed changes

jreback added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jul 14, 2018

jreback reviewed Jul 14, 2018

View reviewed changes

realead commented Jul 15, 2018

View reviewed changes

jreback requested changes Jul 16, 2018

View reviewed changes

realead force-pushed the hash_for_float64_GH21866 branch from 4bf5983 to 0f77145 Compare July 16, 2018 18:42

jreback reviewed Jul 17, 2018

View reviewed changes

jreback requested changes Jul 20, 2018

View reviewed changes

jreback added this to the 0.24.0 milestone Jul 20, 2018

realead added 2 commits July 21, 2018 08:45

BUG: issues with hash-function for Float64HashTable (GH21866)

7f12a1d

The following issues 1) hash(0.0) != hash(-0.0) 2) hash(x) != hash(y) for different x,y which are nans are solved by setting: 1) hash(-0.0):=hash(0.0) 2) hash(x):=hash(np.nan) for every x which is nan

add the id of the issue to tests

124b095

realead force-pushed the hash_for_float64_GH21866 branch from 11131c9 to 124b095 Compare July 21, 2018 06:46

jreback approved these changes Jul 25, 2018

View reviewed changes

jbrockmendel merged commit 2c7c797 into pandas-dev:master Jul 25, 2018

realead deleted the hash_for_float64_GH21866 branch August 9, 2018 19:34

realead mentioned this pull request Aug 12, 2018

BUG: don't mangle NaN-float-values and pd.NaT (GH 22295) #22296

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: issues with hash-function for Float64HashTable (GH21866) #21904

BUG: issues with hash-function for Float64HashTable (GH21866) #21904

realead commented Jul 13, 2018 •

edited

Loading

jbrockmendel commented Jul 14, 2018

codecov bot commented Jul 14, 2018 •

edited

Loading

jreback commented Jul 14, 2018

jreback left a comment

realead commented Jul 14, 2018

jreback Jul 14, 2018

realead Jul 14, 2018

jreback Jul 14, 2018

jreback Jul 14, 2018

realead Jul 15, 2018

jreback Jul 16, 2018

realead Jul 16, 2018

jreback Jul 16, 2018

jreback Jul 16, 2018

jreback Jul 17, 2018

jreback left a comment

jreback Jul 20, 2018

jreback Jul 20, 2018

realead commented Jul 21, 2018

jreback commented Jul 25, 2018

jbrockmendel commented Jul 25, 2018

BUG: issues with hash-function for Float64HashTable (GH21866) #21904

BUG: issues with hash-function for Float64HashTable (GH21866) #21904

Conversation

realead commented Jul 13, 2018 • edited Loading

jbrockmendel commented Jul 14, 2018

codecov bot commented Jul 14, 2018 • edited Loading

Codecov Report

jreback commented Jul 14, 2018

jreback left a comment

Choose a reason for hiding this comment

realead commented Jul 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

realead commented Jul 21, 2018

jreback commented Jul 25, 2018

jbrockmendel commented Jul 25, 2018

realead commented Jul 13, 2018 •

edited

Loading

codecov bot commented Jul 14, 2018 •

edited

Loading