Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: issues with hash-function for Float64HashTable (GH21866) #21904

Merged
merged 2 commits into from
Jul 25, 2018

Conversation

realead
Copy link
Contributor

@realead realead commented Jul 13, 2018

The following issues

  1. hash(0.0) != hash(-0.0)
  2. hash(x) != hash(y) for different x,y which are nans

are solved by setting:

  1. hash(-0.0):=hash(0.0)
  2. hash(x):=hash(np.nan) for every x which is nan

@jbrockmendel
Copy link
Member

Test failure appears unrelated. Can you push again to re-run it?

@realead realead force-pushed the hash_for_float64_GH21866 branch from 0d7fe27 to 2cec96e Compare July 14, 2018 05:16
@codecov
Copy link

codecov bot commented Jul 14, 2018

Codecov Report

Merging #21904 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21904   +/-   ##
=======================================
  Coverage   91.99%   91.99%           
=======================================
  Files         167      167           
  Lines       50578    50578           
=======================================
  Hits        46530    46530           
  Misses       4048     4048
Flag Coverage Δ
#multiple 90.4% <ø> (ø) ⬆️
#single 42.17% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 322dbf4...124b095. Read the comment docs.

@realead realead force-pushed the hash_for_float64_GH21866 branch from 2cec96e to 94b7087 Compare July 14, 2018 13:41
@jreback
Copy link
Contributor

jreback commented Jul 14, 2018

can you run an asv and report any anomalies

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this have any user level visible effects?
can you add a whatsnew note (0.24.0), depending on your answer above either API changes or other Enhancements section.

@jreback jreback added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jul 14, 2018
@realead
Copy link
Contributor Author

realead commented Jul 14, 2018

@jbrockmendel No, it was a problem in the one of the added test cases: somewhat naively 16GB memory were reserved (but not committed!), yet obviously different operating systems with different resources reacts differently to such a request.

It is a little bit strange, that the testing just died and didn't recover from that...

I removed these silly tests and now it looks better (at least it is clear, what goes wrong).

#define kh_float64_hash_func(key) (khint32_t)((asint64(key))>>33^(asint64(key))^(asint64(key))<<11)

// correct for all inputs but not -0.0 and NaNs
#define kh_float64_hash_func_0_NAN(key) (khint32_t)((asint64(key))>>33^(asint64(key))^(asint64(key))<<11)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a blank between cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Sorry for the silly question: Do you expect me to add a new commit with the improvements to the branch and you will fixup it when merging or should I amend the current commit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally push new commits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to add a whatsnew in any event


// correct for all
#define kh_float64_hash_func(key) ((key) != (key) ? \
kh_float64_hash_func_NAN(Py_NAN) : \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about Py_NAN:

  1. Must the case of Py_NO_NAN be taken into account?
  2. There is PANDAS_NAN, but here Py_NAN didn't require additional includes.

PS: NAN from math.h isn't defined for some plattforms.

@@ -235,7 +235,7 @@ Other API Changes
a ``KeyError`` (:issue:`21678`).
- Invalid construction of ``IntervalDtype`` will now always raise a ``TypeError`` rather than a ``ValueError`` if the subdtype is invalid (:issue:`21185`)
- Trying to reindex a ``DataFrame`` with a non unique ``MultiIndex`` now raises a ``ValueError`` instead of an ``Exception`` (:issue:`21770`)
-
- :class:`Float64HashTable` handles zeros/signed zeros and all flavors of NaNs consistently: it is no longer possible to have both, zero and signed-zero, as keys at the same time in a table, also there can be at most one NaN-key in a table (:issue:`21866`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not public, my question below was whether this has a public change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It fixes the bug #21866, i.e. produces right results for some esoteric corner cases. This change of behavior can be observed by the end-user, but is this then a public change worth mentioning?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#21866 (comment)

is this the public case? you can list this, but just make it related to the effect on .unique()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add a test w.r.t. unique

@realead realead force-pushed the hash_for_float64_GH21866 branch from 4bf5983 to 0f77145 Compare July 16, 2018 18:42
@@ -84,6 +84,7 @@ Other Enhancements
- :meth:`Series.nlargest`, :meth:`Series.nsmallest`, :meth:`DataFrame.nlargest`, and :meth:`DataFrame.nsmallest` now accept the value ``"all"`` for the ``keep`` argument. This keeps all ties for the nth largest/smallest value (:issue:`16818`)
- :class:`IntervalIndex` has gained the :meth:`~IntervalIndex.set_closed` method to change the existing ``closed`` value (:issue:`21670`)
- :func:`~DataFrame.to_csv` and :func:`~DataFrame.to_json` now support ``compression='infer'`` to infer compression based on filename (:issue:`15008`)
- :class:`Float64HashTable` handles zeros/signed zeros and all flavors of NaNs consistently: it is no longer possible to have both, zero and signed-zero, as keys at the same time in a table, also there can be at most one NaN-key in a table (:issue:`21866`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, can you reword to just focus on .unique()

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc comments. rebase & ping on green.

@@ -500,6 +501,23 @@ def test_obj_none_preservation(self):

tm.assert_numpy_array_equal(result, expected, strict_nan=True)

def test_signed_zero(self):
a = np.array([-0.0, 0.0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number here as a comment (and on test below)

@@ -84,6 +84,7 @@ Other Enhancements
- :meth:`Series.nlargest`, :meth:`Series.nsmallest`, :meth:`DataFrame.nlargest`, and :meth:`DataFrame.nsmallest` now accept the value ``"all"`` for the ``keep`` argument. This keeps all ties for the nth largest/smallest value (:issue:`16818`)
- :class:`IntervalIndex` has gained the :meth:`~IntervalIndex.set_closed` method to change the existing ``closed`` value (:issue:`21670`)
- :func:`~DataFrame.to_csv` and :func:`~DataFrame.to_json` now support ``compression='infer'`` to infer compression based on filename (:issue:`15008`)
- :func:`unique` handles signed zeros consistently: it is no longer possible to have both, 0.0 and -0.0, in the same resulting array (:issue:`21866`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to bug fix / Numeric section.

@jreback jreback added this to the 0.24.0 milestone Jul 20, 2018
realead added 2 commits July 21, 2018 08:45
The following issues

   1)  hash(0.0) != hash(-0.0)
   2)  hash(x) != hash(y) for different x,y which are nans

are solved by setting:

   1) hash(-0.0):=hash(0.0)
   2) hash(x):=hash(np.nan) for every x which is nan
@realead realead force-pushed the hash_for_float64_GH21866 branch from 11131c9 to 124b095 Compare July 21, 2018 06:46
@realead
Copy link
Contributor Author

realead commented Jul 21, 2018

@jreback Done.

@jreback
Copy link
Contributor

jreback commented Jul 25, 2018

@jbrockmendel any comments? if ok pls merge.

@jbrockmendel
Copy link
Member

@realead good job catching a subtle bug. Thanks for taking point on this.

@jbrockmendel jbrockmendel merged commit 2c7c797 into pandas-dev:master Jul 25, 2018
@realead realead deleted the hash_for_float64_GH21866 branch August 9, 2018 19:34
realead added a commit to realead/pandas that referenced this pull request Aug 12, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
realead added a commit to realead/pandas that referenced this pull request Aug 21, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
realead added a commit to realead/pandas that referenced this pull request Aug 23, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
realead added a commit to realead/pandas that referenced this pull request Sep 5, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
realead added a commit to realead/pandas that referenced this pull request Sep 8, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
realead added a commit to realead/pandas that referenced this pull request Sep 12, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
realead added a commit to realead/pandas that referenced this pull request Sep 15, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
realead added a commit to realead/pandas that referenced this pull request Sep 17, 2018
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
…-dev#21904)

* BUG: issues with hash-function for Float64HashTable (GH21866)

The following issues

   1)  hash(0.0) != hash(-0.0)
   2)  hash(x) != hash(y) for different x,y which are nans

are solved by setting:

   1) hash(-0.0):=hash(0.0)
   2) hash(x):=hash(np.nan) for every x which is nan

* add the id of the issue to tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issues with hash-function for float64 version of klib's hash-map
3 participants