Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pd.unique treats 0 and False as equivalent #18111

Closed
mroeschke opened this issue Nov 4, 2017 · 4 comments
Closed

BUG: pd.unique treats 0 and False as equivalent #18111

mroeschke opened this issue Nov 4, 2017 · 4 comments
Assignees
Labels
Bug Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@mroeschke
Copy link
Member

In [1]: import pandas as pd

# Simple example
In [2]: pd.unique([0, False])
Out[2]: array([0])

# Testing issue in test_datetime_bool for PR 17077
In [4]: pd.unique([0, False, pd.NaT, 0.0])
Out[4]: array([0, nan], dtype=object)

Problem description

Currently a testing blocker (test_datetime_bool) for PR #17077

I am guessing False is getting coerced to 0 when determining uniqueness; however, there may be cases when the user wants False to be distinct from 0.

Expected Output

# Simple example
In [2]: pd.unique([0, False])
Out[2]: array([0, False])

# Testing issue in test_datetime_bool for PR 17077
In [4]: pd.unique([0, False, pd.NaT, 0.0])
Out[4]: array([0, False, nan], dtype=object)

Output of pd.show_versions()

INSTALLED VERSIONS

commit: 86e9dcc
python: 2.7.13.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-45-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0.dev0+50.g86e9dcc
pytest: 3.2.1
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.26
numpy: 1.13.1
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.8
lxml: 3.8.0
bs4: 4.3.2
html5lib: 0.9999999
sqlalchemy: 1.1.13
pymysql: 0.7.9.None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Nov 4, 2017

if you change https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/hashtable_class_helper.pxi.in#L840
from _checknan to checknull (cimported from lib) then this should work. _checknan is basically a fancy isnan, while checknul is used everywhere and is dtype aware.

@jreback jreback added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Datetime Datetime data dtype Difficulty Intermediate labels Nov 4, 2017
@jreback jreback added this to the 0.21.1 milestone Nov 4, 2017
@mroeschke
Copy link
Member Author

I replaced _checknan with checknull, but the issue still exists.

I found one issue in _ensure_arraylike in pandas/core/algorithms.py where numpy coerced the False to 0 like so:

In [4]: np.asarray([0, False])
Out[4]: array([0, 0])

Including 'mixed-integer' in https://github.com/pandas-dev/pandas/blob/master/pandas/core/algorithms.py#L171 ensures that the array is object dtyped and False is not coerced.

However that didn't sufficiently fix the issue. I don't entirely understand what's happening here https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/hashtable_class_helper.pxi.in#L841-L844 where the issue may be, but is it also problematic that False and 0 have the same hash?

In [6]: hash(0) == hash(False)
Out[6]: True

@realead
Copy link
Contributor

realead commented Aug 13, 2018

The first issue can be side-stepped with an explicit cast to numpy-array:

>>> pd.unique(np.array([False, 0, 0.0], dtype=np.object))
array([False], dtype=object)

I'm not sure the result is unexpected. False == 0 and False == 0.0 evaluates to True in python, and because the underlying hash-map uses the Python's PyObject_RichCompareBool we get the result we get.

The result is similar for

>>> pd.unique(np.array([True, 1, 1.0], dtype=np.object))
array([True], dtype=object)

So there are probably two alternatives:

  1. Leave it as it is. If the user doesn't like the Python's equivalence of False and 0 (and True and 1), they should do preprocessing.
  2. Considering this case in pyobject_cmp(...), we already have a special handling for float-nans. It didn't hurt the performance.

To me, first option seems to be totally Ok.

@mroeschke
Copy link
Member Author

Yeah given 0 and False are equivalent in Python, agreed that the current behavior is probably okay. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

5 participants