BUG: pd.unique treats 0 and False as equivalent #18111

mroeschke · 2017-11-04T19:03:03Z

In [1]: import pandas as pd

# Simple example
In [2]: pd.unique([0, False])
Out[2]: array([0])

# Testing issue in test_datetime_bool for PR 17077
In [4]: pd.unique([0, False, pd.NaT, 0.0])
Out[4]: array([0, nan], dtype=object)

Problem description

Currently a testing blocker (test_datetime_bool) for PR #17077

I am guessing False is getting coerced to 0 when determining uniqueness; however, there may be cases when the user wants False to be distinct from 0.

Expected Output

# Simple example
In [2]: pd.unique([0, False])
Out[2]: array([0, False])

# Testing issue in test_datetime_bool for PR 17077
In [4]: pd.unique([0, False, pd.NaT, 0.0])
Out[4]: array([0, False, nan], dtype=object)

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: 86e9dcc
python: 2.7.13.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-45-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0.dev0+50.g86e9dcc
pytest: 3.2.1
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.26
numpy: 1.13.1
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.8
lxml: 3.8.0
bs4: 4.3.2
html5lib: 0.9999999
sqlalchemy: 1.1.13
pymysql: 0.7.9.None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-11-04T19:14:53Z

if you change https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/hashtable_class_helper.pxi.in#L840
from _checknan to checknull (cimported from lib) then this should work. _checknan is basically a fancy isnan, while checknul is used everywhere and is dtype aware.

mroeschke · 2017-11-05T00:16:08Z

I replaced _checknan with checknull, but the issue still exists.

I found one issue in _ensure_arraylike in pandas/core/algorithms.py where numpy coerced the False to 0 like so:

In [4]: np.asarray([0, False])
Out[4]: array([0, 0])

Including 'mixed-integer' in https://github.com/pandas-dev/pandas/blob/master/pandas/core/algorithms.py#L171 ensures that the array is object dtyped and False is not coerced.

However that didn't sufficiently fix the issue. I don't entirely understand what's happening here https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/hashtable_class_helper.pxi.in#L841-L844 where the issue may be, but is it also problematic that False and 0 have the same hash?

In [6]: hash(0) == hash(False)
Out[6]: True

realead · 2018-08-13T20:11:24Z

The first issue can be side-stepped with an explicit cast to numpy-array:

>>> pd.unique(np.array([False, 0, 0.0], dtype=np.object))
array([False], dtype=object)

I'm not sure the result is unexpected. False == 0 and False == 0.0 evaluates to True in python, and because the underlying hash-map uses the Python's PyObject_RichCompareBool we get the result we get.

The result is similar for

>>> pd.unique(np.array([True, 1, 1.0], dtype=np.object))
array([True], dtype=object)

So there are probably two alternatives:

Leave it as it is. If the user doesn't like the Python's equivalence of False and 0 (and True and 1), they should do preprocessing.
Considering this case in pyobject_cmp(...), we already have a special handling for float-nans. It didn't hurt the performance.

To me, first option seems to be totally Ok.

mroeschke · 2021-06-11T05:55:10Z

Yeah given 0 and False are equivalent in Python, agreed that the current behavior is probably okay. Closing

jreback added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Datetime Datetime data dtype Difficulty Intermediate labels Nov 4, 2017

jreback added this to the 0.21.1 milestone Nov 4, 2017

mroeschke mentioned this issue Nov 6, 2017

PERF: Add cache keyword to to_datetime (#11665) #17077

Merged

4 tasks

jorisvandenbossche modified the milestones: 0.21.1, 0.22.0 Nov 30, 2017

jorisvandenbossche modified the milestones: 0.23.0, Next Major Release Mar 29, 2018

Dr-Irv mentioned this issue Apr 12, 2018

Series.is_unique has errors on objects with __ne__ defined #20661

Closed

mroeschke mentioned this issue Aug 13, 2018

pd.to_datetime() throws if caching is on with Null-like arguments #22305

Closed

jbrockmendel self-assigned this Jan 28, 2019

jbrockmendel removed Effort Medium labels Oct 21, 2019

mroeschke closed this as completed Jun 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pd.unique treats 0 and False as equivalent #18111

BUG: pd.unique treats 0 and False as equivalent #18111

mroeschke commented Nov 4, 2017

INSTALLED VERSIONS

jreback commented Nov 4, 2017

mroeschke commented Nov 5, 2017

realead commented Aug 13, 2018

mroeschke commented Jun 11, 2021

BUG: pd.unique treats 0 and False as equivalent #18111

BUG: pd.unique treats 0 and False as equivalent #18111

Comments

mroeschke commented Nov 4, 2017

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Nov 4, 2017

mroeschke commented Nov 5, 2017

realead commented Aug 13, 2018

mroeschke commented Jun 11, 2021

Output of `pd.show_versions()`