qcut: Using cut with IntervalIndex provided by qcut producing wrong NaN values #17284

prcastro · 2017-08-18T19:18:01Z

Code Sample

>>> x.isnull().sum()
0

>>> x.value_counts()
0.000000     693
12.561725      1
13.568112      1
12.521249      1
13.007628      1
6.993961       1
14.815512      1
6.017280       1
12.944714      1
Name: 0, dtype: int64

>>> categorized = pd.qcut(x, 10, duplicates='drop')
>>> categorized.isnull().sum()
0

>>> categorized.cat.categories  # Notice how all values of x are contained in the only interval
IntervalIndex([(-0.001, 14.816]]
              closed='right',
              dtype='interval[float64]')

>>> res = pd.cut(x, categorized.cat.categories)
>>> res.isnull().sum()
701

Copy pastable

x = pd.read_csv('x.csv', header=None).iloc[:, 0]  # x.csv is provided in a comment below
categorized = pd.qcut(x, 10, duplicates='drop')
res = pd.cut(x, categorized.cat.categories)
res.isnull().sum()

Problem description

When I use qcut to get the IntervalIndex corresponding to the quantiles of a float64 series, and than use this as the bins of cut on the same float64 series, it doesn't work. It produces a new series with a lot of NaN values, while the original series contained no NaN and all of its values are contained at the interval of IntervalIndex.

Expected Output

The result of both qcut and cut should also be the same, but they are not.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-514.26.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

prcastro · 2017-08-18T19:33:59Z

To be clear: Even if the feature in #17282 is implemented, this one would persist.

jreback · 2017-08-19T16:47:28Z

you would have to show your original data input. it looks like object dtype.

jreback · 2017-08-22T12:14:41Z

@prcastro pls show a complete copy-pastable example in the top section (just update). This should be a minimal example which shows the problem.

prcastro · 2017-08-22T14:10:26Z

Updated

jreback · 2017-08-24T12:41:58Z

@prcastro your example in the top is not copy-pastable. This needs to be a minimal example.

prcastro · 2017-08-28T03:16:42Z

@jreback Is this considered a minimal example?

x = pd.Series([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.9997242948235283, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.554204550787334, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.688134638736401, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.200711854240529, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.5832425505088623, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0647107369924282, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.373043556642607, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.5822156198526636, 0.0, 0.0, 0.0, 0.0, 1.672944473242426, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.5384474167160302, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.667706820558076, 0.0, 0.0, 4.034952986707273, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.422144328051685, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
categorized = pd.qcut(x, 10, duplicates='drop')
res = pd.cut(x, categorized.cat.categories)
res.isnull().sum()

jreback · 2017-08-29T12:42:52Z

@prcastro yes thank you. I am not exactly what is going on here. if you would debug and see where its going wrong would be great.

prcastro · 2017-08-30T02:58:57Z

@jreback I traced the problem to IntervalIndex.get_indexer method. If I do:

import pandas as pd
x = pd.Series([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.9997242948235283, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.554204550787334, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.688134638736401, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.200711854240529, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.5832425505088623, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0647107369924282, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.373043556642607, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.5822156198526636, 0.0, 0.0, 0.0, 0.0, 1.672944473242426, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.5384474167160302, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.667706820558076, 0.0, 0.0, 4.034952986707273, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.422144328051685, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
categorized = pd.qcut(x, 10, duplicates='drop')
bins = categorized.cat.categories
bins.get_indexer(x)

It wrongly returns a lot of -1values (which are converted to NaN after, resulting in the original problem). After entering the if statement in here the result in stop variable seems to be wrong.

I took a time to invetigate why this was happening. What I found was that when computing stop, in the method IntervalIndex._searchsorted_monotonic, the program was not entering the else clause here. If it does, everything seems to go just fine.

However, I don't understand get_indexer enough to pin down the cause of this bug, and don't even know why or even if it should enter the else clause in IntervalIndex._searchsorted_monotonic I pointed before.

jschendel · 2017-08-30T06:02:32Z

To distill this down a bit further, I think the root of the issue occurs when an IntervalIndex only has a single element:

In [2]: idx1 = pd.IntervalIndex.from_tuples([(0, 5)])

In [3]: idx1.contains(3)
Out[3]: False

In [4]: idx1.get_indexer([3])
Out[4]: array([-1], dtype=int64)

The same operations work when there are two elements:

In [5]: idx2 = pd.IntervalIndex.from_tuples([(0, 5), (5, 10)])

In [6]: idx2.contains(3)
Out[6]: True

In [7]: idx2.get_indexer([3])
Out[7]: array([0], dtype=int64)

Quick fix could be to write a special case in IntervalIndex._searchsorted_monotonic for when there's only a single element. Would probably be beneficial to examine the logic as a whole to see if could be made more robust though.

shoyer · 2017-08-30T06:06:13Z

This sounds related to the IntervalIndex indexing issues being tracked down in #16316 and #16386.

gfyoung added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Aug 18, 2017

jreback added the Can't Repro label Aug 22, 2017

jreback added Bug and removed Can't Repro labels Aug 29, 2017

This was referenced May 2, 2018

BUG: IntervalIndex.get_loc fails when there is only one entry #20921

Closed

BUG: Fix IntervalIndex.get_loc/get_indexer for IntervalIndex of length one #20946

Merged

jreback added this to the 0.23.0 milestone May 4, 2018

jreback closed this as completed in #20946 May 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qcut: Using cut with IntervalIndex provided by qcut producing wrong NaN values #17284

qcut: Using cut with IntervalIndex provided by qcut producing wrong NaN values #17284

prcastro commented Aug 18, 2017 •

edited

Loading

INSTALLED VERSIONS

prcastro commented Aug 18, 2017

jreback commented Aug 19, 2017

jreback commented Aug 22, 2017

prcastro commented Aug 22, 2017

jreback commented Aug 24, 2017 •

edited

Loading

prcastro commented Aug 28, 2017

jreback commented Aug 29, 2017

prcastro commented Aug 30, 2017

jschendel commented Aug 30, 2017

shoyer commented Aug 30, 2017 •

edited

Loading

qcut: Using cut with IntervalIndex provided by qcut producing wrong NaN values #17284

qcut: Using cut with IntervalIndex provided by qcut producing wrong NaN values #17284

Comments

prcastro commented Aug 18, 2017 • edited Loading

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

prcastro commented Aug 18, 2017

jreback commented Aug 19, 2017

jreback commented Aug 22, 2017

prcastro commented Aug 22, 2017

jreback commented Aug 24, 2017 • edited Loading

prcastro commented Aug 28, 2017

jreback commented Aug 29, 2017

prcastro commented Aug 30, 2017

jschendel commented Aug 30, 2017

shoyer commented Aug 30, 2017 • edited Loading

prcastro commented Aug 18, 2017 •

edited

Loading

Output of `pd.show_versions()`

jreback commented Aug 24, 2017 •

edited

Loading

shoyer commented Aug 30, 2017 •

edited

Loading