-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
qcut: Using cut with IntervalIndex provided by qcut producing wrong NaN values #17284
Comments
To be clear: Even if the feature in #17282 is implemented, this one would persist. |
you would have to show your original data input. it looks like object dtype. |
@prcastro pls show a complete copy-pastable example in the top section (just update). This should be a minimal example which shows the problem. |
Updated |
@prcastro your example in the top is not copy-pastable. This needs to be a minimal example. |
@jreback Is this considered a minimal example? x = pd.Series([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.9997242948235283, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.554204550787334, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.688134638736401, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.200711854240529, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.5832425505088623, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0647107369924282, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.373043556642607, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.5822156198526636, 0.0, 0.0, 0.0, 0.0, 1.672944473242426, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.5384474167160302, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.667706820558076, 0.0, 0.0, 4.034952986707273, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.422144328051685, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
categorized = pd.qcut(x, 10, duplicates='drop')
res = pd.cut(x, categorized.cat.categories)
res.isnull().sum() |
@prcastro yes thank you. I am not exactly what is going on here. if you would debug and see where its going wrong would be great. |
@jreback I traced the problem to
It wrongly returns a lot of I took a time to invetigate why this was happening. What I found was that when computing However, I don't understand |
To distill this down a bit further, I think the root of the issue occurs when an In [2]: idx1 = pd.IntervalIndex.from_tuples([(0, 5)])
In [3]: idx1.contains(3)
Out[3]: False
In [4]: idx1.get_indexer([3])
Out[4]: array([-1], dtype=int64) The same operations work when there are two elements: In [5]: idx2 = pd.IntervalIndex.from_tuples([(0, 5), (5, 10)])
In [6]: idx2.contains(3)
Out[6]: True
In [7]: idx2.get_indexer([3])
Out[7]: array([0], dtype=int64) Quick fix could be to write a special case in |
xref #17282
Code Sample
Copy pastable
Problem description
When I use
qcut
to get theIntervalIndex
corresponding to the quantiles of afloat64
series, and than use this as thebins
ofcut
on the same float64 series, it doesn't work. It produces a new series with a lot of NaN values, while the original series contained no NaN and all of its values are contained at the interval ofIntervalIndex
.Expected Output
The result of both
qcut
andcut
should also be the same, but they are not.Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-514.26.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: