Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qcut: Option to return -inf/inf as lower/upper bound #17282

Closed
prcastro opened this issue Aug 18, 2017 · 3 comments · May be fixed by dberenbaum/pandas#4
Closed

qcut: Option to return -inf/inf as lower/upper bound #17282

prcastro opened this issue Aug 18, 2017 · 3 comments · May be fixed by dberenbaum/pandas#4
Labels

Comments

@prcastro
Copy link
Contributor

prcastro commented Aug 18, 2017

Code Sample

>>> x = pd.qcut([2,3,4,5,6,7,8,9], q=5, duplicates='drop')
>>> x.categories
IntervalIndex([(1.999, 3.4], (3.4, 4.8], (4.8, 6.2], (6.2, 7.6], (7.6, 9.0]]
              closed='right',
              dtype='interval[float64]')

>>> pd.cut([1,5,6,10], x.categories)
[NaN, (4.8, 6.2], (4.8, 6.2], NaN]
Categories (5, interval[float64]): [(1.999, 3.4] < (3.4, 4.8] < (4.8, 6.2] < (6.2, 7.6] < (7.6, 9.0]]

Problem description

I'm currently using qcut and cut together for Machine Learning. I use qcut to cut training data into quantiles and use cut to cut the test data into the same bins.

However, if a value in the test data is too high/low, it will violate the Categories created by qcut, and the resulting category will be NaN. A solution to this is to create an option to return -inf/inf as lower/upper bound of the categories.

Expected Output

>>> x = pd.qcut([2,3,4,5,6,7,8,9], q=5, duplicates='drop', inf_bounds=True)
>>> x.categories
IntervalIndex([(-inf, 3.4], (3.4, 4.8], (4.8, 6.2], (6.2, 7.6], (7.6, inf)]
              closed='right',
              dtype='interval[float64]')
>>> pd.cut([1,5,6,10], x.categories)
[(-inf, 3.4], (4.8, 6.2], (4.8, 6.2], (7.6, inf)]
Categories (5, interval[float64]): [(-inf, 3.4] < (3.4, 4.8] < (4.8, 6.2] < (6.2, 7.6] < (7.6, inf)]

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-514.26.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@hurcy
Copy link

hurcy commented Aug 19, 2017

I would like to working on this issue.

@gfyoung
Copy link
Member

gfyoung commented Aug 19, 2017

Go for it! No need to ask for permission, unless someone else already claimed it (e.g via PR).

@jreback jreback added this to the Contributions Welcome milestone Jul 28, 2018
dberenbaum added a commit to dberenbaum/pandas that referenced this issue Jul 30, 2018
dberenbaum added a commit to dberenbaum/pandas that referenced this issue Aug 3, 2018
dberenbaum added a commit to dberenbaum/pandas that referenced this issue Aug 3, 2018
@jreback jreback modified the milestones: Contributions Welcome, 0.24.0 Sep 18, 2018
@jreback jreback modified the milestones: 0.24.0, Contributions Welcome Nov 6, 2018
@jbrockmendel jbrockmendel added the quantile quantile method label Nov 1, 2019
@mroeschke mroeschke added cut cut, qcut and removed Reshaping Concat, Merge/Join, Stack/Unstack, Explode quantile quantile method labels Apr 5, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@mroeschke
Copy link
Member

Thanks for the request, but it appears this feature request hasn't gain much traction in years so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment