qcut: Option to return -inf/inf as lower/upper bound #22185

dberenbaum · 2018-08-03T02:30:46Z

closes qcut: Option to return -inf/inf as lower/upper bound #17282
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pandas/core/reshape/tile.py

gfyoung · 2018-08-03T02:55:15Z

pandas/core/reshape/tile.py

+            bins = bins.astype(np.float64)
+        bins[0] = -np.inf
+        bins[-1] = np.inf
+        pass


Why do we have a pass here?

gfyoung · 2018-08-03T02:55:29Z

pandas/tests/reshape/test_tile.py

@@ -479,6 +479,14 @@ def test_cut_read_only(self, array_1_writeable, array_2_writeable):
        tm.assert_categorical_equal(cut(hundred_elements, array_1),
                                    cut(hundred_elements, array_2))

+    def test_qcut_unbounded(self):
+        labels = qcut(range(5), 4, bounded=False)


Reference issue number as a comment above this line.

gfyoung

LGTM!

cc @jreback

codecov · 2018-08-03T09:49:58Z

Codecov Report

Merging #22185 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #22185      +/-   ##
==========================================
+ Coverage   92.38%   92.38%   +<.01%     
==========================================
  Files         166      166              
  Lines       52363    52380      +17     
==========================================
+ Hits        48377    48393      +16     
- Misses       3986     3987       +1

Flag	Coverage Δ
#multiple	`90.81% <100%> (ø)`	⬆️
#single	`42.91% <16.66%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/reshape/tile.py	`95.08% <100%> (+0.13%)`	⬆️
pandas/util/testing.py	`88% <0%> (-0.1%)`	⬇️
pandas/core/ops.py	`94.28% <0%> (ø)`	⬆️
pandas/core/indexes/interval.py	`95.27% <0%> (ø)`	⬆️
pandas/core/arrays/interval.py	`93.12% <0%> (+0.03%)`	⬆️
pandas/core/arrays/categorical.py	`95.97% <0%> (+0.04%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6d3565a...c2f194c. Read the comment docs.

dberenbaum · 2018-09-15T15:05:28Z

Checking in on this. How do I move this forward?

gfyoung · 2018-09-16T05:06:14Z

@dberenbaum : So sorry! Didn't realize this one fell through the cracks.

cc @jreback @jorisvandenbossche @TomAugspurger : This one has been ready to go IMO for over a month. Could one of you take a look?

jreback · 2018-09-18T13:14:03Z

I am not sure I like this, we are growing too many keywords here. Now we have 2 ways of specifying the bins. Is this really a problem to use np.inf in the bins specification?

dberenbaum · 2018-09-19T01:15:55Z

I'm not following how there are 2 ways of specifying the bins. I only see a way to specify quantiles in qcut (there is a bins arg in cut). I'm not sure how else to return np.inf as a bin edge using qcut.

The use cases I have are very similar to the one raised by @prcastro in #17282. When I need to 1) compare distributions or 2) bin numerical values into known categories, I use qcut to bin by quantiles and then bin the new distribution by those categories. This is problematic when the new distribution contains out of bounds values. The best alternative I've found is to avoid using qcut:

bins = np.quantile(x, [0, 0.25, 0.5, 0.75, 1])
bins[0] = -np.inf
bins[-1] = np.inf
pd.cut(x, bins)

jreback · 2018-12-03T01:58:42Z

@dberenbaum so I this for .cut() this is not needed at all as a user can just specify np.inf as a bound, for .qcut() its between 0 and 1, the quantiles. So can you give a case where this is actually useful?

dberenbaum · 2018-12-18T01:16:07Z

Please see #17282 for an example of how this is useful. That example is very similar to the cases where I would have found this useful. Does that example clarify the intended usage?

jreback · 2018-12-18T12:59:05Z

@dberenbaum did you see my comment. qcut is for quantile cutting, not arbitrary bounds, where cut suffices.

dberenbaum · 2018-12-19T01:49:37Z

@jreback I saw your comment. Did you look at the example in #17282?

Let's say I use qcut to bin a sample of data collected into quartiles:

>>>q = pd.qcut([1,2,3,4,5,6,7,8,9], 4)
>>>q
[(0.999, 3.0], (0.999, 3.0], (0.999, 3.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], (5.0, 7.0], (7.0, 9.0], (7.0, 9.0]]
Categories (4, interval[float64]): [(0.999, 3.0] < (3.0, 5.0] < (5.0, 7.0] < (7.0, 9.0]]

I subsequently collect a new data sample, which contains values > 9. I want to compare the new sample to my original data using the returned quartiles. There is no quartile for the new values > 9. With the bounded option, I could do the following:

>>>q = pd.qcut([1,2,3,4,5,6,7,8,9], 4, bounded=False)
>>>q
[(-np.inf, 3.0], (0.999, 3.0], (0.999, 3.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], (5.0, 7.0], (7.0, 9.0], (7.0, np.inf]]
Categories (4, interval[float64]): [(-np.inf, 3.0] < (3.0, 5.0] < (5.0, 7.0] < (7.0, np.inf]]

Now, I have useful quartiles for any subsequent code that uses the output of qcut, even if I'm applying those quartiles to new data that falls outside the range of the original sample.

The bounded behavior might not always be desirable, but I've found it to at least be a useful option.

jreback · 2019-01-14T00:15:34Z

@dberenbaum ok I looked at this again and it is ok, can you merge master

pandas/core/reshape/tile.py

TomAugspurger · 2019-01-15T03:27:11Z

pandas/core/reshape/tile.py

@@ -308,6 +315,11 @@ def qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise'):
    else:
        quantiles = q
    bins = algos.quantile(x, quantiles)
+    if not bounded and not dtype:


what about bounded and dtype? I feel like bounded should not be ignored in that case (though I don't know the correct behavior).

TomAugspurger · 2019-01-15T03:29:03Z

pandas/core/reshape/tile.py

@@ -308,6 +315,11 @@ def qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise'):
    else:
        quantiles = q
    bins = algos.quantile(x, quantiles)
+    if not bounded and not dtype:
+        if is_integer_dtype(bins):
+            bins = bins.astype(np.float64)


We probably don't want to do this. It can cause precision issues for large integers, and I suspect it may be surprising for users.

Could you instead use the min / max integer for the size?

info = np.iinf(bins.dtype) bins[0] = info.min bins[-1] = info.max

Thanks for the comments. Not sure either approach is guaranteed to avoid unexpected results for users. I think either would work for my use cases, but any approach will be a compromise since there is no way to represent infinity for int types. Looking into your other comment about dtype, the same issues arise for datetime-like types. I'm leaning towards closing this PR since I think the unbounded concept can only be naturally represented for float types and isn't worth using hacks for all other types.

Co-Authored-By: dberenbaum <dave.berenbaum@gmail.com>

jreback · 2019-01-16T01:45:29Z

pandas/tests/reshape/test_qcut.py

@@ -197,3 +197,30 @@ def test_date_like_qcut_bins(arg, expected_bins):
    ser = Series(arg)
    result, result_bins = qcut(ser, 2, retbins=True)
    tm.assert_index_equal(result_bins, expected_bins)
+
+
+def test_qcut_unbounded():


can you parametrize over bounded

jreback · 2019-01-16T01:45:53Z

pandas/tests/reshape/test_qcut.py

+    labels = qcut(range(5), 4, bounded=False)
+    left = labels.categories.left.values
+    right = labels.categories.right.values
+    expected = np.array([-np.inf, 1.0, 2.0, 3.0, np.inf])


rather than use numpy arrays, can you construct the expected Index and use tm.assert_index_equal

WillAyd

Can you merge master?

WillAyd · 2019-02-27T23:32:50Z

doc/source/whatsnew/v0.24.0.rst

@@ -421,6 +421,7 @@ Other Enhancements
 - :func:`pandas.DataFrame.to_sql` has gained the ``method`` argument to control SQL insertion clause. See the :ref:`insertion method <io.sql.method>` section in the documentation. (:issue:`8953`)
 - :meth:`DataFrame.corrwith` now supports Spearman's rank correlation, Kendall's tau as well as callable correlation methods. (:issue:`21925`)
 - :meth:`DataFrame.to_json`, :meth:`DataFrame.to_csv`, :meth:`DataFrame.to_pickle`, and :meth:`DataFrame.to_XXX` etc. now support tilde(~) in path argument. (:issue:`23473`)
+- :func: qcut now accepts ``bounded`` as a keyword argument, allowing for unbounded quantiles such that the lower/upper bounds are -inf/inf (:issue:`17282`)


Move to 0.25 at this point

dberenbaum · 2019-03-01T02:19:04Z

Closing this PR per my comment above.

dberenbaum added 2 commits August 2, 2018 20:49

ENH: option to return -inf/inf as lower/upper bound for qcut quantile…

c8e2d63

…s, see pandas-dev#17282

remove extraneous pass statement

12279de

gfyoung added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Aug 3, 2018

gfyoung reviewed Aug 3, 2018

View reviewed changes

pandas/core/reshape/tile.py Show resolved Hide resolved

gfyoung reviewed Aug 3, 2018

View reviewed changes

clean up docs/comments for qcut bounded kwarg

66c1172

gfyoung approved these changes Aug 3, 2018

View reviewed changes

dberenbaum added 4 commits January 14, 2019 20:45

fixes merge conflicts in PR pandas-dev#22185

1d87989

fixes conflict in whatsnew doc in PR pandas-dev#22185

e5316fd

fixes bugs in qcut unbounded tests

b4e28c4

sorts imports in pandas/core/reshape/tile.py

a17cc9b

TomAugspurger reviewed Jan 15, 2019

View reviewed changes

Update pandas/core/reshape/tile.py

c2f194c

Co-Authored-By: dberenbaum <dave.berenbaum@gmail.com>

jreback requested changes Jan 16, 2019

View reviewed changes

WillAyd requested changes Feb 27, 2019

View reviewed changes

dberenbaum closed this Mar 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qcut: Option to return -inf/inf as lower/upper bound #22185

qcut: Option to return -inf/inf as lower/upper bound #22185

dberenbaum commented Aug 3, 2018

gfyoung Aug 3, 2018

gfyoung Aug 3, 2018

gfyoung left a comment

codecov bot commented Aug 3, 2018 •

edited

Loading

dberenbaum commented Sep 15, 2018

gfyoung commented Sep 16, 2018 •

edited

Loading

jreback commented Sep 18, 2018

dberenbaum commented Sep 19, 2018

jreback commented Dec 3, 2018

dberenbaum commented Dec 18, 2018

jreback commented Dec 18, 2018

dberenbaum commented Dec 19, 2018

jreback commented Jan 14, 2019

TomAugspurger Jan 15, 2019

TomAugspurger Jan 15, 2019

dberenbaum Jan 21, 2019

jreback Jan 16, 2019

jreback Jan 16, 2019

WillAyd left a comment

WillAyd Feb 27, 2019

dberenbaum commented Mar 1, 2019

qcut: Option to return -inf/inf as lower/upper bound #22185

qcut: Option to return -inf/inf as lower/upper bound #22185

Conversation

dberenbaum commented Aug 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung left a comment

Choose a reason for hiding this comment

codecov bot commented Aug 3, 2018 • edited Loading

Codecov Report

dberenbaum commented Sep 15, 2018

gfyoung commented Sep 16, 2018 • edited Loading

jreback commented Sep 18, 2018

dberenbaum commented Sep 19, 2018

jreback commented Dec 3, 2018

dberenbaum commented Dec 18, 2018

jreback commented Dec 18, 2018

dberenbaum commented Dec 19, 2018

jreback commented Jan 14, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dberenbaum commented Mar 1, 2019

codecov bot commented Aug 3, 2018 •

edited

Loading

gfyoung commented Sep 16, 2018 •

edited

Loading