Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qcut: Option to return -inf/inf as lower/upper bound #22185

Closed
wants to merge 8 commits into from

Conversation

dberenbaum
Copy link

@gfyoung gfyoung added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Aug 3, 2018
bins = bins.astype(np.float64)
bins[0] = -np.inf
bins[-1] = np.inf
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have a pass here?

@@ -479,6 +479,14 @@ def test_cut_read_only(self, array_1_writeable, array_2_writeable):
tm.assert_categorical_equal(cut(hundred_elements, array_1),
cut(hundred_elements, array_2))

def test_qcut_unbounded(self):
labels = qcut(range(5), 4, bounded=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference issue number as a comment above this line.

Copy link
Member

@gfyoung gfyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

cc @jreback

@codecov
Copy link

codecov bot commented Aug 3, 2018

Codecov Report

Merging #22185 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22185      +/-   ##
==========================================
+ Coverage   92.38%   92.38%   +<.01%     
==========================================
  Files         166      166              
  Lines       52363    52380      +17     
==========================================
+ Hits        48377    48393      +16     
- Misses       3986     3987       +1
Flag Coverage Δ
#multiple 90.81% <100%> (ø) ⬆️
#single 42.91% <16.66%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/reshape/tile.py 95.08% <100%> (+0.13%) ⬆️
pandas/util/testing.py 88% <0%> (-0.1%) ⬇️
pandas/core/ops.py 94.28% <0%> (ø) ⬆️
pandas/core/indexes/interval.py 95.27% <0%> (ø) ⬆️
pandas/core/arrays/interval.py 93.12% <0%> (+0.03%) ⬆️
pandas/core/arrays/categorical.py 95.97% <0%> (+0.04%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6d3565a...c2f194c. Read the comment docs.

@dberenbaum
Copy link
Author

Checking in on this. How do I move this forward?

@gfyoung
Copy link
Member

gfyoung commented Sep 16, 2018

@dberenbaum : So sorry! Didn't realize this one fell through the cracks.

cc @jreback @jorisvandenbossche @TomAugspurger : This one has been ready to go IMO for over a month. Could one of you take a look?

@jreback
Copy link
Contributor

jreback commented Sep 18, 2018

I am not sure I like this, we are growing too many keywords here. Now we have 2 ways of specifying the bins. Is this really a problem to use np.inf in the bins specification?

@dberenbaum
Copy link
Author

I'm not following how there are 2 ways of specifying the bins. I only see a way to specify quantiles in qcut (there is a bins arg in cut). I'm not sure how else to return np.inf as a bin edge using qcut.

The use cases I have are very similar to the one raised by @prcastro in #17282. When I need to 1) compare distributions or 2) bin numerical values into known categories, I use qcut to bin by quantiles and then bin the new distribution by those categories. This is problematic when the new distribution contains out of bounds values. The best alternative I've found is to avoid using qcut:

bins = np.quantile(x, [0, 0.25, 0.5, 0.75, 1])
bins[0] = -np.inf
bins[-1] = np.inf
pd.cut(x, bins)

@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

@dberenbaum so I this for .cut() this is not needed at all as a user can just specify np.inf as a bound, for .qcut() its between 0 and 1, the quantiles. So can you give a case where this is actually useful?

@dberenbaum
Copy link
Author

Please see #17282 for an example of how this is useful. That example is very similar to the cases where I would have found this useful. Does that example clarify the intended usage?

@jreback
Copy link
Contributor

jreback commented Dec 18, 2018

@dberenbaum did you see my comment. qcut is for quantile cutting, not arbitrary bounds, where cut suffices.

@dberenbaum
Copy link
Author

@jreback I saw your comment. Did you look at the example in #17282?

Let's say I use qcut to bin a sample of data collected into quartiles:

>>>q = pd.qcut([1,2,3,4,5,6,7,8,9], 4)
>>>q
[(0.999, 3.0], (0.999, 3.0], (0.999, 3.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], (5.0, 7.0], (7.0, 9.0], (7.0, 9.0]]
Categories (4, interval[float64]): [(0.999, 3.0] < (3.0, 5.0] < (5.0, 7.0] < (7.0, 9.0]]

I subsequently collect a new data sample, which contains values > 9. I want to compare the new sample to my original data using the returned quartiles. There is no quartile for the new values > 9. With the bounded option, I could do the following:

>>>q = pd.qcut([1,2,3,4,5,6,7,8,9], 4, bounded=False)
>>>q
[(-np.inf, 3.0], (0.999, 3.0], (0.999, 3.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], (5.0, 7.0], (7.0, 9.0], (7.0, np.inf]]
Categories (4, interval[float64]): [(-np.inf, 3.0] < (3.0, 5.0] < (5.0, 7.0] < (7.0, np.inf]]

Now, I have useful quartiles for any subsequent code that uses the output of qcut, even if I'm applying those quartiles to new data that falls outside the range of the original sample.

The bounded behavior might not always be desirable, but I've found it to at least be a useful option.

@jreback
Copy link
Contributor

jreback commented Jan 14, 2019

@dberenbaum ok I looked at this again and it is ok, can you merge master

pandas/core/reshape/tile.py Outdated Show resolved Hide resolved
@@ -308,6 +315,11 @@ def qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise'):
else:
quantiles = q
bins = algos.quantile(x, quantiles)
if not bounded and not dtype:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about bounded and dtype? I feel like bounded should not be ignored in that case (though I don't know the correct behavior).

@@ -308,6 +315,11 @@ def qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise'):
else:
quantiles = q
bins = algos.quantile(x, quantiles)
if not bounded and not dtype:
if is_integer_dtype(bins):
bins = bins.astype(np.float64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't want to do this. It can cause precision issues for large integers, and I suspect it may be surprising for users.

Could you instead use the min / max integer for the size?

info = np.iinf(bins.dtype)
bins[0] = info.min
bins[-1] = info.max

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. Not sure either approach is guaranteed to avoid unexpected results for users. I think either would work for my use cases, but any approach will be a compromise since there is no way to represent infinity for int types. Looking into your other comment about dtype, the same issues arise for datetime-like types. I'm leaning towards closing this PR since I think the unbounded concept can only be naturally represented for float types and isn't worth using hacks for all other types.

Co-Authored-By: dberenbaum <dave.berenbaum@gmail.com>
@@ -197,3 +197,30 @@ def test_date_like_qcut_bins(arg, expected_bins):
ser = Series(arg)
result, result_bins = qcut(ser, 2, retbins=True)
tm.assert_index_equal(result_bins, expected_bins)


def test_qcut_unbounded():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parametrize over bounded

labels = qcut(range(5), 4, bounded=False)
left = labels.categories.left.values
right = labels.categories.right.values
expected = np.array([-np.inf, 1.0, 2.0, 3.0, np.inf])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than use numpy arrays, can you construct the expected Index and use tm.assert_index_equal

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you merge master?

@@ -421,6 +421,7 @@ Other Enhancements
- :func:`pandas.DataFrame.to_sql` has gained the ``method`` argument to control SQL insertion clause. See the :ref:`insertion method <io.sql.method>` section in the documentation. (:issue:`8953`)
- :meth:`DataFrame.corrwith` now supports Spearman's rank correlation, Kendall's tau as well as callable correlation methods. (:issue:`21925`)
- :meth:`DataFrame.to_json`, :meth:`DataFrame.to_csv`, :meth:`DataFrame.to_pickle`, and :meth:`DataFrame.to_XXX` etc. now support tilde(~) in path argument. (:issue:`23473`)
- :func: qcut now accepts ``bounded`` as a keyword argument, allowing for unbounded quantiles such that the lower/upper bounds are -inf/inf (:issue:`17282`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to 0.25 at this point

@dberenbaum dberenbaum closed this Mar 1, 2019
@dberenbaum
Copy link
Author

Closing this PR per my comment above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

qcut: Option to return -inf/inf as lower/upper bound
5 participants