-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
qcut: Option to return -inf/inf as lower/upper bound #22185
Conversation
pandas/core/reshape/tile.py
Outdated
bins = bins.astype(np.float64) | ||
bins[0] = -np.inf | ||
bins[-1] = np.inf | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have a pass
here?
pandas/tests/reshape/test_tile.py
Outdated
@@ -479,6 +479,14 @@ def test_cut_read_only(self, array_1_writeable, array_2_writeable): | |||
tm.assert_categorical_equal(cut(hundred_elements, array_1), | |||
cut(hundred_elements, array_2)) | |||
|
|||
def test_qcut_unbounded(self): | |||
labels = qcut(range(5), 4, bounded=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reference issue number as a comment above this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
cc @jreback
Codecov Report
@@ Coverage Diff @@
## master #22185 +/- ##
==========================================
+ Coverage 92.38% 92.38% +<.01%
==========================================
Files 166 166
Lines 52363 52380 +17
==========================================
+ Hits 48377 48393 +16
- Misses 3986 3987 +1
Continue to review full report at Codecov.
|
Checking in on this. How do I move this forward? |
@dberenbaum : So sorry! Didn't realize this one fell through the cracks. cc @jreback @jorisvandenbossche @TomAugspurger : This one has been ready to go IMO for over a month. Could one of you take a look? |
I am not sure I like this, we are growing too many keywords here. Now we have 2 ways of specifying the bins. Is this really a problem to use np.inf in the bins specification? |
I'm not following how there are 2 ways of specifying the bins. I only see a way to specify quantiles in The use cases I have are very similar to the one raised by @prcastro in #17282. When I need to 1) compare distributions or 2) bin numerical values into known categories, I use
|
@dberenbaum so I this for |
Please see #17282 for an example of how this is useful. That example is very similar to the cases where I would have found this useful. Does that example clarify the intended usage? |
@dberenbaum did you see my comment. |
@jreback I saw your comment. Did you look at the example in #17282? Let's say I use
I subsequently collect a new data sample, which contains values > 9. I want to compare the new sample to my original data using the returned quartiles. There is no quartile for the new values > 9. With the
Now, I have useful quartiles for any subsequent code that uses the output of The |
@dberenbaum ok I looked at this again and it is ok, can you merge master |
@@ -308,6 +315,11 @@ def qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise'): | |||
else: | |||
quantiles = q | |||
bins = algos.quantile(x, quantiles) | |||
if not bounded and not dtype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about bounded
and dtype
? I feel like bounded should not be ignored in that case (though I don't know the correct behavior).
@@ -308,6 +315,11 @@ def qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise'): | |||
else: | |||
quantiles = q | |||
bins = algos.quantile(x, quantiles) | |||
if not bounded and not dtype: | |||
if is_integer_dtype(bins): | |||
bins = bins.astype(np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't want to do this. It can cause precision issues for large integers, and I suspect it may be surprising for users.
Could you instead use the min
/ max
integer for the size?
info = np.iinf(bins.dtype)
bins[0] = info.min
bins[-1] = info.max
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments. Not sure either approach is guaranteed to avoid unexpected results for users. I think either would work for my use cases, but any approach will be a compromise since there is no way to represent infinity for int types. Looking into your other comment about dtype, the same issues arise for datetime-like types. I'm leaning towards closing this PR since I think the unbounded concept can only be naturally represented for float types and isn't worth using hacks for all other types.
Co-Authored-By: dberenbaum <dave.berenbaum@gmail.com>
@@ -197,3 +197,30 @@ def test_date_like_qcut_bins(arg, expected_bins): | |||
ser = Series(arg) | |||
result, result_bins = qcut(ser, 2, retbins=True) | |||
tm.assert_index_equal(result_bins, expected_bins) | |||
|
|||
|
|||
def test_qcut_unbounded(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you parametrize over bounded
labels = qcut(range(5), 4, bounded=False) | ||
left = labels.categories.left.values | ||
right = labels.categories.right.values | ||
expected = np.array([-np.inf, 1.0, 2.0, 3.0, np.inf]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than use numpy arrays, can you construct the expected Index and use tm.assert_index_equal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you merge master?
@@ -421,6 +421,7 @@ Other Enhancements | |||
- :func:`pandas.DataFrame.to_sql` has gained the ``method`` argument to control SQL insertion clause. See the :ref:`insertion method <io.sql.method>` section in the documentation. (:issue:`8953`) | |||
- :meth:`DataFrame.corrwith` now supports Spearman's rank correlation, Kendall's tau as well as callable correlation methods. (:issue:`21925`) | |||
- :meth:`DataFrame.to_json`, :meth:`DataFrame.to_csv`, :meth:`DataFrame.to_pickle`, and :meth:`DataFrame.to_XXX` etc. now support tilde(~) in path argument. (:issue:`23473`) | |||
- :func: qcut now accepts ``bounded`` as a keyword argument, allowing for unbounded quantiles such that the lower/upper bounds are -inf/inf (:issue:`17282`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to 0.25 at this point
Closing this PR per my comment above. |
git diff upstream/master -u -- "*.py" | flake8 --diff