`qcut`: Cannot specify `labels` if quantiles contain duplicates #10483

stinodego · 2023-08-14T18:21:46Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

s = pl.Series([1, 2, 2, 3])

s.qcut([0.50, 0.51], labels=['a', 'b', 'c'])
# DuplicateError: Breaks are not unique

# OK, let's allow duplicates...

s.qcut([0.50, 0.51], labels=['a', 'b', 'c'], allow_duplicates=True)
# pyo3_runtime.PanicException: index out of bounds: the len is 2 but the index is 2

# Makes sense, we only have two bins so we should have two labels.

s.qcut([0.50, 0.51], labels=['a', 'b'], allow_duplicates=True)
# ShapeError: Wrong number of labels

Issue description

allow_duplicates does not play nice with labels.

Expected behavior

I believe the solution is for allow_duplicates to also drop the label associated with the duplicate quantile.

Installed versions

main branch

The text was updated successfully, but these errors were encountered:

mcrumiller · 2023-08-14T21:30:51Z

I must have missed the discussion, what is the point of allow_duplicates?

magarick · 2023-08-14T22:18:43Z

This was brought up before. The issue is that which labels need to be dropped is potentially ambiguous and care is required to not end up with nonsense. I have something reasonable figured out but I need to find the time to implement it.

Edit: #9755 (comment)

magarick · 2023-08-14T23:03:47Z

I must have missed the discussion, what is the point of allow_duplicates?

It lets you bin data even if it has enough repeated values to result in duplicate quantiles for breaks.

paladin158 · 2023-08-15T01:06:17Z

Another way people typically do with duplicates is adding random noise to the original data. So, adding an option like handle_duplicates='random' seems to be a reasonable way to deal with duplicates as well.

s-banach · 2023-08-15T03:25:45Z

Another way people typically do with duplicates is adding random noise to the original data. So, adding an option like handle_duplicates='random' seems to be a reasonable way to deal with duplicates as well.

They could just do .rank().qcut() for this, since rank has different methods such as “random”.

qqlearn123 · 2023-08-15T03:51:49Z

Another way people typically do with duplicates is adding random noise to the original data. So, adding an option like handle_duplicates='random' seems to be a reasonable way to deal with duplicates as well.

They could just do .rank().qcut() for this, since rank has different methods such as “random”.

That works only if you know in advance the data will have duplicated bins. It may be better in terms of performance to implement the logic inside qcut.

magarick · 2023-08-15T05:51:08Z

Jittering doesn't seem appropriate here. rank + qcut would actually be pretty reasonable since qcut sorts the data anyway to compute the quantiles. If you don't care about duplicates and just want even bins no matter what, maybe ntile could be added since it's been requested before. But adding jittering internally feels like unnecessary mixing of concerns and complicating the code for a specific, and in my experience uncommon, solution to an edge case.

So ideally I think we'd end up with the ability to bin by quantiles and either either fail or combine bins optionally if there were duplicates, AND have a function that splits data into evenly sized bins. I think the latter would be close enough to qcut when there aren't duplicates.

mcrumiller · 2023-08-16T00:30:11Z

Agree with @magarick, don't secretly jitter data without telling the user.

paladin158 · 2023-08-16T01:03:48Z

Agree with @magarick, don't secretly jitter data without telling the user.

If you could just by any chance read a little bit more carefully, what I was suggesting is to give users an option to handle duplicates by adding random noise. Whether this approach is sound or not may worth discussion, but definitely for sure not secretly doing it without telling users.

stinodego added bug Something isn't working python Related to Python Polars accepted Ready for implementation labels Aug 14, 2023

orlp mentioned this issue Aug 17, 2023

Redesign cut and qcut #10468

Open

stinodego mentioned this issue Oct 14, 2023

Make qcut support labels truncation #9755

Closed

stinodego added P-low Priority: low and removed accepted Ready for implementation labels Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`qcut`: Cannot specify `labels` if quantiles contain duplicates #10483

`qcut`: Cannot specify `labels` if quantiles contain duplicates #10483

stinodego commented Aug 14, 2023

mcrumiller commented Aug 14, 2023

magarick commented Aug 14, 2023 •

edited

Loading

magarick commented Aug 14, 2023

paladin158 commented Aug 15, 2023

s-banach commented Aug 15, 2023

qqlearn123 commented Aug 15, 2023

magarick commented Aug 15, 2023

mcrumiller commented Aug 16, 2023

paladin158 commented Aug 16, 2023 •

edited

Loading

qcut: Cannot specify labels if quantiles contain duplicates #10483

qcut: Cannot specify labels if quantiles contain duplicates #10483

Comments

stinodego commented Aug 14, 2023

Checks

Reproducible example

Issue description

Expected behavior

Installed versions

mcrumiller commented Aug 14, 2023

magarick commented Aug 14, 2023 • edited Loading

magarick commented Aug 14, 2023

paladin158 commented Aug 15, 2023

s-banach commented Aug 15, 2023

qqlearn123 commented Aug 15, 2023

magarick commented Aug 15, 2023

mcrumiller commented Aug 16, 2023

paladin158 commented Aug 16, 2023 • edited Loading

`qcut`: Cannot specify `labels` if quantiles contain duplicates #10483

`qcut`: Cannot specify `labels` if quantiles contain duplicates #10483

magarick commented Aug 14, 2023 •

edited

Loading

paladin158 commented Aug 16, 2023 •

edited

Loading