Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qcut: Cannot specify labels if quantiles contain duplicates #10483

Open
2 tasks done
stinodego opened this issue Aug 14, 2023 · 9 comments
Open
2 tasks done

qcut: Cannot specify labels if quantiles contain duplicates #10483

stinodego opened this issue Aug 14, 2023 · 9 comments
Labels
bug Something isn't working P-low Priority: low python Related to Python Polars

Comments

@stinodego
Copy link
Member

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

s = pl.Series([1, 2, 2, 3])

s.qcut([0.50, 0.51], labels=['a', 'b', 'c'])
# DuplicateError: Breaks are not unique

# OK, let's allow duplicates...

s.qcut([0.50, 0.51], labels=['a', 'b', 'c'], allow_duplicates=True)
# pyo3_runtime.PanicException: index out of bounds: the len is 2 but the index is 2

# Makes sense, we only have two bins so we should have two labels.

s.qcut([0.50, 0.51], labels=['a', 'b'], allow_duplicates=True)
# ShapeError: Wrong number of labels

Issue description

allow_duplicates does not play nice with labels.

Expected behavior

I believe the solution is for allow_duplicates to also drop the label associated with the duplicate quantile.

Installed versions

main branch

@stinodego stinodego added bug Something isn't working python Related to Python Polars accepted Ready for implementation labels Aug 14, 2023
@mcrumiller
Copy link
Contributor

I must have missed the discussion, what is the point of allow_duplicates?

@magarick
Copy link
Contributor

magarick commented Aug 14, 2023

This was brought up before. The issue is that which labels need to be dropped is potentially ambiguous and care is required to not end up with nonsense. I have something reasonable figured out but I need to find the time to implement it.

Edit: #9755 (comment)

@magarick
Copy link
Contributor

I must have missed the discussion, what is the point of allow_duplicates?

It lets you bin data even if it has enough repeated values to result in duplicate quantiles for breaks.

@paladin158
Copy link

Another way people typically do with duplicates is adding random noise to the original data. So, adding an option like handle_duplicates='random' seems to be a reasonable way to deal with duplicates as well.

@s-banach
Copy link
Contributor

Another way people typically do with duplicates is adding random noise to the original data. So, adding an option like handle_duplicates='random' seems to be a reasonable way to deal with duplicates as well.

They could just do .rank().qcut() for this, since rank has different methods such as “random”.

@qqlearn123
Copy link

Another way people typically do with duplicates is adding random noise to the original data. So, adding an option like handle_duplicates='random' seems to be a reasonable way to deal with duplicates as well.

They could just do .rank().qcut() for this, since rank has different methods such as “random”.

That works only if you know in advance the data will have duplicated bins. It may be better in terms of performance to implement the logic inside qcut.

@magarick
Copy link
Contributor

Jittering doesn't seem appropriate here. rank + qcut would actually be pretty reasonable since qcut sorts the data anyway to compute the quantiles. If you don't care about duplicates and just want even bins no matter what, maybe ntile could be added since it's been requested before. But adding jittering internally feels like unnecessary mixing of concerns and complicating the code for a specific, and in my experience uncommon, solution to an edge case.

So ideally I think we'd end up with the ability to bin by quantiles and either either fail or combine bins optionally if there were duplicates, AND have a function that splits data into evenly sized bins. I think the latter would be close enough to qcut when there aren't duplicates.

@mcrumiller
Copy link
Contributor

Agree with @magarick, don't secretly jitter data without telling the user.

@paladin158
Copy link

paladin158 commented Aug 16, 2023

Agree with @magarick, don't secretly jitter data without telling the user.

If you could just by any chance read a little bit more carefully, what I was suggesting is to give users an option to handle duplicates by adding random noise. Whether this approach is sound or not may worth discussion, but definitely for sure not secretly doing it without telling users.

@stinodego stinodego added P-low Priority: low and removed accepted Ready for implementation labels Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P-low Priority: low python Related to Python Polars
Projects
Status: Ready
Development

No branches or pull requests

6 participants