-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
qcut
: Cannot specify labels
if quantiles contain duplicates
#10483
Comments
I must have missed the discussion, what is the point of |
This was brought up before. The issue is that which labels need to be dropped is potentially ambiguous and care is required to not end up with nonsense. I have something reasonable figured out but I need to find the time to implement it. Edit: #9755 (comment) |
It lets you bin data even if it has enough repeated values to result in duplicate quantiles for breaks. |
Another way people typically do with duplicates is adding random noise to the original data. So, adding an option like |
They could just do .rank().qcut() for this, since rank has different methods such as “random”. |
That works only if you know in advance the data will have duplicated bins. It may be better in terms of performance to implement the logic inside |
Jittering doesn't seem appropriate here. rank + qcut would actually be pretty reasonable since qcut sorts the data anyway to compute the quantiles. If you don't care about duplicates and just want even bins no matter what, maybe So ideally I think we'd end up with the ability to bin by quantiles and either either fail or combine bins optionally if there were duplicates, AND have a function that splits data into evenly sized bins. I think the latter would be close enough to qcut when there aren't duplicates. |
Agree with @magarick, don't secretly jitter data without telling the user. |
If you could just by any chance read a little bit more carefully, what I was suggesting is to give users an option to handle duplicates by adding random noise. Whether this approach is sound or not may worth discussion, but definitely for sure not secretly doing it without telling users. |
Checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Issue description
allow_duplicates
does not play nice withlabels
.Expected behavior
I believe the solution is for
allow_duplicates
to also drop the label associated with the duplicate quantile.Installed versions
main branch
The text was updated successfully, but these errors were encountered: