-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign cut
and qcut
#10468
Comments
Thanks for making an issue. I'm picking this up. |
This was a mistake on my part. Aligning is good.
I don't think that deals with the core issue of one parameter being used for two things unambiguously. @lorentzenchr Regarding the other thread, I'm sorry if I offended you but I'm genuinely confused about what came off as harsh language. I enjoy salty language but I didn't think I was anywhere close to it here. |
Nothing around quantiles has anything to do with probabilities, so that's definitely a -1 from me on naming anything |
Here's what I'd suggest def cut(
self,
breaks: list[float],
labels: list[str] | None = None,
*,
left_closed: bool = True,
include_intervals: bool = False,
) -> Series | DataFrame:
def cut_quantiles(
self,
breaks: list[float] | int,
labels: list[str] | None = None,
*,
left_closed: bool = True,
include_intervals: bool = False,
) -> Series | DataFrame: Changes from current API:
With also the following change in
|
I agree with most of your proposal, @orlp . I think we should just merge a breaking redesign of
The expression variants cannot return a DataFrame, they will return a Struct column. The Series methods could also just return a Struct column - although returning a DataFrame might be more ergonomic in some instances. Either way is fine.
I don't know what the reason is for the right-closed default. pandas also has right-closed as a default. If we set left-closed as a default, the parameter should be named
What would be the performance impact on this? If it's expensive, this might not be the best idea. |
That's just not true. Ask any a statistician or refer to literature. Quantiles are, in a way, the inverse of a probability distribution. I would be very interested in any definition of quantiles without the notions of probability, sample and statistics. |
If properly implemented, nothing. To find the quantiles themselves you already need to do the sort / quickselect, which also provides you with the partition. |
None of that has anything to do with the computation of the quantiles of a particular dataset. Probability, distributions, statistics only come into play when interpreting the meaning of those quantiles, assuming your dataset corresponds to a sample of an underlying distribution. We're discussing the computational method here, not any statistical interpretation of it (which may or may not be relevant for the user).
Quite simple, for the Note that the linear interpolation is not necessary if you simply wish to partition the series. All elements with rank |
First, this issue is not about computational methods.
So you are using the order statistic of a given sample/data set. That's statistics with a notion of probability. |
No, again, you are applying a statistical interpretation to it. I quote from Wikipedia (emphasis mine):
We are talking about computational methods on If you view the input as a statistical sample with an underlying distribution, then sure, yes, the computational method can be interpreted to compute a statistical quantity. But your reasoning is circular, you derive "this is a statistical/probabilistic process" starting from the assumption... that this is a probabilistic sample of an underlying distribution. Which is just not necessarily true. As a counterexample, Either way, I think |
Let's go with |
To be honest, I really do not like this discussion. There is the issue of having a separate parameter to say: "Hello Then there is the point of specifying which quantiles exactly. For the naming, I would at least look at what other libraries and languages do:
The paper Sample Quantiles in Statistical Packages is THE reference on how to calculate quantiles. So it seems evident to me, that there are probabilities involved. As the logic of BTW, I really feel offended by someone repeatedly saying quantiles have nothing to do with probabilities. As info: For a statistician, a set of data points, e.g. a polars.Series (no matter its type), is just called a sample. More disclosure: I teach at the department of mathematics at a university. |
None of the linked functions perform a quantile partition. Pandas'
We aren't computing quantiles here. I have no objection to using this nomenclature on the Perhaps we should drop the quantile nomenclature here altogether to avoid the confusion?
But this is wrong (in the proposed redesign, it does match the current problematic implementation), and precisely what caused a variety of issues around duplicates. Consider the array
I should clarify, I meant not necessarily having something to do with probability. Obviously in a statistical context there are probabilities involved. And of course it's nothing personal. |
@lorentzenchr Perhaps you find my argument more convincing if you consider that both ' |
Another bikeshed suggestion: |
A Modern gradient boosting libs like XGBoost, LightGBM and scikit-learn use histograms and perform exactly the operation of |
As long as strings have an order/are sortable (which they are), they fully qualify as a sample on which to calculate quantiles with specified probability levels. |
I think that's a fair point, I will give it some more thought.
That doesn't make sense to me. What would be the quantile with probability level 0.5 for the sample |
Alright, proposed redesign, take two. We remove def bin_intervals(
self,
intervals: list[Any] | int,
labels: list[str] | None = None,
*,
include_intervals: bool = False,
right_closed: bool = False,
) -> Series | DataFrame:
def bin_quantiles(
self,
quantiles: list[float] | int,
labels: list[str] | None = None,
*,
include_intervals: bool = False,
right_closed: bool = False,
) -> Series | DataFrame:
def bin_ranks(
self,
ranks: list[float] | int,
labels: list[str] | None = None,
*,
include_intervals: bool = False,
) -> Series | DataFrame: Changes from current API:
With also the following change to
The following change to
And a new function:
How does that sound @lorentzenchr ? |
Respectfully, this lengthy discussion/argument appears very unproductive. Everything in the But I don't understand why
I'm unsure how an argument regarding the connection between quantities and probability led to the need for changing two existing function names, their parameter names, and their default values, which breaks everything. @orlp Everything in |
It didn't. It's an accumulation of #10525, #10483, #10524, #9854, this issue, etc.
Some of the the above issues disagree with that.
I know it is found in
immediately abandons the
Not identical, but definitely related, and should be consistent.
Absolutely not. Binning into equal-sized quantiles is completely different than binning based on equal-sized intervals for skewed distributions. And both are different (as @lorentzenchr pointed out with regards to assigning consistent labels to identical values) from equal-sized binning based on rank. Hence, three binning functions. Example, if we let l.bin_intervals(4) bins to [[0, 0, 0, 0, 0], [1, 1.3, 1.6], [2], [3]]
l.bin_quantiles(4) bins to [[0, 0, 0, 0, 0], [], [1, 1.3], [1.6, 2, 3]]
l.bin_ranks(4) bins to [[0, 0, 0], [0, 0, 1], [1.3, 1.6, 2], [3]] |
FYI Matlab uses |
@orlp I don't see much relevance between the linked issues and my point. I didn't say you shouldn't fix anything in |
This is, um, "colorful". I'll leave the micro-bike-shedding and just state some high level facts and observations.
I think the type of debate going on here is losing sight of the big picture. People will get used to, and be comfortable with, all but the most wildly confusing and unreasonable interfaces. What they're mostly asking for is more features and more flexibility. I've had a PR outstanding for quite a while to improve the labels that seems to be languishing for no good reason. And I've been stalled in adding other things people want here because of it. These are commonly used functions so I know it'll attract a lot of attention, but I think it's more important to make sure people using them can do what they need to rather than everyone being upset about the interface not conforming exactly to their personal tastes. |
First,
I think 'drop' should be added, since sometimes we do not want multiple values for the breakpoint. In pandas, the relevent parameter is:
For polars version 0.18.15, if we set Second, |
We have
Because one really specifies the probability level of the quantiles, not the quantiles themselves. Lets say I want qcut with just one break at the 50%-quantile, I would specify My intention was not to change everything, just align the parameters and behaviour of If I were to start from scratch, I would sympathize with #10468 (comment) because I like the explicit |
That's what first / last do, they just allow more flexibility than drop (which is unclear about which breakpoint and thus which label is kept).
I have no intention of silently introducing nulls here without a very good reason. Can you give some example code that shows exactly what you'd want and why?
The problem is that this doesn't really work with half-open intervals. Adding arbitrary deltas does not seem great to me. But I'll consider it.
That's fine by me, although it does introduce an arbitrary downwards bias by flooring rather than rounding, assuming you are referring to Method 1 from H&F as numpy describes it. Because the 65% quantile in that definition would still be "baz", right?
I understand, but we've been receiving a lot of different issues surrounding the
That's good to hear. As a compromise I will make sure to mention quantile level / probability level in the documentation, but for consistency with the other methods ( Also after talking with @stinodego another advantage of calling the new functions |
@orlp Thanks.
These two part is about the same topic, and 'drop' is not the same as first or the last. Please follow this code, and you will know the meaning of 'drop' here. Pandas: df = pd.DataFrame({
'value': [1] * 10
})
pd.qcut(df['value'], 2, labels=False, duplicates='drop') Polars: pl.from_pandas(df).select(pl.col('value').qcut(2, allow_duplicates=False)) 'drop' is quite useful when we do not want those points which are hard to choose which quantile there should be in. What I hope is just that the previous polars code can have the same result as pandas do. Thank you. As for the delta, I think we can add a parameter: precision. |
Thinking about it some more, the
@xuJ14 Based on this logic, it being "hard to choose which quantile they should be in" is never a thing. It's always determined exactly by the nature of the half-open intervals. So perhaps the
Why? Pandas here just starts returning NaNs. I can't imagine a scenario where a data point is so common that multiple quantile bins collapse into one... you just want to throw all of it out? Pandas' behavior here just seems awful to me. Look how it does different things for each of these cases: pd.qcut(pd.Series([1, 1, 1, 1, 1]), 2, duplicates='drop') # 1
pd.qcut(pd.Series([0, 1, 1, 1, 1]), 2, duplicates='drop') # 2
pd.qcut(pd.Series([1, 1, 1, 1, 2]), 2, duplicates='drop') # 3
pd.qcut(pd.Series([0, 1, 1, 1, 2]), 2, duplicates='drop') # 4 In every single case above we split on the median, which is unambiguously 1. In my opinion in every case you should get exactly two bins,
After evaluating the nonsense above, I have no intention to replicate what pandas does here.
I really don't like this idea, the more I think about it. It's just arbitrary and leads to bad edgecases. |
I see the problem. You are saying that 0th quantile is -inf, and 100th is inf. But in most of application, Excel, R, Stata, etc, 0th is the minimum and 100th is the maximum. Now let's go back to [1,1,1,...,1], we have duplicate breaks right? Moreover, from the concept of qcut(2), how could you evenly separate the 1's into two groups, if the domain has only {1}? Now if we consider the minimum to be 0 quantile and maximum to be 100 quantile, all of your pandas example seems more reasonable right? And your opinion that all the cases' results should be [-inf, 1) and [1, inf) is not right to me. How about just make this qcut function more general and let user decide what to do? Thanks for your thought-provoking idea. BTW, Sample Quantiles in Statistical Packages is great as reference. |
@xuJ14 No, I'm not saying the 0th quantile is -inf or 100th is inf. We could make it so that for
where
No, I don't think so. |
Maybe a further remark on the usage of
This means that ML pipelines can never use |
@lorentzenchr In my rewrite I'm writing an efficient |
Hi @orlp, just to check if you are still working on this? The whole (scorecard) credit risk modelling/insurance community are waiting anxiously for this feature😅. As discussed above, we would need a way to calculate the list of quantiles (breakpoints) for all columns during model training stage. This is currently not available with pl.quantile, which takes just one quantile. Instead, I'm using the following expression to generate these: pl.col("A")
.qcut(q, allow_duplicates=True,include_breaks=True)
.struct.field("brk")
.unique()
.sort()
.implode()
.list.shift()
.list.drop_nulls()
.alias("A") which is working, but definitely not optimal...In the second stage, the list of breakpoints got passed over to the cut function, which works fine. I would very much like to see this redesign happen, and would like to contribute as well, but I'm pretty much a Rust newbie (just started this week!)...can you kindly give some advice on the best way to participate? |
@markxwang I am still planning on tackling this but I've been working on other higher-priority stuff at the moment. |
Hi, I have been working a lot with With the redesign discussed here (and any change on these functions not being breaking), I would like to propose the following. Case 1: In all cases a null value should give nulls across the row, and a group of dataframe with only nulls should return nulls. This approach should cover all existing functionality, in a slightly different way. It is easy to produce the current string-based output, should this be necessary for backward compatibility. Most importantly, it does away with the category concatenation issues and issues with null values that cause bugs. Whenever categories are used this is always on a fixed set that is known in advance (the labels). The approach is fully consistent between To illustrate more concretely: import polars as pl
df = pl.DataFrame({
"group": [1, 1, 2],
"value": [1.0, 3.0, 2.0],
}) Without breaks: df.with_columns(pl.col("value").qcut([0.5], include_breaks=False))
And with breaks: df.with_columns(pl.col("value").qcut([0.5], include_breaks=True))
The labeled cases should be self-explanatory. I don't know any rust, but from looking at the codebase it seems that all the output is there to make this work. |
Problem description
The current (v0.18.13) function signatures are (similar in expression module):
Propositions:
labels
is keyword only inqcut
but allowed positional incut
.quantiles
toquantile_levels
orprobabilities
or alike, see discussion in depr(python): Rename firstqcut
parameter toquantiles
#10253.The text was updated successfully, but these errors were encountered: