-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weighed quantiles #10726
Comments
On reflection, this might not be the most useful feature for me and my team right now, and is probably a little tricky - I might attempt an easier PR first as a first contribution. But it would still be nice to have. |
Could you add an example with some small sample data? I'm not sure what you're looking for exactly. Possibly, the redesign discussed in #10468 may address this. |
The standard qcut divides a column into buckets containing (approximately) equal counts. Weighted qcut would divide a column into buckets containing (approximately) equal weights. |
Small example of weighted quantile: >>> df = pl.DataFrame({"value": [1,2,3,4,-1], "weight":[0.1, 0.1, 0.1, 2, 0.2]})
shape: (5, 2)
┌───────┬────────┐
│ value ┆ weight │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═══════╪════════╡
│ 1 ┆ 0.1 │
│ 2 ┆ 0.1 │
│ 3 ┆ 0.1 │
│ 4 ┆ 2.0 │
│ -1 ┆ 0.2 │
└───────┴────────┘ Implementation using current Python api, no interpolations: quantile = 0.5
dfs = df.sort("value").with_columns(cumw=(pl.col("weight").cumsum() - 0.5*pl.col("weight"))/pl.col("weight").sum())
dfs.select(pl.col("value").sort_by((pl.col("cumw")-quantile).abs()).first()).item() # returns 4 |
I believe the following two methods are equivalent:
I think doing both methods (subtracting I believe you can do something like this: def get_weighted_quantiles(df: pl.DataFrame, q: list[float]):
quantiles = pl.DataFrame({"q": q}).set_sorted("q")
df = df.with_columns(pl.col("w").cumsum() / pl.sum("w")).set_sorted("w")
return quantiles.join_asof(df, left_on="q", right_on="w", strategy="nearest").drop("w") P.S. Could be wrong, not a statistician. |
Before implementing weighted quantiles, I would suggest to start with weighted mean first! |
Problem description
Hi - I would find it very useful to be able to perform the "quantile" methods on DataFrames (and ideally LazyFrames) with a optional sample weight column please.
Something like 'df.quantile(0.5, weights="w")'. (I think I would expect this to "drop" the weights column in the result.)
While this is a slightly more complex weighted statistic than those suggested in #7499 (which I may turn my attention to in future!), I believe it's more useful to fix as I don't believe there is an obvious performant workaround. I hope that this operation being somewhat simple and well-defined would justify its presence as not too much of a maintenance burden.
If this feature were acceptable in principle, I would be happy to implement it myself, given a little direction!
I wonder how much integration might be worthwhile. Perhaps functionality could be added in (non-breaking) stages:
Thoughts much appreciated! Thanks,
Tom
The text was updated successfully, but these errors were encountered: