Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weighed quantiles #10726

Open
tmct opened this issue Aug 25, 2023 · 6 comments
Open

Weighed quantiles #10726

tmct opened this issue Aug 25, 2023 · 6 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@tmct
Copy link

tmct commented Aug 25, 2023

Problem description

Hi - I would find it very useful to be able to perform the "quantile" methods on DataFrames (and ideally LazyFrames) with a optional sample weight column please.

Something like 'df.quantile(0.5, weights="w")'. (I think I would expect this to "drop" the weights column in the result.)

While this is a slightly more complex weighted statistic than those suggested in #7499 (which I may turn my attention to in future!), I believe it's more useful to fix as I don't believe there is an obvious performant workaround. I hope that this operation being somewhat simple and well-defined would justify its presence as not too much of a maintenance burden.

If this feature were acceptable in principle, I would be happy to implement it myself, given a little direction!

I wonder how much integration might be worthwhile. Perhaps functionality could be added in (non-breaking) stages:

  1. Add this to DataFrame only: only accepting optional column name
  2. Same for LazyFrame
  3. Allowing more general inputs than string Exprs, and potentially adding to Series, Expr etc?

Thoughts much appreciated! Thanks,
Tom

@tmct tmct added the enhancement New feature or an improvement of an existing feature label Aug 25, 2023
@tmct
Copy link
Author

tmct commented Aug 25, 2023

On reflection, this might not be the most useful feature for me and my team right now, and is probably a little tricky - I might attempt an easier PR first as a first contribution. But it would still be nice to have.

@stinodego
Copy link
Member

stinodego commented Aug 27, 2023

Could you add an example with some small sample data? I'm not sure what you're looking for exactly.

Possibly, the redesign discussed in #10468 may address this.

@s-banach
Copy link
Contributor

s-banach commented Aug 28, 2023

The standard qcut divides a column into buckets containing (approximately) equal counts. Weighted qcut would divide a column into buckets containing (approximately) equal weights.

@zundertj
Copy link
Collaborator

Small example of weighted quantile:

>>> df = pl.DataFrame({"value": [1,2,3,4,-1], "weight":[0.1, 0.1, 0.1, 2, 0.2]})
shape: (5, 2)
┌───────┬────────┐
│ valueweight │
│ ------    │
│ i64f64    │
╞═══════╪════════╡
│ 10.1    │
│ 20.1    │
│ 30.1    │
│ 42.0    │
│ -10.2    │
└───────┴────────┘

Implementation using current Python api, no interpolations:

quantile = 0.5
dfs = df.sort("value").with_columns(cumw=(pl.col("weight").cumsum() - 0.5*pl.col("weight"))/pl.col("weight").sum())
dfs.select(pl.col("value").sort_by((pl.col("cumw")-quantile).abs()).first()).item()  # returns 4

@s-banach
Copy link
Contributor

s-banach commented Aug 30, 2023

Implementation using current Python api, no interpolations:

quantile = 0.5
dfs = df.sort("value").with_columns(cumw=(pl.col("weight").cumsum() - 0.5*pl.col("weight"))/pl.col("weight").sum())
dfs.select(pl.col("value").sort_by((pl.col("cumw")-quantile).abs()).first()).item()  # returns 4

I believe the following two methods are equivalent:

  • Let cw = w.cumsum() / w.sum(). To find the q quantile, choose the nearest value of cw.
  • Let cw = (w.cumsum() - 0.5 * w) / w.sum(). To find the q quantile, do backward search-sorted on cw.

I think doing both methods (subtracting 0.5 * w and searching for the nearest value) is putting a hat on a hat.

I believe you can do something like this:

def get_weighted_quantiles(df: pl.DataFrame, q: list[float]):
    quantiles = pl.DataFrame({"q": q}).set_sorted("q")
    df = df.with_columns(pl.col("w").cumsum() / pl.sum("w")).set_sorted("w")
    return quantiles.join_asof(df, left_on="q", right_on="w", strategy="nearest").drop("w")

P.S. Could be wrong, not a statistician.

@lorentzenchr
Copy link
Contributor

Before implementing weighted quantiles, I would suggest to start with weighted mean first!
Note that weighted quantiles can turn out to be a rabbit hole as to what the interpretation of weights should be. Even numpy does not (yet!) have it, see numpy/numpy#24254.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

5 participants