Weighed quantiles #10726

tmct · 2023-08-25T12:11:25Z

Problem description

Hi - I would find it very useful to be able to perform the "quantile" methods on DataFrames (and ideally LazyFrames) with a optional sample weight column please.

Something like 'df.quantile(0.5, weights="w")'. (I think I would expect this to "drop" the weights column in the result.)

While this is a slightly more complex weighted statistic than those suggested in #7499 (which I may turn my attention to in future!), I believe it's more useful to fix as I don't believe there is an obvious performant workaround. I hope that this operation being somewhat simple and well-defined would justify its presence as not too much of a maintenance burden.

If this feature were acceptable in principle, I would be happy to implement it myself, given a little direction!

I wonder how much integration might be worthwhile. Perhaps functionality could be added in (non-breaking) stages:

Add this to DataFrame only: only accepting optional column name
Same for LazyFrame
Allowing more general inputs than string Exprs, and potentially adding to Series, Expr etc?

Thoughts much appreciated! Thanks,
Tom

tmct · 2023-08-25T15:08:50Z

On reflection, this might not be the most useful feature for me and my team right now, and is probably a little tricky - I might attempt an easier PR first as a first contribution. But it would still be nice to have.

stinodego · 2023-08-27T13:21:05Z

Could you add an example with some small sample data? I'm not sure what you're looking for exactly.

Possibly, the redesign discussed in #10468 may address this.

s-banach · 2023-08-28T00:05:47Z

The standard qcut divides a column into buckets containing (approximately) equal counts. Weighted qcut would divide a column into buckets containing (approximately) equal weights.

zundertj · 2023-08-28T17:24:14Z

Small example of weighted quantile:

>>> df = pl.DataFrame({"value": [1,2,3,4,-1], "weight":[0.1, 0.1, 0.1, 2, 0.2]})
shape: (5, 2)
┌───────┬────────┐
│ value ┆ weight │
│ ---   ┆ ---    │
│ i64   ┆ f64    │
╞═══════╪════════╡
│ 1     ┆ 0.1    │
│ 2     ┆ 0.1    │
│ 3     ┆ 0.1    │
│ 4     ┆ 2.0    │
│ -1    ┆ 0.2    │
└───────┴────────┘

Implementation using current Python api, no interpolations:

quantile = 0.5
dfs = df.sort("value").with_columns(cumw=(pl.col("weight").cumsum() - 0.5*pl.col("weight"))/pl.col("weight").sum())
dfs.select(pl.col("value").sort_by((pl.col("cumw")-quantile).abs()).first()).item()  # returns 4

s-banach · 2023-08-30T12:53:23Z

Implementation using current Python api, no interpolations:

quantile = 0.5
dfs = df.sort("value").with_columns(cumw=(pl.col("weight").cumsum() - 0.5*pl.col("weight"))/pl.col("weight").sum())
dfs.select(pl.col("value").sort_by((pl.col("cumw")-quantile).abs()).first()).item()  # returns 4

I believe the following two methods are equivalent:

Let cw = w.cumsum() / w.sum(). To find the q quantile, choose the nearest value of cw.
Let cw = (w.cumsum() - 0.5 * w) / w.sum(). To find the q quantile, do backward search-sorted on cw.

I think doing both methods (subtracting 0.5 * w and searching for the nearest value) is putting a hat on a hat.

I believe you can do something like this:

def get_weighted_quantiles(df: pl.DataFrame, q: list[float]):
    quantiles = pl.DataFrame({"q": q}).set_sorted("q")
    df = df.with_columns(pl.col("w").cumsum() / pl.sum("w")).set_sorted("w")
    return quantiles.join_asof(df, left_on="q", right_on="w", strategy="nearest").drop("w")

P.S. Could be wrong, not a statistician.

lorentzenchr · 2023-09-06T11:07:52Z

Before implementing weighted quantiles, I would suggest to start with weighted mean first!
Note that weighted quantiles can turn out to be a rabbit hole as to what the interpretation of weights should be. Even numpy does not (yet!) have it, see numpy/numpy#24254.

tmct added the enhancement New feature or an improvement of an existing feature label Aug 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weighed quantiles #10726

Weighed quantiles #10726

tmct commented Aug 25, 2023

tmct commented Aug 25, 2023

stinodego commented Aug 27, 2023 •

edited

Loading

s-banach commented Aug 28, 2023 •

edited

Loading

zundertj commented Aug 28, 2023

s-banach commented Aug 30, 2023 •

edited

Loading

lorentzenchr commented Sep 6, 2023

Weighed quantiles #10726

Weighed quantiles #10726

Comments

tmct commented Aug 25, 2023

Problem description

tmct commented Aug 25, 2023

stinodego commented Aug 27, 2023 • edited Loading

s-banach commented Aug 28, 2023 • edited Loading

zundertj commented Aug 28, 2023

s-banach commented Aug 30, 2023 • edited Loading

lorentzenchr commented Sep 6, 2023

stinodego commented Aug 27, 2023 •

edited

Loading

s-banach commented Aug 28, 2023 •

edited

Loading

s-banach commented Aug 30, 2023 •

edited

Loading