modify weighted quantile (aweights + fweights) #316

matthieugomez · 2017-11-09T00:29:36Z

Solves #313
The quantile method for frequency weights is now equivalent to the unweighted method with a vector of repeated values
The quantile method for non frequency weight now removes zeros.

Solves #313

codecov · 2017-11-09T01:08:35Z

Codecov Report

❗ No coverage uploaded for pull request base (master@43b5023). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master     #316   +/-   ##
=========================================
  Coverage          ?   91.07%           
=========================================
  Files             ?       18           
  Lines             ?     1960           
  Branches          ?        0           
=========================================
  Hits              ?     1785           
  Misses            ?      175           
  Partials          ?        0

Impacted Files	Coverage Δ
src/weights.jl	`89.38% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 43b5023...2e72f48. Read the comment docs.

nalimilan

Thanks! Do you have a reference regarding the choice of normalization for non frequency weights? In particular, why should we normalize to 1 rather than e.g. to N or sum(w)? If not, I can survey what other software do.

nalimilan · 2017-11-09T09:17:20Z

src/weights.jl

-This corresponds to  R-7, Excel, SciPy-(1,1) and Maple-6 when `w` contains only ones
-(see [Wikipedia](https://en.wikipedia.org/wiki/Quantile)).
+With frequency weights, the function returns the same result as `quantile` for a vector with repeated values.
+With non frequency weights,  denote N the length of the vector, w the vector of weights normalized to sum to 1, `h = p (N - 1) + 1`  and ``S_k = 1 + (k-1) * wk + (N-1) \\sum_{i<=k}w_i/\\sum_{i<=N}w_i``, define ``x_{k+1}`` the smallest element of `x` such that ``S_{k+1}`` is strictly superior to `h`. The function returns


Could you say what S and h represent?

Remove the double spaces, use double backticks everywhere (including around variable names) and break lines at 92 chars. Also better have [frequency weights](@ref FrequencyWeights).

nalimilan · 2017-11-09T09:18:15Z

src/weights.jl

 """
-function quantile(v::RealVector{V}, w::AbstractWeights{W}, p::RealVector) where {V,W<:Real}
+


No line breaks before nor after the signature. I think the style of this package is not to have spaces after commas for type parameters.

nalimilan · 2017-11-09T09:28:21Z

src/weights.jl

-    vw = sort!(collect(zip(v, w.values)))
+    wvalues = w.values
+    nz = find(w.values)
+    #normalize if non frequencyweight


Space after # (also below) and between "frequency" and "weights". Better also say that the sum will be 1.

nalimilan · 2017-11-09T09:28:43Z

src/weights.jl

-    # full sort
-    vw = sort!(collect(zip(v, w.values)))
+    wvalues = w.values
+    nz = find(w.values)


Use find(!iszero, w.values) since the previous form is deprecated on 0.7. Or probably even better, just do nz = .!iszero.(w.values), since a boolean vector will be more efficient (you know the size in advance and it's smaller).

Also move this below to the place where it's used for clarity.

Finally, why is that operation needed for frequency weights? Shouldn't the algorithm be able to skip entries with zero weights on its own?

No, the algorithm does not work with zero weights in its current form. There is an issue for instance if the highest value has zero weight. The algorithm also only keeps track of the last visited value, whereas it should keep track of the last visiting value with non zero weight. I just think it's simpler to remove the zero values.

nalimilan · 2017-11-09T09:32:48Z

src/weights.jl

-    cumulative_weight, Sk, Skold =  zero(W), zero(W), zero(W)
-    vk, vkold = zero(V), zero(V)
-    k = 1
+    Sk, Skold =  zero(W), zero(W)


Double space.

nalimilan · 2017-11-09T09:38:36Z

src/weights.jl

-            # happens when N or p or wsum equal zero
-            out[ppermute[i]] = vw[1][1]
-        else
+        if isa(w, FrequencyWeights)


It feels weird to use a completely different path for frequency weights. Couldn't a common path be defined, moving some type-specific computations out of the loop like you did for normalization? For example, for h = p[i] * (wsum - 1) + 1, wsum just needs to be replaced with N for non frequency weights, so you could as well define the variable before the loop.

BTW, wouldn't it make more sense to normalize the weights to sum to N (rather than 1), since it plays a role equivalent to wsum?

I have thought about this but I think it is better to do two different paths. Joining the two looks more confusing than enlightening.

nalimilan · 2017-11-09T09:39:37Z

test/weights.jl

+    # zero don't count
+    x = [1, 2, 3, 4, 5]
+    @test quantile(x, fweights([0,1,1,1,0]), p) ≈ quantile([2, 3, 4], p)
+    # repetitions dont count


nalimilan · 2017-11-09T09:40:37Z

test/weights.jl

 end

+


Not needed.

nalimilan · 2017-11-09T09:41:05Z

test/weights.jl

    )
    p = [0.0, 0.25, 0.5, 0.75, 1.0]

    srand(10)
    for i = 1:length(data)
-        @test quantile(data[i], wt[i], p) ≈ quantile_answers[i]
+        @test quantile(data[i], aweights(wt[i]), p) ≈ quantile_answers[i] atol = 1e-3


1e-3 sounds quite high, why not keep more precision?

This comment hasn't been addressed.

BTW, it would be nice to duplicate each @test line to test pweights too.

nalimilan · 2017-11-09T09:43:13Z

test/weights.jl

+    )
+    p = [0.0, 0.25, 0.5, 0.75, 1.0]
+    for x in data
+        @test quantile(x, fweights(ones(Int64, length(x))), p) ≈ quantile(x, p)


It would be nice to test more thoroughly the results by combining various inputs with various weights, as done below with non frequency weights. You could use a rep helper function (JuliaLang/julia#16443) to generate vectors with repeated entries to call the unweighted quantile method on.

Also, there aren't any zeros no negative non-integer values. Wouldn't hurt to add some.

matthieugomez · 2017-11-09T21:14:24Z

Thanks for the detailed comments. I commented on the ones I disagreed with.
About the normalization of weights to 1 or N, the actual normalizing factor is irrelevant — the important thing is that the algorithm does not depend on the sum of weights. It turns out that the formula with weights normalized to 1 are simpler to digest, so that's what I end up doing.

nalimilan · 2017-11-09T22:23:53Z

But why is it more problematic to give different results depending on the sum of weights than to normalize them arbitrarily to 1? Is there a rationale or a precedent in other software? That combined with the fact that we don't even use the same algorithm for frequency and other kinds of weights makes it sound like a totally arbitrary choice. For example, Hmisc::wtd.quantile normalizes to N when normwt=TRUE, and does not provide an argument to normalize to 1. Is that better? Is that worse? I wouldn't want to make a choice without solid arguments.

matthieugomez · 2017-11-09T22:38:08Z

Take an algorithm that does not depend on the sum of weight. You can always rewrite it as an algorithm that takes a vector of weight that sum to one. Rewriting it like that allows to replace all the expressions in sum wi by 1. That’s all there is. This does not mean that the normalization is arbitrary.

…

On Thu, Nov 9, 2017 at 2:23 PM Milan Bouchet-Valat ***@***.***> wrote: But why is it more problematic to give different results depending on the sum of weights than to normalize them arbitrarily to 1? Is there a rationale or a precedent in other software? That combined with the fact that we don't even use the same algorithm for frequency and other kinds of weights makes it sound like a totally arbitrary choice. For example, Hmisc::wtd.quantile normalizes to N when normwt=TRUE, and does not provide an argument to normalize to 1. Is that better? Is that worse? I wouldn't want to make a choice without solid arguments. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#316 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF733Zfl61eoMzyFFc9gOhFQ0FzVSX8gks5s03t6gaJpZM4QXQaz> .

matthieugomez · 2017-11-09T23:14:07Z

Ok I have updated the definition of aweights. There are now only two differences between fweights and aweights in the algorithm. These differences are important. The intuition for why they need to be different is that fweights give you more datapoints on the empirical CDF than aweights do.
For instant, the following is intuitive:

quantile([1, 2], fweights([1, 2]), 0.5) = 2
# but
quantile([1, 2], weights([1, 2]), 0.5) < 2

nalimilan · 2017-11-10T14:00:46Z

OK, I trust you when you say it's more intuitive, that's not completely obvious to me. ;-)

Is there any way to check our results against another implementation? We don't return the same results as Hmisc::wtd.quantile for non-frequency weights, which is worrying since it claims to implement the same R-7 version. Of course there may well be problems in that implementation rather than in ours,

I've also found a MIT-licensed implementation of weighted quantiles for MATLAB (they use the R-8 definition by default, but support all variants). Unfortunately, even if they say that equal weights should give the same results as the unweighted version, that doesn't appear to be the case:

>> iosr.statistics.quantile([7, 1, 2, 4, 10], [0, .25, .5, .75, 1], [], ['R-7'])

ans =

     1     2     4     7    10

>> iosr.statistics.quantile([7, 1, 2, 4, 10], [0, .25, .5, .75, 1], [], ['R-7'], [1, 1, 1, 1, 1])

ans =

    1.0000    1.5000    3.0000    5.5000   10.0000

It's incredible how hard it is to find completely reliable implementations, and the existence of so many variants makes it difficult to compare them.

nalimilan · 2017-11-10T13:13:35Z

test/weights.jl

-        f([1, 2, 3, 4, 5]),
-        f([0.1, 0.2, 0.3, 0.2, 0.1]),
-        f([1, 1, 1, 1, 1]),
+        Int[3, 1, 1, 1, 3],


Why specify Int?

nalimilan · 2017-11-10T13:27:34Z

src/weights.jl


 """
    quantile(v, w::AbstractWeights, p)

-Compute the weighted quantiles of a vector `x` at a specified set of probability
-values `p`, using weights given by a weight vector `w` (of type `AbstractWeights`).
+Compute the weighted quantiles of a vector ``x`` at a specified set of probability


Actually, references to Julia objects should use single backticks... :-) And v should be changed to x in the signature above to match descriptions below (or the other way around).

nalimilan · 2017-11-10T13:27:54Z

src/weights.jl

-
-This corresponds to  R-7, Excel, SciPy-(1,1) and Maple-6 when `w` contains only ones
-(see [Wikipedia](https://en.wikipedia.org/wiki/Quantile)).
+With [FrequencyWeights](@ref FrequencyWeights), the function returns the same result as 


Missing backticks here and below.

nalimilan · 2017-11-10T13:28:18Z

src/weights.jl

-(see [Wikipedia](https://en.wikipedia.org/wiki/Quantile)).
+With [FrequencyWeights](@ref FrequencyWeights), the function returns the same result as 
+`quantile` for a vector with repeated values.
+With non FrequencyWeights,  denote N the length of the vector, w the vector of weights, ``h = p (\\sum_{i<= N}w_i - w_1) + w_1`` the cumulative weight corresponding to the probability `p` and 


Cut line at 92 chars. Still missing backticks.

nalimilan · 2017-11-10T13:28:55Z

src/weights.jl

+  define ``x_{k+1}`` the smallest element of ``x`` such that ``S_{k+1}`` is strictly 
+  superior to ``h``. The function returns``x_k + \\gamma (x_{k+1} -x_k)`` 
+  with  ``\\gamma = (h - S_k)/(S_{k+1}-S_k)``. In particular, when ``w`` is a vector
+   of one, the function returns the same result as `quantile`.


"of ones". Also remove indentation on these lines.

nalimilan · 2017-11-10T13:30:10Z

src/weights.jl

+With [FrequencyWeights](@ref FrequencyWeights), the function returns the same result as 
+`quantile` for a vector with repeated values.
+With non FrequencyWeights,  denote N the length of the vector, w the vector of weights, ``h = p (\\sum_{i<= N}w_i - w_1) + w_1`` the cumulative weight corresponding to the probability `p` and 
+ ``S_k = \\sum_{i<=k}w_i`` the cumulative weight for to each observation,


nalimilan · 2017-11-10T13:30:26Z

src/weights.jl


-    wsum = w.sum
+    #remove zeros weights and sort


Space after #.

nalimilan · 2017-11-10T13:31:06Z

src/weights.jl


-    wsum = w.sum
+    #remove zeros weights and sort
+    wsum = sum(w.value)


Use sum(w), which is equivalent to w.sum thanks to a special method. I don't think you need to access the values field below either: AbstractWeight implements the AbstractVector interface (but does not guarantee that the values field exists).

nalimilan · 2017-11-10T13:32:41Z

src/weights.jl

-    vk, vkold = zero(V), zero(V)
-    k = 1
+    Sk, Skold = zero(W), zero(W)
+    vk, vkold= zero(V), zero(V)


Space before =.

nalimilan · 2017-11-10T13:36:33Z

test/weights.jl

    )
    p = [0.0, 0.25, 0.5, 0.75, 1.0]

    srand(10)
    for i = 1:length(data)
-        @test quantile(data[i], wt[i], p) ≈ quantile_answers[i]
+        @test quantile(data[i], aweights(wt[i]), p) ≈ quantile_answers[i] atol = 1e-3


This comment hasn't been addressed.

BTW, it would be nice to duplicate each @test line to test pweights too.

nalimilan · 2017-11-11T08:23:16Z

BTW, should ProbabilityWeights behave like FrequencyWeights, like AnalyticWeights, or implement a third behavior?

nalimilan · 2018-01-23T08:48:37Z

Bump. Do you want to finish this?

matthieugomez · 2018-01-24T19:22:24Z

Ok. I've updated it to incorporate your comments. I think AnalyticalWeights should be similar to ProbabilityWeights.

nalimilan

Thanks! I've pushed a few fixes to get the tests to pass.

matthieugomez · 2018-01-25T20:15:43Z

Thanks a lot! Really appreciate your help

tbeason · 2018-02-23T18:40:19Z

What is the status on this PR / branch? It seems like the functionality is there. Are the tests not passing?

nalimilan · 2018-02-23T18:54:26Z

It's just that I had left the PR open in case somebody wanted to comment, but nobody objected nor merged it.

modify weighted quantile (aweights + fweights)

b6295d1

Solves #313

matthieugomez added 2 commits November 8, 2017 20:58

update

b21a2cb

julia 0.6

ed3d858

nalimilan reviewed Nov 9, 2017

View reviewed changes

Update for comments

8ef4511

Simplify aweight on the model of fweight

f23b591

matthieugomez added 2 commits November 9, 2017 15:19

no need for initalization

57203fb

definition

370afd5

nalimilan mentioned this pull request Nov 10, 2017

Zero weights are not equivalent to omitting a value with wtd.quantile() harrelfe/Hmisc#81

Closed

nalimilan reviewed Nov 10, 2017

View reviewed changes

matthieugomez added 2 commits January 24, 2018 14:19

Update

f9e3157

conflict

899f395

nalimilan added 4 commits January 25, 2018 11:28

Minor fixes

0736c02

Test pweights and weights

1e12c77

Merge branch 'master' into wq

c732c82

Fix test

2e72f48

nalimilan approved these changes Jan 25, 2018

View reviewed changes

nalimilan merged commit f869ec3 into JuliaStats:master Feb 23, 2018

gdementen mentioned this pull request Mar 8, 2018

support for weighted aggregates liam2/liam2#226

Closed

nalimilan mentioned this pull request Feb 1, 2019

Quantiles don't respect weights of 0 #313

Closed

tpapp mentioned this pull request May 6, 2019

ignoring elements with 0 weight #492

Open

seberg mentioned this pull request Aug 3, 2023

Weighted quantile option in nanpercentile() numpy/numpy#8935

Closed

		"""
		function quantile(v::RealVector{V}, w::AbstractWeights{W}, p::RealVector) where {V,W<:Real}

modify weighted quantile (aweights + fweights) #316

modify weighted quantile (aweights + fweights) #316

Conversation

matthieugomez commented Nov 9, 2017 • edited Loading

codecov bot commented Nov 9, 2017 • edited Loading

Codecov Report

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthieugomez Nov 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthieugomez commented Nov 9, 2017

nalimilan commented Nov 9, 2017

matthieugomez commented Nov 9, 2017 via email

matthieugomez commented Nov 9, 2017 • edited Loading

nalimilan commented Nov 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Nov 11, 2017

nalimilan commented Jan 23, 2018

matthieugomez commented Jan 24, 2018

nalimilan left a comment

Choose a reason for hiding this comment

matthieugomez commented Jan 25, 2018

tbeason commented Feb 23, 2018

nalimilan commented Feb 23, 2018

matthieugomez commented Nov 9, 2017 •

edited

Loading

codecov bot commented Nov 9, 2017 •

edited

Loading

matthieugomez Nov 9, 2017 •

edited

Loading

matthieugomez commented Nov 9, 2017 •

edited

Loading