Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modify weighted quantile (aweights + fweights) #316

Merged
merged 13 commits into from
Feb 23, 2018
Merged

modify weighted quantile (aweights + fweights) #316

merged 13 commits into from
Feb 23, 2018

Conversation

matthieugomez
Copy link
Contributor

@matthieugomez matthieugomez commented Nov 9, 2017

Solves #313
The quantile method for frequency weights is now equivalent to the unweighted method with a vector of repeated values
The quantile method for non frequency weight now removes zeros.

@codecov
Copy link

codecov bot commented Nov 9, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@43b5023). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #316   +/-   ##
=========================================
  Coverage          ?   91.07%           
=========================================
  Files             ?       18           
  Lines             ?     1960           
  Branches          ?        0           
=========================================
  Hits              ?     1785           
  Misses            ?      175           
  Partials          ?        0
Impacted Files Coverage Δ
src/weights.jl 89.38% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 43b5023...2e72f48. Read the comment docs.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Do you have a reference regarding the choice of normalization for non frequency weights? In particular, why should we normalize to 1 rather than e.g. to N or sum(w)? If not, I can survey what other software do.

src/weights.jl Outdated
This corresponds to R-7, Excel, SciPy-(1,1) and Maple-6 when `w` contains only ones
(see [Wikipedia](https://en.wikipedia.org/wiki/Quantile)).
With frequency weights, the function returns the same result as `quantile` for a vector with repeated values.
With non frequency weights, denote N the length of the vector, w the vector of weights normalized to sum to 1, `h = p (N - 1) + 1` and ``S_k = 1 + (k-1) * wk + (N-1) \\sum_{i<=k}w_i/\\sum_{i<=N}w_i``, define ``x_{k+1}`` the smallest element of `x` such that ``S_{k+1}`` is strictly superior to `h`. The function returns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you say what S and h represent?

Remove the double spaces, use double backticks everywhere (including around variable names) and break lines at 92 chars. Also better have [frequency weights](@ref FrequencyWeights).

src/weights.jl Outdated
"""
function quantile(v::RealVector{V}, w::AbstractWeights{W}, p::RealVector) where {V,W<:Real}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No line breaks before nor after the signature. I think the style of this package is not to have spaces after commas for type parameters.

src/weights.jl Outdated
vw = sort!(collect(zip(v, w.values)))
wvalues = w.values
nz = find(w.values)
#normalize if non frequencyweight
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space after # (also below) and between "frequency" and "weights". Better also say that the sum will be 1.

src/weights.jl Outdated
# full sort
vw = sort!(collect(zip(v, w.values)))
wvalues = w.values
nz = find(w.values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use find(!iszero, w.values) since the previous form is deprecated on 0.7. Or probably even better, just do nz = .!iszero.(w.values), since a boolean vector will be more efficient (you know the size in advance and it's smaller).

Also move this below to the place where it's used for clarity.

Finally, why is that operation needed for frequency weights? Shouldn't the algorithm be able to skip entries with zero weights on its own?

Copy link
Contributor Author

@matthieugomez matthieugomez Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the algorithm does not work with zero weights in its current form. There is an issue for instance if the highest value has zero weight. The algorithm also only keeps track of the last visited value, whereas it should keep track of the last visiting value with non zero weight. I just think it's simpler to remove the zero values.

src/weights.jl Outdated
cumulative_weight, Sk, Skold = zero(W), zero(W), zero(W)
vk, vkold = zero(V), zero(V)
k = 1
Sk, Skold = zero(W), zero(W)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double space.

src/weights.jl Outdated
# happens when N or p or wsum equal zero
out[ppermute[i]] = vw[1][1]
else
if isa(w, FrequencyWeights)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels weird to use a completely different path for frequency weights. Couldn't a common path be defined, moving some type-specific computations out of the loop like you did for normalization? For example, for h = p[i] * (wsum - 1) + 1, wsum just needs to be replaced with N for non frequency weights, so you could as well define the variable before the loop.

BTW, wouldn't it make more sense to normalize the weights to sum to N (rather than 1), since it plays a role equivalent to wsum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought about this but I think it is better to do two different paths. Joining the two looks more confusing than enlightening.

test/weights.jl Outdated
# zero don't count
x = [1, 2, 3, 4, 5]
@test quantile(x, fweights([0,1,1,1,0]), p) ≈ quantile([2, 3, 4], p)
# repetitions dont count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"don't"

test/weights.jl Outdated
end


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed.

test/weights.jl Outdated
)
p = [0.0, 0.25, 0.5, 0.75, 1.0]

srand(10)
for i = 1:length(data)
@test quantile(data[i], wt[i], p) ≈ quantile_answers[i]
@test quantile(data[i], aweights(wt[i]), p) ≈ quantile_answers[i] atol = 1e-3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1e-3 sounds quite high, why not keep more precision?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment hasn't been addressed.

BTW, it would be nice to duplicate each @test line to test pweights too.

test/weights.jl Outdated
)
p = [0.0, 0.25, 0.5, 0.75, 1.0]
for x in data
@test quantile(x, fweights(ones(Int64, length(x))), p) ≈ quantile(x, p)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to test more thoroughly the results by combining various inputs with various weights, as done below with non frequency weights. You could use a rep helper function (JuliaLang/julia#16443) to generate vectors with repeated entries to call the unweighted quantile method on.

Also, there aren't any zeros no negative non-integer values. Wouldn't hurt to add some.

@matthieugomez
Copy link
Contributor Author

Thanks for the detailed comments. I commented on the ones I disagreed with.
About the normalization of weights to 1 or N, the actual normalizing factor is irrelevant — the important thing is that the algorithm does not depend on the sum of weights. It turns out that the formula with weights normalized to 1 are simpler to digest, so that's what I end up doing.

@nalimilan
Copy link
Member

But why is it more problematic to give different results depending on the sum of weights than to normalize them arbitrarily to 1? Is there a rationale or a precedent in other software? That combined with the fact that we don't even use the same algorithm for frequency and other kinds of weights makes it sound like a totally arbitrary choice. For example, Hmisc::wtd.quantile normalizes to N when normwt=TRUE, and does not provide an argument to normalize to 1. Is that better? Is that worse? I wouldn't want to make a choice without solid arguments.

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Nov 9, 2017 via email

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Nov 9, 2017

Ok I have updated the definition of aweights. There are now only two differences between fweights and aweights in the algorithm. These differences are important. The intuition for why they need to be different is that fweights give you more datapoints on the empirical CDF than aweights do.
For instant, the following is intuitive:

quantile([1, 2], fweights([1, 2]), 0.5) = 2
# but
quantile([1, 2], weights([1, 2]), 0.5) < 2

@nalimilan
Copy link
Member

OK, I trust you when you say it's more intuitive, that's not completely obvious to me. ;-)

Is there any way to check our results against another implementation? We don't return the same results as Hmisc::wtd.quantile for non-frequency weights, which is worrying since it claims to implement the same R-7 version. Of course there may well be problems in that implementation rather than in ours,

I've also found a MIT-licensed implementation of weighted quantiles for MATLAB (they use the R-8 definition by default, but support all variants). Unfortunately, even if they say that equal weights should give the same results as the unweighted version, that doesn't appear to be the case:

>> iosr.statistics.quantile([7, 1, 2, 4, 10], [0, .25, .5, .75, 1], [], ['R-7'])

ans =

     1     2     4     7    10

>> iosr.statistics.quantile([7, 1, 2, 4, 10], [0, .25, .5, .75, 1], [], ['R-7'], [1, 1, 1, 1, 1])

ans =

    1.0000    1.5000    3.0000    5.5000   10.0000

It's incredible how hard it is to find completely reliable implementations, and the existence of so many variants makes it difficult to compare them.

test/weights.jl Outdated
f([1, 2, 3, 4, 5]),
f([0.1, 0.2, 0.3, 0.2, 0.1]),
f([1, 1, 1, 1, 1]),
Int[3, 1, 1, 1, 3],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why specify Int?

src/weights.jl Outdated

"""
quantile(v, w::AbstractWeights, p)

Compute the weighted quantiles of a vector `x` at a specified set of probability
values `p`, using weights given by a weight vector `w` (of type `AbstractWeights`).
Compute the weighted quantiles of a vector ``x`` at a specified set of probability
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, references to Julia objects should use single backticks... :-) And v should be changed to x in the signature above to match descriptions below (or the other way around).

src/weights.jl Outdated

This corresponds to R-7, Excel, SciPy-(1,1) and Maple-6 when `w` contains only ones
(see [Wikipedia](https://en.wikipedia.org/wiki/Quantile)).
With [FrequencyWeights](@ref FrequencyWeights), the function returns the same result as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing backticks here and below.

src/weights.jl Outdated
(see [Wikipedia](https://en.wikipedia.org/wiki/Quantile)).
With [FrequencyWeights](@ref FrequencyWeights), the function returns the same result as
`quantile` for a vector with repeated values.
With non FrequencyWeights, denote N the length of the vector, w the vector of weights, ``h = p (\\sum_{i<= N}w_i - w_1) + w_1`` the cumulative weight corresponding to the probability `p` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cut line at 92 chars. Still missing backticks.

src/weights.jl Outdated
define ``x_{k+1}`` the smallest element of ``x`` such that ``S_{k+1}`` is strictly
superior to ``h``. The function returns``x_k + \\gamma (x_{k+1} -x_k)``
with ``\\gamma = (h - S_k)/(S_{k+1}-S_k)``. In particular, when ``w`` is a vector
of one, the function returns the same result as `quantile`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"of ones". Also remove indentation on these lines.

src/weights.jl Outdated
With [FrequencyWeights](@ref FrequencyWeights), the function returns the same result as
`quantile` for a vector with repeated values.
With non FrequencyWeights, denote N the length of the vector, w the vector of weights, ``h = p (\\sum_{i<= N}w_i - w_1) + w_1`` the cumulative weight corresponding to the probability `p` and
``S_k = \\sum_{i<=k}w_i`` the cumulative weight for to each observation,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"for to"

src/weights.jl Outdated

wsum = w.sum
#remove zeros weights and sort
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space after #.

src/weights.jl Outdated

wsum = w.sum
#remove zeros weights and sort
wsum = sum(w.value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use sum(w), which is equivalent to w.sum thanks to a special method. I don't think you need to access the values field below either: AbstractWeight implements the AbstractVector interface (but does not guarantee that the values field exists).

src/weights.jl Outdated
vk, vkold = zero(V), zero(V)
k = 1
Sk, Skold = zero(W), zero(W)
vk, vkold= zero(V), zero(V)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space before =.

test/weights.jl Outdated
)
p = [0.0, 0.25, 0.5, 0.75, 1.0]

srand(10)
for i = 1:length(data)
@test quantile(data[i], wt[i], p) ≈ quantile_answers[i]
@test quantile(data[i], aweights(wt[i]), p) ≈ quantile_answers[i] atol = 1e-3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment hasn't been addressed.

BTW, it would be nice to duplicate each @test line to test pweights too.

@nalimilan
Copy link
Member

BTW, should ProbabilityWeights behave like FrequencyWeights, like AnalyticWeights, or implement a third behavior?

@nalimilan
Copy link
Member

Bump. Do you want to finish this?

@matthieugomez
Copy link
Contributor Author

Ok. I've updated it to incorporate your comments. I think AnalyticalWeights should be similar to ProbabilityWeights.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've pushed a few fixes to get the tests to pass.

@matthieugomez
Copy link
Contributor Author

Thanks a lot! Really appreciate your help

@tbeason
Copy link

tbeason commented Feb 23, 2018

What is the status on this PR / branch? It seems like the functionality is there. Are the tests not passing?

@nalimilan nalimilan merged commit f869ec3 into JuliaStats:master Feb 23, 2018
@nalimilan
Copy link
Member

It's just that I had left the PR open in case somebody wanted to comment, but nobody objected nor merged it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants