Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modify weighted quantile (aweights + fweights) #316

Merged
merged 13 commits into from
Feb 23, 2018
79 changes: 45 additions & 34 deletions src/weights.jl
Original file line number Diff line number Diff line change
Expand Up @@ -537,10 +537,6 @@ wmedian(v::RealVector, w::AbstractWeights{<:Real}) = median(v, w)

###### Weighted quantile #####

# http://stats.stackexchange.com/questions/13169/defining-quantiles-over-a-weighted-sample
# In the non weighted version, we compute a vector of index h(N, p)
# and take interpolation between floor and ceil of this index
# Here there is a supplementary function from index to weighted index k -> Sk

"""
quantile(v, w::AbstractWeights, p)
Expand All @@ -549,15 +545,15 @@ Compute the weighted quantiles of a vector `x` at a specified set of probability
values `p`, using weights given by a weight vector `w` (of type `AbstractWeights`).
Weights must not be negative. The weights and data vectors must have the same length.

The quantile for `p` is defined as follows. Denoting
``S_k = (k-1)w_k + (n-1) \\sum_{i<k}w_i``, define ``x_{k+1}`` the smallest element of `x`
such that ``S_{k+1}/S_{n}`` is strictly superior to `p`. The function returns
``(1-\\gamma) x_k + \\gamma x_{k+1}`` with ``\\gamma = (pS_n- S_k)/(S_{k+1}-S_k)``.

This corresponds to R-7, Excel, SciPy-(1,1) and Maple-6 when `w` contains only ones
(see [Wikipedia](https://en.wikipedia.org/wiki/Quantile)).
With frequency weights, the function returns the same result as `quantile` for a vector with repeated values.
With non frequency weights, denote N the length of the vector, w the vector of weights normalized to sum to 1, `h = p (N - 1) + 1` and ``S_k = 1 + (k-1) * wk + (N-1) \\sum_{i<=k}w_i/\\sum_{i<=N}w_i``, define ``x_{k+1}`` the smallest element of `x` such that ``S_{k+1}`` is strictly superior to `h`. The function returns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you say what S and h represent?

Remove the double spaces, use double backticks everywhere (including around variable names) and break lines at 92 chars. Also better have [frequency weights](@ref FrequencyWeights).

``x_k + \\gamma (x_{k+1} -x_k)`` with ``\\gamma = (h - S_k)/(S_{k+1}-S_k)``
In particular, when `w` is a vector of one, the function returns the same result as `quantile`.
"""
function quantile(v::RealVector{V}, w::AbstractWeights{W}, p::RealVector) where {V,W<:Real}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No line breaks before nor after the signature. I think the style of this package is not to have spaces after commas for type parameters.


function quantile(v::RealVector{V}, w::AbstractWeights{W}, p::RealVector) where {V, W <: Real}

# checks
isempty(v) && error("quantile of an empty array is undefined")
isempty(p) && throw(ArgumentError("empty quantile array"))
Expand All @@ -569,47 +565,62 @@ function quantile(v::RealVector{V}, w::AbstractWeights{W}, p::RealVector) where
x < 0 && error("weight vector cannot contain negative entries")
end

# full sort
vw = sort!(collect(zip(v, w.values)))
wvalues = w.values
nz = find(w.values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use find(!iszero, w.values) since the previous form is deprecated on 0.7. Or probably even better, just do nz = .!iszero.(w.values), since a boolean vector will be more efficient (you know the size in advance and it's smaller).

Also move this below to the place where it's used for clarity.

Finally, why is that operation needed for frequency weights? Shouldn't the algorithm be able to skip entries with zero weights on its own?

Copy link
Contributor Author

@matthieugomez matthieugomez Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the algorithm does not work with zero weights in its current form. There is an issue for instance if the highest value has zero weight. The algorithm also only keeps track of the last visited value, whereas it should keep track of the last visiting value with non zero weight. I just think it's simpler to remove the zero values.

#normalize if non frequencyweight
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space after # (also below) and between "frequency" and "weights". Better also say that the sum will be 1.

if !isa(w, FrequencyWeights)
wvalues = wvalues / w.sum
end
wsum = sum(wvalues)

wsum = w.sum
#remove zeros weights and sort
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space after #.

vw = sort!(collect(zip(view(v, nz), view(wvalues, nz))))
N = length(vw)

# prepare percentiles
ppermute = sortperm(p)
p = p[ppermute]
p = bound_quantiles(p)

# prepare out vector
N = length(vw)
out = Vector{typeof(zero(V)/1)}(length(p))
fill!(out, vw[end][1])

# start looping on quantiles
cumulative_weight, Sk, Skold = zero(W), zero(W), zero(W)
vk, vkold = zero(V), zero(V)
k = 1
Sk, Skold = zero(W), zero(W)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double space.

vk, vkold, cumwk, wk = zero(V), zero(V), zero(V), zero(V)
k = 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single line break is enough.


for i in 1:length(p)
h = p[i] * (N - 1) * wsum
if h == 0
# happens when N or p or wsum equal zero
out[ppermute[i]] = vw[1][1]
else
if isa(w, FrequencyWeights)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels weird to use a completely different path for frequency weights. Couldn't a common path be defined, moving some type-specific computations out of the loop like you did for normalization? For example, for h = p[i] * (wsum - 1) + 1, wsum just needs to be replaced with N for non frequency weights, so you could as well define the variable before the loop.

BTW, wouldn't it make more sense to normalize the weights to sum to N (rather than 1), since it plays a role equivalent to wsum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought about this but I think it is better to do two different paths. Joining the two looks more confusing than enlightening.

h = p[i] * (wsum - 1) + 1
while Sk <= h
# happens in particular when k == 1
vk, wk = vw[k]
cumulative_weight += wk
if k >= N
# out was initialized with maximum v
return out
k += 1
if k > N
# out was initialized with maximum v
return out
end
Skold, vkold = Sk, vk
vk, wk = vw[k]
Sk += wk
end
out[ppermute[i]] = vkold + min(h - Skold, 1) * (vk - vkold)
else
# https://stats.stackexchange.com/questions/13169/defining-quantiles-over-a-weighted-sample
h = p[i] * (N - 1) + 1
while Sk <= h
k += 1
if k > N
# out was initialized with maximum v
return out
end
Skold, vkold = Sk, vk
cumwk += wk
vk, wk = vw[k]
Sk = (k - 1) * wk + (N - 1) * cumulative_weight
Sk = 1 + (k - 1) * wk + (N - 1) * cumwk
end
# in particular, Sk is different from Skold
g = (h - Skold) / (Sk - Skold)
out[ppermute[i]] = vkold + g * (vk - vkold)
out[ppermute[i]] = vkold + (h - Skold) / (Sk - Skold) * (vk - vkold)
end
end
return out
Expand Down
144 changes: 94 additions & 50 deletions test/weights.jl
Original file line number Diff line number Diff line change
Expand Up @@ -261,7 +261,48 @@ end
@test_throws ErrorException median(data, f(wt))
end

@testset "Quantile $f" for f in weight_funcs

# Quantile fweights
@testset "Quantile fweights" begin
data = (
[7, 1, 2, 4, 10],
[7, 1, 2, 4, 10],
[7, 1, 2, 4, 10, 15],
[1, 2, 4, 7, 10, 15],
[0, 10, 20, 30],
[1, 2, 3, 4, 5],
[1, 2, 3, 4, 5],
[30, 40, 50, 60, 35],
[2, 0.6, 1.3, 0.3, 0.3, 1.7, 0.7, 1.7],
[1, 2, 2],
[3.7, 3.3, 3.5, 2.8],
[100, 125, 123, 60, 45, 56, 66],
[2, 2, 2, 2, 2, 2],
[2.3],
[-2, -3, 1, 2, -10],
[1, 2, 3, 4, 5],
[5, 4, 3, 2, 1],
[-2, 2, -1, 3, 6],
[-10, 1, 1, -10, -10],
)
p = [0.0, 0.25, 0.5, 0.75, 1.0]
for x in data
@test quantile(x, fweights(ones(Int64, length(x))), p) ≈ quantile(x, p)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to test more thoroughly the results by combining various inputs with various weights, as done below with non frequency weights. You could use a rep helper function (JuliaLang/julia#16443) to generate vectors with repeated entries to call the unweighted quantile method on.

Also, there aren't any zeros no negative non-integer values. Wouldn't hurt to add some.

end
# zero don't count
x = [1, 2, 3, 4, 5]
@test quantile(x, fweights([0,1,1,1,0]), p) ≈ quantile([2, 3, 4], p)
# repetitions dont count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"don't"

@test quantile(x, fweights([0,1,2,1,0]), p) ≈ quantile([2, 3, 3, 4], p)
# Issue #313
@test quantile(x, fweights([0,1,2,1,0]), p) ≈ quantile([2, 3, 3, 4], p)
@test quantile([1, 2], fweights([1, 1]), 0.25) ≈ 1.25
@test quantile([1, 2], fweights([2, 2]), 0.25) ≈ 1.0
end


@testset "Quantile aweights" begin

data = (
[7, 1, 2, 4, 10],
[7, 1, 2, 4, 10],
Expand All @@ -284,81 +325,84 @@ end
[-10, 1, 1, -10, -10],
)
wt = (
f([1, 1/3, 1/3, 1/3, 1]),
f([1, 1, 1, 1, 1]),
f([1, 1/3, 1/3, 1/3, 1, 1]),
f([1/3, 1/3, 1/3, 1, 1, 1]),
f([30, 191, 9, 0]),
f([10, 1, 1, 1, 9]),
f([10, 1, 1, 1, 900]),
f([1, 3, 5, 4, 2]),
f([2, 2, 5, 1, 2, 2, 1, 6]),
f([0.1, 0.1, 0.8]),
f([5, 5, 4, 1]),
f([30, 56, 144, 24, 55, 43, 67]),
f([0.1, 0.2, 0.3, 0.4, 0.5, 0.6]),
f([12]),
f([7, 1, 1, 1, 6]),
f([1, 0, 0, 0, 2]),
f([1, 2, 3, 4, 5]),
f([0.1, 0.2, 0.3, 0.2, 0.1]),
f([1, 1, 1, 1, 1]),
[1, 1/3, 1/3, 1/3, 1],
[1, 1, 1, 1, 1],
[1, 1/3, 1/3, 1/3, 1, 1],
[1/3, 1/3, 1/3, 1, 1, 1],
[30, 191, 9, 0],
[10, 1, 1, 1, 9],
[10, 1, 1, 1, 900],
[1, 3, 5, 4, 2],
[2, 2, 5, 1, 2, 2, 1, 6],
[0.1, 0.1, 0.8],
[5, 5, 4, 1],
[30, 56, 144, 24, 55, 43, 67],
[0.1, 0.2, 0.3, 0.4, 0.5, 0.6],
[12],
[7, 1, 1, 1, 6],
[1, 0, 0, 0, 2],
[1, 2, 3, 4, 5],
[0.1, 0.2, 0.3, 0.2, 0.1],
[1, 1, 1, 1, 1],
)
quantile_answers = (
[1.0,3.6000000000000005,6.181818181818182,8.2,10.0],
[1.0,2.0,4.0,7.0,10.0],
[1.0,4.75,8.0,10.833333333333334,15.0],
[1.0,4.75,8.0,10.833333333333334,15.0],
[0.0,6.1387900355871885,11.600000000000001,15.912500000000001,30.0],
[1.0,1.5365853658536586,2.5999999999999996,4.405405405405405,5.0],
[1.0,4.239377950569287,4.492918633712858,4.746459316856429,5.0],
[30.0,38.75,45.714285714285715,52.85714285714286,60.0],
[0.3,0.6903846153846154,1.484,1.7,2.0],
[1.0,2.0,2.0,2.0,2.0],
[2.8,3.3361111111111112,3.4611111111111112,3.581578947368421,3.7],
[45.0,59.88593155893536,100.08846153846153,118.62115384615385,125.0],
[2.0,2.0,2.0,2.0,2.0],
[2.3,2.3,2.3,2.3,2.3],
[-10.0,-5.52,-2.5882352941176467,-0.9411764705882351,2.0],
[1.0,1.75,4.25,4.625,5.0],
[1.0,1.625,2.3333333333333335,3.25,5.0],
[-2.0,-0.5384615384615388,1.5384615384615383,2.6999999999999997,6.0],
[-10.0,-10.0,-10.0,1.0,1.0]
[1.0, 3.6, 6.18182, 8.2, 10.0],
[1.0, 2.0, 4.0, 7.0, 10.0],
[1.0, 4.75, 8.0, 10.8333, 15.0],
[1.0, 4.75, 8.0, 10.8333, 15.0],
[0.0, 4.58167, 9.16335, 14.4976, 20.0],
[1.0, 1.53659, 2.6, 4.40541, 5.0],
[1.0, 4.23938, 4.49292, 4.74646, 5.0],
[30.0, 38.75, 45.7143, 52.8571, 60.0],
[0.3, 0.690385, 1.484, 1.7, 2.0],
[1.0, 2.0, 2.0, 2.0, 2.0],
[2.8, 3.33611, 3.46111, 3.58158, 3.7],
[45.0, 59.8859, 100.088, 118.621, 125.0],
[2.0, 2.0, 2.0, 2.0, 2.0],
[2.3, 2.3, 2.3, 2.3, 2.3],
[-10.0, -5.52, -2.58824, -0.941176, 2.0],
[1.0, 2.0, 3.0, 4.0, 5.0],
[1.0, 1.625, 2.33333, 3.25, 5.0],
[-2.0, -0.538462, 1.53846, 2.7, 6.0],
[-10.0, -10.0, -10.0, 1.0, 1.0],
)
p = [0.0, 0.25, 0.5, 0.75, 1.0]

srand(10)
for i = 1:length(data)
@test quantile(data[i], wt[i], p) ≈ quantile_answers[i]
@test quantile(data[i], aweights(wt[i]), p) ≈ quantile_answers[i] atol = 1e-3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1e-3 sounds quite high, why not keep more precision?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment hasn't been addressed.

BTW, it would be nice to duplicate each @test line to test pweights too.

for j = 1:10
# order of p does not matter
reorder = sortperm(rand(length(p)))
@test quantile(data[i], wt[i], p[reorder]) ≈ quantile_answers[i][reorder]
@test quantile(data[i], aweights(wt[i]), p[reorder]) ≈ quantile_answers[i][reorder] atol = 1e-3
end
for j = 1:10
# order of w does not matter
reorder = sortperm(rand(length(data[i])))
@test quantile(data[i][reorder], f(wt[i][reorder]), p) ≈ quantile_answers[i]
@test quantile(data[i][reorder], aweights(wt[i][reorder]), p) ≈ quantile_answers[i] atol = 1e-3
end
end
# w = 1 corresponds to base quantile
for i = 1:length(data)
@test quantile(data[i], f(ones(Int64, length(data[i]))), p) ≈ quantile(data[i], p)
@test quantile(data[i], aweights(ones(Int64, length(data[i]))), p) ≈ quantile(data[i], p) atol = 1e-3
for j = 1:10
prandom = rand(4)
@test quantile(data[i], f(ones(Int64, length(data[i]))), prandom) ≈ quantile(data[i], prandom)
@test quantile(data[i], aweights(ones(Int64, length(data[i]))), prandom) ≈ quantile(data[i], prandom) atol = 1e-3
end
end

# other syntaxes
v = [7, 1, 2, 4, 10]
w = [1, 1/3, 1/3, 1/3, 1]
answer = 6.181818181818182
@test quantile(data[1], f(w), 0.5) ≈ answer
@test wquantile(data[1], f(w), [0.5]) ≈ [answer]
@test wquantile(data[1], f(w), 0.5) ≈ answer
@test wquantile(data[1], w, [0.5]) ≈ [answer]
@test wquantile(data[1], w, 0.5) ≈ answer
answer = 6.1818
@test quantile(data[1], aweights(w), 0.5) ≈ answer atol = 1e-4
@test wquantile(data[1], aweights(w), [0.5]) ≈ [answer] atol = 1e-4
@test wquantile(data[1], aweights(w), 0.5) ≈ answer atol = 1e-4
@test wquantile(data[1], w, [0.5]) ≈ [answer] atol = 1e-4
@test wquantile(data[1], w, 0.5) ≈ answer atol = 1e-4
end


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed.



end # @testset StatsBase.Weights