-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
modify weighted quantile (aweights + fweights) #316
Changes from 7 commits
b6295d1
b21a2cb
ed3d858
8ef4511
f23b591
57203fb
370afd5
f9e3157
899f395
0736c02
1e12c77
c732c82
2e72f48
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -537,25 +537,22 @@ wmedian(v::RealVector, w::AbstractWeights{<:Real}) = median(v, w) | |
|
||
###### Weighted quantile ##### | ||
|
||
# http://stats.stackexchange.com/questions/13169/defining-quantiles-over-a-weighted-sample | ||
# In the non weighted version, we compute a vector of index h(N, p) | ||
# and take interpolation between floor and ceil of this index | ||
# Here there is a supplementary function from index to weighted index k -> Sk | ||
|
||
""" | ||
quantile(v, w::AbstractWeights, p) | ||
|
||
Compute the weighted quantiles of a vector `x` at a specified set of probability | ||
values `p`, using weights given by a weight vector `w` (of type `AbstractWeights`). | ||
Compute the weighted quantiles of a vector ``x`` at a specified set of probability | ||
values ``p``, using weights given by a weight vector ``w`` (of type `AbstractWeights`). | ||
Weights must not be negative. The weights and data vectors must have the same length. | ||
|
||
The quantile for `p` is defined as follows. Denoting | ||
``S_k = (k-1)w_k + (n-1) \\sum_{i<k}w_i``, define ``x_{k+1}`` the smallest element of `x` | ||
such that ``S_{k+1}/S_{n}`` is strictly superior to `p`. The function returns | ||
``(1-\\gamma) x_k + \\gamma x_{k+1}`` with ``\\gamma = (pS_n- S_k)/(S_{k+1}-S_k)``. | ||
|
||
This corresponds to R-7, Excel, SciPy-(1,1) and Maple-6 when `w` contains only ones | ||
(see [Wikipedia](https://en.wikipedia.org/wiki/Quantile)). | ||
With [FrequencyWeights](@ref FrequencyWeights), the function returns the same result as | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing backticks here and below. |
||
`quantile` for a vector with repeated values. | ||
With non FrequencyWeights, denote N the length of the vector, w the vector of weights, ``h = p (\\sum_{i<= N}w_i - w_1) + w_1`` the cumulative weight corresponding to the probability `p` and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cut line at 92 chars. Still missing backticks. |
||
``S_k = \\sum_{i<=k}w_i`` the cumulative weight for to each observation, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "for to" |
||
define ``x_{k+1}`` the smallest element of ``x`` such that ``S_{k+1}`` is strictly | ||
superior to ``h``. The function returns``x_k + \\gamma (x_{k+1} -x_k)`` | ||
with ``\\gamma = (h - S_k)/(S_{k+1}-S_k)``. In particular, when ``w`` is a vector | ||
of one, the function returns the same result as `quantile`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "of ones". Also remove indentation on these lines. |
||
""" | ||
function quantile(v::RealVector{V}, w::AbstractWeights{W}, p::RealVector) where {V,W<:Real} | ||
# checks | ||
|
@@ -569,47 +566,47 @@ function quantile(v::RealVector{V}, w::AbstractWeights{W}, p::RealVector) where | |
x < 0 && error("weight vector cannot contain negative entries") | ||
end | ||
|
||
# full sort | ||
vw = sort!(collect(zip(v, w.values))) | ||
|
||
wsum = w.sum | ||
#remove zeros weights and sort | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Space after |
||
wsum = sum(w.value) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use |
||
nz = .!iszero.(w.values) | ||
vw = sort!(collect(zip(view(v, nz), view(wvalues, nz)))) | ||
N = length(vw) | ||
|
||
# prepare percentiles | ||
ppermute = sortperm(p) | ||
p = p[ppermute] | ||
p = bound_quantiles(p) | ||
|
||
# prepare out vector | ||
N = length(vw) | ||
out = Vector{typeof(zero(V)/1)}(length(p)) | ||
fill!(out, vw[end][1]) | ||
|
||
# start looping on quantiles | ||
cumulative_weight, Sk, Skold = zero(W), zero(W), zero(W) | ||
vk, vkold = zero(V), zero(V) | ||
k = 1 | ||
Sk, Skold = zero(W), zero(W) | ||
vk, vkold= zero(V), zero(V) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Space before |
||
k = 0 | ||
|
||
for i in 1:length(p) | ||
h = p[i] * (N - 1) * wsum | ||
if h == 0 | ||
# happens when N or p or wsum equal zero | ||
out[ppermute[i]] = vw[1][1] | ||
if isa(w, FrequencyWeights) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It feels weird to use a completely different path for frequency weights. Couldn't a common path be defined, moving some type-specific computations out of the loop like you did for normalization? For example, for BTW, wouldn't it make more sense to normalize the weights to sum to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have thought about this but I think it is better to do two different paths. Joining the two looks more confusing than enlightening. |
||
h = p[i] * (wsum - 1) + 1 | ||
else | ||
while Sk <= h | ||
# happens in particular when k == 1 | ||
vk, wk = vw[k] | ||
cumulative_weight += wk | ||
if k >= N | ||
# out was initialized with maximum v | ||
return out | ||
end | ||
k += 1 | ||
Skold, vkold = Sk, vk | ||
vk, wk = vw[k] | ||
Sk = (k - 1) * wk + (N - 1) * cumulative_weight | ||
h = p[i] * (wsum - vw[1][2]) + vw[1][2] | ||
end | ||
while Sk <= h | ||
k += 1 | ||
if k > N | ||
# out was initialized with maximum v | ||
return out | ||
end | ||
# in particular, Sk is different from Skold | ||
g = (h - Skold) / (Sk - Skold) | ||
out[ppermute[i]] = vkold + g * (vk - vkold) | ||
Skold, vkold = Sk, vk | ||
vk, wk = vw[k] | ||
Sk += wk | ||
end | ||
if isa(w, FrequencyWeights) | ||
out[ppermute[i]] = vkold + min(h - Skold, 1) * (vk - vkold) | ||
else | ||
out[ppermute[i]] = vkold + (h - Skold) / (Sk - Skold) * (vk - vkold) | ||
end | ||
end | ||
return out | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -261,7 +261,9 @@ end | |
@test_throws ErrorException median(data, f(wt)) | ||
end | ||
|
||
@testset "Quantile $f" for f in weight_funcs | ||
|
||
# Quantile fweights | ||
@testset "Quantile fweights" begin | ||
data = ( | ||
[7, 1, 2, 4, 10], | ||
[7, 1, 2, 4, 10], | ||
|
@@ -284,81 +286,155 @@ end | |
[-10, 1, 1, -10, -10], | ||
) | ||
wt = ( | ||
f([1, 1/3, 1/3, 1/3, 1]), | ||
f([1, 1, 1, 1, 1]), | ||
f([1, 1/3, 1/3, 1/3, 1, 1]), | ||
f([1/3, 1/3, 1/3, 1, 1, 1]), | ||
f([30, 191, 9, 0]), | ||
f([10, 1, 1, 1, 9]), | ||
f([10, 1, 1, 1, 900]), | ||
f([1, 3, 5, 4, 2]), | ||
f([2, 2, 5, 1, 2, 2, 1, 6]), | ||
f([0.1, 0.1, 0.8]), | ||
f([5, 5, 4, 1]), | ||
f([30, 56, 144, 24, 55, 43, 67]), | ||
f([0.1, 0.2, 0.3, 0.4, 0.5, 0.6]), | ||
f([12]), | ||
f([7, 1, 1, 1, 6]), | ||
f([1, 0, 0, 0, 2]), | ||
f([1, 2, 3, 4, 5]), | ||
f([0.1, 0.2, 0.3, 0.2, 0.1]), | ||
f([1, 1, 1, 1, 1]), | ||
Int[3, 1, 1, 1, 3], | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why specify |
||
Int[1, 1, 1, 1, 1], | ||
Int[3, 1, 1, 1, 3, 3], | ||
Int[1, 1, 1, 3, 3, 3], | ||
Int[30, 191, 9, 0], | ||
Int[10, 1, 1, 1, 9], | ||
Int[10, 1, 1, 1, 900], | ||
Int[1, 3, 5, 4, 2], | ||
Int[2, 2, 5, 0, 2, 2, 1, 6], | ||
Int[1, 1, 8], | ||
Int[5, 5, 4, 1], | ||
Int[30, 56, 144, 24, 55, 43, 67], | ||
Int[1, 2, 3, 4, 5, 6], | ||
Int[12], | ||
Int[7, 1, 1, 1, 6], | ||
Int[1, 0, 0, 0, 2], | ||
Int[1, 2, 3, 4, 5], | ||
Int[1, 2, 3, 2, 1], | ||
Int[0, 1, 1, 1, 1], | ||
) | ||
p = [0.0, 0.25, 0.5, 0.75, 1.0] | ||
function _rep(x::AbstractVector, lengths::AbstractVector{Int}) | ||
res = similar(x, sum(lengths)) | ||
i = 1 | ||
for idx in 1:length(x) | ||
tmp = x[idx] | ||
for kdx in 1:lengths[idx] | ||
res[i] = tmp | ||
i += 1 | ||
end | ||
end | ||
return res | ||
end | ||
# quantile with fweights is the same as repeated vectors | ||
for i = 1:length(data) | ||
@test quantile(data[i], fweights(wt[i]), p) ≈ quantile(_rep(data[i], wt[i]), p) | ||
end | ||
# quantile with fweights = 1 is the same as quantile | ||
for i = 1:length(data) | ||
@test quantile(data[i], fweights(ones(wt[i])), p) ≈ quantile(data[i], p) | ||
end | ||
|
||
# Issue #313 | ||
@test quantile(x, fweights([0,1,2,1,0]), p) ≈ quantile([2, 3, 3, 4], p) | ||
@test quantile([1, 2], fweights([1, 1]), 0.25) ≈ 1.25 | ||
@test quantile([1, 2], fweights([2, 2]), 0.25) ≈ 1.0 | ||
end | ||
|
||
@testset "Quantile aweights" begin | ||
data = ( | ||
[7, 1, 2, 4, 10], | ||
[7, 1, 2, 4, 10], | ||
[7, 1, 2, 4, 10, 15], | ||
[1, 2, 4, 7, 10, 15], | ||
[0, 10, 20, 30], | ||
[1, 2, 3, 4, 5], | ||
[1, 2, 3, 4, 5], | ||
[30, 40, 50, 60, 35], | ||
[2, 0.6, 1.3, 0.3, 0.3, 1.7, 0.7, 1.7], | ||
[1, 2, 2], | ||
[3.7, 3.3, 3.5, 2.8], | ||
[100, 125, 123, 60, 45, 56, 66], | ||
[2, 2, 2, 2, 2, 2], | ||
[2.3], | ||
[-2, -3, 1, 2, -10], | ||
[1, 2, 3, 4, 5], | ||
[5, 4, 3, 2, 1], | ||
[-2, 2, -1, 3, 6], | ||
[-10, 1, 1, -10, -10], | ||
) | ||
wt = ( | ||
[1, 1/3, 1/3, 1/3, 1], | ||
[1, 1, 1, 1, 1], | ||
[1, 1/3, 1/3, 1/3, 1, 1], | ||
[1/3, 1/3, 1/3, 1, 1, 1], | ||
[30, 191, 9, 0], | ||
[10, 1, 1, 1, 9], | ||
[10, 1, 1, 1, 900], | ||
[1, 3, 5, 4, 2], | ||
[2, 2, 5, 1, 2, 2, 1, 6], | ||
[0.1, 0.1, 0.8], | ||
[5, 5, 4, 1], | ||
[30, 56, 144, 24, 55, 43, 67], | ||
[0.1, 0.2, 0.3, 0.4, 0.5, 0.6], | ||
[12], | ||
[7, 1, 1, 1, 6], | ||
[1, 0, 0, 0, 2], | ||
[1, 2, 3, 4, 5], | ||
[0.1, 0.2, 0.3, 0.2, 0.1], | ||
[1, 1, 1, 1, 1], | ||
) | ||
quantile_answers = ( | ||
[1.0,3.6000000000000005,6.181818181818182,8.2,10.0], | ||
[1.0,2.0,4.0,7.0,10.0], | ||
[1.0,4.75,8.0,10.833333333333334,15.0], | ||
[1.0,4.75,8.0,10.833333333333334,15.0], | ||
[0.0,6.1387900355871885,11.600000000000001,15.912500000000001,30.0], | ||
[1.0,1.5365853658536586,2.5999999999999996,4.405405405405405,5.0], | ||
[1.0,4.239377950569287,4.492918633712858,4.746459316856429,5.0], | ||
[30.0,38.75,45.714285714285715,52.85714285714286,60.0], | ||
[0.3,0.6903846153846154,1.484,1.7,2.0], | ||
[1.0,2.0,2.0,2.0,2.0], | ||
[2.8,3.3361111111111112,3.4611111111111112,3.581578947368421,3.7], | ||
[45.0,59.88593155893536,100.08846153846153,118.62115384615385,125.0], | ||
[2.0,2.0,2.0,2.0,2.0], | ||
[2.3,2.3,2.3,2.3,2.3], | ||
[-10.0,-5.52,-2.5882352941176467,-0.9411764705882351,2.0], | ||
[1.0,1.75,4.25,4.625,5.0], | ||
[1.0,1.625,2.3333333333333335,3.25,5.0], | ||
[-2.0,-0.5384615384615388,1.5384615384615383,2.6999999999999997,6.0], | ||
[-10.0,-10.0,-10.0,1.0,1.0] | ||
[1.0, 4.0, 6.0, 8.0, 10.0], | ||
[1.0, 2.0, 4.0, 7.0, 10.0], | ||
[1.0, 4.75, 7.5, 10.4167, 15.0], | ||
[1.0, 4.75, 7.5, 10.4167, 15.0], | ||
[0.0, 2.6178, 5.2356, 7.8534, 20.0], | ||
[1.0, 4.0, 4.33333, 4.66667, 5.0], | ||
[1.0, 4.2475, 4.49833, 4.74917, 5.0], | ||
[30.0, 37.5, 44.0, 51.25, 60.0], | ||
[0.3, 0.7, 1.3, 1.7, 2.0], | ||
[1.0, 2.0, 2.0, 2.0, 2.0], | ||
[2.8, 3.15, 3.4, 3.56, 3.7], | ||
[45.0, 62.1493, 102.875, 117.41, 125.0], | ||
[2.0, 2.0, 2.0, 2.0, 2.0], | ||
[2.3, 2.3, 2.3, 2.3, 2.3], | ||
[-10.0, -2.78571, -2.42857, -2.07143, 2.0], | ||
[1.0, 2.0, 3.0, 4.0, 5.0], | ||
[1.0, 1.625, 2.33333, 3.25, 5.0], | ||
[-2.0, -1.33333, 0.5, 2.5, 6.0], | ||
[-10.0, -10.0, -10.0, 1.0, 1.0] | ||
) | ||
p = [0.0, 0.25, 0.5, 0.75, 1.0] | ||
|
||
srand(10) | ||
for i = 1:length(data) | ||
@test quantile(data[i], wt[i], p) ≈ quantile_answers[i] | ||
@test quantile(data[i], aweights(wt[i]), p) ≈ quantile_answers[i] atol = 1e-3 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 1e-3 sounds quite high, why not keep more precision? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This comment hasn't been addressed. BTW, it would be nice to duplicate each |
||
for j = 1:10 | ||
# order of p does not matter | ||
reorder = sortperm(rand(length(p))) | ||
@test quantile(data[i], wt[i], p[reorder]) ≈ quantile_answers[i][reorder] | ||
@test quantile(data[i], aweights(wt[i]), p[reorder]) ≈ quantile_answers[i][reorder] atol = 1e-3 | ||
end | ||
for j = 1:10 | ||
# order of w does not matter | ||
reorder = sortperm(rand(length(data[i]))) | ||
@test quantile(data[i][reorder], f(wt[i][reorder]), p) ≈ quantile_answers[i] | ||
@test quantile(data[i][reorder], aweights(wt[i][reorder]), p) ≈ quantile_answers[i] atol = 1e-3 | ||
end | ||
end | ||
# w = 1 corresponds to base quantile | ||
for i = 1:length(data) | ||
@test quantile(data[i], f(ones(Int64, length(data[i]))), p) ≈ quantile(data[i], p) | ||
@test quantile(data[i], aweights(ones(Int64, length(data[i]))), p) ≈ quantile(data[i], p) atol = 1e-3 | ||
for j = 1:10 | ||
prandom = rand(4) | ||
@test quantile(data[i], f(ones(Int64, length(data[i]))), prandom) ≈ quantile(data[i], prandom) | ||
@test quantile(data[i], aweights(ones(Int64, length(data[i]))), prandom) ≈ quantile(data[i], prandom) atol = 1e-3 | ||
end | ||
end | ||
|
||
# other syntaxes | ||
# test zeros are removed | ||
for i = 1:length(data) | ||
@test quantile(vcat(1.0, data[i]), aweights(vcat(0.0, wt[i])), p) ≈ quantile_answers[i] atol = 1e-3 | ||
end | ||
# Syntax | ||
v = [7, 1, 2, 4, 10] | ||
w = [1, 1/3, 1/3, 1/3, 1] | ||
answer = 6.181818181818182 | ||
@test quantile(data[1], f(w), 0.5) ≈ answer | ||
@test wquantile(data[1], f(w), [0.5]) ≈ [answer] | ||
@test wquantile(data[1], f(w), 0.5) ≈ answer | ||
@test wquantile(data[1], w, [0.5]) ≈ [answer] | ||
@test wquantile(data[1], w, 0.5) ≈ answer | ||
answer = 6.0 | ||
@test quantile(data[1], aweights(w), 0.5) ≈ answer atol = 1e-3 | ||
@test wquantile(data[1], aweights(w), [0.5]) ≈ [answer] atol = 1e-3 | ||
@test wquantile(data[1], aweights(w), 0.5) ≈ answer atol = 1e-3 | ||
@test wquantile(data[1], w, [0.5]) ≈ [answer] atol = 1e-3 | ||
@test wquantile(data[1], w, 0.5) ≈ answer atol = 1e-3 | ||
end | ||
|
||
end # @testset StatsBase.Weights | ||
end # @testset StatsBase.Weights |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, references to Julia objects should use single backticks... :-) And
v
should be changed tox
in the signature above to match descriptions below (or the other way around).