Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Distribution: MvDiscreteNonParametric #1424

Open
wants to merge 24 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
34181c4
Implementation of MvDiscreteNonParametric distribution.
davibarreira Nov 15, 2021
0cc7518
Implementation of MvDiscreteNonParametric samplers.
davibarreira Nov 15, 2021
67b1890
Added ArraysOfArrays as dependency.
davibarreira Nov 15, 2021
7fba9da
Added MvDiscreteNonParametric to be exported.
davibarreira Nov 15, 2021
6f43b18
Added MvDiscreteNonParametric to be exported.
davibarreira Nov 16, 2021
c64339a
Added MvDiscreteNonParametric test set.
davibarreira Nov 16, 2021
e156052
Added MvDiscreteNonParametric samplers to samplers.jl.
davibarreira Nov 16, 2021
d11d300
Added _rand!(::GlobalRNG,...).
davibarreira Nov 16, 2021
24612da
Moved MvDiscreteNonParametric sampling to just one file. All tests pa…
davibarreira Nov 16, 2021
288401b
Merge branch 'master' into MvDiscreteNonParametric
davibarreira Nov 16, 2021
e8cf084
Apply suggestions from code review
davibarreira Nov 16, 2021
68c4d4d
Merge branch 'master' of https://github.com/JuliaStats/Distributions.…
davibarreira Nov 16, 2021
cbb6ae7
Removing ArraysOfArarys dependency.
davibarreira Nov 17, 2021
67a15dd
ArrarysOfArrays on test.
davibarreira Nov 19, 2021
5644219
Fixed GeneralDiscreteNonParametric with version from devmotion.
davibarreira Nov 20, 2021
3baece6
Reverting wrong formatting changes.
davibarreira Nov 20, 2021
10ba493
Fixed functionalities mean, var and cov for vector of vectors without…
davibarreira Nov 20, 2021
236380a
Removing testing.
davibarreira Dec 18, 2021
2947ca9
Merge branch 'master' of https://github.com/JuliaStats/Distributions.…
davibarreira Dec 18, 2021
eeba4f2
Added rand function to generaldiscretenonparametric, but _rand! is no…
davibarreira Dec 19, 2021
b208df3
Moving some functions to generaldiscretenonparametric.
davibarreira Dec 20, 2021
b2f0ec0
Moving some functions to generaldiscretenonparametric.
davibarreira Dec 21, 2021
ad6d01e
Removing ArraysOfArarys as dependency on tests.
davibarreira Dec 21, 2021
001d73d
Renamed test set.
davibarreira Dec 21, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions src/Distributions.jl
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,7 @@ export
FullNormalCanon,
Gamma,
DiscreteNonParametric,
GeneralDiscreteNonParametric,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's keep it unexported until everything is figured out?

Suggested change
GeneralDiscreteNonParametric,

GeneralizedPareto,
GeneralizedExtremeValue,
Geometric,
Expand Down Expand Up @@ -128,6 +129,7 @@ export
MixtureModel,
Multinomial,
MultivariateNormal,
MvDiscreteNonParametric,
MvLogNormal,
MvNormal,
MvNormalCanon,
Expand Down
79 changes: 79 additions & 0 deletions src/multivariate/generaldiscretenonparametric.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
struct GeneralDiscreteNonParametric{VF,T,P <: Real,Ts <: AbstractVector{T},Ps <: AbstractVector{P},} <: Distribution{VF,Discrete}
support::Ts
p::Ps

function GeneralDiscreteNonParametric{VF,T,P,Ts,Ps}(
support::Ts,
p::Ps;
check_args=true,
) where {VF,T,P <: Real,Ts <: AbstractVector{T},Ps <: AbstractVector{P}}
if check_args
length(support) == length(p) ||
error("length of `support` and `p` must be equal")
isprobvec(p) || error("`p` must be a probability vector")
allunique(support) || error("`support` must contain only unique values")
end
new{VF,T,P,Ts,Ps}(support, p)
end
end

function rand(rng::AbstractRNG, d::GeneralDiscreteNonParametric)
x = support(d)
p = probs(d)
n = length(p)
draw = rand(rng, float(eltype(p)))
cp = p[1]
i = 1
while cp <= draw && i < n
@inbounds cp += p[i +=1]
end
return x[i]
end
Comment on lines +20 to +31
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this method is correct, it messes with the rand dispatches. Maybe either change it to the univariate case only (i.e., use ::GeneralDiscreteNonParametric{Univariate})

Suggested change
function rand(rng::AbstractRNG, d::GeneralDiscreteNonParametric)
x = support(d)
p = probs(d)
n = length(p)
draw = rand(rng, float(eltype(p)))
cp = p[1]
i = 1
while cp <= draw && i < n
@inbounds cp += p[i +=1]
end
return x[i]
end
function rand(rng::AbstractRNG, d::GeneralDiscreteNonParametric{Univariate})
x = support(d)
p = probs(d)
n = length(p)
draw = rand(rng, float(eltype(p)))
cp = p[1]
i = 1
while cp <= draw && i < n
@inbounds cp += p[i +=1]
end
return x[i]
end

or let's remove it for now and add it in a follow-up PR that updates DiscreteNonParametric:

Suggested change
function rand(rng::AbstractRNG, d::GeneralDiscreteNonParametric)
x = support(d)
p = probs(d)
n = length(p)
draw = rand(rng, float(eltype(p)))
cp = p[1]
i = 1
while cp <= draw && i < n
@inbounds cp += p[i +=1]
end
return x[i]
end


"""
support(d::MvDiscreteNonParametric)
Get a sorted AbstractVector defining the support of `d`.
"""
Comment on lines +33 to +36
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a docstring, support is already documented (and I don't think it is sorted here actually?)

Suggested change
"""
support(d::MvDiscreteNonParametric)
Get a sorted AbstractVector defining the support of `d`.
"""

support(d::GeneralDiscreteNonParametric) = d.support

"""
probs(d::MvDiscreteNonParametric)
Get the vector of probabilities associated with the support of `d`.
Comment on lines +40 to +41
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
probs(d::MvDiscreteNonParametric)
Get the vector of probabilities associated with the support of `d`.
probs(d::GeneralDiscreteNonParametric)
Return the vector of probabilities associated with the support of `d`.

"""
probs(d::GeneralDiscreteNonParametric) = d.p


Base.length(d::GeneralDiscreteNonParametric) = length(first(d.support))
Comment on lines +45 to +46
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

length should be defined for the multivariate case only, for other cases we need size:

Suggested change
Base.length(d::GeneralDiscreteNonParametric) = length(first(d.support))
Base.length(d::GeneralDiscreteNonParametric{Multivariate}) = length(first(d.support))
Base.size(d::GeneralDiscreteNonParametric{<:ArrayLikeVariate}) = size(first(d.support))

IMO it is better to be a bit conservative here and only define it for ArrayLikeVariates.


function _rand!(
rng::AbstractRNG,
d::GeneralDiscreteNonParametric,
x::AbstractVector{T},
) where {T<:Real}
Comment on lines +48 to +52
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_rand! is only used by ArrayLikeVariates, hence better to restrict the method a bit - and generalize it to arbitrary sizes:

Suggested change
function _rand!(
rng::AbstractRNG,
d::GeneralDiscreteNonParametric,
x::AbstractVector{T},
) where {T<:Real}
function _rand!(
rng::AbstractRNG,
d::GeneralDiscreteNonParametric{ArrayLikeVariate{N}},
x::AbstractArray{<:Real,N},
) where {N}


length(x) == length(d) || throw(DimensionMismatch("Invalid argument dimension."))
Comment on lines +53 to +54
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dimensions are already checked before _rand! is called, no need to do it again here:

Suggested change
length(x) == length(d) || throw(DimensionMismatch("Invalid argument dimension."))

s = d.support
p = d.p

n = length(p)
draw = Base.rand(rng, float(eltype(p)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to qualify rand

Suggested change
draw = Base.rand(rng, float(eltype(p)))
draw = rand(rng, float(eltype(p)))

cp = p[1]
i = 1
while cp <= draw && i < n
@inbounds cp += p[i+=1]
end
copyto!(x, s[i])
return x
end


function _logpdf(d::GeneralDiscreteNonParametric, x::AbstractVector{T}) where {T<:Real}
Comment on lines +69 to +70
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the dispatches are only used by ArrayLikeVariate distributions, so let's just use

Suggested change
function _logpdf(d::GeneralDiscreteNonParametric, x::AbstractVector{T}) where {T<:Real}
function _logpdf(d::GeneralDiscreteNonParametric{ArrayLikeVariate{N}}, x::AbstractArray{<:Real,N}) where {N}

s = support(d)
p = probs(d)
for i = 1:length(p)
if s[i] == x
return log(p[i])
end
end
Comment on lines +73 to +77
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this could be optimized with searchsortedfirst if the support would be supported. At least for ArrayLikeVariate distributions sorting is clearly defined and works well, so maybe sort the support at least in these cases?

julia> sort([[0.0, 1.0, 0.0], [1.0, 0.0, 0.0], [0.5, 0.0, 0.5]])
3-element Vector{Vector{Float64}}:
 [0.0, 1.0, 0.0]
 [0.5, 0.0, 0.5]
 [1.0, 0.0, 0.0]

julia> searchsortedfirst(ans, [0.5, 0.0, 0.5])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually have been implementing an algorithm that does this more efficiently. But I only got the 2D case working. My idea is to submit a future PR improving this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a custom algorithm? Can't we just use searchsortedfirst? sort etc works with arrays just fine in Julia.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant for computing the empirical cdf. Not exactly the sorting of the support.

return log(zero(eltype(p)))
end
55 changes: 55 additions & 0 deletions src/multivariate/mvdiscretenonparametric.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
const MvDiscreteNonParametric{
T<:AbstractVector{<:Real},
P<:Real,
Ts<:AbstractVector{T},
Ps<:AbstractVector{P},
} = GeneralDiscreteNonParametric{Multivariate,T,P,Ts,Ps}
Comment on lines +1 to +6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation for introducing this type-alias? It seems we could work with GeneralDiscreteNonParametric directly? Also it would only cover the multivariate case even though other ArrayLikeVariate cases could be handled in the same way.

I suggest removing this file completely and moving the definitions to src/generaldiscretenonparametric.jl instead (it should not be in the multivariate subfolder since it is not necessarily a multivariate distribution).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we just drop the MvDiscreteNonParametric then? Should I change all for GeneralNonParametric?


"""
MvDiscreteNonParametric(
support::AbstractVector,
p::AbstractVector{<:Real}=fill(inv(length(support)), length(support)),
)

Construct a multivariate discrete nonparametric probability distribution with `support` and corresponding
probabilities `p`. If the probability vector argument is not passed, then
equal probability is assigned to each entry in the support.

# Examples
```julia
# rows correspond to samples
x = collect(eachrow(rand(10,2)))
μ = MvDiscreteNonParametric(x)

# columns correspond to samples
y = collect(eachcol(rand(7,12)))
ν = MvDiscreteNonParametric(y)
```
"""
function MvDiscreteNonParametric(
support::AbstractArray{<:AbstractVector{<:Real}},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be

Suggested change
support::AbstractArray{<:AbstractVector{<:Real}},
support::AbstractVector{<:AbstractVector{<:Real}},

in the multivariate case, and

Suggested change
support::AbstractArray{<:AbstractVector{<:Real}},
support::AbstractVector{<:AbstractArray{<:Real,N}},

only in the more general case of ArrayLikeVariate{N}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I just use the second option and change the function to GeneralDiscreteNonParametric?

p::AbstractVector{<:Real} = fill(inv(length(support)), length(support)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we should add a default definition for p. In any case, we should use the non-allocating Fill instead.

Suggested change
p::AbstractVector{<:Real} = fill(inv(length(support)), length(support)),
p::AbstractVector{<:Real} = Fill(inv(length(support)), length(support)),

)
return MvDiscreteNonParametric{eltype(support),eltype(p),typeof(support),typeof(p)}(
support,
p,
)
end

Base.eltype(::Type{<:MvDiscreteNonParametric{T}}) where T = Base.eltype(T)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be defined more generally:

Suggested change
Base.eltype(::Type{<:MvDiscreteNonParametric{T}}) where T = Base.eltype(T)
Base.eltype(::Type{<:GeneralDiscreteNonParametric{<:ArrayLikeVariate,T}}) where T = eltype(T)


function mean(d::MvDiscreteNonParametric)
return StatsBase.mean(hcat(d.support...), Weights(d.p, one(eltype(d.p))),dims=2)
end

function var(d::MvDiscreteNonParametric)
x = hcat(support(d)...)
p = probs(d)
return StatsBase.var(x, Weights(p, one(eltype(p))), 2,corrected = false)
end

function cov(d::MvDiscreteNonParametric)
x = hcat(support(d)...)
p = probs(d)
return cov(x, Weights(p, one(eltype(p))), 2,corrected = false)
end
Comment on lines +41 to +55
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I requested to add support for vectors of vectors to StatsBase but it might take a bit until it is available. IMO it is fine to have possibly less optimized implementations until then. However, we should avoid splatting and use

Suggested change
function mean(d::MvDiscreteNonParametric)
return StatsBase.mean(hcat(d.support...), Weights(d.p, one(eltype(d.p))),dims=2)
end
function var(d::MvDiscreteNonParametric)
x = hcat(support(d)...)
p = probs(d)
return StatsBase.var(x, Weights(p, one(eltype(p))), 2,corrected = false)
end
function cov(d::MvDiscreteNonParametric)
x = hcat(support(d)...)
p = probs(d)
return cov(x, Weights(p, one(eltype(p))), 2,corrected = false)
end
function mean(d::MvDiscreteNonParametric)
return mean(reduce(hcat, d.support), Weights(d.p, one(eltype(d.p))); dims=2)
end
function var(d::MvDiscreteNonParametric)
x = reduce(hcat, support(d))
p = probs(d)
return var(x, Weights(p, one(eltype(p))), 2; corrected = false)
end
function cov(d::MvDiscreteNonParametric)
x = reduce(hcat, support(d))
p = probs(d)
return cov(x, Weights(p, one(eltype(p))), 2; corrected = false)
end

2 changes: 2 additions & 0 deletions src/multivariates.jl
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,8 @@ end
for fname in ["dirichlet.jl",
"multinomial.jl",
"dirichletmultinomial.jl",
"generaldiscretenonparametric.jl",
"mvdiscretenonparametric.jl",
"mvnormal.jl",
"mvnormalcanon.jl",
"mvlognormal.jl",
Expand Down
13 changes: 0 additions & 13 deletions src/univariate/discrete/discretenonparametric.jl
Original file line number Diff line number Diff line change
Expand Up @@ -69,19 +69,6 @@ Base.isapprox(c1::D, c2::D) where D<:DiscreteNonParametric =

# Sampling

function rand(rng::AbstractRNG, d::DiscreteNonParametric)
x = support(d)
p = probs(d)
n = length(p)
draw = rand(rng, float(eltype(p)))
cp = p[1]
i = 1
while cp <= draw && i < n
@inbounds cp += p[i +=1]
end
return x[i]
end

Comment on lines -72 to -84
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this was removed accidentally? I guess it would be cleaner to not make any changes to DiscreteNonParametric in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. my bad

sampler(d::DiscreteNonParametric) =
DiscreteNonParametricSampler(support(d), probs(d))

Expand Down
101 changes: 101 additions & 0 deletions test/generaldiscretenonparametric.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
using Distributions
using StatsBase
using LinearAlgebra
using Random
using Test


@testset "GeneralDiscreteNonParametric" begin

@testset "Declaring MvDiscreteNonParametric" begin

Random.seed!(7)
n = 4
m = 2
A = collect(eachrow(rand(n, m)))
p = normalize!(rand(n), 1)

# Passing probabilities
μ = @inferred(MvDiscreteNonParametric(A, p))
@test support(μ) == A
@test length(μ) == m
# @test size(μ) == (m, n)
@test probs(μ) == p

# Without passing probabilities
μ = @inferred(MvDiscreteNonParametric(A))
@test support(μ) == A
@test length(μ) == m
# @test size(μ) == (m, n)
@test probs(μ) == fill(1 / n, n)

# Array of arrays without ArraysOfArrays.jl
n, m = 3, 2
p = ([3 / 5, 1 / 5, 1 / 5])
A = [[1,0],[1,1],[0,1]]
μ = @inferred(MvDiscreteNonParametric(A, p))

@test support(μ) == A
@test length(μ) == m
@test probs(μ) == p

end


@testset "Functionalities" begin

function variance(d)
v = zeros(length(d))
for i in 1:length(d)
s = hcat(μ.support...)[i,:]
mₛ = mean(d)[i]
v[i] = sum(abs2.(s .- mₛ), Weights(d.p))
end
return v
end

function covariance(d)
n = length(d)
v = zeros(n, n)
for i in 1:n, j in 1:n
s = hcat(μ.support...)[i,:]
mₛ = mean(d)[i]

u = hcat(μ.support...)[j,:]
mᵤ = mean(d)[j]

v[i,j] = sum((s .- mₛ) .* (u .- mᵤ), Weights(d.p))
end
return v
end

Random.seed!(7)
n, m = 7, 9

A = collect(eachrow(rand(n, m)))
p = normalize!(rand(n), 1)
μ = @inferred(MvDiscreteNonParametric(A, p))

@test mean(μ) ≈ mean(hcat(μ.support...), Weights(p), dims=2)[:]
@test var(μ) ≈ variance(μ)
@test cov(μ) ≈ covariance(μ)
@test pdf(μ, μ.support) ≈ μ.p
@test pdf(μ, zeros(m)) == 0.0
# @test entropy(μ) == entropy(μ.p)
# @test entropy(μ, 2) == entropy(μ.p, 2)

end

@testset "Sampling" begin
Random.seed!(7)
A = collect(eachrow(rand(3, 2)))
μ = MvDiscreteNonParametric(A, [0.9,0.1,0.0])

for i in 1:3
samples = rand(μ, 10000)
@test abs(mean([s == A[i] for s in eachcol(samples)]) - μ.p[i]) < 0.05
end
end

end

2 changes: 2 additions & 0 deletions test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,8 @@ const tests = [
"pgeneralizedgaussian",
"product",
"discretenonparametric",
"generaldiscretenonparametric",
"functionals",
"chernoff",
"univariate_bounds",
"negativebinomial",
Expand Down