Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft of fweights, pweights, aweights for glms #194

Closed
wants to merge 7 commits into from

Conversation

jeffwong
Copy link

This is a code outline for #186. I am new to contributing to Julia, so please feel free to leave feedback on how this feature should be implemented

@@ -116,6 +116,8 @@ Base.cholfact!{T}(p::SparsePredChol{T}) = p.chol

invchol(x::DensePred) = inv(cholfact!(x))
invchol(x::SparsePredChol) = cholfact!(x) \ eye(size(x.X, 2))
# TODO: support fweights, pweights, and aweights
# This vcov represents σ^2 X'X
vcov(x::LinPredModel) = scale!(invchol(x.pp), dispersion(x, true))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be updated to invchol(x.pp * sqrt(wts))

@@ -116,6 +116,8 @@ Base.cholfact!{T}(p::SparsePredChol{T}) = p.chol

invchol(x::DensePred) = inv(cholfact!(x))
invchol(x::SparsePredChol) = cholfact!(x) \ eye(size(x.X, 2))
# TODO: support fweights, pweights, and aweights
# This vcov represents σ^2 X'X
vcov(x::LinPredModel) = scale!(invchol(x.pp), dispersion(x, true))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check dispersion(x, true) will use nobs, which will use the right behavior for fweights vs pweights

"`wrkwt`: working case weights for the Iteratively Reweighted Least Squares (IRLS) algorithm"
wrkwt::V
"`wrkresid`: working residuals for IRLS"
wrkresid::V
end

function GlmResp{V<:FPVector, D, L}(y::V, d::D, l::L, η::V, μ::V, off::V, wts::V)
function GlmResp{V<:FPVector, D, L}(y::V, d::D, l::L, η::V, μ::V, off::V, wts::V,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the constructor doesn't need a wts anymore, just fweights, pweights and aweights

Copy link
Member

@ararslan ararslan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some preliminary comments, mostly minor style things.

n = length(y)
length(mu) == n || error("mismatched lengths of mu and y")
ll = length(off)
ll == 0 || ll == n || error("length of offset is $ll, must be $n or 0")
ll = length(wts)
ll == 0 || ll == n || error("length of wts is $ll, must be $n or 0")
new{V}(mu, off, wts, y)
ll = length(fweights)
ll == 0 || ll == n || error("length of fweights is $ll, must be $n or 0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it's better to use typed exceptions whenever possible rather than error. In this case I think throw(ArgumentError("...")) would be appropriate.

nf = length(fweights)
np = length(pweights)
na = length(aweights)
if (nf != 0 && np == 0 && na == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ifs in Julia don't need outer parentheses, and indeed that style is generally discouraged.

@@ -98,3 +98,22 @@ loglik_obs(::Binomial, y, μ, wt, ϕ) = logpdf(Binomial(Int(wt), μ), Int(y*wt))
loglik_obs(::Gamma, y, μ, wt, ϕ) = wt*logpdf(Gamma(1/ϕ, μ*ϕ), y)
loglik_obs(::Normal, y, μ, wt, ϕ) = wt*logpdf(Normal(μ, sqrt(ϕ)), y)
loglik_obs(::Poisson, y, μ, wt, ϕ) = wt*logpdf(Poisson(μ), y)

# TODO: combine fweights, pweights, aweights
function normalizeWeights(fweights::FrequencyWeights, pweights::ProbabilityWeights,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Camel case isn't used in Base Julia and the JuliaStats packages tend to match the Base style as closely as possible. For function names, typically it's all lowercase with no separation if the name isn't too long, otherwise snake case. For this function, I'd probably go with normalize_weights.

np = length(pweights)
na = length(aweights)
if (nf != 0 && np == 0 && na == 0)
return fweights |> collect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another minor stylistic point, but it's generally better to use standard calling conventions rather than piping. In this case you could actually just assign the result of the conditional to a variable (e.g. result = if ...) then return collect(result)

@@ -150,10 +152,11 @@ For linear and generalized linear models, returns the number of rows, or,
when prior weights are specified, the sum of weights.
"""
function nobs(obj::LinPredModel)
if isempty(obj.rr.wts)
if (length(obj.rr.fweights) == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is isempty not defined for the weight objects in StatsBase?

end
end
convert{V<:FPVector}(::Type{LmResp{V}}, y::V) =
LmResp{V}(zeros(y), similar(y, 0), similar(y, 0), y)
LmResp{V}(zeros(y), similar(y, 0), similar(y, 0), y,
similar(y, 0), similar(y, 0), similar(y, 0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be aligned vertically with the arguments above it

type LmResp{V<:FPVector} <: ModResp # response in a linear model
mu::V # mean response
offset::V # offset added to linear predictor (may have length 0)
wts::V # prior weights (may have length 0)
fweights::FrequencyWeights
pweights::ProbabilityWeights
aweights::AnalyticWeights
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should carry around three different types of weights in the response object. Perhaps just a single AbstractWeights?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should store the weights internally as a single vector (maybe keep the existing wts field), and only keep track of the necessary additional information like the actual number of observations (which is either the sum of weights or the vector length depending on the weights type).

@@ -14,6 +14,7 @@ module GLM
import Base: (\), cholfact, convert, cor, show, size
import StatsBase: coef, coeftable, confint, deviance, nulldeviance, dof, dof_residual, loglikelihood, nullloglikelihood, nobs, stderr, vcov, residuals, predict, fit, model_response, r2, r², adjr2, adjr², PValue
import StatsFuns: xlogy
import StatsBase: FrequencyWeights, ProbabilityWeights, AnalyticWeights
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to bump the lower bound on StatsBase in the REQUIRE file to the version where these types were introduced.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I wonder how useful is the ability to pass multiple weights vector and have GLM combine them automatically. Is that a really common need? Couldn't we require users to do that manually first (and maybe provide convenience functions for that in StatsBase)? Are there examples of existing software which support this? I'm asking because it will make the code more complex and will affect the APIs, so we should evaluate the benefits.

elseif (nf != 0 && np != 0)
return (fweights .* pweights) |> collect
else
return zeros(Float64, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably handle this more efficiently by returning an empty vector. The existing code uses that trick.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the empty vector gets overridden at some point by ones(n). I haven't checked exactly where or why it was done. I suspect it was so that a quantity could be normalized by sum(weights).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Looks like a good use case for a UnitWeights or Ones type which would implement the AbstractArray interface and always return 1 (JuliaStats/StatsBase.jl#135). But that's not a requirement for a first implementation.

type LmResp{V<:FPVector} <: ModResp # response in a linear model
mu::V # mean response
offset::V # offset added to linear predictor (may have length 0)
wts::V # prior weights (may have length 0)
fweights::FrequencyWeights
pweights::ProbabilityWeights
aweights::AnalyticWeights
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should store the weights internally as a single vector (maybe keep the existing wts field), and only keep track of the necessary additional information like the actual number of observations (which is either the sum of weights or the vector length depending on the weights type).

@jeffwong
Copy link
Author

jeffwong commented Sep 8, 2017

I think if we only supported Frequency Weights and Probability Weights this PR would be a lot easier for me. I am completely unfamiliar with Analytic Weights, and how they affect formulas for covariance of the glm parameters. It would also mean that we could easily simplify the API by only passing in 1 vector of weights. If someone needed to combine Frequency and Probability together, they could do that using fweight * pweight / sum(fweight * pweight) * sum(fweight). This could then be cast as a Frequency Weight: it has the relative weighting that the pweight brings, and the sum of this new vector still sums to sum(fweight), protecting the use of nobs in the package

@jeffwong
Copy link
Author

jeffwong commented Sep 8, 2017

Retrospectively, by limiting the scope to only Fweights and Pweights, I think we only needed a way to define nobs as a function of the weight type. Even if someone wanted to combine Fweights and Pweights, it would ultimately still behave like an Fweight. As long as the cov(parameters) was already based on $X' wts X$ instead of $X'X$ the only modification we would need is to nobs

@nalimilan
Copy link
Member

Sure, let's go with the simpler version for now. I don't know of other software accepting multiple types of weights at the same time, and we can always add support for analytic weights later.

@sergiocorreia
Copy link

sergiocorreia commented Sep 14, 2017

@jeffwong about your comment on Analytic Weights:

They represent multiple observations, like fweights, but instead of representing multiple identical obs., they represent distinct observations that were averaged out. This happens with variables such as "average grade by classroom", "accident rate per state", etc. Since groups with more individuals have less noisy means, aweights lead to heteroskedasticity.

That said, on linear models pweights are just "aweights with robust standard errors", so if the code supports pweights then aweights come for free.

@Nosferican
Copy link
Contributor

What's the status on this?

@nalimilan
Copy link
Member

Superseded by #487.

@nalimilan nalimilan closed this Sep 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants