Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add wts keyword to lm #341

Merged
merged 22 commits into from
Jan 15, 2020
Merged

Add wts keyword to lm #341

merged 22 commits into from
Jan 15, 2020

Conversation

pdeffebach
Copy link
Contributor

Addresses #258. It definitely doesn't work right now, as I haven't been able to track down all the up-stream functions that need to use the weights of a LmResp object.

@pdeffebach
Copy link
Contributor Author

I spent more time on this tonight, trying to make lm seem more like glm. I realized that the GlmResp holds a lot of things that are necessary for the weights to work. I think we should just dispatch to glm(..., Normal(), wts = wts) when the user uses wts with lm rather than alter the LmResp object and algorithm to use weights.

Does that sounds good?

@nalimilan
Copy link
Member

I'd rather keep lm always returning a LinearModel. Otherwise people who dispatch on it will get in trouble when using weights, ftest won't work, p-values won't be based on the t distribution, etc.

@pdeffebach
Copy link
Contributor Author

Thanks for the feedback. That's a valid concern.

If you have time, could you verify that we would need to add more fields to the LmResp object to accommodate actually running OLS with weights? At the moment it's hard to imagine enabling LinearModel to work with weights without basically making it look exactly like GeneralizedLinearModel.

@nalimilan
Copy link
Member

Well LmResp already has a wts field, and other ones seem irrelevant (linked to the IRLS estimation algorithm). What would you need to add?

@pdeffebach
Copy link
Contributor Author

After a decent amount of time inspecting functions to figure out what to change, I think I made the necessary changes.

julia> df = DataFrame(y = rand(10), x = rand(10), w = float.(rand(Bool, 10)))
julia> lm(@formula(y ~ x), df, wts = df.w)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

y ~ 1 + x

Coefficients:
────────────────────────────────────────────────────────────────────────────
              Estimate  Std. Error   t value  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept)   0.870868    0.447738   1.94504    0.1912   -1.05559    2.79733
x            -0.493512    0.902051  -0.5471     0.6392   -4.37472    3.3877 
────────────────────────────────────────────────────────────────────────────
julia> glm(@formula(y ~ x), df, Normal(), wts = df.w)
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

y ~ 1 + x

Coefficients:
──────────────────────────────────────────────────────────────────────────────
              Estimate  Std. Error   z value  Pr(>|z|)    Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────────
(Intercept)   0.870868    0.447738   1.94504    0.0518  -0.00668183    1.74842
x            -0.493512    0.902051  -0.5471     0.5843  -2.2615        1.27448
──────────────────────────────────────────────────────────────────────────────

Actually it looks like the confidence intervals are all off, which will require some tweaks. @nalimilan can you let me know if I'm heading in the right direction in terms of the way I changed the updatemu! method?

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, looks OK AFAICT. Regarding the confidence intervals, better compare with what you get by replicating each row according to its weight, as by definition the results must be equivalent. You can also check with another package (e.g. Stata's fweights). It's hard to tell whether what you showed is correct, since for a small number of observations the F distribution will be very different from the Normal distribution I think (for larger samples it should converge towards it).

Manifest.toml Outdated
@@ -0,0 +1,232 @@
# This file is machine-generated - editing it directly is not advised
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this file.

src/lm.jl Outdated

LmResp(y::AbstractVector{T}) where T<:Real = LmResp(float(y))
LmResp(y::FPVector; wts::FPVector = similar(y, 0)) = LmResp{typeof(y)}(fill!(similar(y), 0), similar(y, 0), wts, y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make wts a positional argument for consistency with GlmResp.

Also, no spaces around = for arguments, and break lines at 92 chars.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to make wts a positional argument because there are 4 fields in LmResp and wts is like the 4th argument. So it seemed weird to have the arguments taken be different from the fields of the object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe but then GlmResp should be changed too... Not sure it's worth it.

src/lm.jl Outdated

function updateμ!(r::LmResp{V}, linPr::V) where V<:FPVector
LmResp(y::AbstractVector{T}; wts::AbstractVector{T} = similar(y, 0)) where T<:Real = LmResp(float(y), float(wts))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LmResp(y::AbstractVector{T}; wts::AbstractVector{T} = similar(y, 0)) where T<:Real = LmResp(float(y), float(wts))
LmResp(y::AbstractVector{<:Real}; wts::AbstractVector{<:Real}=similar(y, 0)) = LmResp(float(y), float(wts))

src/lm.jl Outdated
@@ -126,24 +127,30 @@ end
LinearAlgebra.cholesky(x::LinearModel) = cholesky(x.pp)

function StatsBase.fit!(obj::LinearModel)
installbeta!(delbeta!(obj.pp, obj.rr.y))
if length(obj.rr.wts) == 0
installbeta!(delbeta!(obj.pp, obj.rr.y))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

installbeta!(obj.pp) can be moved outside of the if AFAICT.

src/lm.jl Outdated
allowrankdeficient::Bool=false)
fit!(LinearModel(LmResp(y), cholpred(X, allowrankdeficient)))
function fit(::Type{LinearModel}, X::AbstractMatrix, y::V,
allowrankdeficient::Bool=false; wts::V = similar(y, 0)) where {V<:FPVector}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this signature more restrictive?

src/lm.jl Outdated
end

"""
lm(X, y, allowrankdeficient::Bool=false)
lm(X, y, allowrankdeficient::Bool=false; wts = nothing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lm(X, y, allowrankdeficient::Bool=false; wts = nothing)
lm(X, y, allowrankdeficient::Bool=false; wts=similar(y, 0))

@nalimilan
Copy link
Member

Do you think this is ready? If so, could you add a test?

@pdeffebach
Copy link
Contributor Author

No it's not ready yet. I still have to figure out why the confidence intervals don't match, as well as stuff like nobs, which I think is in StatsBase.

@nalimilan
Copy link
Member

OK. FWIW, as I noted, I don't think confidence intervals should match with glm.

@pdeffebach
Copy link
Contributor Author

Oh I see your point now. lm is more robust than glm with Normal() in that sense. I'll double check with R and Stata.

@nalimilan
Copy link
Member

You don't even need other software: just fit the same model without weights but with replicated rows. R doesn't actually support frequency weights so I don't think you can reproduce the p-values. With Stata you can with fweights.

@codecov-io
Copy link

codecov-io commented Dec 23, 2019

Codecov Report

Merging #341 into master will increase coverage by 11.67%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #341       +/-   ##
===========================================
+ Coverage   69.49%   81.17%   +11.67%     
===========================================
  Files           6        7        +1     
  Lines         531      632      +101     
===========================================
+ Hits          369      513      +144     
+ Misses        162      119       -43
Impacted Files Coverage Δ
src/lm.jl 95.12% <100%> (+22.89%) ⬆️
src/glmfit.jl 81.9% <100%> (+6.31%) ⬆️
src/negbinfit.jl 92.72% <0%> (-1.03%) ⬇️
src/GLM.jl 100% <0%> (ø)
src/ftest.jl 98.43% <0%> (+0.47%) ⬆️
src/linpred.jl 63.15% <0%> (+11.39%) ⬆️
src/glmtools.jl 69.29% <0%> (+21.67%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c97277b...f700891. Read the comment docs.

@pdeffebach
Copy link
Contributor Author

This PR is ready for merging, with a caveat.

Currently issue #84 requires lm to support y and X having different types, while GLM doesn't, so it makes sense to allow the wts keyword argument to be AbstractVector{<:Real} rather than the same type as y. Also note that canonically frequency weights are integers, so there are statistical reasons for wts to be flexible.

Currently the constructors for GLMResp in the glm function are much stricter, wts must be the same type as y. So in my tests I have weights be a vector of Floats, but the reality is that a vector of integers works for wts for lm but not for glm.

I propse merging this and then in a separate PR adding a new constructor method for GLMResp which allows for more type flexibility.

src/lm.jl Outdated

LmResp(y::AbstractVector{T}) where T<:Real = LmResp(float(y))
function LmResp(y::FPVector, wts::FPVector=similar(y, 0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not keep short function syntax? Same below.

src/lm.jl Outdated

function updateμ!(r::LmResp{V}, linPr::V) where V<:FPVector
function updateμ!(r::LmResp{V}, linPr::V) where {V<:FPVector}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't needed.

src/lm.jl Outdated Show resolved Hide resolved
src/lm.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated
@testset "lm with weights" begin
df = dataset("quantreg", "engel")
N = nrow(df)
df.weights = repeat(1.0:5.0, Int(N/5))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not worth repeating this test every time: now that you know you got the correct values, just hardcode them in the test (just like we do above).

Also test other statistics like p-values, confidence intervals, R², deviance, log-likelihood, etc. (see examples in the file).

@nalimilan
Copy link
Member

I propse merging this and then in a separate PR adding a new constructor method for GLMResp which allows for more type flexibility.

Sure.

By the way, we could remove wts to weights, which is a much more explicit name. It would be nice to document clearly that these are frequency weights (in docstrings and in the manual).

pdeffebach and others added 3 commits December 30, 2019 15:22
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
@pdeffebach
Copy link
Contributor Author

Thanks for requiring the R^2 be a test, there is a bug. We should be using the weighed mean to calculated the nulldeviance of the model, that is, instead of the sum of deviations from the mean, we should use the sum of deviations from the un-weighted mean.

I don't have Stata open to check that this is what fweights does, but this is what lfe does in R. The following code shows that the R^2 in lfe with weights is equivelent to the R^2 of the "long" dataset.

library("quantreg")
library("tidyverse")
library("lfe")
data(engel)
df = as_tibble(engel)
df$weights = rep(1:5, nrow(df) / 5)
felm(foodexp ~ income, data = df, weights = df$weights)
r2 = summary(felm_model)$r.squared # get r-squared
print(r2)

@pdeffebach
Copy link
Contributor Author

As an addendum to the above post: I realized that lfe doesn't document the way weights are used when calculating the variance-covariance matrix. So I broke out stata on my other computer and confirmed that Stata also uses the weighted mean when calculating the R^2.

I will go through and just use stata to ensure everything is correct rather than mess with felm.

@pdeffebach
Copy link
Contributor Author

@nalimilan I would like you to please confirm that the calculation of the loglikelihood is correct for lm with weights.

Stata assumes normality of errors when calculating the log-likelihood. This means that

reg foodexp income [fweight = weight]

gives you an e(ll) = -1445, which matches the loglikelihood from the Julia model

glm(@formula(foodexp ~ income), wts = df.weight)

However the loglikelihood of

lm(@formula(foodexp ~ income), wts = df.weight)

is different, due to the implementation here. If you could expand on why it's written like that so we know it's right I'd appreciate it.

@nalimilan
Copy link
Member

Thanks for checking against Stata.

I don't remember the details of why I implemented loglikelihood that way for linear models, and that method doesn't seem to be covered by tests with weights, so maybe it's wrong. I'm not sure why the sum of weights appears in the formula separately from n: actually I would just replace n with the sum of weights, or just n = nobs(m) which does the right thing whether there are weights or not. (Maybe I incorrectly took the formula for analytic weights...)

If you can't find the solution, you could also use the same approach as for GLMs, which should be a bit less efficient, but we don't care that much.

@pdeffebach
Copy link
Contributor Author

I didn't end up figuring out why the log-likelihood was implemented the way it was. It didn't match aweights. I implemented the glm version for lm and made sure it matched stata.

This is a weird PR because there exists a lot of code for lm with weights that has never been used since lm has never supported weights until now. For some reason the residuals function for lm multiplied each residual (i.e. x_i' * beta - y) by the weight, which is wrong and I have now fixed.

@nalimilan
Copy link
Member

Indeed, multiplying the residual for each observation by its weight doesn't make sense.

src/lm.jl Outdated

LmResp(y::AbstractVector{T}) where T<:Real = LmResp(float(y))
LmResp(y::FPVector, wts::FPVector=similar(y, 0)) = LmResp{typeof(y)}(fill!(similar(y), 0), similar(y, 0), wts, y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LmResp(y::FPVector, wts::FPVector=similar(y, 0)) = LmResp{typeof(y)}(fill!(similar(y), 0), similar(y, 0), wts, y)
LmResp(y::FPVector, wts::FPVector=similar(y, 0)) =
LmResp{typeof(y)}(fill!(similar(y), 0), similar(y, 0), wts, y)

src/lm.jl Outdated
LmResp(y::AbstractVector{T}) where T<:Real = LmResp(float(y))
LmResp(y::FPVector, wts::FPVector=similar(y, 0)) = LmResp{typeof(y)}(fill!(similar(y), 0), similar(y, 0), wts, y)

LmResp(y::AbstractVector{<:Real}, wts::AbstractVector{<:Real}=similar(y, 0)) = LmResp(float(y), float(wts))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LmResp(y::AbstractVector{<:Real}, wts::AbstractVector{<:Real}=similar(y, 0)) = LmResp(float(y), float(wts))
LmResp(y::AbstractVector{<:Real}, wts::AbstractVector{<:Real}=similar(y, 0)) =
LmResp(float(y), float(wts))

src/lm.jl Outdated
sw += log(w)
y = r.y
if isempty(wts)
@show mu = mean(y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@show mu = mean(y)
mu = mean(y)

src/lm.jl Outdated
ll
end

function residuals(r::LmResp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this a one-liner?

src/lm.jl Outdated
@@ -126,24 +145,31 @@ end
LinearAlgebra.cholesky(x::LinearModel) = cholesky(x.pp)

function StatsBase.fit!(obj::LinearModel)
installbeta!(delbeta!(obj.pp, obj.rr.y))
if length(obj.rr.wts) == 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if length(obj.rr.wts) == 0
if isempty(obj.rr.wts)

src/lm.jl Outdated
"""
lm(X, y, allowrankdeficient::Bool=false) = fit(LinearModel, X, y, allowrankdeficient)
lm(X, y, allowrankdeficient::Bool=false; kwargs...) = fit(LinearModel, X, y, allowrankdeficient; kwargs...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lm(X, y, allowrankdeficient::Bool=false; kwargs...) = fit(LinearModel, X, y, allowrankdeficient; kwargs...)
lm(X, y, allowrankdeficient::Bool=false; kwargs...) =
fit(LinearModel, X, y, allowrankdeficient; kwargs...)

test/runtests.jl Outdated
@@ -46,6 +46,26 @@ linreg(x::AbstractVecOrMat, y::AbstractVector) = qr!(simplemm(x)) \ y
@test isapprox(coef(lm1), coef(lm2) .* [1., 10.])
end

@testset "weights" begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@testset "weights" begin
@testset "linear model with weights" begin

test/runtests.jl Show resolved Hide resolved
src/lm.jl Show resolved Hide resolved
@pdeffebach
Copy link
Contributor Author

I can't find any reference to Pseudo-R^2. Is that something that lives in GLM? Not sure how to test it.

@nalimilan
Copy link
Member

@pdeffebach
Copy link
Contributor Author

r2 doesn't work for GLM, unfortunately. It says mss not defined for Generalized.... It should be a one-liner to implement though the way it is for lm.

@nalimilan
Copy link
Member

nalimilan commented Jan 5, 2020

I mean r²(m, :McFadden). EDIT: sorry, it used to work, but it doesn't since it moved from using nulldeviance to using nullloglikelihood (which is correct). So we need to implement that.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@nalimilan nalimilan merged commit c3fc336 into JuliaStats:master Jan 15, 2020
@getzze
Copy link

getzze commented Jan 16, 2020

The definition of loglikelihood (and nullloglikelihood) seems incorrect to me. As defined in StatsBase (https://juliastats.org/StatsBase.jl/latest/statmodels/#StatsBase.deviance), the deviance is 2*(l_S - l), where l is the loglikelihood of the model and l_S the loglikelihood of the saturated model, that is often 0. So I would expect that the loglikelihood is proportional to the deviance.

Also for GLMs, the definition of the loglikelihood is a bit confusing to me, I don't get why the deviance is used as an estimate of the second parameter of the distribution. Is the deviance supposed to be an estimate of the variance?

[EDIT] Sorry, I just realized that this was not introduced by this PR, I will open an issue. #356

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants