-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add wts keyword to lm
#341
Conversation
I spent more time on this tonight, trying to make Does that sounds good? |
I'd rather keep |
Thanks for the feedback. That's a valid concern. If you have time, could you verify that we would need to add more fields to the |
Well |
After a decent amount of time inspecting functions to figure out what to change, I think I made the necessary changes.
Actually it looks like the confidence intervals are all off, which will require some tweaks. @nalimilan can you let me know if I'm heading in the right direction in terms of the way I changed the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, looks OK AFAICT. Regarding the confidence intervals, better compare with what you get by replicating each row according to its weight, as by definition the results must be equivalent. You can also check with another package (e.g. Stata's fweights
). It's hard to tell whether what you showed is correct, since for a small number of observations the F distribution will be very different from the Normal distribution I think (for larger samples it should converge towards it).
Manifest.toml
Outdated
@@ -0,0 +1,232 @@ | |||
# This file is machine-generated - editing it directly is not advised |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this file.
src/lm.jl
Outdated
|
||
LmResp(y::AbstractVector{T}) where T<:Real = LmResp(float(y)) | ||
LmResp(y::FPVector; wts::FPVector = similar(y, 0)) = LmResp{typeof(y)}(fill!(similar(y), 0), similar(y, 0), wts, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make wts
a positional argument for consistency with GlmResp
.
Also, no spaces around =
for arguments, and break lines at 92 chars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to make wts
a positional argument because there are 4 fields in LmResp
and wts
is like the 4th argument. So it seemed weird to have the arguments taken be different from the fields of the object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe but then GlmResp
should be changed too... Not sure it's worth it.
src/lm.jl
Outdated
|
||
function updateμ!(r::LmResp{V}, linPr::V) where V<:FPVector | ||
LmResp(y::AbstractVector{T}; wts::AbstractVector{T} = similar(y, 0)) where T<:Real = LmResp(float(y), float(wts)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LmResp(y::AbstractVector{T}; wts::AbstractVector{T} = similar(y, 0)) where T<:Real = LmResp(float(y), float(wts)) | |
LmResp(y::AbstractVector{<:Real}; wts::AbstractVector{<:Real}=similar(y, 0)) = LmResp(float(y), float(wts)) |
src/lm.jl
Outdated
@@ -126,24 +127,30 @@ end | |||
LinearAlgebra.cholesky(x::LinearModel) = cholesky(x.pp) | |||
|
|||
function StatsBase.fit!(obj::LinearModel) | |||
installbeta!(delbeta!(obj.pp, obj.rr.y)) | |||
if length(obj.rr.wts) == 0 | |||
installbeta!(delbeta!(obj.pp, obj.rr.y)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
installbeta!(obj.pp)
can be moved outside of the if
AFAICT.
src/lm.jl
Outdated
allowrankdeficient::Bool=false) | ||
fit!(LinearModel(LmResp(y), cholpred(X, allowrankdeficient))) | ||
function fit(::Type{LinearModel}, X::AbstractMatrix, y::V, | ||
allowrankdeficient::Bool=false; wts::V = similar(y, 0)) where {V<:FPVector} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make this signature more restrictive?
src/lm.jl
Outdated
end | ||
|
||
""" | ||
lm(X, y, allowrankdeficient::Bool=false) | ||
lm(X, y, allowrankdeficient::Bool=false; wts = nothing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lm(X, y, allowrankdeficient::Bool=false; wts = nothing) | |
lm(X, y, allowrankdeficient::Bool=false; wts=similar(y, 0)) |
Do you think this is ready? If so, could you add a test? |
No it's not ready yet. I still have to figure out why the confidence intervals don't match, as well as stuff like |
OK. FWIW, as I noted, I don't think confidence intervals should match with |
Oh I see your point now. |
You don't even need other software: just fit the same model without weights but with replicated rows. R doesn't actually support frequency weights so I don't think you can reproduce the p-values. With Stata you can with |
Codecov Report
@@ Coverage Diff @@
## master #341 +/- ##
===========================================
+ Coverage 69.49% 81.17% +11.67%
===========================================
Files 6 7 +1
Lines 531 632 +101
===========================================
+ Hits 369 513 +144
+ Misses 162 119 -43
Continue to review full report at Codecov.
|
This PR is ready for merging, with a caveat. Currently issue #84 requires Currently the constructors for I propse merging this and then in a separate PR adding a new constructor method for |
src/lm.jl
Outdated
|
||
LmResp(y::AbstractVector{T}) where T<:Real = LmResp(float(y)) | ||
function LmResp(y::FPVector, wts::FPVector=similar(y, 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not keep short function syntax? Same below.
src/lm.jl
Outdated
|
||
function updateμ!(r::LmResp{V}, linPr::V) where V<:FPVector | ||
function updateμ!(r::LmResp{V}, linPr::V) where {V<:FPVector} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't needed.
test/runtests.jl
Outdated
@testset "lm with weights" begin | ||
df = dataset("quantreg", "engel") | ||
N = nrow(df) | ||
df.weights = repeat(1.0:5.0, Int(N/5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not worth repeating this test every time: now that you know you got the correct values, just hardcode them in the test (just like we do above).
Also test other statistics like p-values, confidence intervals, R², deviance, log-likelihood, etc. (see examples in the file).
Sure. By the way, we could remove |
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Thanks for requiring the R^2 be a test, there is a bug. We should be using the weighed mean to calculated the I don't have Stata open to check that this is what fweights does, but this is what
|
As an addendum to the above post: I realized that I will go through and just use stata to ensure everything is correct rather than mess with |
@nalimilan I would like you to please confirm that the calculation of the loglikelihood is correct for Stata assumes normality of errors when calculating the log-likelihood. This means that
gives you an
However the
is different, due to the implementation here. If you could expand on why it's written like that so we know it's right I'd appreciate it. |
Thanks for checking against Stata. I don't remember the details of why I implemented If you can't find the solution, you could also use the same approach as for GLMs, which should be a bit less efficient, but we don't care that much. |
I didn't end up figuring out why the log-likelihood was implemented the way it was. It didn't match This is a weird PR because there exists a lot of code for |
Indeed, multiplying the residual for each observation by its weight doesn't make sense. |
src/lm.jl
Outdated
|
||
LmResp(y::AbstractVector{T}) where T<:Real = LmResp(float(y)) | ||
LmResp(y::FPVector, wts::FPVector=similar(y, 0)) = LmResp{typeof(y)}(fill!(similar(y), 0), similar(y, 0), wts, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LmResp(y::FPVector, wts::FPVector=similar(y, 0)) = LmResp{typeof(y)}(fill!(similar(y), 0), similar(y, 0), wts, y) | |
LmResp(y::FPVector, wts::FPVector=similar(y, 0)) = | |
LmResp{typeof(y)}(fill!(similar(y), 0), similar(y, 0), wts, y) |
src/lm.jl
Outdated
LmResp(y::AbstractVector{T}) where T<:Real = LmResp(float(y)) | ||
LmResp(y::FPVector, wts::FPVector=similar(y, 0)) = LmResp{typeof(y)}(fill!(similar(y), 0), similar(y, 0), wts, y) | ||
|
||
LmResp(y::AbstractVector{<:Real}, wts::AbstractVector{<:Real}=similar(y, 0)) = LmResp(float(y), float(wts)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LmResp(y::AbstractVector{<:Real}, wts::AbstractVector{<:Real}=similar(y, 0)) = LmResp(float(y), float(wts)) | |
LmResp(y::AbstractVector{<:Real}, wts::AbstractVector{<:Real}=similar(y, 0)) = | |
LmResp(float(y), float(wts)) |
src/lm.jl
Outdated
sw += log(w) | ||
y = r.y | ||
if isempty(wts) | ||
@show mu = mean(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@show mu = mean(y) | |
mu = mean(y) |
src/lm.jl
Outdated
ll | ||
end | ||
|
||
function residuals(r::LmResp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this a one-liner?
src/lm.jl
Outdated
@@ -126,24 +145,31 @@ end | |||
LinearAlgebra.cholesky(x::LinearModel) = cholesky(x.pp) | |||
|
|||
function StatsBase.fit!(obj::LinearModel) | |||
installbeta!(delbeta!(obj.pp, obj.rr.y)) | |||
if length(obj.rr.wts) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if length(obj.rr.wts) == 0 | |
if isempty(obj.rr.wts) |
src/lm.jl
Outdated
""" | ||
lm(X, y, allowrankdeficient::Bool=false) = fit(LinearModel, X, y, allowrankdeficient) | ||
lm(X, y, allowrankdeficient::Bool=false; kwargs...) = fit(LinearModel, X, y, allowrankdeficient; kwargs...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lm(X, y, allowrankdeficient::Bool=false; kwargs...) = fit(LinearModel, X, y, allowrankdeficient; kwargs...) | |
lm(X, y, allowrankdeficient::Bool=false; kwargs...) = | |
fit(LinearModel, X, y, allowrankdeficient; kwargs...) |
test/runtests.jl
Outdated
@@ -46,6 +46,26 @@ linreg(x::AbstractVecOrMat, y::AbstractVector) = qr!(simplemm(x)) \ y | |||
@test isapprox(coef(lm1), coef(lm2) .* [1., 10.]) | |||
end | |||
|
|||
@testset "weights" begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@testset "weights" begin | |
@testset "linear model with weights" begin |
I can't find any reference to Pseudo-R^2. Is that something that lives in GLM? Not sure how to test it. |
|
I mean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
The definition of loglikelihood (and nullloglikelihood) seems incorrect to me. As defined in StatsBase (https://juliastats.org/StatsBase.jl/latest/statmodels/#StatsBase.deviance), the deviance is Also for GLMs, the definition of the loglikelihood is a bit confusing to me, I don't get why the deviance is used as an estimate of the second parameter of the distribution. Is the deviance supposed to be an estimate of the variance? [EDIT] Sorry, I just realized that this was not introduced by this PR, I will open an issue. #356 |
Addresses #258. It definitely doesn't work right now, as I haven't been able to track down all the up-stream functions that need to use the weights of a
LmResp
object.