-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Taking weighting seriously #487
base: master
Are you sure you want to change the base?
Conversation
…liaStats-master
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #487 +/- ##
==========================================
- Coverage 90.11% 85.38% -4.74%
==========================================
Files 8 8
Lines 1123 1286 +163
==========================================
+ Hits 1012 1098 +86
- Misses 111 188 +77 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
Hey, Would that fix the issue I am having, which is that if rows of the data contains missing values, GLM discard those rows, but does not discard the corresponding values of I think the interfacing should allow for a DataFrame input of weights, that would take care of such things (like it does for the other variables). |
not really. But it would be easy to make this a feature. But before digging further on this I would like to know whether there is consensus on the approach of this PR. |
FYI this appears to fix #420; a PR was started in #432 and the author closed for lack of time on their part to investigate CI failures. Here's the test case pulled from #432 which passes with the in #487. @testset "collinearity and weights" begin
rng = StableRNG(1234321)
x1 = randn(100)
x1_2 = 3 * x1
x2 = 10 * randn(100)
x2_2 = -2.4 * x2
y = 1 .+ randn() * x1 + randn() * x2 + 2 * randn(100)
df = DataFrame(y = y, x1 = x1, x2 = x1_2, x3 = x2, x4 = x2_2, weights = repeat([1, 0.5],50))
f = @formula(y ~ x1 + x2 + x3 + x4)
lm_model = lm(f, df, wts = df.weights)#, dropcollinear = true)
X = [ones(length(y)) x1_2 x2_2]
W = Diagonal(df.weights)
coef_naive = (X'W*X)\X'W*y
@test lm_model.model.pp.chol isa CholeskyPivoted
@test rank(lm_model.model.pp.chol) == 3
@test isapprox(filter(!=(0.0), coef(lm_model)), coef_naive)
end Can this test set be added? Is there any other feedback for @gragusa ? It would be great to get this merged if good to go. |
Sorry for the long delay, I hadn't realized you were waiting for feedback. Looks great overall, please feel free to finish it! I'll try to find the time to make more specific comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read the code. Lots of comments, but all of these are minor. The main one is mostly stylistic: in most cases it seems that using if wts isa UnitWeights
inside a single method (like the current structure) gives simpler code than defining several methods. Otherwise the PR looks really clean!
What are you thoughts regarding testing? There are a lot of combinations to test and it's not easy to see how to integrate that into the current organization of tests. One way would be to add code for each kind of test to each @testset
that checks a given model family (or a particular case, like collinear variables). There's also the issue of testing the QR factorization, which isn't used by default.
A very nice PR. In the tests can we have some test set that compares the results of |
CI failures on Julia 1.0 can be fixed by requiring Julia 1.6 (more and more packages have started doing that). |
Sorry for the noise, but thank you @gragusa and reviewers for this big PR. As a user I've been watching for weighting for a while and appreciate the technical expertise and dedication to quality here. |
@nalimilan let’s give this a final push. Should I rebase this PR against #339 ? (rhetorical question!) What’s the most efficient way? |
Yes the PR needs to be rebased against master -- or, simpler, merge master into the branch. Most conflicts seem relatively simple to resolve. You can try doing this online on GitHub, though there's always a chance that it won't be 100% correct the first time. Otherwise you can do that locally with |
Don’t worry .. I am already on it.
Sent from Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Milan Bouchet-Valat ***@***.***>
Sent: Wednesday, November 23, 2022 7:41:56 PM
To: JuliaStats/GLM.jl ***@***.***>
Cc: Giuseppe Ragusa ***@***.***>; Mention ***@***.***>
Subject: Re: [JuliaStats/GLM.jl] Taking weighting seriously (PR #487)
Yes the PR needs to be rebased against master -- or, simpler, merge master into the branch. Most conflicts seem relatively simple to resolve. You can try doing this online on GitHub, though there's always a chance that it won't be 100% correct the first time. Otherwise you can do that locally with git fetch; git merge origin/master. Or I can do it in a few days if you want.
—
Reply to this email directly, view it on GitHub<#487 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAD5DAUXTDGDDJXXCRUZLHTWJZQPJANCNFSM53WBWMMQ>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for rebasing! I have more comments, as @bkamins had made a few ones above too.
1.8686815106332157 0.0 0.0 0.0 1.8686815106332157; | ||
0.010149793505874801 0.010149793505874801 0.0 0.0 0.010149793505874801; | ||
-1.8788313148033928 -0.0 -1.8788313148033928 -0.0 -1.8788313148033928] | ||
@test mm0_pois ≈ GLM.momentmatrix(gm_pois) atol=1e-06 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove double space here and elsewhere.
f = @formula(admit ~ 1 + rank) | ||
gm_bin = fit(GeneralizedLinearModel, f, admit_agr, Binomial(); rtol=1e-8) | ||
gm_binw = fit(GeneralizedLinearModel, f, admit_agr, Binomial(), | ||
wts=aweights(admit_agr.count); rtol=1e-08) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to use analytic weights rather than frequency weights? Here I think the latter make more sense for this dataset.
- `FrequencyWeights` describe the inverse of the sampling probability for each observation, | ||
providing a correction mechanism for under- or over-sampling certain population groups. | ||
These weights may also be referred to as sampling weights. | ||
- `ProbabilityWeights` describe how the sample can be scaled back to the population. | ||
Usually are the reciprocals of sampling probabilities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the same wording as in StatsBase for simplicity. If we want to improve it, we'll change it everywhere.
- `FrequencyWeights` describe the inverse of the sampling probability for each observation, | |
providing a correction mechanism for under- or over-sampling certain population groups. | |
These weights may also be referred to as sampling weights. | |
- `ProbabilityWeights` describe how the sample can be scaled back to the population. | |
Usually are the reciprocals of sampling probabilities. | |
- `FrequencyWeights` describe the number of times (or frequency) each observation was seen. | |
These weights may also be referred to as case weights or repeat weights. | |
- `ProbabilityWeights` represent the inverse of the sampling probability for each observation, | |
providing a correction mechanism for under- or over-sampling certain population groups. | |
These weights may also be referred to as sampling weights. |
fitted, fit, fit!, model_response, response, modelmatrix, r2, r², adjr2, adjr², | ||
cooksdistance, hasintercept, dispersion | ||
cooksdistance, hasintercept, dispersion, weights, AnalyticWeights, ProbabilityWeights, FrequencyWeights, | ||
UnitWeights, uweights, fweights, pweights, aweights, leverage | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the description of weights types to COMMON_FIT_KWARGS_DOCS
below.
@nalimilan were there remaining fixes to have this PR completed? I was worried to have the important work brought by this PR loose its momentum. |
Mostly time 🤣
I think there few things to fix (addressing all the comments of @nalimilan) and making few decisions …
I could Find some time in the next week to finish it if @nalimilan has some time to support me.
Sent from Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: jeremiedb ***@***.***>
Sent: Wednesday, March 1, 2023 5:21:44 AM
To: JuliaStats/GLM.jl ***@***.***>
Cc: Giuseppe Ragusa ***@***.***>; Mention ***@***.***>
Subject: Re: [JuliaStats/GLM.jl] Taking weighting seriously (PR #487)
@nalimilan<https://github.com/nalimilan> were there remaining fixes to have this PR completed? I was worried to have the important work brought by this PR loose its momentum.
—
Reply to this email directly, view it on GitHub<#487 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAD5DAXJBMS64DMYJSZ573TWZ3FFRANCNFSM53WBWMMQ>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Sure ! |
While we are at weights my question is if we should not update the |
|
I agree that we would need to carefully consider all cases of weights. I have not thought about probabilistic weights. However, for frequency weights and analytical weights, assuming we produce correct I have just checked against examples in Wooldridge, chapter 8 and properly scaled analytical weights produce correct result. Also, in general, I think we should ensure that every function in GLM.jl that accepts model estimated with weighting should either:
(it does not have to be in this PR, but if we are taking weighting seriously, I think we should ensure this property when we make a release) Thank you for working on this! |
In survey datasets weights are commonly calibrated to sum upto an (integral) population size.
While most applications of F-test have integral In R: > df(1.2, df1 = 10, df2 = 20)
[1] 0.5626125
> df(1.2, df1 = 10, df2 = 20.1)
[1] 0.5630353 Julia and R agree julia> using Distributions
julia> d = FDist(10, 20)
FDist{Float64}(ν1=10.0, ν2=20.0)
julia> pdf(d, 1.2)
0.5626124566227022
julia> d = FDist(10, 20.1)
FDist{Float64}(ν1=10.0, ν2=20.1)
julia> pdf(d, 1.2)
0.5630352744353205 There is this StackExchange post discussing non-integral dof for t-tests and for GAMs in this post.
The F-test is essentially the ratio of two variances. For the weighted GLM case, variances based on the weighted Least Squares could be used to calculate test statistic. Note: whether doing an (adjusted) F-Test for comparing weighted GLM models is the right approach, that is up for debate... |
Hmm, did any of the people who worked on Survey.jl leave comments here? @iuliadmtru @aviks |
I finally found the time to rebase this PR against the latest I have a few days of "free" time and would like to finish this. @nalimilan It is difficult to track the comments and which ones were addressed by the various commit. On my side, the primary decision is about weight scaling. But before engaging in a conversation, I will add documentation so that whoever will contribute to the discussion can do it coherently. Test passed! |
Cool. Do you need any input from my side? |
Hi there! I wonder what will happen to this PR? As I understand, one review from a person with write access is needed? |
Just wanted to give a quick update on the PR.
The PR was almost ready to go, but now, with more PR being merged, there are a few things that need to be straightened out. I should be able to work on it again next week to make sure everything's in good shape. Then I hope somebody will help get this merged.
From: Samuel Mathieu ***@***.***>
Date: Wednesday, January 31, 2024 at 17:37
To: JuliaStats/GLM.jl ***@***.***>
Cc: Giuseppe Ragusa ***@***.***>, Mention ***@***.***>
Subject: Re: [JuliaStats/GLM.jl] Taking weighting seriously (PR #487)
Hi there! I wonder what will happen to this PR? As I understand, one review from a person with write access is needed?
—
Reply to this email directly, view it on GitHub<#487 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAD5DAVWZZIQDXOS5AQHBSLYRJXN5AVCNFSM53WBWMM2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJRHE2DOOJYGM4A>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@gragusa Any chance that you'd be able to look at the remaining items here? It would be good to get in for a 2.0 release. |
@andreasnoack I merged my branch with base. Tests are passing (documentation is failing, but that is easy to fix). There were a few outstanding decisions to make (mostly about ftest and other peripheral methods), but I need to review the code and see where we stand. I only have a little time, but if I get some help, I could add the finishing touches. For instance, there is JuliaStats/StatsAPI.jl#16 to merge eventually. |
This PR addresses several problems with the current GLM implementation.
Current status
In master, GLM/LM only accepts weights through the keyword
wts
. These weights are implicitly frequency weights.With this PR
FrequencyWeights, AnalyticWeights, and ProbabilityWeights are possible. The API is the following
The old behavior -- passing a vector
wts=df.wts
is deprecated and for the moment, the array os coerceddf.wts
to FrequencyWeights.To allow dispatching on the weights,
CholPred
takes a parameterT<:AbstractWeights
. The unweighted LM/GLM has UnitWeights as the parameter for the type.This PR also implements
residuals(r::RegressionModel; weighted::Bool=false)
andmodelmatrix(r::RegressionModel; weighted::Bool = false)
. The new signature for these two methods is pending in StatsApi.There are many changes that I had to make to make everything work. Tests are passing, but some new feature needs new tests. Before implementing them, I wanted to ensure that the approach taken was liked.
I have also implemented
momentmatrix
, which returns the estimating function of the estimator. I arrived to the conclusion that it does not make sense to have a keyword argumentweighted
. Thus I will amend JuliaStats/StatsAPI.jl#16 to remove such a keyword from the signature.Update
I think I covered all the suggestions/comments with this exception as I have to think about it. Maybe this can be addressed later. The new standard errors (the one for
ProbabilityWeights
) also work in the rank deficient case (and so doescooksdistance
).Tests are passing and I think they cover everything that I have implemented. Also, added a section in the documentation about using
Weights
and updatedjldoc
with the new signature ofCholeskyPivoted
.To do:
Closes #186.