Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taking weighting seriously #487

Open
wants to merge 89 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 81 commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
1754cbd
WIP
gragusa Jun 10, 2022
1d778a5
WIP
gragusa Jun 15, 2022
12121a3
WIP
gragusa Jun 15, 2022
4363ba4
Taking weights seriously
gragusa Jun 17, 2022
ca702dc
WIP
gragusa Jun 18, 2022
e2b2d12
Taking weights seriously
gragusa Jun 21, 2022
bc8709a
Merge branch 'master' of https://github.com/JuliaStats/GLM.jl into Ju…
gragusa Jun 21, 2022
84cd990
Add depwarn for passing wts with Vector
gragusa Jun 22, 2022
cbc329f
Cosmettic changes
gragusa Jun 22, 2022
23d67f5
WIP
gragusa Jun 23, 2022
f4d90a9
Fix loglik for weighted models
gragusa Jul 4, 2022
6b7d95c
Fix remaining issues
gragusa Jul 15, 2022
c236b82
Final commit
gragusa Jul 15, 2022
d4bd0c2
Merge branch 'master'
gragusa Jul 15, 2022
8bdfb55
Fix merge
gragusa Jul 15, 2022
3eb2ca4
Fix nulldeviance
gragusa Jul 16, 2022
63c8358
Bypass crossmodelmatrix drom StatsAPI
gragusa Jul 16, 2022
e93a919
Delete momentmatrix.jl
gragusa Jul 16, 2022
7bb0959
Delete scratch.jl
gragusa Jul 16, 2022
ded17a8
Delete settings.json
gragusa Jul 16, 2022
3346774
AbstractWeights are required to be real
gragusa Sep 5, 2022
7376e78
Update src/glmfit.jl
gragusa Sep 5, 2022
a738268
Apply suggestions from code review
gragusa Sep 5, 2022
c9459e7
Merge pull request #2 from JuliaStats/master
gragusa Sep 5, 2022
6af3ca5
Throw error if GlmResp are not AbastractWeights
gragusa Sep 5, 2022
0ded1d4
Addressing review comments
gragusa Sep 5, 2022
d923e48
Reexport aweights, pweights, fweights
gragusa Sep 5, 2022
84f27d1
Fixed remaining issues with null loglikelihood
gragusa Sep 6, 2022
8804dc1
Fix nullloglikelihood tests
gragusa Sep 6, 2022
7f3aa36
Do not dispatch on Weights but use if
gragusa Sep 6, 2022
f67a8e0
Do not dispatch on Weights use if
gragusa Sep 6, 2022
23a3e87
Fix inferred test
gragusa Sep 6, 2022
5481284
Use if instead of dispatching on Weights
gragusa Sep 6, 2022
d12222e
Add doc for weights and fix output
gragusa Sep 7, 2022
a17e812
Fix docs failures
gragusa Sep 7, 2022
58dec0c
Fix pweights stderror even for rank deficient des
gragusa Sep 7, 2022
a6f5c66
Add test for pweights stderror
gragusa Sep 7, 2022
92ddb1e
Export UnitWeights
gragusa Sep 7, 2022
0c61fff
Fix documentation
gragusa Sep 7, 2022
8b0e8e1
Mkae cooksdistance work with rank deficient design
gragusa Sep 7, 2022
f609f06
Test cooksdistance with rank deficient design
gragusa Sep 7, 2022
23f3d03
Fix CholeskyPivoted signature in docs
gragusa Sep 8, 2022
2749b84
Make nancolidx v1.0 and v1.1 friendly
gragusa Sep 8, 2022
82e472b
Fix signatures
gragusa Sep 9, 2022
2d6aaed
Correct implementation of momentmatrix
gragusa Sep 9, 2022
dbc9ae9
Test moment matrix
gragusa Sep 9, 2022
e0d9cdf
Apply suggestions from code review
gragusa Sep 23, 2022
46e8f92
Incorporate suggestions of reviewer
gragusa Sep 23, 2022
6df401b
Deals with review comments
gragusa Sep 24, 2022
ca15eb8
Small fix
gragusa Sep 24, 2022
0c18ae9
Small fix
gragusa Sep 25, 2022
54d68d1
Apply suggestions from code review
gragusa Oct 3, 2022
422a8cd
Merge branch 'master' into JuliaStats-master
gragusa Oct 3, 2022
d6d4e6b
Fix vcov dispatch for vcov
gragusa Oct 3, 2022
b457d74
Fix dispatch of _vcov
gragusa Oct 3, 2022
b087679
Revert changes
gragusa Oct 3, 2022
a44e137
Update src/glmfit.jl
gragusa Oct 3, 2022
11db2c4
Fix weighted keyword in modelmatrix
gragusa Oct 3, 2022
b649d4f
perf in nulldeviance for unweighted models
gragusa Oct 3, 2022
170148c
Merge branch 'JuliaStats-master' of github.com:gragusa/GLM.jl into Ju…
gragusa Oct 3, 2022
29c43cb
Fixed std error for probability weights
gragusa Oct 19, 2022
279e533
Getting there (& switch Analytics to Importance)
gragusa Oct 20, 2022
afb145e
.= instead of copy!
gragusa Oct 20, 2022
2cead0a
Remove comments
gragusa Oct 20, 2022
a1ec49f
up
gragusa Oct 20, 2022
97bf28d
Speedup cooksdistance
gragusa Oct 23, 2022
9ce2d89
Revert back to AnalyticWeights
gragusa Oct 24, 2022
9bddf63
Add extensive tests for AnalyticWeights
gragusa Oct 24, 2022
3fe045a
Add extensive tests for AnalyticWeights
gragusa Oct 24, 2022
852e307
Delete scratch.jl
gragusa Oct 25, 2022
d1ba3e5
Delete analytic_weights.jl
gragusa Oct 25, 2022
831f280
Follow reviewer suggestions [Batch 1]
gragusa Nov 15, 2022
b00dc16
Follow reviewer's suggestions [Batch 2]
gragusa Nov 15, 2022
0825324
probability weights vcov uses momentmatrix
gragusa Nov 15, 2022
48d15fb
Fix ProbabilityWeights vcov and tests
gragusa Nov 16, 2022
3338eab
Use leverage from StasAPI
gragusa Nov 17, 2022
c27c749
Merge branch 'master' into JuliaStats-master
gragusa Nov 17, 2022
970e26e
Rebase against master
gragusa Nov 17, 2022
8832e9d
Fix test
gragusa Nov 17, 2022
9eb2390
Merge remote-tracking branch 'origin/master' into JuliaStats-master
gragusa Dec 20, 2022
587c129
Test on 1.6
gragusa Dec 20, 2022
fa63a9a
Address reviwer comments
gragusa Dec 29, 2022
807731a
Merge branch 'master' of github.com:JuliaStats/GLM.jl into JuliaStats…
gragusa Jun 16, 2023
72996fc
Merge branch 'master' into JuliaStats-master
andreasnoack Nov 19, 2024
1ee383a
Merge remote-tracking branch 'upstream/master' into JuliaStats-master
gragusa Nov 19, 2024
ba52ce9
Merge from origin
gragusa Nov 19, 2024
5e790df
Fix broken test of dof_residual
gragusa Nov 19, 2024
50c1a96
Fix testing issues
gragusa Nov 19, 2024
c4f7959
Fix docs
gragusa Nov 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/CI-stable.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ jobs:
strategy:
fail-fast: false
matrix:
version: ['1.0', '1']
version: ['1.6', '1']
os: ['ubuntu-latest', 'macos-latest', 'windows-latest']
arch: ['x64']
steps:
Expand Down
14 changes: 7 additions & 7 deletions docs/src/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

```@meta
DocTestSetup = quote
using CategoricalArrays, DataFrames, Distributions, GLM, RDatasets
using CategoricalArrays, DataFrames, Distributions, GLM, RDatasets, StableRNGs
end
```

Expand All @@ -22,7 +22,7 @@ GLM.ModResp

The most general approach to fitting a model is with the `fit` function, as in
```jldoctest
julia> using Random
julia> using GLM, StableRNGs

julia> fit(LinearModel, hcat(ones(10), 1:10), randn(MersenneTwister(12321), 10))
LinearModel
Expand All @@ -31,14 +31,14 @@ Coefficients:
────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────
x1 0.717436 0.775175 0.93 0.3818 -1.07012 2.50499
x2 -0.152062 0.124931 -1.22 0.2582 -0.440153 0.136029
x1 0.361896 0.69896 0.52 0.6186 -1.24991 1.9737
x2 -0.012125 0.112648 -0.11 0.9169 -0.271891 0.247641
────────────────────────────────────────────────────────────────
```

This model can also be fit as
```jldoctest
julia> using Random
julia> using GLM, StableRNGs

julia> lm(hcat(ones(10), 1:10), randn(MersenneTwister(12321), 10))
LinearModel
Expand All @@ -47,8 +47,8 @@ Coefficients:
────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────
x1 0.717436 0.775175 0.93 0.3818 -1.07012 2.50499
x2 -0.152062 0.124931 -1.22 0.2582 -0.440153 0.136029
x1 0.361896 0.69896 0.52 0.6186 -1.24991 1.9737
x2 -0.012125 0.112648 -0.11 0.9169 -0.271891 0.247641
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I would add weighted lm putting lower weight to observation 10 in dataset III (an outlier), to show how the results change.

Of course these are soft suggestions, but would show the use of the things that we implement here.

────────────────────────────────────────────────────────────────
```

Expand Down
43 changes: 11 additions & 32 deletions docs/src/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ julia> using DataFrames, GLM, StatsBase

julia> data = DataFrame(X=[1,2,3], Y=[2,4,7])
3×2 DataFrame
Row │ X Y
│ Int64 Int64
Row │ X Y
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trailing whitespace probably should be stripped.
Do we have doctests enabled?

│ Int64 Int64
─────┼──────────────
1 │ 1 2
2 │ 2 4
Expand Down Expand Up @@ -61,7 +61,7 @@ julia> dof(ols)
3

julia> dof_residual(ols)
1.0
1
nalimilan marked this conversation as resolved.
Show resolved Hide resolved

julia> round(aic(ols); digits=5)
5.84252
Expand Down Expand Up @@ -91,8 +91,8 @@ julia> round.(vcov(ols); digits=5)
```jldoctest
julia> data = DataFrame(X=[1,2,2], Y=[1,0,1])
3×2 DataFrame
Row │ X Y
│ Int64 Int64
Row │ X Y
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 0
Expand Down Expand Up @@ -196,8 +196,8 @@ julia> using GLM, RDatasets

julia> form = dataset("datasets", "Formaldehyde")
6×2 DataFrame
Row │ Carb OptDen
│ Float64 Float64
Row │ Carb OptDen
│ Float64 Float64
─────┼──────────────────
1 │ 0.1 0.086
2 │ 0.3 0.269
Expand Down Expand Up @@ -350,8 +350,8 @@ julia> dobson = DataFrame(Counts = [18.,17,15,20,10,21,25,13,13],
Outcome = categorical([1,2,3,1,2,3,1,2,3]),
Treatment = categorical([1,1,1,2,2,2,3,3,3]))
9×3 DataFrame
Row │ Counts Outcome Treatment
│ Float64 Cat… Cat…
Row │ Counts Outcome Treatment
│ Float64 Cat… Cat…
─────┼─────────────────────────────
1 │ 18.0 1 1
2 │ 17.0 2 1
Expand Down Expand Up @@ -390,29 +390,8 @@ In this example, we choose the best model from a set of λs, based on minimum BI
```jldoctest
julia> using GLM, RDatasets, StatsBase, DataFrames, Optim

julia> trees = DataFrame(dataset("datasets", "trees"))
31×3 DataFrame
Row │ Girth Height Volume
│ Float64 Int64 Float64
─────┼──────────────────────────
1 │ 8.3 70 10.3
2 │ 8.6 65 10.3
3 │ 8.8 63 10.2
4 │ 10.5 72 16.4
5 │ 10.7 81 18.8
6 │ 10.8 83 19.7
7 │ 11.0 66 15.6
8 │ 11.0 75 18.2
⋮ │ ⋮ ⋮ ⋮
25 │ 16.3 77 42.6
26 │ 17.3 81 55.4
27 │ 17.5 82 55.7
28 │ 17.9 80 58.3
29 │ 18.0 80 51.5
30 │ 18.0 80 51.0
31 │ 20.6 87 77.0
16 rows omitted

julia> trees = DataFrame(dataset("datasets", "trees"));

julia> bic_glm(λ) = bic(glm(@formula(Volume ~ Height + Girth), trees, Normal(), PowerLink(λ)));

julia> optimal_bic = optimize(bic_glm, -1.0, 1.0);
Expand Down
112 changes: 109 additions & 3 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,110 @@ x: 4 -0.032673 0.0797865 -0.41 0.6831 -0.191048 0.125702
───────────────────────────────────────────────────────────────────────────
```

## Weighting

nalimilan marked this conversation as resolved.
Show resolved Hide resolved
Both `lm` and `glm` allow weighted estimation. The three different
[types of weights](https://juliastats.org/StatsBase.jl/stable/weights/) defined in
[StatsBase.jl](https://github.com/JuliaStats/StatsBase.jl) can be used to fit a model:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about UnitWeights?

- `AnalyticWeights` describe a non-random relative importance (usually between 0 and 1) for
each observation. These weights may also be referred to as reliability weights, precision
weights or inverse variance weights. These are typically used when the observations being
weighted are aggregate values (e.g., averages) with differing variances.
- `FrequencyWeights` describe the inverse of the sampling probability for each observation,
providing a correction mechanism for under- or over-sampling certain population groups.
These weights may also be referred to as sampling weights.
- `ProbabilityWeights` describe how the sample can be scaled back to the population.
Usually are the reciprocals of sampling probabilities.
Comment on lines +136 to +140
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the same wording as in StatsBase for simplicity. If we want to improve it, we'll change it everywhere.

Suggested change
- `FrequencyWeights` describe the inverse of the sampling probability for each observation,
providing a correction mechanism for under- or over-sampling certain population groups.
These weights may also be referred to as sampling weights.
- `ProbabilityWeights` describe how the sample can be scaled back to the population.
Usually are the reciprocals of sampling probabilities.
- `FrequencyWeights` describe the number of times (or frequency) each observation was seen.
These weights may also be referred to as case weights or repeat weights.
- `ProbabilityWeights` represent the inverse of the sampling probability for each observation,
providing a correction mechanism for under- or over-sampling certain population groups.
These weights may also be referred to as sampling weights.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a comment somewhere how these weights are later treated in estimation?

To indicate which kind of weights should be used, the vector of weights must be wrapped in
one of the three weights types, and then passed to the `weights` keyword argument.
Short-hand functions `aweights`, `fweights`, and `pweights` can be used to construct
`AnalyticWeights`, `FrequencyWeights`, and `ProbabilityWeights`, respectively.

We illustrate the API with randomly generated data.

```jldoctest weights
julia> using StableRNGs, DataFrames, GLM

julia> data = DataFrame(y = rand(StableRNG(1), 100), x = randn(StableRNG(2), 100), weights = repeat([1, 2, 3, 4], 25), );
Copy link
Contributor

@bkamins bkamins Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result seems inconsistent with the comment below. Passing AbstractVector does not produce the same results as using fweights below. Am I missing something?


julia> m = lm(@formula(y ~ x), data)
LinearModel

y ~ 1 + x

Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept) 0.517369 0.0280232 18.46 <1e-32 0.461758 0.57298
x -0.0500249 0.0307201 -1.63 0.1066 -0.110988 0.0109382
──────────────────────────────────────────────────────────────────────────

julia> m_aweights = lm(@formula(y ~ x), data, wts=aweights(data.weights))
LinearModel

y ~ 1 + x

Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept) 0.51673 0.0270707 19.09 <1e-34 0.463009 0.570451
x -0.0478667 0.0308395 -1.55 0.1239 -0.109067 0.0133333
──────────────────────────────────────────────────────────────────────────

julia> m_fweights = lm(@formula(y ~ x), data, wts=fweights(data.weights))
LinearModel

y ~ 1 + x

Coefficients:
─────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────────
(Intercept) 0.51673 0.0170172 30.37 <1e-84 0.483213 0.550246
x -0.0478667 0.0193863 -2.47 0.0142 -0.0860494 -0.00968394
─────────────────────────────────────────────────────────────────────────────

julia> m_pweights = lm(@formula(y ~ x), data, wts=pweights(data.weights))
LinearModel

y ~ 1 + x

Coefficients:
───────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept) 0.51673 0.0288654 17.90 <1e-31 0.459447 0.574012
x -0.0478667 0.0266884 -1.79 0.0760 -0.100829 0.00509556
───────────────────────────────────────────────────────────────────────────
```

!!! warning

In the old API, weights were passed as `AbstractVectors` and were silently treated in
the internal computation of standard errors and related quantities as `FrequencyWeights`.
Passing weights as `AbstractVector` is still allowed for backward compatibility, but it
is deprecated. When weights are passed following the old API, they are now coerced to
`FrequencyWeights` and a deprecation warning is issued.

The type of the weights will affect the variance of the estimated coefficients and the
quantities involving this variance. The coefficient point estimates will be the same
regardless of the type of weights.

```jldoctest weights
julia> loglikelihood(m_aweights)
-16.296307561384253

julia> loglikelihood(m_fweights)
-25.51860961756451

julia> loglikelihood(m_pweights)
-16.296307561384253
```

## Comparing models with F-test

Comparisons between two or more linear models can be performed using the `ftest` function,
Expand Down Expand Up @@ -176,8 +280,8 @@ Many of the methods provided by this package have names similar to those in [R](
- `vcov`: variance-covariance matrix of the coefficient estimates


Note that the canonical link for negative binomial regression is `NegativeBinomialLink`, but
in practice one typically uses `LogLink`.
Note that the canonical link for negative binomial regression is `NegativeBinomialLink`,
but in practice one typically uses `LogLink`.

```jldoctest methods
julia> using GLM, DataFrames, StatsBase
Expand Down Expand Up @@ -209,7 +313,9 @@ julia> round.(predict(mdl, test_data); digits=8)
9.33333333
```

The [`cooksdistance`](@ref) method computes [Cook's distance](https://en.wikipedia.org/wiki/Cook%27s_distance) for each observation used to fit a linear model, giving an estimate of the influence of each data point.
The [`cooksdistance`](@ref) method computes
[Cook's distance](https://en.wikipedia.org/wiki/Cook%27s_distance) for each observation
used to fit a linear model, giving an estimate of the influence of each data point.
Note that it's currently only implemented for linear models without weights.

```jldoctest methods
Expand Down
11 changes: 6 additions & 5 deletions src/GLM.jl
Original file line number Diff line number Diff line change
Expand Up @@ -11,17 +11,18 @@ module GLM
import LinearAlgebra: cholesky, cholesky!
import Statistics: cor
import StatsBase: coef, coeftable, coefnames, confint, deviance, nulldeviance, dof, dof_residual,
loglikelihood, nullloglikelihood, nobs, stderror, vcov,
residuals, predict, predict!,
fitted, fit, model_response, response, modelmatrix, r2, r², adjr2, adjr², PValue
loglikelihood, nullloglikelihood, nobs, stderror, vcov, residuals, predict, predict!,
fitted, fit, model_response, response, modelmatrix, r2, r², adjr2, adjr²,
PValue, weights, leverage
import StatsFuns: xlogy
import SpecialFunctions: erfc, erfcinv, digamma, trigamma
import StatsModels: hasintercept
import Tables
export coef, coeftable, confint, deviance, nulldeviance, dof, dof_residual,
loglikelihood, nullloglikelihood, nobs, stderror, vcov, residuals, predict,
loglikelihood, nullloglikelihood, nobs, stderror, vcov, residuals, predict, predict!,
fitted, fit, fit!, model_response, response, modelmatrix, r2, r², adjr2, adjr²,
cooksdistance, hasintercept, dispersion
cooksdistance, hasintercept, dispersion, weights, AnalyticWeights, ProbabilityWeights, FrequencyWeights,
UnitWeights, uweights, fweights, pweights, aweights, leverage

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the description of weights types to COMMON_FIT_KWARGS_DOCS below.

export
# types
Expand Down
Loading