-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Taking weighting seriously #487
base: master
Are you sure you want to change the base?
Changes from 81 commits
1754cbd
1d778a5
12121a3
4363ba4
ca702dc
e2b2d12
bc8709a
84cd990
cbc329f
23d67f5
f4d90a9
6b7d95c
c236b82
d4bd0c2
8bdfb55
3eb2ca4
63c8358
e93a919
7bb0959
ded17a8
3346774
7376e78
a738268
c9459e7
6af3ca5
0ded1d4
d923e48
84f27d1
8804dc1
7f3aa36
f67a8e0
23a3e87
5481284
d12222e
a17e812
58dec0c
a6f5c66
92ddb1e
0c61fff
8b0e8e1
f609f06
23f3d03
2749b84
82e472b
2d6aaed
dbc9ae9
e0d9cdf
46e8f92
6df401b
ca15eb8
0c18ae9
54d68d1
422a8cd
d6d4e6b
b457d74
b087679
a44e137
11db2c4
b649d4f
170148c
29c43cb
279e533
afb145e
2cead0a
a1ec49f
97bf28d
9ce2d89
9bddf63
3fe045a
852e307
d1ba3e5
831f280
b00dc16
0825324
48d15fb
3338eab
c27c749
970e26e
8832e9d
9eb2390
587c129
fa63a9a
807731a
72996fc
1ee383a
ba52ce9
5e790df
50c1a96
c4f7959
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,8 +12,8 @@ julia> using DataFrames, GLM, StatsBase | |
|
||
julia> data = DataFrame(X=[1,2,3], Y=[2,4,7]) | ||
3×2 DataFrame | ||
Row │ X Y | ||
│ Int64 Int64 | ||
Row │ X Y | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. trailing whitespace probably should be stripped. |
||
│ Int64 Int64 | ||
─────┼────────────── | ||
1 │ 1 2 | ||
2 │ 2 4 | ||
|
@@ -61,7 +61,7 @@ julia> dof(ols) | |
3 | ||
|
||
julia> dof_residual(ols) | ||
1.0 | ||
1 | ||
nalimilan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
julia> round(aic(ols); digits=5) | ||
5.84252 | ||
|
@@ -91,8 +91,8 @@ julia> round.(vcov(ols); digits=5) | |
```jldoctest | ||
julia> data = DataFrame(X=[1,2,2], Y=[1,0,1]) | ||
3×2 DataFrame | ||
Row │ X Y | ||
│ Int64 Int64 | ||
Row │ X Y | ||
│ Int64 Int64 | ||
─────┼────────────── | ||
1 │ 1 1 | ||
2 │ 2 0 | ||
|
@@ -196,8 +196,8 @@ julia> using GLM, RDatasets | |
|
||
julia> form = dataset("datasets", "Formaldehyde") | ||
6×2 DataFrame | ||
Row │ Carb OptDen | ||
│ Float64 Float64 | ||
Row │ Carb OptDen | ||
│ Float64 Float64 | ||
─────┼────────────────── | ||
1 │ 0.1 0.086 | ||
2 │ 0.3 0.269 | ||
|
@@ -350,8 +350,8 @@ julia> dobson = DataFrame(Counts = [18.,17,15,20,10,21,25,13,13], | |
Outcome = categorical([1,2,3,1,2,3,1,2,3]), | ||
Treatment = categorical([1,1,1,2,2,2,3,3,3])) | ||
9×3 DataFrame | ||
Row │ Counts Outcome Treatment | ||
│ Float64 Cat… Cat… | ||
Row │ Counts Outcome Treatment | ||
│ Float64 Cat… Cat… | ||
─────┼───────────────────────────── | ||
1 │ 18.0 1 1 | ||
2 │ 17.0 2 1 | ||
|
@@ -390,29 +390,8 @@ In this example, we choose the best model from a set of λs, based on minimum BI | |
```jldoctest | ||
julia> using GLM, RDatasets, StatsBase, DataFrames, Optim | ||
|
||
julia> trees = DataFrame(dataset("datasets", "trees")) | ||
31×3 DataFrame | ||
Row │ Girth Height Volume | ||
│ Float64 Int64 Float64 | ||
─────┼────────────────────────── | ||
1 │ 8.3 70 10.3 | ||
2 │ 8.6 65 10.3 | ||
3 │ 8.8 63 10.2 | ||
4 │ 10.5 72 16.4 | ||
5 │ 10.7 81 18.8 | ||
6 │ 10.8 83 19.7 | ||
7 │ 11.0 66 15.6 | ||
8 │ 11.0 75 18.2 | ||
⋮ │ ⋮ ⋮ ⋮ | ||
25 │ 16.3 77 42.6 | ||
26 │ 17.3 81 55.4 | ||
27 │ 17.5 82 55.7 | ||
28 │ 17.9 80 58.3 | ||
29 │ 18.0 80 51.5 | ||
30 │ 18.0 80 51.0 | ||
31 │ 20.6 87 77.0 | ||
16 rows omitted | ||
|
||
julia> trees = DataFrame(dataset("datasets", "trees")); | ||
|
||
julia> bic_glm(λ) = bic(glm(@formula(Volume ~ Height + Girth), trees, Normal(), PowerLink(λ))); | ||
|
||
julia> optimal_bic = optimize(bic_glm, -1.0, 1.0); | ||
|
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -123,6 +123,110 @@ x: 4 -0.032673 0.0797865 -0.41 0.6831 -0.191048 0.125702 | |||||||||||||||||||||
─────────────────────────────────────────────────────────────────────────── | ||||||||||||||||||||||
``` | ||||||||||||||||||||||
|
||||||||||||||||||||||
## Weighting | ||||||||||||||||||||||
|
||||||||||||||||||||||
nalimilan marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||
Both `lm` and `glm` allow weighted estimation. The three different | ||||||||||||||||||||||
[types of weights](https://juliastats.org/StatsBase.jl/stable/weights/) defined in | ||||||||||||||||||||||
[StatsBase.jl](https://github.com/JuliaStats/StatsBase.jl) can be used to fit a model: | ||||||||||||||||||||||
|
||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what about |
||||||||||||||||||||||
- `AnalyticWeights` describe a non-random relative importance (usually between 0 and 1) for | ||||||||||||||||||||||
each observation. These weights may also be referred to as reliability weights, precision | ||||||||||||||||||||||
weights or inverse variance weights. These are typically used when the observations being | ||||||||||||||||||||||
weighted are aggregate values (e.g., averages) with differing variances. | ||||||||||||||||||||||
- `FrequencyWeights` describe the inverse of the sampling probability for each observation, | ||||||||||||||||||||||
providing a correction mechanism for under- or over-sampling certain population groups. | ||||||||||||||||||||||
These weights may also be referred to as sampling weights. | ||||||||||||||||||||||
- `ProbabilityWeights` describe how the sample can be scaled back to the population. | ||||||||||||||||||||||
Usually are the reciprocals of sampling probabilities. | ||||||||||||||||||||||
Comment on lines
+136
to
+140
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's use the same wording as in StatsBase for simplicity. If we want to improve it, we'll change it everywhere.
Suggested change
|
||||||||||||||||||||||
|
||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we add a comment somewhere how these weights are later treated in estimation? |
||||||||||||||||||||||
To indicate which kind of weights should be used, the vector of weights must be wrapped in | ||||||||||||||||||||||
one of the three weights types, and then passed to the `weights` keyword argument. | ||||||||||||||||||||||
Short-hand functions `aweights`, `fweights`, and `pweights` can be used to construct | ||||||||||||||||||||||
`AnalyticWeights`, `FrequencyWeights`, and `ProbabilityWeights`, respectively. | ||||||||||||||||||||||
|
||||||||||||||||||||||
We illustrate the API with randomly generated data. | ||||||||||||||||||||||
|
||||||||||||||||||||||
```jldoctest weights | ||||||||||||||||||||||
julia> using StableRNGs, DataFrames, GLM | ||||||||||||||||||||||
|
||||||||||||||||||||||
julia> data = DataFrame(y = rand(StableRNG(1), 100), x = randn(StableRNG(2), 100), weights = repeat([1, 2, 3, 4], 25), ); | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The result seems inconsistent with the comment below. Passing |
||||||||||||||||||||||
|
||||||||||||||||||||||
julia> m = lm(@formula(y ~ x), data) | ||||||||||||||||||||||
LinearModel | ||||||||||||||||||||||
|
||||||||||||||||||||||
y ~ 1 + x | ||||||||||||||||||||||
|
||||||||||||||||||||||
Coefficients: | ||||||||||||||||||||||
────────────────────────────────────────────────────────────────────────── | ||||||||||||||||||||||
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% | ||||||||||||||||||||||
────────────────────────────────────────────────────────────────────────── | ||||||||||||||||||||||
(Intercept) 0.517369 0.0280232 18.46 <1e-32 0.461758 0.57298 | ||||||||||||||||||||||
x -0.0500249 0.0307201 -1.63 0.1066 -0.110988 0.0109382 | ||||||||||||||||||||||
────────────────────────────────────────────────────────────────────────── | ||||||||||||||||||||||
|
||||||||||||||||||||||
julia> m_aweights = lm(@formula(y ~ x), data, wts=aweights(data.weights)) | ||||||||||||||||||||||
LinearModel | ||||||||||||||||||||||
|
||||||||||||||||||||||
y ~ 1 + x | ||||||||||||||||||||||
|
||||||||||||||||||||||
Coefficients: | ||||||||||||||||||||||
────────────────────────────────────────────────────────────────────────── | ||||||||||||||||||||||
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% | ||||||||||||||||||||||
────────────────────────────────────────────────────────────────────────── | ||||||||||||||||||||||
(Intercept) 0.51673 0.0270707 19.09 <1e-34 0.463009 0.570451 | ||||||||||||||||||||||
x -0.0478667 0.0308395 -1.55 0.1239 -0.109067 0.0133333 | ||||||||||||||||||||||
────────────────────────────────────────────────────────────────────────── | ||||||||||||||||||||||
|
||||||||||||||||||||||
julia> m_fweights = lm(@formula(y ~ x), data, wts=fweights(data.weights)) | ||||||||||||||||||||||
LinearModel | ||||||||||||||||||||||
|
||||||||||||||||||||||
y ~ 1 + x | ||||||||||||||||||||||
|
||||||||||||||||||||||
Coefficients: | ||||||||||||||||||||||
───────────────────────────────────────────────────────────────────────────── | ||||||||||||||||||||||
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% | ||||||||||||||||||||||
───────────────────────────────────────────────────────────────────────────── | ||||||||||||||||||||||
(Intercept) 0.51673 0.0170172 30.37 <1e-84 0.483213 0.550246 | ||||||||||||||||||||||
x -0.0478667 0.0193863 -2.47 0.0142 -0.0860494 -0.00968394 | ||||||||||||||||||||||
───────────────────────────────────────────────────────────────────────────── | ||||||||||||||||||||||
|
||||||||||||||||||||||
julia> m_pweights = lm(@formula(y ~ x), data, wts=pweights(data.weights)) | ||||||||||||||||||||||
LinearModel | ||||||||||||||||||||||
|
||||||||||||||||||||||
y ~ 1 + x | ||||||||||||||||||||||
|
||||||||||||||||||||||
Coefficients: | ||||||||||||||||||||||
─────────────────────────────────────────────────────────────────────────── | ||||||||||||||||||||||
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% | ||||||||||||||||||||||
─────────────────────────────────────────────────────────────────────────── | ||||||||||||||||||||||
(Intercept) 0.51673 0.0288654 17.90 <1e-31 0.459447 0.574012 | ||||||||||||||||||||||
x -0.0478667 0.0266884 -1.79 0.0760 -0.100829 0.00509556 | ||||||||||||||||||||||
─────────────────────────────────────────────────────────────────────────── | ||||||||||||||||||||||
``` | ||||||||||||||||||||||
|
||||||||||||||||||||||
!!! warning | ||||||||||||||||||||||
|
||||||||||||||||||||||
In the old API, weights were passed as `AbstractVectors` and were silently treated in | ||||||||||||||||||||||
the internal computation of standard errors and related quantities as `FrequencyWeights`. | ||||||||||||||||||||||
Passing weights as `AbstractVector` is still allowed for backward compatibility, but it | ||||||||||||||||||||||
is deprecated. When weights are passed following the old API, they are now coerced to | ||||||||||||||||||||||
`FrequencyWeights` and a deprecation warning is issued. | ||||||||||||||||||||||
|
||||||||||||||||||||||
The type of the weights will affect the variance of the estimated coefficients and the | ||||||||||||||||||||||
quantities involving this variance. The coefficient point estimates will be the same | ||||||||||||||||||||||
regardless of the type of weights. | ||||||||||||||||||||||
|
||||||||||||||||||||||
```jldoctest weights | ||||||||||||||||||||||
julia> loglikelihood(m_aweights) | ||||||||||||||||||||||
-16.296307561384253 | ||||||||||||||||||||||
|
||||||||||||||||||||||
julia> loglikelihood(m_fweights) | ||||||||||||||||||||||
-25.51860961756451 | ||||||||||||||||||||||
|
||||||||||||||||||||||
julia> loglikelihood(m_pweights) | ||||||||||||||||||||||
-16.296307561384253 | ||||||||||||||||||||||
``` | ||||||||||||||||||||||
|
||||||||||||||||||||||
## Comparing models with F-test | ||||||||||||||||||||||
|
||||||||||||||||||||||
Comparisons between two or more linear models can be performed using the `ftest` function, | ||||||||||||||||||||||
|
@@ -176,8 +280,8 @@ Many of the methods provided by this package have names similar to those in [R]( | |||||||||||||||||||||
- `vcov`: variance-covariance matrix of the coefficient estimates | ||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
Note that the canonical link for negative binomial regression is `NegativeBinomialLink`, but | ||||||||||||||||||||||
in practice one typically uses `LogLink`. | ||||||||||||||||||||||
Note that the canonical link for negative binomial regression is `NegativeBinomialLink`, | ||||||||||||||||||||||
but in practice one typically uses `LogLink`. | ||||||||||||||||||||||
|
||||||||||||||||||||||
```jldoctest methods | ||||||||||||||||||||||
julia> using GLM, DataFrames, StatsBase | ||||||||||||||||||||||
|
@@ -209,7 +313,9 @@ julia> round.(predict(mdl, test_data); digits=8) | |||||||||||||||||||||
9.33333333 | ||||||||||||||||||||||
``` | ||||||||||||||||||||||
|
||||||||||||||||||||||
The [`cooksdistance`](@ref) method computes [Cook's distance](https://en.wikipedia.org/wiki/Cook%27s_distance) for each observation used to fit a linear model, giving an estimate of the influence of each data point. | ||||||||||||||||||||||
The [`cooksdistance`](@ref) method computes | ||||||||||||||||||||||
[Cook's distance](https://en.wikipedia.org/wiki/Cook%27s_distance) for each observation | ||||||||||||||||||||||
used to fit a linear model, giving an estimate of the influence of each data point. | ||||||||||||||||||||||
Note that it's currently only implemented for linear models without weights. | ||||||||||||||||||||||
|
||||||||||||||||||||||
```jldoctest methods | ||||||||||||||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,17 +11,18 @@ module GLM | |
import LinearAlgebra: cholesky, cholesky! | ||
import Statistics: cor | ||
import StatsBase: coef, coeftable, coefnames, confint, deviance, nulldeviance, dof, dof_residual, | ||
loglikelihood, nullloglikelihood, nobs, stderror, vcov, | ||
residuals, predict, predict!, | ||
fitted, fit, model_response, response, modelmatrix, r2, r², adjr2, adjr², PValue | ||
loglikelihood, nullloglikelihood, nobs, stderror, vcov, residuals, predict, predict!, | ||
fitted, fit, model_response, response, modelmatrix, r2, r², adjr2, adjr², | ||
PValue, weights, leverage | ||
import StatsFuns: xlogy | ||
import SpecialFunctions: erfc, erfcinv, digamma, trigamma | ||
import StatsModels: hasintercept | ||
import Tables | ||
export coef, coeftable, confint, deviance, nulldeviance, dof, dof_residual, | ||
loglikelihood, nullloglikelihood, nobs, stderror, vcov, residuals, predict, | ||
loglikelihood, nullloglikelihood, nobs, stderror, vcov, residuals, predict, predict!, | ||
fitted, fit, fit!, model_response, response, modelmatrix, r2, r², adjr2, adjr², | ||
cooksdistance, hasintercept, dispersion | ||
cooksdistance, hasintercept, dispersion, weights, AnalyticWeights, ProbabilityWeights, FrequencyWeights, | ||
UnitWeights, uweights, fweights, pweights, aweights, leverage | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add the description of weights types to |
||
export | ||
# types | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I would add weighted
lm
putting lower weight to observation 10 in dataset III (an outlier), to show how the results change.Of course these are soft suggestions, but would show the use of the things that we implement here.