Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct R² formulas and add Efron's R² #400

Merged
merged 7 commits into from
Oct 3, 2018
Merged
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 37 additions & 22 deletions src/statmodels.jl
Original file line number Diff line number Diff line change
Expand Up @@ -197,7 +197,13 @@ Coefficient of determination (R-squared).
For a linear model, the R² is defined as ``ESS/TSS``, with ``ESS`` the explained sum of squares
and ``TSS`` the total sum of squares.
"""
r2(obj::StatisticalModel) = mss(obj) / deviance(obj)
function r2(obj::StatisticalModel)
Base.depwarn("The default r² method for linear models is deprecated. " *
"Packages should define their own methods.", :r2)

μ = meanresponse(obj)
mss(obj) / sum(x -> abs2(x - μ), response(obj))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these take into accounts weights since weights would have been handled in mss and deviance?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. That also applies to the Efron R². For the method above, since it's deprecated, we'd better keep using deviance for now. For the Efron R², I guess we need to multiply observations by their weights, but will it work for all kinds of models?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precisely: packages can handle weights through mss(obj) and loglikelihood(obj). There's no need to provide for weights here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure for ProbabilityWeights, but should work fine for the other weights supported if not mistaken. The tests should cover the weight cases for sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Efron's R² does need special treatment, since it uses the sum of residuals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know the formula for weighted Efron R²? I guess we just need to multiply both the numerator and the denominator?

I think so. How are we going to implement it? We can retrieve weights via weights, but how do we check if weights were defined in the first place?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default is just a one array. Might be nice to follow FillArrays.jl for it to implement the UnitWeight or something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call weights and assume it returns a valid AbstractVector. Then it would make sense to define UnitWeights in this package, to complement other kinds of weights (#135).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump. Can you drop the Efron R² for now so that we can merge the PR? We can discuss improvements later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since #515 we have UnitWeights, so we can require models to return this when they are unweighted.

end

"""
r2(obj::StatisticalModel, variant::Symbol)
Expand All @@ -207,27 +213,37 @@ Pseudo-coefficient of determination (pseudo R-squared).

For nonlinear models, one of several pseudo R² definitions must be chosen via `variant`.
Supported variants are:
- `:MacFadden` (a.k.a. likelihood ratio index), defined as ``1 - \\log L/\\log L0``.
- `:CoxSnell`, defined as ``1 - (L0/L)^{2/n}``
- `:Nagelkerke`, defined as ``(1 - (L0/L)^{2/n})/(1 - L0^{2/n})``, with ``n`` the number
of observations (as returned by [`nobs`](@ref)).
- `:MacFadden` (a.k.a. likelihood ratio index), defined as ``1 - \\log (L)/\\log (L_0)``;
- `:CoxSnell`, defined as ``1 - (L_0/L)^{2/n}``;
- `:Nagelkerke`, defined as ``(1 - (L_0/L)^{2/n})/(1 - L_0^{2/n})``;
- `:Efron`, defined as ``1 - \\sum_i (y_i - \\hat{y}_i)^2 / \\sum_i (y_i - \\bar{y})^2``.

In the above formulas, ``L`` is the likelihood of the model, ``L0`` that of the null model
(the model including only the intercept). These two quantities are taken to be minus half
`deviance` of the corresponding models.
In the above formulas, ``L`` is the likelihood of the model,
``L_0`` is the likelihood of the null model (the model with only an intercept),
``n`` is the number of observations, ``y_i`` are the responses,
``\\hat{y}_i`` are fitted values and ``\\bar{y}`` is the average response.
ararslan marked this conversation as resolved.
Show resolved Hide resolved

Cox and Snell's and Efron's R² should match the classical R² for linear models.
"""
function r2(obj::StatisticalModel, variant::Symbol)
ll = -deviance(obj)/2
ll0 = -nulldeviance(obj)/2

if variant == :McFadden
1 - ll/ll0
elseif variant == :CoxSnell
1 - exp(2/nobs(obj) * (ll0 - ll))
elseif variant == :Nagelkerke
(1 - exp(2/nobs(obj) * (ll0 - ll)))/(1 - exp(2/nobs(obj) * ll0))
function r2(obj::StatisticalModel, variant::Symbol)
if variant == :Efron
y = response(obj)
ŷ = fitted(obj)
μ = meanresponse(obj)
1 - sum(x -> abs2(x[1] - x[2]), zip(y, ŷ)) / sum(x -> abs2(x - μ), response(obj))
else
error("variant must be one of :McFadden, :CoxSnell or :Nagelkerke")
ll = loglikelihood(obj)
ll0 = nullloglikelihood(obj)
if variant == :McFadden
1 - ll/ll0
elseif variant == :CoxSnell
1 - exp(2 * (ll0 - ll) / nobs(obj))
elseif variant == :Nagelkerke
(1 - exp(2 * (ll0 - ll) / obs(obj))) / (1 - exp(2 * ll0 / nobs(obj)))
else
error("variant must be one of :McFadden, :CoxSnell, :Nagelkerke or :Efron")
end
end
end

Expand All @@ -252,17 +268,16 @@ adjr2(obj::StatisticalModel) = error("adjr2 is not defined for $(typeof(obj)).")
Adjusted pseudo-coefficient of determination (adjusted pseudo R-squared).

For nonlinear models, one of the several pseudo R² definitions must be chosen via `variant`.
The only currently supported variant is `:MacFadden`, defined as ``1 - (\\log L - k)/\\log L0``.
The only currently supported variant is `:MacFadden`, defined as ``1 - (\\log (L) - k)/\\log (L0)``.
In this formula, ``L`` is the likelihood of the model, ``L0`` that of the null model
(the model including only the intercept). These two quantities are taken to be minus half
`deviance` of the corresponding models. ``k`` is the number of consumed degrees of freedom
of the model (as returned by [`dof`](@ref)).
"""
ararslan marked this conversation as resolved.
Show resolved Hide resolved
function adjr2(obj::StatisticalModel, variant::Symbol)
ll = -deviance(obj)/2
ll0 = -nulldeviance(obj)/2
ll = loglikelihood(obj)
ll0 = nullloglikelihood(obj)
k = dof(obj)

if variant == :McFadden
1 - (ll - k)/ll0
else
Expand Down