Correct R² formulas and add Efron's R² #400

lbittarello · 2018-07-22T00:07:46Z

This PR fixed the R² formulas, which were previously based on the deviance instead of the likelihood function. It also adds Efron's R².
The following variants were benchmarked against Stata's fitstat: McFadden, Cox and Snell, Nagelkerke and Efron.
Cox and Snell's and Efron's R² should match the classical R² for linear models.

Closes #396.

This PR fixed the R² formulas, which were previously based on the deviance instead of the likelihood function. It also adds Efron's R². The following variants were benchmarked against Stata's fitstat: McFadden, Cox and Snell, Nagelkerke and Efron. Cox and Snell's and Efron's R² should match the classical R² for linear models.

codecov · 2018-07-22T00:26:45Z

Codecov Report

Merging #400 into master will decrease coverage by 0.07%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #400      +/-   ##
==========================================
- Coverage   97.97%   97.89%   -0.08%     
==========================================
  Files          17       16       -1     
  Lines        1282     1281       -1     
==========================================
- Hits         1256     1254       -2     
- Misses         26       27       +1

Impacted Files	Coverage Δ
src/statmodels.jl	`97.77% <ø> (ø)`	⬆️
src/scalarstats.jl	`96.66% <0%> (-0.72%)`	⬇️
src/weights.jl	`99.29% <0%> (-0.03%)`	⬇️
src/hist.jl	`97.18% <0%> (-0.02%)`	⬇️
src/moments.jl	`100% <0%> (ø)`	⬆️
src/cov.jl	`100% <0%> (ø)`	⬆️
src/common.jl

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc9ad6e...7b40de8. Read the comment docs.

nalimilan

Thanks. Sorry for the delay, I had missed the PR.

nalimilan · 2018-09-02T14:21:26Z

src/statmodels.jl

@@ -197,7 +197,7 @@ Coefficient of determination (R-squared).
 For a linear model, the R² is defined as ``ESS/TSS``, with ``ESS`` the explained sum of squares
 and ``TSS`` the total sum of squares.
 """
-r2(obj::StatisticalModel) = mss(obj) / deviance(obj)
+r2(obj::StatisticalModel) = mss(obj) / sum(abs2, response(obj) .- meanresponse(obj))


Better do μ = meanresponse(obj); sum(x -> abs2(x - μ), response(obj)) to avoid allocating a copy. Maybe turn this into a long-form function for readability.

@Nosferican Is this formula OK for your package?

EDIT: actually I think we'd better deprecate this function and leave packages to implement it for linear models. That will give a clearer error message when called on a nonlinear model (instead of an error about mss not being defined).

avoid allocating a copy

I thought that sum didn't allocate transform scalars into vectors with the dot syntax.

we'd better deprecate this function and leave packages to implement it for linear models

Do you mean removing the function and leaving a warning or leaving the function and adding a warning?

Consider this script,

using BenchmarkTools a(x, y) = sum(abs2, x - y) b(x, y) = sum(abs2, x .- y) c(x, y) = sum(xy -> abs2(xy[1] - xy[2]), zip(x, y)) d(x, y) = sum(abs2(a - b) for (a, b) ∈ zip(x, y)) using Random: seed! seed!(0) const x = rand(1000); const y = rand(1000); @btime a($x, $y) @btime b($x, $y) @btime c($x, $y) @btime d($x, $y) 703.416 ns (1 allocation: 7.94 KiB) 659.205 ns (1 allocation: 7.94 KiB) 823.250 ns (1 allocation: 32 bytes) 969.684 ns (2 allocations: 48 bytes)

It seems to be trivially faster, but it certainly allocates as opposed to c or d.

If the R² matches Stata, it will match mine. Some models will still need to define the method for some panel estimators, but this seems good.

If the statsmodels functions are being moved to StatsModels it might be best to make the decision there (better to trim there, than to have lost useful code then),

@lbittarello:

I thought that sum didn't allocate transform scalars into vectors with the dot syntax.

As @Nosferican shows, it allocates, and it will probably stay that way in the future. Maybe another syntax will be introduced to do that.

Do you mean removing the function and leaving a warning or leaving the function and adding a warning?

I mean print a warning using Base.depwarn(..., :r2), saying that packages should define their custom method. Then we'll turn the warning into an error just like for other function stubs in the file.

@Nosferican:

If the R² matches Stata, it will match mine. Some models will still need to define the method for some panel estimators, but this seems good.

OK, cool.

If the statsmodels functions are being moved to StatsModels it might be best to make the decision there (better to trim there, than to have lost useful code then),

It's not clear yet what will happen and when, and given how small this function is that's really not an issue.

I've added a deprecation warning. Let me know if it's okay.

src/statmodels.jl

nalimilan · 2018-09-02T14:25:39Z

src/statmodels.jl

+        y = response(obj)
+        ŷ = fitted(obj)
+        μ = meanresponse(obj)
+        1 - sum(abs2, y .- ŷ) / sum(abs2, y .- μ)


Same remark as above about avoiding allocations.

nalimilan · 2018-09-02T14:27:21Z

src/statmodels.jl

+- `:MacFadden` (a.k.a. likelihood ratio index), defined as ``1 - \\log (L)/\\log (L_0)``;
+- `:CoxSnell`, defined as ``1 - (L_0/L)^{2/n}``;
+- `:Nagelkerke`, defined as ``(1 - (L_0/L)^{2/n})/(1 - L_0^{2/n})``;
+- `:Efron`, defined as ``1 - \\sum_i (y_i - \\hat{y})^2 / \\sum_i (y_i - \\bar{y})^2``.


Should be \\har{y}_i.

nalimilan · 2018-09-02T14:29:42Z

src/statmodels.jl

-        1 - exp(2/nobs(obj) * (ll0 - ll))
-    elseif variant == :Nagelkerke
-        (1 - exp(2/nobs(obj) * (ll0 - ll)))/(1 - exp(2/nobs(obj) * ll0))
+# The following variants were benchmarked against Stata's fitstat:


I'd rather put this mention in the tests next to the expected values. The explanation about linear models can go to the docstring.

Tests in GLM.jl?

Some tests here are run using GLM which uses this package... Might be best test these eventually without another package for it.

Yeah, in GLM.jl (since the check against Stata applies to GLMs AFAICT).

It's not ideal to have this split in two packages, but testing things directly here implies writing pseudo-model types. That's not so hard, but it will take some work.

nalimilan · 2018-09-02T14:32:06Z

src/statmodels.jl

@@ -259,10 +271,9 @@ In this formula, ``L`` is the likelihood of the model, ``L0`` that of the null m
 of the model (as returned by [`dof`](@ref)).
 """
 function adjr2(obj::StatisticalModel, variant::Symbol)
-    ll = -deviance(obj)/2
-    ll0 = -nulldeviance(obj)/2
+    ll = likelihood(obj)


AFAICT you want loglikelihood here. The fact it wasn't caught by testing is a bit worrying: isn't this covered by GLM's tests?

Sorry. You're right. I hadn't tested the adjusted R².

OK. Do you confirm it works know?

src/statmodels.jl

nalimilan · 2018-09-05T07:00:01Z

src/statmodels.jl

+        y = response(obj)
+        ŷ = fitted(obj)
+        μ = meanresponse(obj)
+        1 - sum(abs2, y .- ŷ) / sum(x -> abs2(x - μ), response(obj))


sum(abs2, y .- ŷ) also needs to be changed.

nalimilan

Looks good, thanks!

Nosferican · 2018-09-05T18:23:20Z

src/statmodels.jl

+                 "Packages should define their own methods.", :r2)
+
+    μ = meanresponse(obj)
+    mss(obj) / sum(x -> abs2(x - μ), response(obj))


Shouldn't these take into accounts weights since weights would have been handled in mss and deviance?

Good catch. That also applies to the Efron R². For the method above, since it's deprecated, we'd better keep using deviance for now. For the Efron R², I guess we need to multiply observations by their weights, but will it work for all kinds of models?

Precisely: packages can handle weights through mss(obj) and loglikelihood(obj). There's no need to provide for weights here.

Not sure for ProbabilityWeights, but should work fine for the other weights supported if not mistaken. The tests should cover the weight cases for sure.

Efron's R² does need special treatment, since it uses the sum of residuals.

Do you know the formula for weighted Efron R²? I guess we just need to multiply both the numerator and the denominator?

I think so. How are we going to implement it? We can retrieve weights via weights, but how do we check if weights were defined in the first place?

I think the default is just a one array. Might be nice to follow FillArrays.jl for it to implement the UnitWeight or something.

Let's call weights and assume it returns a valid AbstractVector. Then it would make sense to define UnitWeights in this package, to complement other kinds of weights (#135).

Bump. Can you drop the Efron R² for now so that we can merge the PR? We can discuss improvements later.

Since #515 we have UnitWeights, so we can require models to return this when they are unweighted.

nalimilan · 2018-10-03T06:39:08Z

Thanks! Can you file a PR against GLM to update R² tests?

lbittarello · 2018-10-03T14:13:56Z

As far as I can tell, the only tests involve linear models. This PR shouldn't affect them.

nalimilan · 2018-10-03T14:17:08Z

OK. Then it would be nice to add tests to ensure results remain correct in the future. ;-)

Since #400 we use the log-likelihood.

nalimilan reviewed Sep 2, 2018

View reviewed changes

nalimilan changed the title ~~Closes #396~~ Correct R² formulas and add Efron's R² Sep 2, 2018

Fixes

0b27eb3

nalimilan reviewed Sep 5, 2018

View reviewed changes

lbittarello and others added 3 commits September 5, 2018 09:24

Fixes

ef14acb

Deprecated default r² for linear models

a369a31

Remove empty line

f5d5550

nalimilan approved these changes Sep 5, 2018

View reviewed changes

Nosferican reviewed Sep 5, 2018

View reviewed changes

lbittarello added 2 commits September 5, 2018 14:52

Fixes

fe20386

Drop Efron's R²

7b40de8

nalimilan merged commit 1ccb352 into JuliaStats:master Oct 3, 2018

nalimilan mentioned this pull request Jan 13, 2019

Fix typo in r2 #446

Merged

nalimilan added a commit that referenced this pull request Oct 19, 2019

Remove mention of deviance in adjr2 docstring

caa0bb1

Since #400 we use the log-likelihood.

nalimilan mentioned this pull request Oct 19, 2019

Remove mention of deviance in adjr2 docstring #531

Merged

nalimilan added a commit that referenced this pull request Dec 4, 2019

Remove mention of deviance in adjr2 docstring (#531)

0bb7740

Since #400 we use the log-likelihood.

This was referenced Jan 20, 2020

Wrong loglikelihood definitions JuliaStats/GLM.jl#356

Closed

R² generalization #549

Closed

Correct R² formulas and add Efron's R² #400

Correct R² formulas and add Efron's R² #400

Conversation

lbittarello commented Jul 22, 2018 • edited by nalimilan Loading

codecov bot commented Jul 22, 2018 • edited Loading

Codecov Report

nalimilan left a comment

Choose a reason for hiding this comment

nalimilan Sep 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Oct 3, 2018

lbittarello commented Oct 3, 2018

nalimilan commented Oct 3, 2018

lbittarello commented Jul 22, 2018 •

edited by nalimilan

Loading

codecov bot commented Jul 22, 2018 •

edited

Loading

nalimilan Sep 2, 2018 •

edited

Loading