Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Methods for StatsModel #396

Closed
lbittarello opened this issue Jul 17, 2018 · 15 comments · Fixed by #400
Closed

Methods for StatsModel #396

lbittarello opened this issue Jul 17, 2018 · 15 comments · Fixed by #400

Comments

@lbittarello
Copy link
Contributor

lbittarello commented Jul 17, 2018

I have some questions:

  • Could we add a function to recover the label of the response (similar to coefnames)? respname, maybe?
  • The method weights was recently added. It has the same name as the constructor of Weights. Isn't it confusing?
  • Why is r2 based on deviances rather than likelihoods? The original formulae use likelihoods and so do programs like Stata.
  • r2 gained a new method: r2(obj::StatisticalModel) = mss(obj) / deviance(obj). Wouldn't it make more sense to add the Efron R² to the standard r² instead? Something like:
function r2(obj::StatisticalModel, variant::Symbol)

    if variant == :Efron

        y = model_response(obj)
        ŷ = fitted(obj)
        μ = meanresponse(obj)
        1.0 - sum(abs2, y .- ŷ) / sum(abs2, y .- μ)

    else

        ll = loglikelihood(obj)
        ll0 = nullloglikelihood(obj)

        if variant == :McFadden
            1 - ll/ll0
        elseif variant == :CoxSnell
            1 - exp(2 * (ll0 - ll) / nobs(obj))
        elseif variant == :Nagelkerke
            (1 - exp(2 * (ll0 - ll) / obs(obj))) / (1 - exp(2 * ll0/nobs(obj)))
        else
            error("variant must be one of :McFadden, :CoxSnell, :Nagelkerke or :Efron")
        end
    end
end
@nalimilan
Copy link
Member

Could we add a function to recover the label of the response (similar to coefnames)? respname, maybe?

Maybe, but note that it's a bit more complex than that since e.g. multinomial logistic regression models have several possible outcomes, so should we return the name of the variable or the list of outcomes? If the former, this information could be obtained from the formula (and we don't currently provide a function to get the name of the independent variables). If the latter, what name would we return for e.g. linear regression?

The method weights was recently added. It has the same name as the constructor of Weights. Isn't it confusing?

Why?

Why is r2 based on deviances rather than likelihoods? The original formulae use likelihoods and so do programs like Stata.

Because I haven't been able to compute it from the likelihood, and that nobody appears to know how to compute the deviance from the likelihood even though it's supposed to be trivial (according to all textbooks). See JuliaStats/GLM.jl#115 (comment) and https://stats.stackexchange.com/questions/136055/deviance-of-a-regression-model/186544#186544.

r2 gained a new method: r2(obj::StatisticalModel) = mss(obj) / deviance(obj). Wouldn't it make more sense to add the Efron R² to the standard r² instead? Something like:

My original idea was to require people to explicitly choose the pseudo-R² they want, since none is really better than others. The advantage of the current one-argument method is that it only works only when mss is defined, i.e. for linear models, for which it's the classic R², and not an arbitrary pseudo-R². I guess it would be nice to 1) throw a more explicit error, and 2) add :Efron to the list of supported pseudo-R².

@lbittarello
Copy link
Contributor Author

Maybe, but note that it's a bit more complex than that since e.g. multinomial logistic regression models have several possible outcomes, so should we return the name of the variable or the list of outcomes?

Excellent question.

My inquiry had RegressionTables.jl in mind. It transforms response labels into label columns. Maybe @jmboehm could weigh in.

this information could be obtained from the formula

Not all statistical objects carry Formula around though.

Why?

I think it's a strange use of multiple dispatch. One expects a function to perform similar functions for different input types. (I do anyway.) It would be counterintuitive if foo(x::Int, y::Int) = x * y and foo(x::AbstractFloat, y::AbstractFloat) = x / y.

Because I haven't been able to compute it from the likelihood, and that nobody appears to know how to compute the deviance from the likelihood

I might be mistaken, but I don't think that the standard formulae use the deviance at all. So why would you need to compute the deviance from the likelihood?

The advantage of the current one-argument method is that it only works only when mss is defined, i.e. for linear models

That makes sense, but it's strange then that it uses the deviance. Does it have a standard definition for linear models? I've only ever found it in relation to MLE.

I guess it would be nice to 1) throw a more explicit error, and 2) add :Efron to the list of supported pseudo-R².

Agreed.

@nalimilan
Copy link
Member

Not all statistical objects carry Formula around though.

Yeah, but maybe we could decide that you need to use formulas if you want to have variable names. If you build your matrix manually, variables don't make sense.

I think it's a strange use of multiple dispatch. One expects a function to perform similar functions for different input types. (I do anyway.) It would be counterintuitive if foo(x::Int, y::Int) = x * y and foo(x::AbstractFloat, y::AbstractFloat) = x / y.

That example is quite radical. For weights, the difference between passing a vector of reals and passing a model is much more obvious and each behavior couldn't work for the other type. So confusion is much more limited. Also I don't know what other name we could use instead...

I might be mistaken, but I don't think that the standard formulae use the deviance at all. So why would you need to compute the deviance from the likelihood?

In theory, yes. In practice, I haven't been able to reproduce R² from other software using the loglikelihood, only using the deviance. In particular, for the standard Gaussian linear model, the way the dispersion parameter should be taken into account isn't explained in textbooks, and in practice the assumption deviance = -2 loglikelihood doesn't really hold. And what R returns from logLik is computed from the deviance (actually, from the AIC) and doesn't appear to really be the log-likelihood. If you find the solution, that would be great.

That makes sense, but it's strange then that it uses the deviance. Does it have a standard definition for linear models? I've only ever found it in relation to MLE.

Yes, the deviance is just the MSS for the linear model. Since its just a special case of GLM (gaussian distribution with identity link), the deviance has to exist.

@lbittarello
Copy link
Contributor Author

I haven't been able to reproduce R² from other software using the loglikelihood, only using the deviance

I am only able to reproduce Stata's R² with the log likelihood. I don't know how you got the R² on R (I'm not very familiar with R), but I could also match the R² from the packages pscl and DescTools.

what R returns from logLik is computed from the deviance (actually, from the AIC)

The AIC is a function of the likelihood rather than the deviance. I can match the output from logLik.

in practice the assumption deviance = -2 loglikelihood doesn't really hold

It is only true for binary models, as far as I know.

its just a special case of GLM (gaussian distribution with identity link), the deviance has to exist

OLS is consistent under much weaker conditions: as a special case of GMM, it only requires the specification of the expectation rather than the entire distribution. Therefore, the likelihood isn't intrinsically defined for linear models. Unlike the R², the deviance only exists for a special case.

@nalimilan
Copy link
Member

Can you show how to replicate R²/pseudo-R² values reported by Stata and R packages using the log-likelihood for a concrete example of a linear model? There's an example in the tests where values have been checked against Stata and R IIRC.

@lbittarello
Copy link
Contributor Author

Julia:

using DataFrames, GLM

Carb   = [0.1, 0.3, 0.5, 0.6, 0.7, 0.9]
OptDen = [0.086, 0.269, 0.446, 0.538, 0.626, 0.782]
form   = DataFrame(Carb = Carb, OptDen = OptDen)
lm     = fit(LinearModel, @formula(OptDen ~ Carb), form)

r2_mcfadden  = 1.0 - loglikelihood(lm) / nullloglikelihood(lm)
r2_statsbase = r2(lm, :McFadden)

R:

require(pscl)
require(DescTools)

Carb   <- c(0.1, 0.3, 0.5, 0.6, 0.7, 0.9)
OptDen <- c(0.086, 0.269, 0.446, 0.538, 0.626, 0.782)
form   <- data.frame(Carb, OptDen)
lm     <- glm(OptDen ~ Carb, family = gaussian(link = identity), data = form)

pR2(lm)
PseudoR2(lm)

The current formula in StatsBase.jl uses the deviance instead of the likelihood. It works by coincidence. R and Stata report the standard R² for linear regression. It coincides with McFadden's pseudo R² from StatsBase.jl because the deviance is equal to the sum of squares in this case. If we ask for the pseudo R² via the pscl or DescTools packages for R, however, we see that StatsBase.jl is reporting the wrong number.

@lbittarello
Copy link
Contributor Author

lbittarello commented Jul 19, 2018

maybe we could decide that you need to use formulas if you want to have variable names

Microeconometrics.jl uses Formula internally. It wouldn't make sense to carry the intermediate input around. (I think we've had this discussion elsewhere though and I was supposed to help generalize Formula, which I never found the time to. Sorry.)

I don't know what other name we could use instead...

We could follow Stata and rename the struct Weights to ImportanceWeights (with associated constructor iweights).

@nalimilan
Copy link
Member

I get this:

julia> r2_mcfadden  = 1.0 - loglikelihood(lm) / nullloglikelihood(lm)
-61.703067811928314

julia> r2_statsbase = r2(lm, :McFadden)
0.9990466748057584

The first one is consistent with what both R packages you cite, but the second one is consistent with BaylorEdPsych::PseudoR2(lm). I haven't checked, but IIRC the second one is also consistent with Stata. And a R² of -61.7 really doesn't look right -- at that point you may as well drop that indicator...

Microeconometrics.jl uses Formula internally. It wouldn't make sense to carry the intermediate input around. (I think we've had this discussion elsewhere though and I was supposed to help generalize Formula, which I never found the time to. Sorry.)

What do you mean by "intermediate output"?

We could follow Stata and rename the struct Weights to ImportanceWeights (with associated constructor iweights).

Maybe, but "importance" doesn't mean anything special, so I think I prefer just "weights". I actually kind of like the fact that weights can be used either to construct weights from a vector of values or to retrieve the weights used with a model.

@lbittarello
Copy link
Contributor Author

the second one is consistent with BaylorEdPsych::PseudoR2(lm)

It's true that BaylorEdPsych uses the deviance. It's a mistake though. You can check that it reports the wrong AIC as well (compare PseudoR2(lm) and summary(lm)).

Ultimately, you need only check the original papers, such as McFadden's or Nagelkerke's. None mentions the deviance.

the second one is also consistent with Stata

Stata does not report a pseudo R² for GLMs. It reports the classic R² for OLS (i.e. regress).

a R² of -61.7 really doesn't look right -- at that point you may as well drop that indicator...

It's true that it looks strange. It's partly because of the sample (it behaves better with larger samples), partly because of the model (these measures target binary models) and partly because of the intrinsic limitations of any pseudo R².

What do you mean by "intermediate output"?

The user inputs the components of a model: model = Dict(:response => "y", :control => "x1 + x2 + 1"). The package combines them into a Formula, ~ y + x1 + x2 + 1. It feeds this intermediate Formula to ModelFrame. So it doesn't make sense to carry it around. (This syntax may look convoluted, but it's helpful for more complex models. For example, an IV Poisson model may have five components (response, controls, treatments, instruments, offsets) – the two-sided Formula isn't really suitable.)

I actually kind of like the fact that weights can be used either to construct weights from a vector of values or to retrieve the weights used with a model

I think I've lost this battle. :)

@nalimilan
Copy link
Member

Ultimately, you need only check the original papers, such as McFadden's or Nagelkerke's. None mentions the deviance.

Indeed they don't. But it's interesting to note that Nagelerke says his pseudo-R² is equivalent to the R² for linear regression. Currently we give a very close value (not sure whether it's due to fitting approximations or whether there's a problem), but using the log-likelihood I get -8.37. Also, none of the papers says what dispersion parameter to use for the normal distribution: should we take different values for each model, or the same one (and which)?

One argument in favor of using the log-likelihood directly is that the Cox-Snell pseudo-R² we currently return isn't equal to the R² for linear regression, while it's supposed to be. But since the Nagelkerke R² is also supposed to be equivalent, I'm a bit lost. Unfortunately I don't have the time to investigate this right now.

It would be interesting to check with more reliable software than random R packages. For example maybe in Stata fitstats would work after fitting a linear model or a GLM with normal distribution?

The user inputs the components of a model: model = Dict(:response => "y", :control => "x1 + x2 + 1"). The package combines them into a Formula, ~ y + x1 + x2 + 1. It feeds this intermediate Formula to ModelFrame. So it doesn't make sense to carry it around. (This syntax may look convoluted, but it's helpful for more complex models. For example, an IV Poisson model may have five components (response, controls, treatments, instruments, offsets) – the two-sided Formula isn't really suitable.)

What's the problem with carrying the formula around? You need something similar to a formula anyway if you want to keep track of the information it contains, including the name of dependent variables. I don't really see the point of storing it in a new structure, which would only contain part of the information.

@lbittarello
Copy link
Contributor Author

lbittarello commented Jul 21, 2018

it's interesting to note that Nagelerke says his pseudo-R² is equivalent to the R² for linear regression. Currently we give a very close value (not sure whether it's due to fitting approximations or whether there's a problem), but using the log-likelihood I get -8.37

Nagelkerke's R² is being misinterpreted. His definition is: ρ / max(ρ), where ρ is the R² of Cox and Snell (1989). As Nagelkerke explains, max(ρ) = 1 – exp(2 * ll₀ / n) for discrete data (e.g., binary). For continuous data, max(ρ) = 1 and Nagelkerke's R² reduces to Cox and Snell's. It's clear in the penultimate paragraph of his paper.

none of the papers says what dispersion parameter to use for the normal distribution: should we take different values for each model, or the same one (and which)

Different parameters. Cox and Snell's R² doesn't coincide with the classical definition otherwise.

What's the problem with carrying the formula around?

It wouldn't help. The intermediate Formula doesn't even have a left-hand side. I could change it, but the extent of necessary adjustments to the package would be incommensurate with the simple task of labeling the response. I concede defeat again. :)

@lbittarello
Copy link
Contributor Author

lbittarello commented Jul 21, 2018

maybe in Stata fitstats would work after fitting a linear model or a GLM with normal distribution

It doesn't. On the other hand, we can check our numbers for a Poisson model. It's convenient in that the likelihood, the deviance and the residual sum of squares don't coincide.

I had to use a trick because GLM.jl doesn't compute the null likelihood and the null deviance of the Poisson model. Here's the Julia code:

using DataFrames, GLM

y = [18.0, 17.0, 15.0, 20.0, 10.0, 20.0, 25.0, 13.0, 12.0]
x = categorical(["A", "B", "C", "A", "B", "C", "A", "B", "C"])

D  = DataFrame(y = y, x = x)
P₁ = fit(GeneralizedLinearModel, @formula(y ~ 1 + x), D, Poisson())
P₀ = fit(GeneralizedLinearModel, @formula(y ~ 1), D, Poisson())

ll₁ = loglikelihood(P₁)
ll₀ = loglikelihood(P₀)
dd₁ = deviance(P₁)
dd₀ = deviance(P₀)

r2_mcfadden  = 1.0 - ll₁ / ll₀
r2_statsbase = 1.0 - dd₁ / dd₀

r2_coxsnell  = 1 - exp(2 * (ll₀ - ll₁) / nobs(P₁))
r2_statsbase = 1 - exp(2 * (dd₀ - dd₁) / nobs(P₁))

r2_nagelkerke = (1 - exp(2 * (ll₀ - ll₁) / nobs(P₁))) / (1 - exp(2 * ll₀ / nobs(P₁)))
r2_statsbase  = (1 - exp(2 * (dd₀ - dd₁) / nobs(P₁))) / (1 - exp(2 * dd₀ / nobs(P₁)))

Here's the Stata code:

input y str1 x
	18.0 "A"
	17.0 "B"
	15.0 "C"
	20.0 "A"
	10.0 "B"
	20.0 "C"
	25.0 "A"
	13.0 "B"
	12.0 "C"
end

encode x, gen(xx)

poisson y i.(xx)
fitstat

We match Stata's numbers with the likelihood.

@nalimilan
Copy link
Member

OK. Feel free to make a PR, but please specify precisely which values have been checked against Stata and which values are equal to the standard R² in the case of linear models, so that we can refer to that if the issue comes up again in the future.

The fact that we would need to know whether the data is continuous or categorical to compute Nagelkerke's pseudo-R² correctly is annoying. I guess it's not the end of the world if we always use the categorical variant, given that people can use the Cox-Snell pseudo-R² instead. (I think this really illustrates the poor theoretical support for this pseudo-generalization of the R².)

@Nosferican
Copy link
Contributor

I also went ahead and used log-likelihood rather than deviance for my implementation of Fisher scoring for vector generalized linear models. It was a while ago, but I matched Stata's output using log-likelihood, but only did canonical link models. For some models, it is easier to compute the log-likelihood and define the variance as a function of it. For respname or something, in my implementation of multinomial logistic regression I thought of doing the outcomes.

@Nosferican
Copy link
Contributor

As for r2, we need to take into account the weights in a general way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants