-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Methods for StatsModel #396
Comments
Maybe, but note that it's a bit more complex than that since e.g. multinomial logistic regression models have several possible outcomes, so should we return the name of the variable or the list of outcomes? If the former, this information could be obtained from the formula (and we don't currently provide a function to get the name of the independent variables). If the latter, what name would we return for e.g. linear regression?
Why?
Because I haven't been able to compute it from the likelihood, and that nobody appears to know how to compute the deviance from the likelihood even though it's supposed to be trivial (according to all textbooks). See JuliaStats/GLM.jl#115 (comment) and https://stats.stackexchange.com/questions/136055/deviance-of-a-regression-model/186544#186544.
My original idea was to require people to explicitly choose the pseudo-R² they want, since none is really better than others. The advantage of the current one-argument method is that it only works only when |
Excellent question. My inquiry had RegressionTables.jl in mind. It transforms response labels into label columns. Maybe @jmboehm could weigh in.
Not all statistical objects carry
I think it's a strange use of multiple dispatch. One expects a function to perform similar functions for different input types. (I do anyway.) It would be counterintuitive if
I might be mistaken, but I don't think that the standard formulae use the deviance at all. So why would you need to compute the deviance from the likelihood?
That makes sense, but it's strange then that it uses the deviance. Does it have a standard definition for linear models? I've only ever found it in relation to MLE.
Agreed. |
Yeah, but maybe we could decide that you need to use formulas if you want to have variable names. If you build your matrix manually, variables don't make sense.
That example is quite radical. For
In theory, yes. In practice, I haven't been able to reproduce R² from other software using the loglikelihood, only using the deviance. In particular, for the standard Gaussian linear model, the way the dispersion parameter should be taken into account isn't explained in textbooks, and in practice the assumption
Yes, the deviance is just the MSS for the linear model. Since its just a special case of GLM (gaussian distribution with identity link), the deviance has to exist. |
I am only able to reproduce Stata's R² with the log likelihood. I don't know how you got the R² on R (I'm not very familiar with R), but I could also match the R² from the packages
The AIC is a function of the likelihood rather than the deviance. I can match the output from
It is only true for binary models, as far as I know.
OLS is consistent under much weaker conditions: as a special case of GMM, it only requires the specification of the expectation rather than the entire distribution. Therefore, the likelihood isn't intrinsically defined for linear models. Unlike the R², the deviance only exists for a special case. |
Can you show how to replicate R²/pseudo-R² values reported by Stata and R packages using the log-likelihood for a concrete example of a linear model? There's an example in the tests where values have been checked against Stata and R IIRC. |
Julia:
R:
The current formula in StatsBase.jl uses the deviance instead of the likelihood. It works by coincidence. R and Stata report the standard R² for linear regression. It coincides with McFadden's pseudo R² from StatsBase.jl because the deviance is equal to the sum of squares in this case. If we ask for the pseudo R² via the pscl or DescTools packages for R, however, we see that StatsBase.jl is reporting the wrong number. |
Microeconometrics.jl uses
We could follow Stata and rename the struct |
I get this: julia> r2_mcfadden = 1.0 - loglikelihood(lm) / nullloglikelihood(lm)
-61.703067811928314
julia> r2_statsbase = r2(lm, :McFadden)
0.9990466748057584 The first one is consistent with what both R packages you cite, but the second one is consistent with
What do you mean by "intermediate output"?
Maybe, but "importance" doesn't mean anything special, so I think I prefer just "weights". I actually kind of like the fact that |
It's true that Ultimately, you need only check the original papers, such as McFadden's or Nagelkerke's. None mentions the deviance.
Stata does not report a pseudo R² for GLMs. It reports the classic R² for OLS (i.e.
It's true that it looks strange. It's partly because of the sample (it behaves better with larger samples), partly because of the model (these measures target binary models) and partly because of the intrinsic limitations of any pseudo R².
The user inputs the components of a model:
I think I've lost this battle. :) |
Indeed they don't. But it's interesting to note that Nagelerke says his pseudo-R² is equivalent to the R² for linear regression. Currently we give a very close value (not sure whether it's due to fitting approximations or whether there's a problem), but using the log-likelihood I get -8.37. Also, none of the papers says what dispersion parameter to use for the normal distribution: should we take different values for each model, or the same one (and which)? One argument in favor of using the log-likelihood directly is that the Cox-Snell pseudo-R² we currently return isn't equal to the R² for linear regression, while it's supposed to be. But since the Nagelkerke R² is also supposed to be equivalent, I'm a bit lost. Unfortunately I don't have the time to investigate this right now. It would be interesting to check with more reliable software than random R packages. For example maybe in Stata
What's the problem with carrying the formula around? You need something similar to a formula anyway if you want to keep track of the information it contains, including the name of dependent variables. I don't really see the point of storing it in a new structure, which would only contain part of the information. |
Nagelkerke's R² is being misinterpreted. His definition is:
Different parameters. Cox and Snell's R² doesn't coincide with the classical definition otherwise.
It wouldn't help. The intermediate |
It doesn't. On the other hand, we can check our numbers for a Poisson model. It's convenient in that the likelihood, the deviance and the residual sum of squares don't coincide. I had to use a trick because GLM.jl doesn't compute the null likelihood and the null deviance of the Poisson model. Here's the Julia code:
Here's the Stata code:
We match Stata's numbers with the likelihood. |
OK. Feel free to make a PR, but please specify precisely which values have been checked against Stata and which values are equal to the standard R² in the case of linear models, so that we can refer to that if the issue comes up again in the future. The fact that we would need to know whether the data is continuous or categorical to compute Nagelkerke's pseudo-R² correctly is annoying. I guess it's not the end of the world if we always use the categorical variant, given that people can use the Cox-Snell pseudo-R² instead. (I think this really illustrates the poor theoretical support for this pseudo-generalization of the R².) |
I also went ahead and used log-likelihood rather than deviance for my implementation of Fisher scoring for vector generalized linear models. It was a while ago, but I matched Stata's output using log-likelihood, but only did canonical link models. For some models, it is easier to compute the log-likelihood and define the variance as a function of it. For |
As for |
I have some questions:
coefnames
)?respname
, maybe?weights
was recently added. It has the same name as the constructor ofWeights
. Isn't it confusing?r2
based on deviances rather than likelihoods? The original formulae use likelihoods and so do programs like Stata.r2
gained a new method:r2(obj::StatisticalModel) = mss(obj) / deviance(obj)
. Wouldn't it make more sense to add the Efron R² to the standard r² instead? Something like:The text was updated successfully, but these errors were encountered: