-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weights in ModelFrame #823
Comments
+1 I think I would go with a But then you'd also need to include the actual weights vector in I also think several kinds of weights should be supported. This has been discussed at JuliaStats/StatsBase.jl#53: people may want to use frequency weights, precision weights, sampling weights or arbitrary weights, which (when supported) give different standard errors. An additional field giving the type of weights would be enough. This type would be obtained from the type of the column in the data frame, with a default to "arbitrary" (i.e. unspecified) for plain vectors. |
I agree that we need a design where weights (of many kinds) are available by default. But I'd also like to move towards a set of abstractions we can use for model fitting and away from specific implementations like
To me, the general model fitting procedure looks like:
With a bit of work, this kind of abstraction would make it easy for us to estimate not only GLM models, but much more advanced models like GBDT's, while reusing the best part of R's GLM machinery -- the DSL for translating DB rows into matrix form. In this kind of setup, weights would be a component of specifying the objective function. Also probably worth nothing that Doug Bates seems to have already done a lot of work on thinking through the sparse matrix representation of DB's with high cardinality categorical variables. |
Agreed, that's what I had in mind too. |
Feel free to re-open at https://github.com/JuliaStats/StatsModels.jl if this is still relevant. |
We now have several types of weight vectors in StatsBase, and work is being done to use them with GLM.jl (JuliaStats/GLM.jl#194). |
It would be nice to allow the user to specify weights when creating a
ModelFrame
, that would add a column for weights in the returned DataFrame and a field for the weight name in the ModelFrame object.This option would allow to make sure rows with missing weights are removed, that the returned dataframe retains the weight column, and that the right msng vector is returned.
Ideally this option would also introduce a standardization of the weight option across different stat models . Currently in R, it seems like every model has a different name for weights (w, wts, wght, weights, weight etc).
About the implementation, I think the weight option should accept a symbol (ie a column in the dataframe) to be consistent with how formulas work, rather than a Vector/WeightVec. Since weights may or may not be present, I'm not sure of the best approach for the weight field (Union with nothing, Nullable, or a WeightedModelFrame type, etc)
The text was updated successfully, but these errors were encountered: