Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weights in ModelFrame #823

Closed
matthieugomez opened this issue Jun 20, 2015 · 5 comments
Closed

Weights in ModelFrame #823

matthieugomez opened this issue Jun 20, 2015 · 5 comments

Comments

@matthieugomez
Copy link
Contributor

It would be nice to allow the user to specify weights when creating a ModelFrame, that would add a column for weights in the returned DataFrame and a field for the weight name in the ModelFrame object.

This option would allow to make sure rows with missing weights are removed, that the returned dataframe retains the weight column, and that the right msng vector is returned.

Ideally this option would also introduce a standardization of the weight option across different stat models . Currently in R, it seems like every model has a different name for weights (w, wts, wght, weights, weight etc).

About the implementation, I think the weight option should accept a symbol (ie a column in the dataframe) to be consistent with how formulas work, rather than a Vector/WeightVec. Since weights may or may not be present, I'm not sure of the best approach for the weight field (Union with nothing, Nullable, or a WeightedModelFrame type, etc)

@nalimilan
Copy link
Member

+1 I think I would go with a Nullable{Symbol} field holding the name of the weights column (if any).

But then you'd also need to include the actual weights vector in ModelMatrix. I guess it could be stored as a plain Vector, as we don't really need a WeightsVec here (the point of that type is to avoid computing the sum every time, but here we don't need the sum AFAICT, and with missing values we would need to recompute it anyway).

I also think several kinds of weights should be supported. This has been discussed at JuliaStats/StatsBase.jl#53: people may want to use frequency weights, precision weights, sampling weights or arbitrary weights, which (when supported) give different standard errors. An additional field giving the type of weights would be enough. This type would be obtained from the type of the column in the data frame, with a default to "arbitrary" (i.e. unspecified) for plain vectors.

@johnmyleswhite
Copy link
Contributor

I agree that we need a design where weights (of many kinds) are available by default. But I'd also like to move towards a set of abstractions we can use for model fitting and away from specific implementations like ModelMatrix. Examples of things I think we should support in a generalized model fitting protocol:

  • Remote: Fitting a GLM to data in a DB that is never fully loaded into memory on the client
  • Streaming: Fitting a GLM to a stream of observations, only N of which are loaded into memory at a time

To me, the general model fitting procedure looks like:

  • Initialize the model parameters
  • Translate N DB rows into a matrix-like form, which may be a dense Float64 matrix, a sparse Float64 matrix or a joint dense/sparse Float64 matrix
  • Evaluate the objective function with regard to the current N rows in matrix format and the current model parameters
  • Perform a parameter update step
  • Go back to step 2 and repeat until convergence

With a bit of work, this kind of abstraction would make it easy for us to estimate not only GLM models, but much more advanced models like GBDT's, while reusing the best part of R's GLM machinery -- the DSL for translating DB rows into matrix form.

In this kind of setup, weights would be a component of specifying the objective function.

Also probably worth nothing that Doug Bates seems to have already done a lot of work on thinking through the sparse matrix representation of DB's with high cardinality categorical variables.

@nalimilan
Copy link
Member

Agreed, that's what I had in mind too. ModelFrame sounds like a good first level of abstraction, which could work with any model fitting procedure -- be it the current ModelMatrix or any more general interface.

@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

Feel free to re-open at https://github.com/JuliaStats/StatsModels.jl if this is still relevant.

@quinnj quinnj closed this as completed Sep 7, 2017
@nalimilan
Copy link
Member

We now have several types of weight vectors in StatsBase, and work is being done to use them with GLM.jl (JuliaStats/GLM.jl#194).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants