Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring #17

Closed
6 of 8 tasks
rofinn opened this issue May 15, 2019 · 3 comments
Closed
6 of 8 tasks

Refactoring #17

rofinn opened this issue May 15, 2019 · 3 comments

Comments

@rofinn
Copy link
Member

rofinn commented May 15, 2019

Between changes in the julia ecosystem and new requirements we should probably refactor our design to address specific use cases:

  • Simple Impute.fill(X), Impute.locf(X), Impute.chain(...) should stay simple
  • Deprecate impute(X, :method) calls
  • More functional interface for Chain. Maybe we can have the locf, fill, etc methods default to returning a lazy function if data isn't passed in? This would allow us to write an imputation pipeline as
data = Impute.interp(data; kwargs...) |> Impute.locf(; kwargs...) |> Impute.nocb(; kwargs...)

?

  • Drop direct dependence on DataFrames by using Tables interface (at the expense of an extra copy) Switch to Tables.jl API #20
  • Switch to using JuliaStats matrix orientation by default.
  • Introduce an IDataset type which stores original values, missing bitmask, sparse array of imputed values.
  • Alternate API where we construct an IDataset from X and pass that to different methods (e.g., @chain, @multiply).
  • Add support for dropping entire variables if there are too many missing values

NOTE: It's okay if certain imputation methods only work on certain types of data

@rofinn
Copy link
Member Author

rofinn commented Jul 4, 2019

I think the best way to handle dropping entire variables is to:

  1. Construct Imputation methods with a Context.
  2. Define DropVars/dropvars and DropObs/dropobs

This would allow you to implement a workflow like:

chain(DropVars(...), Interpolate(...), DropObs(...))

Before we make that changes we should probably:

  1. deprecate the impute(X, :method) functions first
  2. switch the matrix orientation

@rofinn
Copy link
Member Author

rofinn commented Sep 2, 2019

As an extension to the above proposed changes we may want to define a separate module for imputation iterators. This would address issues related to mutation inconsistency and Context usage by encapsulating most of the base behaviour in a collection of iterators that don't support mutation and have a reduced API. For more complex cases, we should just use a Dataset type which stores a mask of the missing values with with original and imputed datasets. We can always provide helpful methods for testing missing data patterns, but those will likely require multiple passes over the data anyways.

Iterators

  1. Only makes 1 pass through the data (even with chaining)
  2. Doesn't explicitly make a copy of the data, but also doesn't mutate the underlying data.
  3. Takes a ismissing function
  4. Takes a limit value to error if there are too many missing values

Datasets

  1. Construct missingness masks for original dataset (w/ error conditions)
  2. Impute values and store them in a sparse array
  3. Support complete, merge and analyse operations

@rofinn
Copy link
Member Author

rofinn commented Mar 6, 2020

Iterators API didn't work out. Ensuring reasonable performance for even the current list of imputation strategies was challenging and I don't know how much benefit it's likely to have. I think any future efforts would be better served to just simplify the current API, so folks can define their own methods more easily.

#60

@rofinn rofinn closed this as completed Mar 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant