Refactoring #17

rofinn · 2019-05-15T23:38:25Z

Between changes in the julia ecosystem and new requirements we should probably refactor our design to address specific use cases:

Simple Impute.fill(X), Impute.locf(X), Impute.chain(...) should stay simple
Deprecate impute(X, :method) calls
More functional interface for Chain. Maybe we can have the locf, fill, etc methods default to returning a lazy function if data isn't passed in? This would allow us to write an imputation pipeline as

data = Impute.interp(data; kwargs...) |> Impute.locf(; kwargs...) |> Impute.nocb(; kwargs...)

?

Drop direct dependence on DataFrames by using Tables interface (at the expense of an extra copy) Switch to Tables.jl API #20
Switch to using JuliaStats matrix orientation by default.
Introduce an IDataset type which stores original values, missing bitmask, sparse array of imputed values.
Alternate API where we construct an IDataset from X and pass that to different methods (e.g., @chain, @multiply).
Add support for dropping entire variables if there are too many missing values

NOTE: It's okay if certain imputation methods only work on certain types of data

The text was updated successfully, but these errors were encountered:

rofinn · 2019-07-04T21:02:47Z

I think the best way to handle dropping entire variables is to:

Construct Imputation methods with a Context.
Define DropVars/dropvars and DropObs/dropobs

This would allow you to implement a workflow like:

chain(DropVars(...), Interpolate(...), DropObs(...))

Before we make that changes we should probably:

deprecate the impute(X, :method) functions first
switch the matrix orientation

rofinn · 2019-09-02T05:21:04Z

As an extension to the above proposed changes we may want to define a separate module for imputation iterators. This would address issues related to mutation inconsistency and Context usage by encapsulating most of the base behaviour in a collection of iterators that don't support mutation and have a reduced API. For more complex cases, we should just use a Dataset type which stores a mask of the missing values with with original and imputed datasets. We can always provide helpful methods for testing missing data patterns, but those will likely require multiple passes over the data anyways.

Iterators

Only makes 1 pass through the data (even with chaining)
Doesn't explicitly make a copy of the data, but also doesn't mutate the underlying data.
Takes a ismissing function
Takes a limit value to error if there are too many missing values

Datasets

Construct missingness masks for original dataset (w/ error conditions)
Impute values and store them in a sparse array
Support complete, merge and analyse operations

rofinn · 2020-03-06T17:49:13Z

Iterators API didn't work out. Ensuring reasonable performance for even the current list of imputation strategies was challenging and I don't know how much benefit it's likely to have. I think any future efforts would be better served to just simplify the current API, so folks can define their own methods more easily.

#60

This was referenced Jul 4, 2019

API simplification #22

Closed

Change matrix orientation #23

Closed

Introduce dropvars #24

Closed

rofinn closed this as completed Mar 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring #17

Refactoring #17

rofinn commented May 15, 2019 •

edited

Loading

rofinn commented Jul 4, 2019 •

edited

Loading

rofinn commented Sep 2, 2019 •

edited

Loading

rofinn commented Mar 6, 2020

Refactoring #17

Refactoring #17

Comments

rofinn commented May 15, 2019 • edited Loading

rofinn commented Jul 4, 2019 • edited Loading

rofinn commented Sep 2, 2019 • edited Loading

Iterators

Datasets

rofinn commented Mar 6, 2020

rofinn commented May 15, 2019 •

edited

Loading

rofinn commented Jul 4, 2019 •

edited

Loading

rofinn commented Sep 2, 2019 •

edited

Loading