Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to Tables.jl API #20

Merged
merged 34 commits into from
Jul 15, 2019
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
1d85605
Started work on using Tables API.
rofinn Jul 2, 2019
78ba17f
Fixed up Context code to better fit with Tables interface changes.
rofinn Jul 3, 2019
37b7ef2
Tests and bug fixes for working with Context types directly.
rofinn Jul 3, 2019
45dfea9
Simplify exports deprecation.
rofinn Jul 4, 2019
7c6ceed
API simplification.
rofinn Jul 5, 2019
f54e2e2
Fix automerge on Project.toml
rofinn Jul 5, 2019
a0ab2ea
Drop 0.7 tests and add the deprecated file.
rofinn Jul 5, 2019
0b2bbe7
Added a deprecation for switching to the column-major convention.
rofinn Jul 5, 2019
1f99bbd
Updated tests to new API and moved existing deprecated tests to a dif…
rofinn Jul 5, 2019
20f084e
Added some more tests for Chain and mutating methods.
rofinn Jul 7, 2019
aedd1ab
Introduce dropobs and dropvars and deprecate Drop.
rofinn Jul 8, 2019
5f1f4d8
Add a test for broadcasted imputation over a groupby.
rofinn Jul 8, 2019
e512171
Review changes.
rofinn Jul 9, 2019
83a4bf5
Introduce a vardim kwarg to make the column-major convention easier t…
rofinn Jul 9, 2019
7f90aad
Cleanup docstrings and add jldoctests.
rofinn Jul 10, 2019
0d97c08
Remove test REQUIRE file.
rofinn Jul 10, 2019
eecc2d4
Cleanup docs in README and index page.
rofinn Jul 10, 2019
3edef07
More PR review cleanup.
rofinn Jul 11, 2019
9254ebf
Switched impute!(imp, data) -> impute!(data, imp)
rofinn Jul 11, 2019
2fece06
Remove matrix orientation deprecation.
rofinn Jul 11, 2019
f77421f
Update test/runtests.jl
rofinn Jul 11, 2019
e3ddd08
Update src/imputors.jl
rofinn Jul 11, 2019
cadd28d
Update src/imputors.jl
rofinn Jul 11, 2019
81fc7f8
Missed PR review fixes.
rofinn Jul 11, 2019
4b18a0d
Update src/imputors.jl
rofinn Jul 12, 2019
fafe219
Update src/context.jl
rofinn Jul 12, 2019
8f0f4b6
Throw MethodErrors in fallback table methods.
rofinn Jul 12, 2019
d8b51d4
Update src/imputors/fill.jl
rofinn Jul 15, 2019
7c70227
Update src/context.jl
rofinn Jul 15, 2019
d5ff2c5
Use selectdim for obswise and varwise.
rofinn Jul 15, 2019
5591076
Use ∘ in tests to compose imputor pipelines.
rofinn Jul 15, 2019
ec902fe
Change !any(ismissing, ...) tests to all(!ismissing, ...)
rofinn Jul 15, 2019
e823cc2
Restrict RDatasets to >=0.6.2
rofinn Jul 15, 2019
051d6ce
Don't pipe to materializer.
rofinn Jul 15, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .appveyor.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
environment:
matrix:
- julia_version: 0.7
- julia_version: 1.0
- julia_version: nightly

Expand Down
26 changes: 10 additions & 16 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ os:
- linux
- osx
julia:
- 0.7
- 1.0
- nightly
notifications:
Expand All @@ -18,18 +17,13 @@ matrix:
# - if [[ -a .git/shallow ]]; then git fetch --unshallow; fi
# - julia -e 'Pkg.clone(pwd()); Pkg.build("Impute"); Pkg.test("Impute"; coverage=true)'
after_success:
- |
julia -e '
VERSION >= v"0.7.0-DEV.3656" && using Pkg
VERSION >= v"0.7.0-DEV.5183" || cd(Pkg.dir("Impute"))
Pkg.add("Coverage")
using Coverage
Codecov.submit(Codecov.process_folder())
'
- |
julia -e '
VERSION >= v"0.7.0-DEV.3656" && using Pkg
VERSION >= v"0.7.0-DEV.5183" || cd(Pkg.dir("Impute"))
Pkg.add("Documenter")
include(joinpath("docs", "make.jl"))
'
- julia -e 'using Pkg; Pkg.add("Coverage"); using Coverage; Codecov.submit(process_folder())'
jobs:
include:
- stage: "Documentation"
julia: 1.0
os: linux
script:
- julia --project=docs/ -e 'using Pkg; Pkg.develop(PackageSpec(path=pwd())); Pkg.instantiate()'
- julia --project=docs/ docs/make.jl
after_success: skip
11 changes: 8 additions & 3 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,21 @@ authors = ["Invenia Technical Computing"]
version = "0.2.0"

[deps]
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
IterTools = "c8e1da08-722c-5040-9ed9-7db0dc04731e"
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"

[compat]
DataFrames = "0.17, 0.18"
DataFrames = ">= 0.16"
IterTools = "1.2"
Tables = "0.2"
julia = "1"

[extras]
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
RDatasets = "ce6b1742-4840-55fa-b093-852dadbb1d8b"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[targets]
test = ["RDatasets", "Test"]
test = ["DataFrames", "RDatasets", "Test"]
116 changes: 98 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,30 +5,110 @@
[![Build status](https://ci.appveyor.com/api/projects/status/github/invenia/Impute.jl?svg=true)](https://ci.appveyor.com/project/invenia/Impute-jl)
[![codecov](https://codecov.io/gh/invenia/Impute.jl/branch/master/graph/badge.svg)](https://codecov.io/gh/invenia/Impute.jl)

Impute.jl provides various data imputation methods for `Arrays` and `DataFrames` with various types of missing values.
Impute.jl provides various methods for handling missing data in Vectors, Matrices and [Tables](https://github.com/JuliaData/Tables.jl).

## Installation
```julia
Pkg.clone("https://github.com/invenia/Impute.jl")
julia> using Pkg; Pkg.add("Impute")
```

## Features
* Operate over Vectors, Matrices or DataFrames
* Chaining of methods
## Quickstart
Let's start by loading our dependencies:
```julia
julia> using DataFrames, RDatasets, Impute
```

We'll also want some test data containing missings to work with:

```julia
julia> df = dataset("boot", "neuro")
469×6 DataFrames.DataFrame
│ Row │ V1 │ V2 │ V3 │ V4 │ V5 │ V6 │
│ │ Float64⍰ │ Float64⍰ │ Float64 │ Float64⍰ │ Float64⍰ │ Float64⍰ │
├─────┼──────────┼──────────┼─────────┼──────────┼──────────┼──────────┤
│ 1 │ missing │ -203.7 │ -84.1 │ 18.5 │ missing │ missing │
│ 2 │ missing │ -203.0 │ -97.8 │ 25.8 │ 134.7 │ missing │
│ 3 │ missing │ -249.0 │ -92.1 │ 27.8 │ 177.1 │ missing │
│ 4 │ missing │ -231.5 │ -97.5 │ 27.0 │ 150.3 │ missing │
│ 5 │ missing │ missing │ -130.1 │ 25.8 │ 160.0 │ missing │
│ 6 │ missing │ -223.1 │ -70.7 │ 62.1 │ 197.5 │ missing │
│ 7 │ missing │ -164.8 │ -12.2 │ 76.8 │ 202.8 │ missing │
│ 462 │ missing │ -207.3 │ -88.3 │ 9.6 │ 104.1 │ 218.0 │
│ 463 │ -242.6 │ -142.0 │ -21.8 │ 69.8 │ 148.7 │ missing │
│ 464 │ -235.9 │ -128.8 │ -33.1 │ 68.8 │ 177.1 │ missing │
│ 465 │ missing │ -140.8 │ -38.7 │ 58.1 │ 186.3 │ missing │
│ 466 │ missing │ -149.5 │ -40.3 │ 62.8 │ 139.7 │ 242.5 │
│ 467 │ -247.6 │ -157.8 │ -53.3 │ 28.3 │ 122.9 │ 227.6 │
│ 468 │ missing │ -154.9 │ -50.8 │ 28.1 │ 119.9 │ 201.1 │
│ 469 │ missing │ -180.7 │ -70.9 │ 33.7 │ 114.8 │ 222.5 │
```

## Methods
Our first instinct might be to drop all observations, but this leaves us too few rows to work with:

* drop - remove missing
* locf - last observation carried forward
* nocb - next observation carried backward
* interp - linear interpolation of values in vector
* fill - replace with a specific value or a function which returns a value given the existing vector with missing values dropped.
```julia
julia> Impute.drop(df)
4×6 DataFrames.DataFrame
│ Row │ V1 │ V2 │ V3 │ V4 │ V5 │ V6 │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 1 │ -247.0 │ -132.2 │ -18.8 │ 28.2 │ 81.4 │ 237.9 │
│ 2 │ -234.0 │ -140.8 │ -56.5 │ 28.0 │ 114.3 │ 222.9 │
│ 3 │ -215.8 │ -114.8 │ -18.4 │ 65.3 │ 171.6 │ 249.7 │
│ 4 │ -247.6 │ -157.8 │ -53.3 │ 28.3 │ 122.9 │ 227.6 │
```

## TODO
We could try imputing the values with linear interpolation, but that still leaves missing
data at the head and tail of our dataset:

```julia
julia> Impute.interp(df)
469×6 DataFrames.DataFrame
│ Row │ V1 │ V2 │ V3 │ V4 │ V5 │ V6 │
│ │ Float64⍰ │ Float64⍰ │ Float64 │ Float64⍰ │ Float64⍰ │ Float64⍰ │
├─────┼──────────┼──────────┼─────────┼──────────┼──────────┼──────────┤
│ 1 │ missing │ -203.7 │ -84.1 │ 18.5 │ missing │ missing │
│ 2 │ missing │ -203.0 │ -97.8 │ 25.8 │ 134.7 │ missing │
│ 3 │ missing │ -249.0 │ -92.1 │ 27.8 │ 177.1 │ missing │
│ 4 │ missing │ -231.5 │ -97.5 │ 27.0 │ 150.3 │ missing │
│ 5 │ missing │ -227.3 │ -130.1 │ 25.8 │ 160.0 │ missing │
│ 6 │ missing │ -223.1 │ -70.7 │ 62.1 │ 197.5 │ missing │
│ 7 │ missing │ -164.8 │ -12.2 │ 76.8 │ 202.8 │ missing │
│ 462 │ -241.025 │ -207.3 │ -88.3 │ 9.6 │ 104.1 │ 218.0 │
│ 463 │ -242.6 │ -142.0 │ -21.8 │ 69.8 │ 148.7 │ 224.125 │
│ 464 │ -235.9 │ -128.8 │ -33.1 │ 68.8 │ 177.1 │ 230.25 │
│ 465 │ -239.8 │ -140.8 │ -38.7 │ 58.1 │ 186.3 │ 236.375 │
│ 466 │ -243.7 │ -149.5 │ -40.3 │ 62.8 │ 139.7 │ 242.5 │
│ 467 │ -247.6 │ -157.8 │ -53.3 │ 28.3 │ 122.9 │ 227.6 │
│ 468 │ missing │ -154.9 │ -50.8 │ 28.1 │ 119.9 │ 201.1 │
│ 469 │ missing │ -180.7 │ -70.9 │ 33.7 │ 114.8 │ 222.5 │
```

Finally, we can chain multiple simple methods together to give a complete dataset:

```julia
julia> Impute.interp(df) |> Impute.locf() |> Impute.nocb()
469×6 DataFrames.DataFrame
│ Row │ V1 │ V2 │ V3 │ V4 │ V5 │ V6 │
│ │ Float64⍰ │ Float64⍰ │ Float64 │ Float64⍰ │ Float64⍰ │ Float64⍰ │
├─────┼──────────┼──────────┼─────────┼──────────┼──────────┼──────────┤
│ 1 │ -233.6 │ -203.7 │ -84.1 │ 18.5 │ 134.7 │ 222.7 │
│ 2 │ -233.6 │ -203.0 │ -97.8 │ 25.8 │ 134.7 │ 222.7 │
│ 3 │ -233.6 │ -249.0 │ -92.1 │ 27.8 │ 177.1 │ 222.7 │
│ 4 │ -233.6 │ -231.5 │ -97.5 │ 27.0 │ 150.3 │ 222.7 │
│ 5 │ -233.6 │ -227.3 │ -130.1 │ 25.8 │ 160.0 │ 222.7 │
│ 6 │ -233.6 │ -223.1 │ -70.7 │ 62.1 │ 197.5 │ 222.7 │
│ 7 │ -233.6 │ -164.8 │ -12.2 │ 76.8 │ 202.8 │ 222.7 │
│ 462 │ -241.025 │ -207.3 │ -88.3 │ 9.6 │ 104.1 │ 218.0 │
│ 463 │ -242.6 │ -142.0 │ -21.8 │ 69.8 │ 148.7 │ 224.125 │
│ 464 │ -235.9 │ -128.8 │ -33.1 │ 68.8 │ 177.1 │ 230.25 │
│ 465 │ -239.8 │ -140.8 │ -38.7 │ 58.1 │ 186.3 │ 236.375 │
│ 466 │ -243.7 │ -149.5 │ -40.3 │ 62.8 │ 139.7 │ 242.5 │
│ 467 │ -247.6 │ -157.8 │ -53.3 │ 28.3 │ 122.9 │ 227.6 │
│ 468 │ -247.6 │ -154.9 │ -50.8 │ 28.1 │ 119.9 │ 201.1 │
│ 469 │ -247.6 │ -180.7 │ -70.9 │ 33.7 │ 114.8 │ 222.5 │
```

* Dropping rows in a matrix allocates extra memory (ie: `data[mask, :]` make a copy).
* More sophisticated imputation methods
1. MICE
2. EM
3. kNN
4. Regression
**Warning**: Your approach should depend on the properties of you data (e.g., [MCAR, MAR, MNAR](https://en.wikipedia.org/wiki/Missing_data#Types_of_missing_data)).
9 changes: 9 additions & 0 deletions docs/Project.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
[deps]
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
Impute = "f7bf1975-0170-51b9-8c5f-a992d46b9575"
RDatasets = "ce6b1742-4840-55fa-b093-852dadbb1d8b"

[compat]
DataFrames = ">= 0.16"
rofinn marked this conversation as resolved.
Show resolved Hide resolved
Documenter = "~0.22"
2 changes: 1 addition & 1 deletion docs/make.jl
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
using Documenter, Impute, RDatasets
using Documenter, Impute

makedocs(
modules=[Impute],
Expand Down
Loading