Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fea] Data imputation limited by null conversion #2966

Closed
wphicks opened this issue Oct 13, 2020 · 2 comments
Closed

[Fea] Data imputation limited by null conversion #2966

wphicks opened this issue Oct 13, 2020 · 2 comments
Assignees
Labels
bug Something isn't working Cython / Python Cython or Python issue feature request New feature or request

Comments

@wphicks
Copy link
Contributor

wphicks commented Oct 13, 2020

Is your feature request related to a problem? Please describe.
In sklearn, a fairly common data imputation workflow might look something like this

import numpy as np
import pandas
from sklearn.impute import SimpleImputer

df = pandas.DataFrame(data=[[7, 2, 3], [4, None, 6], [10, 5, 9]])
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit_transform(df)

The Rapids equivalent would looks something like:

import cupy as cp
import cudf
from cuml.experimental.preprocessing import SimpleImputer

df = cudf.DataFrame(data=[[7, 2, 3], [4, None, 6], [10, 5, 9]])
imp = SimpleImputer(missing_values=cp.nan, strategy='mean')
imp.fit_transform(df)

Under the hood, we try to convert the cudf DataFrame to a cupy array, which fails because of null values in the DataFrame. This severely limits the usefulness of our data imputation methods.

Describe the solution you'd like
We can fix this either in cuml through special handling of DataFrame input or in cudf by providing some infrastructure for dealing with null values when we convert to cupy, though that may also require cupy changes (possibly related: cudf/5754).

Describe alternatives you've considered
For floating point data, we can use fillna(cp.nan) before running data imputation. For integers, we would have to either know of an integer value which cannot appear in the data or generate one.

@wphicks wphicks added feature request New feature or request ? - Needs Triage Need team to review and classify labels Oct 13, 2020
@wphicks
Copy link
Contributor Author

wphicks commented Oct 13, 2020

@viclafargue: Added this issue to follow up from our external discussion.

@viclafargue viclafargue added bug Something isn't working Cython / Python Cython or Python issue and removed ? - Needs Triage Need team to review and classify labels Oct 13, 2020
@viclafargue viclafargue self-assigned this Oct 13, 2020
@viclafargue
Copy link
Contributor

Solved with #3194

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cython / Python Cython or Python issue feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants