Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Feature request: train_test_split to accept numpy inputs #1619

Closed
rmccorm4 opened this issue Jan 31, 2020 · 3 comments
Closed

[FEA] Feature request: train_test_split to accept numpy inputs #1619

rmccorm4 opened this issue Jan 31, 2020 · 3 comments
Labels
Cython / Python Cython or Python issue feature request New feature or request good first issue Good for newcomers

Comments

@rmccorm4
Copy link

rmccorm4 commented Jan 31, 2020

Describe the bug

cuml.preprocessing.model_selection.train_test_split hits this error when passed np.array inputs

UnboundLocalError: local variable 'X_train' referenced before assignment

Steps/Code to reproduce bug

Reproduced with both cuml==0.11 (nvcr.io/nvidia/rapidsai/rapidsai:0.11-cuda10.0-runtime-ubuntu18.04) and cuml==0.12 built from source, using the following minimal snippet:

import cuml
import cudf
import numpy as np
import pandas as pd
from cuml.preprocessing.model_selection import train_test_split

# Desired parameters
max_depth = 20
n_trees = 30
rows, cols = 10000, 74

# Generate fake data for example's sake
x = np.random.random((rows, cols))
df = pd.DataFrame(x, columns=["C{}".format(i) for i in range(cols)])
X = df.drop(['C2'],1).to_numpy().astype(np.float, 32)
y = df['C2'].astype(np.float, 32)

# Error when passed numpy arrays (not cuda_array or cudf df)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Expected behavior

Either an error that numpy input isn't allowed, or better handling/casting inside of the train_test_split function.

This code will get around the error for reference:

import cuml
import cudf
import numpy as np
import pandas as pd
from cuml.preprocessing.model_selection import train_test_split

# Desired parameters
max_depth = 20
n_trees = 30
rows, cols = 10000, 74

# Generate fake data for example's sake
x = np.random.random((rows, cols))
df = pd.DataFrame(x, columns=["C{}".format(i) for i in range(cols)])
X = df.drop(['C2'],1).to_numpy().astype(np.float, 32)
y = df['C2'].astype(np.float, 32)

Xdf = pd.DataFrame(X)
ydf = pd.DataFrame(y)
Xgdf = cudf.DataFrame.from_pandas(Xdf)
ygdf = cudf.DataFrame.from_pandas(ydf)
X_train, X_test, y_train, y_test = train_test_split(Xgdf, ygdf, test_size=0.2)

Environment details (please complete the following information):

  • Environment location: Docker nvcr.io/nvidia/rapidsai/rapidsai:0.11-cuda10.0-runtime-ubuntu18.04
  • Linux Distro/Architecture: Ubuntu 18.04
  • GPU Model/Driver: V100 driver 418.67
  • CUDA: 10.0
  • Method of cuDF & cuML install: [conda, Docker, or from source]
    • pre-built in container for v0.11
    • built from source for v0.12:
# Use RAPIDS container for easier reproducibility
nvidia-docker run -it -v `pwd`:/mnt --workdir=/mnt nvcr.io/nvidia/rapidsai/rapidsai:0.11-cuda10.0-runtime-ubuntu18.04
# Clone CUML source
git clone https://github.com/rapidsai/cuml
# Switch to 0.12 branch
git checkout branch-0.12
# Build dev env for 0.12 branch
conda env create --name cuml_dev --file /mnt/cuml/conda/environments/cuml_dev_cuda10.0.yml
conda activate cuml_dev
# Install cuml 0.12
conda install cuml

Additional context
Discovered while trying to repro this issue: #1467 (comment)

@rmccorm4 rmccorm4 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jan 31, 2020
@JohnZed JohnZed removed the ? - Needs Triage Need team to review and classify label Feb 3, 2020
@dantegd
Copy link
Member

dantegd commented Feb 3, 2020

@rmccorm4 right now our train_test_split does not support NumPy inputs, from the docs:

 X : cudf.DataFrame or cuda_array_interface compliant device array
        Data to split, has shape (n_samples, n_features)

I'll change the title and tag this issue as a feature request

@dantegd dantegd changed the title [BUG] UnboundLocalError: local variable 'X_train' referenced before assignment [FEA] Feature request: train_test_split to accept numpy inputs Feb 3, 2020
@dantegd dantegd added Cython / Python Cython or Python issue feature request New feature or request good first issue Good for newcomers and removed bug Something isn't working labels Feb 3, 2020
@dantegd dantegd removed their assignment Feb 3, 2020
@sxz294
Copy link

sxz294 commented Apr 11, 2021

Hi, I'd like to work on this ticket as a first-time contributor.

@dantegd
Copy link
Member

dantegd commented May 10, 2024

Closing, this has been added in #5873

@dantegd dantegd closed this as completed May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cython / Python Cython or Python issue feature request New feature or request good first issue Good for newcomers
Projects
Status: Closed
Development

No branches or pull requests

4 participants