Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support CPU object for train_test_split #5873

Merged
merged 10 commits into from
Apr 30, 2024

Conversation

isVoid
Copy link
Contributor

@isVoid isVoid commented Apr 29, 2024

This PR adds support to CPU objects for train_test_split, leveraging the
input conversion tools defined in input_utils.py. This PR also adds
output_to_df_obj_like API that converts CumlArray back to a series/dataframe,
matching metadata from input.

In the meantime, this PR reimplements majority of train_test_split by
centralizing indices compute and gather. This reduces the number of kernels
launched, especially in the cases where stratify keys are provided.

Closes #5619

@github-actions github-actions bot added the Cython / Python Cython or Python issue label Apr 29, 2024
python/cuml/internals/input_utils.py Outdated Show resolved Hide resolved
python/cuml/internals/input_utils.py Outdated Show resolved Hide resolved
python/cuml/model_selection/_split.py Outdated Show resolved Hide resolved
python/cuml/model_selection/_split.py Outdated Show resolved Hide resolved
@isVoid isVoid marked this pull request as ready for review April 29, 2024 14:00
@isVoid isVoid requested a review from a team as a code owner April 29, 2024 14:00
Copy link
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall things look great! Love the code cleaning included in the PR as well! Just had one question


cuda = gpu_only_import_from("numba", "cuda")

test_array_input_types = ["numba", "cupy"]
test_array_input_types = ["numba", "cupy", "cudf", "pandas"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks great! But the in the tests we're only testing for dataframes, should we add tests for series as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing tests have hardcoded 2-d array inputs that are can't be naturally cast to Series. I parametrize them with 1-d array and a fixture to convert to df / series based on data shape.

@isVoid isVoid requested a review from dantegd April 30, 2024 06:55
@dantegd dantegd added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 30, 2024
@dantegd
Copy link
Member

dantegd commented Apr 30, 2024

/merge

@rapids-bot rapids-bot bot merged commit 1609fcd into rapidsai:branch-24.06 Apr 30, 2024
59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] train_test_split should accept CPU dataframe and array inputs
2 participants