Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First check for BaseDtype when infering the data type of an arbitrary object #13295

Merged
merged 2 commits into from
May 5, 2023

Conversation

shwina
Copy link
Contributor

@shwina shwina commented May 4, 2023

We have an internal utility called dtype() that attempts to infer the data type of an arbitrary object. One of the first thing that dtype() does is attempt to call np.dtype(obj). That can be slow for extremely large cardinality categorical data types, as it copies data to host (in particular, it attempts to call the object's __repr__):

Before this PR:

dtype = cudf.CategoricalDtype(categories=range(100_000_000))
%%time x = cudf.core.dtypes.dtype(dtype)
CPU times: user 3.75 s, sys: 885 ms, total: 4.64 s
Wall time: 4.63 s

This PR ensures we attempt to do far less expensive inference first, before calling np.dtype(...).

After this PR:

%%time x = cudf.core.dtypes.dtype(dtype)
CPU times: user 13 µs, sys: 1 µs, total: 14 µs
Wall time: 19.1 µs

@shwina shwina requested a review from a team as a code owner May 4, 2023 20:08
@github-actions github-actions bot added the Python Affects Python cuDF API. label May 4, 2023
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. How did you come across this?

@shwina shwina added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels May 4, 2023
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooof.

@shwina
Copy link
Contributor Author

shwina commented May 5, 2023

/merge

@rapids-bot rapids-bot bot merged commit ceacfa4 into rapidsai:branch-23.06 May 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants