Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String dtype: map builtin str alias to StringDtype #59685

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

jorisvandenbossche
Copy link
Member

This ensures that the builtin type str is also mapped to the future default string dtype (if infer_string is enabled), i.e. to make .astype(str) behave the same as .astype('str").

In general we rely on numpy for mapping builtin types to dtypes. For example int is mapped to np.dtype("int64") by passing that value to np.dtype(..) in pandas_dtype().
But now that we have a default dtype that is mapped to one of those builtin types, we can no longer rely on numpy doing that. Implementation wise, for pandas_dtype(str) I currently added a special case checking explicitly for str and then importing the dtype lazily and returning it.

An alternative could be to handle this through the ExtensionDtype Registry.find(). However, that currently relies on using ExtensionDtype.construct_from_string() (i.e. we would have to expand the typing there to allow type objects as well next to strings, which is of course also possible). And maybe it is better to have control centrally about builtin aliases, instead of allowing external dtypes to map to those as well.

xref #54792

@jorisvandenbossche jorisvandenbossche added the Strings String extension data type and string data label Sep 2, 2024
@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Sep 2, 2024
if (
hasattr(arr, "dtype")
and arr.dtype.kind in "mM"
and not hasattr(arr, "_pa_array")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit ugly, but essentially this is avoiding that astype(str) for ArrowExtensionArray (if that happens to have a datetimelike dtype) is called here, because that will use ensure_string_array for that implementation, causing a recursion error.

An alternative would be a better check than arr.dtype.kind in "mM" that restricts itself to DatetimeLikeArrayMixin (essentially I need a isinstance(arr, DatetimeLikeArrayMixin), I think). Could potentially do that with some ABC check.

Another alternative is to handle astype(str) specifically in ArrowExtensionArray (currently astype is not implemented there, and it falls back entirely on the base class implementation)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think eventually it would be good to implement ArrowExtensionArray.astype so I would support eventually handing str there. Could you add a TODO comment here describing when this could be removed when astype is implemented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO comment

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment otherwise looks good

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks great - some minor things

@@ -1448,6 +1450,9 @@ def is_extension_array_dtype(arr_or_dtype) -> bool:
elif isinstance(dtype, np.dtype):
return False
else:
# TODO ugly -> move into registry find()? Or make this work with pandas_dtype?
if dtype is str and using_string_dtype():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other places where maybe we should just have a function like is_builtin_string_dtype instead of this branch? I think its hard to mentally process what logic like this is doing - would be good to abstract that as much as we can

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd I replaced the registry.find() with pandas_dtype(), then I don't need to handle the case of str here specifically because pandas_dtype() already does that.

In general pandas_dtype() should be our general preprocessing function. I only needed to suppress any warning or error here, because in this case if pandas_dtype fails then is_extension_array_dtype should return False.
(potentially that could be factored out into something like pandas_dtype_safe() if that would be needed in multiple places)

@@ -184,7 +184,7 @@ def test_construct_from_string_fill_value_raises(string):
[
(SparseDtype(int, 0), float, SparseDtype(float, 0.0)),
(SparseDtype(int, 1), float, SparseDtype(float, 1.0)),
(SparseDtype(int, 1), str, SparseDtype(object, "1")),
(SparseDtype(int, 1), np.str_, SparseDtype(object, "1")),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need this test case? I'm not sure what expectations we have around the np.str_ data type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, can also leave out this test case entirely. I am not entirely sure yet what dtype=np.str_ should mean, I was planning to open an issue about that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be fine to remove; I feel like this opens up a pandora's box of issues that aren't worth tracking down at this point in time

pandas/tests/frame/test_constructors.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants