-
-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String dtype: map builtin str alias to StringDtype #59685
base: main
Are you sure you want to change the base?
String dtype: map builtin str alias to StringDtype #59685
Conversation
if ( | ||
hasattr(arr, "dtype") | ||
and arr.dtype.kind in "mM" | ||
and not hasattr(arr, "_pa_array") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit ugly, but essentially this is avoiding that astype(str)
for ArrowExtensionArray (if that happens to have a datetimelike dtype) is called here, because that will use ensure_string_array
for that implementation, causing a recursion error.
An alternative would be a better check than arr.dtype.kind in "mM"
that restricts itself to DatetimeLikeArrayMixin
(essentially I need a isinstance(arr, DatetimeLikeArrayMixin)
, I think). Could potentially do that with some ABC check.
Another alternative is to handle astype(str)
specifically in ArrowExtensionArray (currently astype
is not implemented there, and it falls back entirely on the base class implementation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think eventually it would be good to implement ArrowExtensionArray.astype
so I would support eventually handing str
there. Could you add a TODO
comment here describing when this could be removed when astype
is implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a TODO comment
…erring StringDtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment otherwise looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks great - some minor things
pandas/core/dtypes/common.py
Outdated
@@ -1448,6 +1450,9 @@ def is_extension_array_dtype(arr_or_dtype) -> bool: | |||
elif isinstance(dtype, np.dtype): | |||
return False | |||
else: | |||
# TODO ugly -> move into registry find()? Or make this work with pandas_dtype? | |||
if dtype is str and using_string_dtype(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there other places where maybe we should just have a function like is_builtin_string_dtype
instead of this branch? I think its hard to mentally process what logic like this is doing - would be good to abstract that as much as we can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd I replaced the registry.find()
with pandas_dtype()
, then I don't need to handle the case of str
here specifically because pandas_dtype()
already does that.
In general pandas_dtype()
should be our general preprocessing function. I only needed to suppress any warning or error here, because in this case if pandas_dtype
fails then is_extension_array_dtype
should return False
.
(potentially that could be factored out into something like pandas_dtype_safe()
if that would be needed in multiple places)
@@ -184,7 +184,7 @@ def test_construct_from_string_fill_value_raises(string): | |||
[ | |||
(SparseDtype(int, 0), float, SparseDtype(float, 0.0)), | |||
(SparseDtype(int, 1), float, SparseDtype(float, 1.0)), | |||
(SparseDtype(int, 1), str, SparseDtype(object, "1")), | |||
(SparseDtype(int, 1), np.str_, SparseDtype(object, "1")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we even need this test case? I'm not sure what expectations we have around the np.str_
data type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, can also leave out this test case entirely. I am not entirely sure yet what dtype=np.str_
should mean, I was planning to open an issue about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be fine to remove; I feel like this opens up a pandora's box of issues that aren't worth tracking down at this point in time
This ensures that the builtin type
str
is also mapped to the future default string dtype (ifinfer_string
is enabled), i.e. to make.astype(str)
behave the same as.astype('str")
.In general we rely on numpy for mapping builtin types to dtypes. For example
int
is mapped tonp.dtype("int64")
by passing that value tonp.dtype(..)
inpandas_dtype()
.But now that we have a default dtype that is mapped to one of those builtin types, we can no longer rely on numpy doing that. Implementation wise, for
pandas_dtype(str)
I currently added a special case checking explicitly forstr
and then importing the dtype lazily and returning it.An alternative could be to handle this through the ExtensionDtype
Registry.find()
. However, that currently relies on usingExtensionDtype.construct_from_string()
(i.e. we would have to expand the typing there to allow type objects as well next to strings, which is of course also possible). And maybe it is better to have control centrally about builtin aliases, instead of allowing external dtypes to map to those as well.xref #54792