-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: any()
and all()
raise with extension strings
#51939
Comments
FWIW over in |
For pyarrow, this kernel is simply not implemented for strings, only for actual bools (and I don't think we would plan to add such a variant, users are responsible to convert to boolean as they see fit first). But so pandas could easily work around this if we would want. Generally, we have been very "liberal" in implementing any/all for all data types, but IMO it's fine to be stricter to limit ourselves to data types that have a clear true/false interpretation (like for numeric dtypes, where you can say that 0 is false). I know for datetime we recently had a similar discussion (1970-01-01 being False is a bit of a leaky implementation detail; don't directly find the relevant issue). |
you can try converting the
|
I think it's pretty standard to treat empty strings as false and non-empty strings as true. Are there other conventions you've seen commonly used? (I could very well be missing something)
Yeah, maybe that's what I'm interested in. I think there's a user expectation about |
In the context of Python's concept of "truthy" and "falsey", for sure. But for example, also numpy only implements The fact that
I don't remember if we intentionally omitted the reduction from the StringDtype implementation. I don't see much discussion in the original issues about this, and for example the original PR #27949 explicitly didn't implement any reduction at the time, but only discussed that "sum" should be added (which we actually still didn't add, only "min" and "max" were added as reductions) |
It would be good to know the intended behavior for |
We've added this for the StringDtype that is based on the NumPy semantics, I think we should move forward with this for all StringDtype implementations. Some discussion is here: #54591 (comment) |
+1 for implementing any/all for other strings |
It could also be an option to deprecate+remove For example for string data, you can use it to check for missing values (but |
You can also check for empty strings, which is definitely useful and is consistent with Python semantics |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
When using
object
dtypes for string data,.any()
and.all()
reductions both work. When usingstring[python]
orstring[pyarrow]
both these operations raise aTypeError
`string[python]` traceback:
`string[pyarrow]` traceback:
Expected Behavior
I'd expect
string[pyarrow]
andstring[python]
to also treat empty strings as falsey and non-empty strings as truthy when computing.any()
and.all()
cc @phofl for visibility
Installed Versions
The text was updated successfully, but these errors were encountered: