BUG: `any()` and `all()` raise with extension strings #51939

jrbourbeau · 2023-03-13T21:34:46Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

s = pd.Series(["foo", "bar", ""])
print(f"{s.any() = }")
print(f"{s.astype('string[python]').any() = }")

# Also fails with pyarrow strings
# print(f"{s.astype('string[pyarrow]').any() = }")

# Similar behavior with `.all()`

Issue Description

When using object dtypes for string data, .any() and .all() reductions both work. When using string[python] or string[pyarrow] both these operations raise a TypeError

`string[python]` traceback:

Traceback (most recent call last):
  File "/Users/james/projects/dask/dask/test-any.py", line 5, in <module>
    print(f"{s.astype('string[python]').any() = }")
  File "/Users/james/mambaforge/envs/dask/lib/python3.10/site-packages/pandas/core/generic.py", line 11368, in any
    return NDFrame.any(
  File "/Users/james/mambaforge/envs/dask/lib/python3.10/site-packages/pandas/core/generic.py", line 11056, in any
    return self._logical_func(
  File "/Users/james/mambaforge/envs/dask/lib/python3.10/site-packages/pandas/core/generic.py", line 11040, in _logical_func
    return self._reduce(
  File "/Users/james/mambaforge/envs/dask/lib/python3.10/site-packages/pandas/core/series.py", line 4522, in _reduce
    return delegate._reduce(name, skipna=skipna, **kwds)
  File "/Users/james/mambaforge/envs/dask/lib/python3.10/site-packages/pandas/core/arrays/string_.py", line 474, in _reduce
    raise TypeError(f"Cannot perform reduction '{name}' with string dtype")
TypeError: Cannot perform reduction 'any' with string dtype

`string[pyarrow]` traceback:

Traceback (most recent call last):
  File "/Users/james/mambaforge/envs/dask/lib/python3.10/site-packages/pandas/core/arrays/arrow/array.py", line 1281, in _reduce
    result = pyarrow_meth(data_to_reduce, skip_nulls=skipna, **kwargs)
  File "/Users/james/mambaforge/envs/dask/lib/python3.10/site-packages/pyarrow/compute.py", line 256, in wrapper
    return func.call(args, options, memory_pool)
  File "pyarrow/_compute.pyx", line 355, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Function 'any' has no kernel matching input types (string)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/james/projects/dask/dask/test-any.py", line 5, in <module>
    print(f"{s.astype('string[pyarrow]').any() = }")
  File "/Users/james/mambaforge/envs/dask/lib/python3.10/site-packages/pandas/core/generic.py", line 11368, in any
    return NDFrame.any(
  File "/Users/james/mambaforge/envs/dask/lib/python3.10/site-packages/pandas/core/generic.py", line 11056, in any
    return self._logical_func(
  File "/Users/james/mambaforge/envs/dask/lib/python3.10/site-packages/pandas/core/generic.py", line 11040, in _logical_func
    return self._reduce(
  File "/Users/james/mambaforge/envs/dask/lib/python3.10/site-packages/pandas/core/series.py", line 4522, in _reduce
    return delegate._reduce(name, skipna=skipna, **kwds)
  File "/Users/james/mambaforge/envs/dask/lib/python3.10/site-packages/pandas/core/arrays/arrow/array.py", line 1289, in _reduce
    raise TypeError(msg) from err
TypeError: 'ArrowStringArray' with dtype string does not support reduction 'any' with pyarrow version 11.0.0. 'any' may be supported by upgrading pyarrow.

Expected Behavior

I'd expect string[pyarrow] and string[python] to also treat empty strings as falsey and non-empty strings as truthy when computing .any() and .all()

cc @phofl for visibility

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 6169cba72dbe8c7e9c7f17ab38af15a256f083da
python           : 3.10.4.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 22.3.0
Version          : Darwin Kernel Version 22.3.0: Mon Jan 30 20:42:11 PST 2023; root:xnu-8792.81.3~2/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 2.1.0.dev0+186.g6169cba72d
numpy            : 1.25.0.dev0+882.gea0b1708b
pytz             : 2022.1
dateutil         : 2.8.2
setuptools       : 59.8.0
pip              : 22.0.4
Cython           : None
pytest           : 7.1.3
hypothesis       : None
sphinx           : 4.5.0
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 3.1.2
IPython          : 8.2.0
pandas_datareader: None
bs4              : 4.11.1
bottleneck       : None
brotli           :
fastparquet      : 2023.2.0
fsspec           : 2023.1.0+5.g012816b
gcsfs            : None
matplotlib       : 3.5.1
numba            : None
numexpr          : 2.8.0
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 11.0.0
pyreadstat       : None
pyxlsb           : None
s3fs             : 2022.10.0
scipy            : 1.9.0
snappy           :
sqlalchemy       : 1.4.35
tables           : 3.7.0
tabulate         : None
xarray           : 2022.3.0
xlrd             : None
zstandard        : None
tzdata           : None
qtpy             : None
pyqt5            : None

The text was updated successfully, but these errors were encountered:

jrbourbeau · 2023-03-13T21:38:01Z

FWIW over in dask we also ran into similar errors for <string-series>.sum()

jorisvandenbossche · 2023-03-13T23:06:05Z

For pyarrow, this kernel is simply not implemented for strings, only for actual bools (and I don't think we would plan to add such a variant, users are responsible to convert to boolean as they see fit first). But so pandas could easily work around this if we would want.

Generally, we have been very "liberal" in implementing any/all for all data types, but IMO it's fine to be stricter to limit ourselves to data types that have a clear true/false interpretation (like for numeric dtypes, where you can say that 0 is false). I know for datetime we recently had a similar discussion (1970-01-01 being False is a bit of a leaky implementation detail; don't directly find the relevant issue).
For strings, I would say that it can be clearer code to explicitly check for empty strings instead of relying on this aspect of any/all implicitly.

Prashantkhobragade · 2023-03-15T20:44:22Z

you can try converting the string dtype to object using the .astype() method, as the object dtype supports the any() and all() methods.

print(f"{s.astype('object').any() = }")

jrbourbeau · 2023-03-16T21:22:54Z

Generally, we have been very "liberal" in implementing any/all for all data types, but IMO it's fine to be stricter to limit ourselves to data types that have a clear true/false interpretation (like for numeric dtypes, where you can say that 0 is false).

I think it's pretty standard to treat empty strings as false and non-empty strings as true. Are there other conventions you've seen commonly used? (I could very well be missing something)

But so pandas could easily work around this if we would want.

Yeah, maybe that's what I'm interested in. I think there's a user expectation about any() / all() / etc. with strings, but what I ultimately care about here is dask - pandas compatibility. If erroring in this case this is an intentional decision, then I'll update dask to have the same behavior. I guess I'm just looking for clarity on if this is intentional

jorisvandenbossche · 2023-03-17T08:33:00Z

I think it's pretty standard to treat empty strings as false and non-empty strings as true

In the context of Python's concept of "truthy" and "falsey", for sure. But for example, also numpy only implements np.any for actual bools and will fail when passing it strings ("ufunc 'logical_or' did not contain a loop with signature matching types .. str ..").

The fact that any/all actually worked in pandas in the past is because we have been using object dtype (and not numpy's (fixed width) strings). But this also gives several weird results not returning actual boolean value because of how numpy implements this (just using Python's "and"/"or"), although this has been fixed on the pandas side recently to guarantee bool output: see #12863 and the many linked issues.
(now, the fact that there were so many issues about this strange behaviour is certainly an indication that people do call any/all on object/string dtype data)

If erroring in this case this is an intentional decision, then I'll update dask to have the same behavior. I guess I'm just looking for clarity on if this is intentional

I don't remember if we intentionally omitted the reduction from the StringDtype implementation. I don't see much discussion in the original issues about this, and for example the original PR #27949 explicitly didn't implement any reduction at the time, but only discussed that "sum" should be added (which we actually still didn't add, only "min" and "max" were added as reductions)

j-bennet · 2023-03-17T16:36:51Z

I don't remember if we intentionally omitted the reduction from the StringDtype implementation. I don't see much discussion in the original issues about this, and for example the original PR #27949 explicitly didn't implement any reduction at the time, but only discussed that "sum" should be added (which we actually still didn't add, only "min" and "max" were added as reductions)

It would be good to know the intended behavior for sum reduction with ArrowStringArray, as well. From #27949, it's not completely clear, and I can't seem to find more recent discussions on the topic.

phofl · 2023-08-28T21:12:56Z

We've added this for the StringDtype that is based on the NumPy semantics, I think we should move forward with this for all StringDtype implementations. Some discussion is here: #54591 (comment)

cc @jbrockmendel

jbrockmendel · 2023-08-31T17:33:36Z

+1 for implementing any/all for other strings

jorisvandenbossche · 2023-08-31T21:34:10Z

It could also be an option to deprecate+remove any/all for all non-numeric data types.

For example for string data, you can use it to check for missing values (but isna/notna + any/all is more explicit for that) or for empty strings (but .str.len() > 0 or == "" is more explicit for that). And for datetime data it relies on an implementation detail that 1970 is considered as false and everything else as true ...

phofl · 2023-08-31T21:46:43Z

You can also check for empty strings, which is definitely useful and is consistent with Python semantics

jrbourbeau added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 13, 2023

jorisvandenbossche added Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 13, 2023

jorisvandenbossche added the Strings String extension data type and string data label Mar 13, 2023

jorisvandenbossche mentioned this issue Jul 30, 2024

String dtype: overview of breaking behaviour changes #59328

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: `any()` and `all()` raise with extension strings #51939

BUG: `any()` and `all()` raise with extension strings #51939

jrbourbeau commented Mar 13, 2023

jrbourbeau commented Mar 13, 2023

jorisvandenbossche commented Mar 13, 2023

Prashantkhobragade commented Mar 15, 2023

jrbourbeau commented Mar 16, 2023

jorisvandenbossche commented Mar 17, 2023

j-bennet commented Mar 17, 2023

phofl commented Aug 28, 2023

jbrockmendel commented Aug 31, 2023

jorisvandenbossche commented Aug 31, 2023

phofl commented Aug 31, 2023

BUG: any() and all() raise with extension strings #51939

BUG: any() and all() raise with extension strings #51939

Comments

jrbourbeau commented Mar 13, 2023

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

jrbourbeau commented Mar 13, 2023

jorisvandenbossche commented Mar 13, 2023

Prashantkhobragade commented Mar 15, 2023

jrbourbeau commented Mar 16, 2023

jorisvandenbossche commented Mar 17, 2023

j-bennet commented Mar 17, 2023

phofl commented Aug 28, 2023

jbrockmendel commented Aug 31, 2023

jorisvandenbossche commented Aug 31, 2023

phofl commented Aug 31, 2023

BUG: `any()` and `all()` raise with extension strings #51939

BUG: `any()` and `all()` raise with extension strings #51939