String dtype: implement sum reduction #59853

jorisvandenbossche · 2024-09-20T22:23:56Z

Based on the feedback in #59328, implementing sum() for the string dtype

rhshadrach · 2024-09-21T15:07:36Z

pandas/core/arrays/arrow/array.py

+                if pa.types.is_large_string(data.type):
+                    # binary_join only supports string, not large_string
+                    data = data.cast(pa.string())


Not too familiar here, can this cause unexpected results? If so, should it be documented?

Yes, it can cause overflow error if a single chunk doesn't fit into the string dtype. I suppose this will be very rare, because we are summing here, and that would mean that the single scalar string as result of the sum would be bigger than 2GB (I am not fully sure how well Python will handle such a large str object).

We can indeed document it.
In theory it could also be circumvented by splitting the chunk into multiple chunks (although I have to verify that pyarrow then does not actually concatenate that again under the hood in the binary_join implementation).

Added a note about this in the issue listing the behavioral changes -> #59328

WillAyd · 2024-09-22T14:01:28Z

Hmm wouldn't it be better to just make users cast to object if they want the legacy behavior on this? I think it was intentional to not implement sum on the string dtype; while there may be some niche use cases, more often than not I think its a mistake (and huge performance hit) to have string types summed up

rhshadrach · 2024-09-22T14:08:01Z

while there may be some niche use cases, more often than not I think its a mistake (and huge performance hit)

This is not my personal experience.

jorisvandenbossche · 2024-09-22T19:04:25Z

@WillAyd it was indeed intentional in the past to not implement it, but see the thread in #59328, and if you want to discuss this, let's do that there ;) (I personally don't care too much about this specifically, but can follow some of the arguments given there, and just implementing what has been decided for now)

jorisvandenbossche · 2024-10-30T20:56:00Z

I am planning to merge this tomorrow (given it is one of the last remaining bigger changes for the string dtype, and including a bunch of test updates / blocking other test updates), but some more review is certainly still welcome

lumberbot-app · 2024-10-31T10:16:30Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 2.3.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 2fdb16b347fc34f78213868a8a973447ac79ab2d

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #59853: String dtype: implement sum reduction'

Push to a named branch:

git push YOURFORK 2.3.x:auto-backport-of-pr-59853-on-2.3.x

Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #59853 on branch 2.3.x (String dtype: implement sum reduction)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

(cherry picked from commit 2fdb16b)

jorisvandenbossche · 2024-10-31T10:46:18Z

Manual backport -> #60157

String dtype: implement sum reduction (#59853) (cherry picked from commit 2fdb16b)

String dtype: implemen sum reduction

593653a

jorisvandenbossche added Strings String extension data type and string data Reduction Operations sum, mean, min, max, etc. labels Sep 20, 2024

jorisvandenbossche added this to the 2.3 milestone Sep 20, 2024

jorisvandenbossche changed the title ~~String dtype: implemen sum reduction~~ String dtype: implement sum reduction Sep 20, 2024

jorisvandenbossche added 4 commits September 21, 2024 10:16

fix object-dtype implementation + update tests

ab30d87

remove xfails

3bb6ea9

add comments

90c4672

ignore typing of pyarrow_meth

fb4f99f

jorisvandenbossche mentioned this pull request Sep 21, 2024

String dtype: overview of breaking behaviour changes #59328

Open

jorisvandenbossche added 3 commits September 21, 2024 14:07

remove xfails + update expected error msgs

42b77db

fixup test for python storage

18a6b0d

add whatsnew note

2cf06ad

jorisvandenbossche marked this pull request as ready for review September 21, 2024 13:51

rhshadrach reviewed Sep 21, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into string-dtype-sum

191fdd4

jorisvandenbossche requested a review from jbrockmendel October 30, 2024 20:55

Merge remote-tracking branch 'upstream/main' into string-dtype-sum

eda1edc

jorisvandenbossche merged commit 2fdb16b into pandas-dev:main Oct 31, 2024
51 checks passed

jorisvandenbossche deleted the string-dtype-sum branch October 31, 2024 10:16

lumberbot-app bot added the Still Needs Manual Backport label Oct 31, 2024

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Oct 31, 2024

String dtype: implement sum reduction (pandas-dev#59853)

a79bec6

(cherry picked from commit 2fdb16b)

jorisvandenbossche mentioned this pull request Oct 31, 2024

[backport 2.3.x] String dtype: implement sum reduction (#59853) #60157

Merged

jorisvandenbossche removed the Still Needs Manual Backport label Oct 31, 2024

jorisvandenbossche added a commit that referenced this pull request Oct 31, 2024

[backport 2.3.x] String dtype: implement sum reduction (#59853) (#60157)

4f189a4

String dtype: implement sum reduction (#59853) (cherry picked from commit 2fdb16b)

jorisvandenbossche mentioned this pull request Nov 7, 2024

BUG/API: sum of a string column with all-NaN or empty #60229

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String dtype: implement sum reduction #59853

String dtype: implement sum reduction #59853

jorisvandenbossche commented Sep 20, 2024 •

edited

Loading

rhshadrach Sep 21, 2024

jorisvandenbossche Oct 9, 2024

jorisvandenbossche Oct 31, 2024

WillAyd commented Sep 22, 2024

rhshadrach commented Sep 22, 2024

jorisvandenbossche commented Sep 22, 2024

jorisvandenbossche commented Oct 30, 2024

lumberbot-app bot commented Oct 31, 2024

jorisvandenbossche commented Oct 31, 2024

String dtype: implement sum reduction #59853

String dtype: implement sum reduction #59853

Conversation

jorisvandenbossche commented Sep 20, 2024 • edited Loading

rhshadrach Sep 21, 2024

Choose a reason for hiding this comment

jorisvandenbossche Oct 9, 2024

Choose a reason for hiding this comment

jorisvandenbossche Oct 31, 2024

Choose a reason for hiding this comment

WillAyd commented Sep 22, 2024

rhshadrach commented Sep 22, 2024

jorisvandenbossche commented Sep 22, 2024

jorisvandenbossche commented Oct 30, 2024

lumberbot-app bot commented Oct 31, 2024

jorisvandenbossche commented Oct 31, 2024

jorisvandenbossche commented Sep 20, 2024 •

edited

Loading