Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String dtype: implement sum reduction #59853

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Sep 20, 2024

Based on the feedback in #59328, implementing sum() for the string dtype

xref #54792

@jorisvandenbossche jorisvandenbossche added Strings String extension data type and string data Reduction Operations sum, mean, min, max, etc. labels Sep 20, 2024
@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Sep 20, 2024
@jorisvandenbossche jorisvandenbossche changed the title String dtype: implemen sum reduction String dtype: implement sum reduction Sep 20, 2024
@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review September 21, 2024 13:51
Comment on lines +1732 to +1734
if pa.types.is_large_string(data.type):
# binary_join only supports string, not large_string
data = data.cast(pa.string())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too familiar here, can this cause unexpected results? If so, should it be documented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can cause overflow error if a single chunk doesn't fit into the string dtype. I suppose this will be very rare, because we are summing here, and that would mean that the single scalar string as result of the sum would be bigger than 2GB (I am not fully sure how well Python will handle such a large str object).

We can indeed document it.
In theory it could also be circumvented by splitting the chunk into multiple chunks (although I have to verify that pyarrow then does not actually concatenate that again under the hood in the binary_join implementation).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note about this in the issue listing the behavioral changes -> #59328

@WillAyd
Copy link
Member

WillAyd commented Sep 22, 2024

Hmm wouldn't it be better to just make users cast to object if they want the legacy behavior on this? I think it was intentional to not implement sum on the string dtype; while there may be some niche use cases, more often than not I think its a mistake (and huge performance hit) to have string types summed up

@rhshadrach
Copy link
Member

while there may be some niche use cases, more often than not I think its a mistake (and huge performance hit)

This is not my personal experience.

@jorisvandenbossche
Copy link
Member Author

@WillAyd it was indeed intentional in the past to not implement it, but see the thread in #59328, and if you want to discuss this, let's do that there ;) (I personally don't care too much about this specifically, but can follow some of the arguments given there, and just implementing what has been decided for now)

@jorisvandenbossche
Copy link
Member Author

I am planning to merge this tomorrow (given it is one of the last remaining bigger changes for the string dtype, and including a bunch of test updates / blocking other test updates), but some more review is certainly still welcome

@jorisvandenbossche jorisvandenbossche merged commit 2fdb16b into pandas-dev:main Oct 31, 2024
51 checks passed
@jorisvandenbossche jorisvandenbossche deleted the string-dtype-sum branch October 31, 2024 10:16
Copy link

lumberbot-app bot commented Oct 31, 2024

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.3.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 2fdb16b347fc34f78213868a8a973447ac79ab2d
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #59853: String dtype: implement sum reduction'
  1. Push to a named branch:
git push YOURFORK 2.3.x:auto-backport-of-pr-59853-on-2.3.x
  1. Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #59853 on branch 2.3.x (String dtype: implement sum reduction)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

@jorisvandenbossche
Copy link
Member Author

Manual backport -> #60157

jorisvandenbossche added a commit that referenced this pull request Oct 31, 2024
String dtype: implement sum reduction (#59853)

(cherry picked from commit 2fdb16b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reduction Operations sum, mean, min, max, etc. Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants