-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String dtype: implement sum reduction #59853
String dtype: implement sum reduction #59853
Conversation
if pa.types.is_large_string(data.type): | ||
# binary_join only supports string, not large_string | ||
data = data.cast(pa.string()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too familiar here, can this cause unexpected results? If so, should it be documented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can cause overflow error if a single chunk doesn't fit into the string dtype. I suppose this will be very rare, because we are summing here, and that would mean that the single scalar string as result of the sum would be bigger than 2GB (I am not fully sure how well Python will handle such a large str
object).
We can indeed document it.
In theory it could also be circumvented by splitting the chunk into multiple chunks (although I have to verify that pyarrow then does not actually concatenate that again under the hood in the binary_join
implementation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note about this in the issue listing the behavioral changes -> #59328
Hmm wouldn't it be better to just make users cast to object if they want the legacy behavior on this? I think it was intentional to not implement sum on the string dtype; while there may be some niche use cases, more often than not I think its a mistake (and huge performance hit) to have string types summed up |
This is not my personal experience. |
@WillAyd it was indeed intentional in the past to not implement it, but see the thread in #59328, and if you want to discuss this, let's do that there ;) (I personally don't care too much about this specifically, but can follow some of the arguments given there, and just implementing what has been decided for now) |
I am planning to merge this tomorrow (given it is one of the last remaining bigger changes for the string dtype, and including a bunch of test updates / blocking other test updates), but some more review is certainly still welcome |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
(cherry picked from commit 2fdb16b)
Manual backport -> #60157 |
Based on the feedback in #59328, implementing
sum()
for the string dtypexref #54792