-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate hash aggregation output in smaller record batches #3461
Conversation
Thank you @milenkovicm -- I plan to review this more carefully tomorrow morning. cc @Dandandan and @yjshen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @milenkovicm -- this change makes sense to me.
Note there is an almost similar copy of the code in https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/physical_plan/aggregates/hash.rs for non row format, which likely needs the same treatment (though we could do it as a follow on PR)
I think the only thing this PR needs is to use the configured batch size rather than a hard coded value.
is it line 442 which is "unbounded" ? https://github.com/apache/arrow-datafusion/blob/84bee899958aaf70372ef84811c6787f53fa25eb/datafusion/core/src/physical_plan/aggregates/hash.rs#L442 |
Yes that looks correct |
may I suggest merging this one, and I'll try to patch that one in due course. One question before hand, which will save me some time, which aggregation operators will end up using that hash? |
I think it is based on the type of the aggregate and if it supports a special "row format" added by @yjshen This ticket describes the reason (and the potential challenges) with having multiple hash aggregate operators: #2723 |
this change would prevent of cloning of whole state, doubling memory needed for aggregation. this PR relates to #1570
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok to me -- thank you @milenkovicm
@yjshen do you have time to review these changes?
Benchmark runs are scheduled for baseline = 011bcf4 and contender = 0b90a8a. 0b90a8a is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Thanks again @milenkovicm |
this change would prevent of cloning of whole state, doubling memory needed for aggregation.
relates to #1570
Which issue does this PR close?
Closes #3460.
Rationale for this change
What changes are included in this PR?
update
poll_next
method to return multiple aggregation state batches rather than a single one.Are there any user-facing changes?
No