Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate hash aggregation output in smaller record batches #3461

Merged
merged 3 commits into from
Oct 15, 2022
Merged

Generate hash aggregation output in smaller record batches #3461

merged 3 commits into from
Oct 15, 2022

Conversation

milenkovicm
Copy link
Contributor

this change would prevent of cloning of whole state, doubling memory needed for aggregation.

relates to #1570

Which issue does this PR close?

Closes #3460.

Rationale for this change

What changes are included in this PR?

update poll_next method to return multiple aggregation state batches rather than a single one.

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Sep 13, 2022
@milenkovicm milenkovicm changed the title change how final aggregation row group is created ... change how final aggregation record batch is created ... Sep 13, 2022
@alamb
Copy link
Contributor

alamb commented Sep 14, 2022

Thank you @milenkovicm -- I plan to review this more carefully tomorrow morning.

cc @Dandandan and @yjshen

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @milenkovicm -- this change makes sense to me.

Note there is an almost similar copy of the code in https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/physical_plan/aggregates/hash.rs for non row format, which likely needs the same treatment (though we could do it as a follow on PR)

I think the only thing this PR needs is to use the configured batch size rather than a hard coded value.

@alamb alamb changed the title change how final aggregation record batch is created ... Generate hash aggregation output in smaller record batches Sep 14, 2022
@milenkovicm
Copy link
Contributor Author

Note there is an almost similar copy of the code in https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/physical_plan/aggregates/hash.rs for non row format, which likely needs the same treatment (though we could do it as a follow on PR)

is it line 442 which is "unbounded" ? https://github.com/apache/arrow-datafusion/blob/84bee899958aaf70372ef84811c6787f53fa25eb/datafusion/core/src/physical_plan/aggregates/hash.rs#L442

@alamb
Copy link
Contributor

alamb commented Sep 15, 2022

is it line 442 which is "unbounded" ?

Yes that looks correct

@milenkovicm
Copy link
Contributor Author

is it line 442 which is "unbounded" ?

Yes that looks correct

may I suggest merging this one, and I'll try to patch that one in due course.

One question before hand, which will save me some time, which aggregation operators will end up using that hash?

@alamb
Copy link
Contributor

alamb commented Sep 15, 2022

One question before hand, which will save me some time, which aggregation operators will end up using that hash?

I think it is based on the type of the aggregate and if it supports a special "row format" added by @yjshen

This ticket describes the reason (and the potential challenges) with having multiple hash aggregate operators: #2723

this change would prevent of cloning of whole state, doubling memory
needed for aggregation.

this PR relates to #1570
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok to me -- thank you @milenkovicm

@yjshen do you have time to review these changes?

@alamb alamb requested a review from yjshen October 12, 2022 17:39
@alamb alamb merged commit 0b90a8a into apache:master Oct 15, 2022
@ursabot
Copy link

ursabot commented Oct 15, 2022

Benchmark runs are scheduled for baseline = 011bcf4 and contender = 0b90a8a. 0b90a8a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@alamb
Copy link
Contributor

alamb commented Oct 15, 2022

Thanks again @milenkovicm

@milenkovicm milenkovicm deleted the create_batch_fix branch October 18, 2022 08:46
@milenkovicm milenkovicm restored the create_batch_fix branch October 26, 2022 15:52
@milenkovicm milenkovicm deleted the create_batch_fix branch October 26, 2022 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Row Hash loads whole aggregation state to memory before sending
3 participants