Generate hash aggregation output in smaller record batches #3461

milenkovicm · 2022-09-13T10:31:54Z

this change would prevent of cloning of whole state, doubling memory needed for aggregation.

relates to #1570

Which issue does this PR close?

Closes #3460.

Rationale for this change

What changes are included in this PR?

update poll_next method to return multiple aggregation state batches rather than a single one.

Are there any user-facing changes?

No

alamb · 2022-09-14T20:55:35Z

Thank you @milenkovicm -- I plan to review this more carefully tomorrow morning.

cc @Dandandan and @yjshen

alamb

Thanks @milenkovicm -- this change makes sense to me.

Note there is an almost similar copy of the code in https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/physical_plan/aggregates/hash.rs for non row format, which likely needs the same treatment (though we could do it as a follow on PR)

I think the only thing this PR needs is to use the configured batch size rather than a hard coded value.

datafusion/core/src/physical_plan/aggregates/row_hash.rs

milenkovicm · 2022-09-15T08:51:13Z

Note there is an almost similar copy of the code in https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/physical_plan/aggregates/hash.rs for non row format, which likely needs the same treatment (though we could do it as a follow on PR)

is it line 442 which is "unbounded" ? https://github.com/apache/arrow-datafusion/blob/84bee899958aaf70372ef84811c6787f53fa25eb/datafusion/core/src/physical_plan/aggregates/hash.rs#L442

alamb · 2022-09-15T12:51:12Z

is it line 442 which is "unbounded" ?

Yes that looks correct

milenkovicm · 2022-09-15T12:58:06Z

is it line 442 which is "unbounded" ?

Yes that looks correct

may I suggest merging this one, and I'll try to patch that one in due course.

One question before hand, which will save me some time, which aggregation operators will end up using that hash?

alamb · 2022-09-15T13:23:39Z

One question before hand, which will save me some time, which aggregation operators will end up using that hash?

I think it is based on the type of the aggregate and if it supports a special "row format" added by @yjshen

This ticket describes the reason (and the potential challenges) with having multiple hash aggregate operators: #2723

this change would prevent of cloning of whole state, doubling memory needed for aggregation. this PR relates to #1570

alamb

Looks ok to me -- thank you @milenkovicm

@yjshen do you have time to review these changes?

ursabot · 2022-10-15T11:22:11Z

Benchmark runs are scheduled for baseline = 011bcf4 and contender = 0b90a8a. 0b90a8a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

alamb · 2022-10-15T11:24:15Z

Thanks again @milenkovicm

github-actions bot added the core Core DataFusion crate label Sep 13, 2022

milenkovicm changed the title ~~change how final aggregation row group is created ...~~ change how final aggregation record batch is created ... Sep 13, 2022

alamb reviewed Sep 14, 2022

View reviewed changes

datafusion/core/src/physical_plan/aggregates/row_hash.rs Show resolved Hide resolved

alamb changed the title ~~change how final aggregation record batch is created ...~~ Generate hash aggregation output in smaller record batches Sep 14, 2022

milenkovicm added 3 commits September 28, 2022 11:03

change how final aggregation row group is created ...

e9d62a0

this change would prevent of cloning of whole state, doubling memory needed for aggregation. this PR relates to #1570

Fix clippy issues

7aeac97

read batch size from session_config

20aef78

alamb approved these changes Oct 12, 2022

View reviewed changes

alamb requested a review from yjshen October 12, 2022 17:39

alamb merged commit 0b90a8a into apache:master Oct 15, 2022

milenkovicm deleted the create_batch_fix branch October 18, 2022 08:46

milenkovicm restored the create_batch_fix branch October 26, 2022 15:52

milenkovicm deleted the create_batch_fix branch October 26, 2022 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate hash aggregation output in smaller record batches #3461

Generate hash aggregation output in smaller record batches #3461

milenkovicm commented Sep 13, 2022

alamb commented Sep 14, 2022

alamb left a comment

milenkovicm commented Sep 15, 2022

alamb commented Sep 15, 2022

milenkovicm commented Sep 15, 2022

alamb commented Sep 15, 2022

alamb left a comment

ursabot commented Oct 15, 2022

alamb commented Oct 15, 2022

Generate hash aggregation output in smaller record batches #3461

Generate hash aggregation output in smaller record batches #3461

Conversation

milenkovicm commented Sep 13, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb commented Sep 14, 2022

alamb left a comment

Choose a reason for hiding this comment

milenkovicm commented Sep 15, 2022

alamb commented Sep 15, 2022

milenkovicm commented Sep 15, 2022

alamb commented Sep 15, 2022

alamb left a comment

Choose a reason for hiding this comment

ursabot commented Oct 15, 2022

alamb commented Oct 15, 2022