Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for bounded execution when window frame involves UNBOUNDED PRECEDING #5003

Merged
merged 17 commits into from
Jan 23, 2023
Merged

Support for bounded execution when window frame involves UNBOUNDED PRECEDING #5003

merged 17 commits into from
Jan 23, 2023

Conversation

mustafasrepo
Copy link
Contributor

Which issue does this PR close?

Closes #4978

Rationale for this change

Currently, queries that contain UNBOUNDED PRECEDING as in their window frame bounds, like the one below

SELECT
    SUM(c1) OVER (ORDER BY c3 RANGE BETWEEN UNBOUNDED PRECEDING AND 11 FOLLOWING) as a1
    FROM aggregate_test_100

run with WindowAggExec. However, many aggregators do not require the whole range in memory to calculate their results -- the above query can actually run with BoundedWindowAggExec.

What changes are included in this PR?

This PR adds support for bounded-memory execution of suitable window functions even when the start bound is UNBOUNDED PRECEDING.

Are these changes tested?

We added new tests that verify the updated (i.e. optimized) physical plan. We also added fuzzy window tests to generate window frame bounds with UNBOUNDED PRECEDING. Fuzzy tests can now generate window frame bounds in the form RANGE BETWEEN N PRECEDING AND M PRECEDING or RANGE BETWEEN M FOLLOWING AND N FOLLOWING, which increases effective coverage.

Are there any user-facing changes?

No.

@github-actions github-actions bot added core Core DataFusion crate physical-expr Physical Expressions labels Jan 20, 2023
@ozankabak
Copy link
Contributor

A quick summary to help reviews: If all you are doing is something like a running sum, you can get the job done with bounded memory even if your frame is ever-growing. This PR paves the way for Datafusion to support these kinds of use cases with low memory usage and without breaking the pipeline.

@alamb
Copy link
Contributor

alamb commented Jan 20, 2023

The idea makes sense to me. I plan to review this carefully in the next day or two

@alamb
Copy link
Contributor

alamb commented Jan 22, 2023

I plan to review this carefully first thing tomorrow (monday) morning US Eastern time

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a great improvement to me 👍 I had some minor suggestions, but nothing that I think is required

datafusion/core/tests/sql/window.rs Show resolved Hide resolved
datafusion/core/tests/window_fuzz.rs Show resolved Hide resolved

/// A window expr that takes the form of an aggregate function
#[derive(Debug)]
pub struct AggregateWindowExpr {
pub struct PlainAggregateWindowExpr {
Copy link
Contributor

@alamb alamb Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might help here to explain what is the alternative to a PlainAggregateWindowExpr (e.g. SlidingAggregateWindowExpr I think) in the doc comments

Also, since this would be a non trivial API change -- anyone who has a AggregateWindowExpr in their code would need to change to PlainAggregateWindowExpr I wonder how important do you view the new name?

Maybe we could leave the struct called AggregateWindowExpr

// A non-sliding aggregation only processes new data, it never
// deals with expiring data as its starting point is always the
// same point (i.e. the beginning of the table/frame). Hence, we
// do not call `retract_batch`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is worth an assertion to verify the invariant that cur_range is always a superset of last_range

datafusion/physical-expr/src/window/built_in.rs Outdated Show resolved Hide resolved
@alamb alamb merged commit 624f02d into apache:master Jan 23, 2023
@alamb
Copy link
Contributor

alamb commented Jan 23, 2023

Thanks again @mustafasrepo

@ursabot
Copy link

ursabot commented Jan 23, 2023

Benchmark runs are scheduled for baseline = 930c8de and contender = 624f02d. 624f02d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@mustafasrepo mustafasrepo deleted the feature/window_unbounded_preceding branch February 10, 2023 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for bounded execution when window query involves UNBOUNDED PRECEDING
4 participants