-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for bounded execution when window frame involves UNBOUNDED PRECEDING #5003
Support for bounded execution when window frame involves UNBOUNDED PRECEDING #5003
Conversation
A quick summary to help reviews: If all you are doing is something like a running sum, you can get the job done with bounded memory even if your frame is ever-growing. This PR paves the way for Datafusion to support these kinds of use cases with low memory usage and without breaking the pipeline. |
The idea makes sense to me. I plan to review this carefully in the next day or two |
I plan to review this carefully first thing tomorrow (monday) morning US Eastern time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a great improvement to me 👍 I had some minor suggestions, but nothing that I think is required
|
||
/// A window expr that takes the form of an aggregate function | ||
#[derive(Debug)] | ||
pub struct AggregateWindowExpr { | ||
pub struct PlainAggregateWindowExpr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might help here to explain what is the alternative to a PlainAggregateWindowExpr
(e.g. SlidingAggregateWindowExpr
I think) in the doc comments
Also, since this would be a non trivial API change -- anyone who has a AggregateWindowExpr
in their code would need to change to PlainAggregateWindowExpr
I wonder how important do you view the new name?
Maybe we could leave the struct called AggregateWindowExpr
// A non-sliding aggregation only processes new data, it never | ||
// deals with expiring data as its starting point is always the | ||
// same point (i.e. the beginning of the table/frame). Hence, we | ||
// do not call `retract_batch`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it is worth an assertion to verify the invariant that cur_range is always a superset of last_range
Thanks again @mustafasrepo |
Benchmark runs are scheduled for baseline = 930c8de and contender = 624f02d. 624f02d is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #4978
Rationale for this change
Currently, queries that contain
UNBOUNDED PRECEDING
as in their window frame bounds, like the one belowrun with
WindowAggExec
. However, many aggregators do not require the whole range in memory to calculate their results -- the above query can actually run withBoundedWindowAggExec
.What changes are included in this PR?
This PR adds support for bounded-memory execution of suitable window functions even when the start bound is
UNBOUNDED PRECEDING
.Are these changes tested?
We added new tests that verify the updated (i.e. optimized) physical plan. We also added fuzzy window tests to generate window frame bounds with
UNBOUNDED PRECEDING
. Fuzzy tests can now generate window frame bounds in the formRANGE BETWEEN N PRECEDING AND M PRECEDING
orRANGE BETWEEN M FOLLOWING AND N FOLLOWING
, which increases effective coverage.Are there any user-facing changes?
No.