-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip partial aggregation based on the cardinality of hash value instead of group values #12697
Changes from all commits
e4e5229
3848ae7
a08401b
dab3ea7
d2f68c9
4a171f4
bd11ff5
0aefcc6
405c463
e80c2b1
07e4f78
80aae67
ed8e135
72a8c18
86923c3
a266d3f
d494cfa
9865cbd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -325,11 +325,7 @@ config_namespace! { | |
/// Aggregation ratio (number of distinct groups / number of input rows) | ||
/// threshold for skipping partial aggregation. If the value is greater | ||
/// then partial aggregation will skip aggregation for further input | ||
pub skip_partial_aggregation_probe_ratio_threshold: f64, default = 0.8 | ||
|
||
/// Number of input rows partial aggregation partition should process, before | ||
/// aggregation ratio check and trying to switch to skipping aggregation mode | ||
pub skip_partial_aggregation_probe_rows_threshold: usize, default = 100_000 | ||
pub skip_partial_aggregation_probe_ratio_threshold: f64, default = 0.1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's interesting why 0.1 value makes things faster 🤔 Is aggregation for simple cases (e.g. single integer) slower than repartitioning x10 rows? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Given the benchmark result, I think so. Especially if we have millions of distinct integer. |
||
|
||
/// Should DataFusion use row number estimates at the input to decide | ||
/// whether increasing parallelism is beneficial or not. By default, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
c1,c2 | ||
1,'a' | ||
2,'b' | ||
3,'c' | ||
4,'d' | ||
1,'a' | ||
2,'b' | ||
3,'c' | ||
4,'d' | ||
4,'d' | ||
3,'c' | ||
3,'c' | ||
5,'e' | ||
6,'f' | ||
7,'g' | ||
8,'a' | ||
9,'b' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why has this setting been deleted? Checking if aggregation should be skipped right in the beginning of the execution may lead to skipping decision made based on insufficient amount of data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is that in my strategy I make the decision per batch since I assume the data is distributed evenly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config with 0.1 and 100_000
config with 0.1 and 0