-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP]Dynamic evaluation of GroupBy Initial Capacity #14001
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #14001 +/- ##
=============================================
- Coverage 61.75% 27.94% -33.82%
- Complexity 207 213 +6
=============================================
Files 2436 2613 +177
Lines 133233 143288 +10055
Branches 20636 21998 +1362
=============================================
- Hits 82274 40037 -42237
- Misses 44911 100174 +55263
+ Partials 6048 3077 -2971
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@@ -110,7 +117,11 @@ public DefaultGroupByExecutor(QueryContext queryContext, AggregationFunction[] a | |||
|
|||
// Initialize result holders | |||
int maxNumResults = _groupKeyGenerator.getGlobalGroupKeyUpperBound(); | |||
Integer optimalGroupByResultHolderCapacity = getGroupByResultHolderCapacityBasedOnFilterPredicate(queryContext); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we wire the logic of determining the filter-based result holder limit into maxInitialResultHolderCapacity()
?
@@ -183,4 +194,26 @@ public GroupKeyGenerator getGroupKeyGenerator() { | |||
public GroupByResultHolder[] getGroupByResultHolders() { | |||
return _groupByResultHolders; | |||
} | |||
|
|||
private Integer getGroupByResultHolderCapacityBasedOnFilterPredicate(QueryContext queryContext) { | |||
if (queryContext.getFilter() == null || queryContext.getGroupByExpressions() == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless there are groupByExpressions, we wouldn't come in to this class. It's better for this to be an assert if at all you want to ensure this case.
I think a better check here is to see if the groupByExpression contains only a single column (assuming that is the case you are optimizing)
|
||
Map<ExpressionContext, InPredicate> predicateMap = filterContexts.stream() | ||
.map(FilterContext::getPredicate) | ||
.filter(predicate -> predicate.getType() == Predicate.Type.IN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also do this for EQ predicate?
.filter(predicate -> predicate.getType() == Predicate.Type.IN) | ||
.collect(Collectors.toMap(Predicate::getLhs, predicate -> (InPredicate) predicate)); | ||
|
||
OptionalInt result = queryContext.getGroupByExpressions().stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a single groupBy expression with a filter on the groupBy, this is correct.
However, consider the case where there are 2 groupByExpressions but only a filter on 1 column. In this case this optimization will not work.
Summary
For GroupBy queries, the default size of the GroupByResultHolder is set to 10K, which can lead to inefficient resource usage in cases where fewer group-by keys are expected, such as in queries with highly selective filters.
select column1, sum(column2) from testTable where column1 in ("123") group by column1 limit 20000
Description
This update dynamically adjusts the initial capacity of the GroupByResultHolder based on the filter predicates for such queries. By aligning the result holder size with the filter, we aim to optimize resource allocation and improve performance for filtered group-by queries.
Testing
TODO: Functional tests
Performance evaluation is also required to assess the trade-offs of the introduced overhead vs. resource optimization.