Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]Dynamic evaluation of GroupBy Initial Capacity #14001

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

praveenc7
Copy link
Contributor

Summary

For GroupBy queries, the default size of the GroupByResultHolder is set to 10K, which can lead to inefficient resource usage in cases where fewer group-by keys are expected, such as in queries with highly selective filters.

select column1, sum(column2) from testTable where column1 in ("123") group by column1 limit 20000

Description

This update dynamically adjusts the initial capacity of the GroupByResultHolder based on the filter predicates for such queries. By aligning the result holder size with the filter, we aim to optimize resource allocation and improve performance for filtered group-by queries.

Testing

TODO: Functional tests
Performance evaluation is also required to assess the trade-offs of the introduced overhead vs. resource optimization.

@codecov-commenter
Copy link

codecov-commenter commented Sep 15, 2024

Codecov Report

Attention: Patch coverage is 0% with 18 lines in your changes missing coverage. Please review.

Project coverage is 27.94%. Comparing base (59551e4) to head (d6e0cae).
Report is 1036 commits behind head on master.

Files with missing lines Patch % Lines
...ry/aggregation/groupby/DefaultGroupByExecutor.java 0.00% 18 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (59551e4) and HEAD (d6e0cae). Click for more details.

HEAD has 43 uploads less than BASE
Flag BASE (59551e4) HEAD (d6e0cae)
integration 7 0
integration2 3 0
temurin 12 3
java-21 7 2
skip-bytebuffers-true 3 1
skip-bytebuffers-false 7 2
unittests 5 3
unittests1 2 0
java-11 5 1
integration1 2 0
custom-integration1 2 0
Additional details and impacted files
@@              Coverage Diff              @@
##             master   #14001       +/-   ##
=============================================
- Coverage     61.75%   27.94%   -33.82%     
- Complexity      207      213        +6     
=============================================
  Files          2436     2613      +177     
  Lines        133233   143288    +10055     
  Branches      20636    21998     +1362     
=============================================
- Hits          82274    40037    -42237     
- Misses        44911   100174    +55263     
+ Partials       6048     3077     -2971     
Flag Coverage Δ
custom-integration1 ?
integration ?
integration1 ?
integration2 ?
java-11 27.93% <0.00%> (-33.78%) ⬇️
java-21 27.93% <0.00%> (-33.69%) ⬇️
skip-bytebuffers-false 27.94% <0.00%> (-33.81%) ⬇️
skip-bytebuffers-true 27.93% <0.00%> (+0.20%) ⬆️
temurin 27.94% <0.00%> (-33.82%) ⬇️
unittests 27.94% <0.00%> (-33.81%) ⬇️
unittests1 ?
unittests2 27.94% <0.00%> (+0.20%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@praveenc7 praveenc7 changed the title Dynamic evaluation of GroupBy Initial Capacity [WIP]Dynamic evaluation of GroupBy Initial Capacity Sep 15, 2024
@@ -110,7 +117,11 @@ public DefaultGroupByExecutor(QueryContext queryContext, AggregationFunction[] a

// Initialize result holders
int maxNumResults = _groupKeyGenerator.getGlobalGroupKeyUpperBound();
Integer optimalGroupByResultHolderCapacity = getGroupByResultHolderCapacityBasedOnFilterPredicate(queryContext);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we wire the logic of determining the filter-based result holder limit into maxInitialResultHolderCapacity()?

@@ -183,4 +194,26 @@ public GroupKeyGenerator getGroupKeyGenerator() {
public GroupByResultHolder[] getGroupByResultHolders() {
return _groupByResultHolders;
}

private Integer getGroupByResultHolderCapacityBasedOnFilterPredicate(QueryContext queryContext) {
if (queryContext.getFilter() == null || queryContext.getGroupByExpressions() == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there are groupByExpressions, we wouldn't come in to this class. It's better for this to be an assert if at all you want to ensure this case.

I think a better check here is to see if the groupByExpression contains only a single column (assuming that is the case you are optimizing)


Map<ExpressionContext, InPredicate> predicateMap = filterContexts.stream()
.map(FilterContext::getPredicate)
.filter(predicate -> predicate.getType() == Predicate.Type.IN)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also do this for EQ predicate?

.filter(predicate -> predicate.getType() == Predicate.Type.IN)
.collect(Collectors.toMap(Predicate::getLhs, predicate -> (InPredicate) predicate));

OptionalInt result = queryContext.getGroupByExpressions().stream()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a single groupBy expression with a filter on the groupBy, this is correct.

However, consider the case where there are 2 groupByExpressions but only a filter on 1 column. In this case this optimization will not work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants