[WIP]Dynamic evaluation of GroupBy Initial Capacity #14001

praveenc7 · 2024-09-15T05:19:07Z

Summary

For GroupBy queries, the default size of the GroupByResultHolder is set to 10K, which can lead to inefficient resource usage in cases where fewer group-by keys are expected, such as in queries with highly selective filters.

select column1, sum(column2) from testTable where column1 in ("123") group by column1 limit 20000

Description

This update dynamically adjusts the initial capacity of the GroupByResultHolder based on the filter predicates for such queries. By aligning the result holder size with the filter, we aim to optimize resource allocation and improve performance for filtered group-by queries.

Testing

TODO: Functional tests
Performance evaluation is also required to assess the trade-offs of the introduced overhead vs. resource optimization.

codecov-commenter · 2024-09-15T06:02:32Z

Codecov Report

Attention: Patch coverage is 0% with 18 lines in your changes missing coverage. Please review.

Project coverage is 27.94%. Comparing base (59551e4) to head (d6e0cae).
Report is 1036 commits behind head on master.

Files with missing lines	Patch %	Lines
...ry/aggregation/groupby/DefaultGroupByExecutor.java	0.00%	18 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (59551e4) and HEAD (d6e0cae). Click for more details.

HEAD has 43 uploads less than BASE

Flag BASE (59551e4) HEAD (d6e0cae)

integration 7 0

integration2 3 0

temurin 12 3

java-21 7 2

skip-bytebuffers-true 3 1

skip-bytebuffers-false 7 2

unittests 5 3

unittests1 2 0

java-11 5 1

integration1 2 0

custom-integration1 2 0

Additional details and impacted files

@@              Coverage Diff              @@
##             master   #14001       +/-   ##
=============================================
- Coverage     61.75%   27.94%   -33.82%     
- Complexity      207      213        +6     
=============================================
  Files          2436     2613      +177     
  Lines        133233   143288    +10055     
  Branches      20636    21998     +1362     
=============================================
- Hits          82274    40037    -42237     
- Misses        44911   100174    +55263     
+ Partials       6048     3077     -2971

Flag	Coverage Δ
custom-integration1	`?`
integration	`?`
integration1	`?`
integration2	`?`
java-11	`27.93% <0.00%> (-33.78%)`	⬇️
java-21	`27.93% <0.00%> (-33.69%)`	⬇️
skip-bytebuffers-false	`27.94% <0.00%> (-33.81%)`	⬇️
skip-bytebuffers-true	`27.93% <0.00%> (+0.20%)`	⬆️
temurin	`27.94% <0.00%> (-33.82%)`	⬇️
unittests	`27.94% <0.00%> (-33.81%)`	⬇️
unittests1	`?`
unittests2	`27.94% <0.00%> (+0.20%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

vvivekiyer · 2024-09-16T16:43:14Z

...re/src/main/java/org/apache/pinot/core/query/aggregation/groupby/DefaultGroupByExecutor.java

@@ -110,7 +117,11 @@ public DefaultGroupByExecutor(QueryContext queryContext, AggregationFunction[] a

    // Initialize result holders
    int maxNumResults = _groupKeyGenerator.getGlobalGroupKeyUpperBound();
+    Integer optimalGroupByResultHolderCapacity = getGroupByResultHolderCapacityBasedOnFilterPredicate(queryContext);


Can we wire the logic of determining the filter-based result holder limit into maxInitialResultHolderCapacity()?

vvivekiyer · 2024-09-16T16:47:18Z

...re/src/main/java/org/apache/pinot/core/query/aggregation/groupby/DefaultGroupByExecutor.java

@@ -183,4 +194,26 @@ public GroupKeyGenerator getGroupKeyGenerator() {
  public GroupByResultHolder[] getGroupByResultHolders() {
    return _groupByResultHolders;
  }
+
+  private Integer getGroupByResultHolderCapacityBasedOnFilterPredicate(QueryContext queryContext) {
+    if (queryContext.getFilter() == null || queryContext.getGroupByExpressions() == null) {


Unless there are groupByExpressions, we wouldn't come in to this class. It's better for this to be an assert if at all you want to ensure this case.

I think a better check here is to see if the groupByExpression contains only a single column (assuming that is the case you are optimizing)

vvivekiyer · 2024-09-16T16:49:34Z

...re/src/main/java/org/apache/pinot/core/query/aggregation/groupby/DefaultGroupByExecutor.java

+
+    Map<ExpressionContext, InPredicate> predicateMap = filterContexts.stream()
+        .map(FilterContext::getPredicate)
+        .filter(predicate -> predicate.getType() == Predicate.Type.IN)


We should also do this for EQ predicate?

vvivekiyer · 2024-09-16T16:54:04Z

...re/src/main/java/org/apache/pinot/core/query/aggregation/groupby/DefaultGroupByExecutor.java

+        .filter(predicate -> predicate.getType() == Predicate.Type.IN)
+        .collect(Collectors.toMap(Predicate::getLhs, predicate -> (InPredicate) predicate));
+
+    OptionalInt result = queryContext.getGroupByExpressions().stream()


If there is a single groupBy expression with a filter on the groupBy, this is correct.

However, consider the case where there are 2 groupByExpressions but only a filter on 1 column. In this case this optimization will not work.

Dynamic evaluation of GroupBy Initial Capacity

d6e0cae

praveenc7 changed the title ~~Dynamic evaluation of GroupBy Initial Capacity~~ [WIP]Dynamic evaluation of GroupBy Initial Capacity Sep 15, 2024

vvivekiyer reviewed Sep 16, 2024

View reviewed changes

Add test

075e135

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]Dynamic evaluation of GroupBy Initial Capacity #14001

[WIP]Dynamic evaluation of GroupBy Initial Capacity #14001

praveenc7 commented Sep 15, 2024

codecov-commenter commented Sep 15, 2024 •

edited

Loading

vvivekiyer Sep 16, 2024

vvivekiyer Sep 16, 2024

vvivekiyer Sep 16, 2024

vvivekiyer Sep 16, 2024

[WIP]Dynamic evaluation of GroupBy Initial Capacity #14001

Are you sure you want to change the base?

[WIP]Dynamic evaluation of GroupBy Initial Capacity #14001

Conversation

praveenc7 commented Sep 15, 2024

Summary

Description

Testing

codecov-commenter commented Sep 15, 2024 • edited Loading

Codecov Report

vvivekiyer Sep 16, 2024

Choose a reason for hiding this comment

vvivekiyer Sep 16, 2024

Choose a reason for hiding this comment

vvivekiyer Sep 16, 2024

Choose a reason for hiding this comment

vvivekiyer Sep 16, 2024

Choose a reason for hiding this comment

codecov-commenter commented Sep 15, 2024 •

edited

Loading