Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip partial aggregation based on the cardinality of hash value instead of group values #12697

Closed
wants to merge 18 commits into from

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Oct 1, 2024

Which issue does this PR close?

Closes #.

Rationale for this change

The current skip aggregation logic calculates the ratio between the number of group values and the total batch size. However, what we actually need is the ratio of unique group values compared to the total number of group values.

To achieve this, we don't need to append group values to the GroupValueBuilder. Instead, we can simply check the hash value.

Benchmarking also shows an improvement for high-cardinality queries. Although low-cardinality cases, like TPCH Q1, slow down due to the additional HashSet computation, I believe the tradeoff is worthwhile.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

TODO

  • Remove the existing skipping logic based on group values and see if there is any slowdown

Benchmark

Comparing main and single-mode-v4
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ single-mode-v4 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.41ms │         0.44ms │  1.09x slower │
│ QQuery 1     │    42.61ms │        42.14ms │     no change │
│ QQuery 2     │    75.89ms │        75.09ms │     no change │
│ QQuery 3     │    72.24ms │        63.22ms │ +1.14x faster │
│ QQuery 4     │   399.04ms │       376.35ms │ +1.06x faster │
│ QQuery 5     │   663.20ms │       634.01ms │     no change │
│ QQuery 6     │    37.91ms │        37.54ms │     no change │
│ QQuery 7     │    42.71ms │        41.72ms │     no change │
│ QQuery 8     │   638.09ms │       619.10ms │     no change │
│ QQuery 9     │   678.82ms │       704.86ms │     no change │
│ QQuery 10    │   200.65ms │       192.01ms │     no change │
│ QQuery 11    │   233.85ms │       221.90ms │ +1.05x faster │
│ QQuery 12    │   718.91ms │       638.75ms │ +1.13x faster │
│ QQuery 13    │   926.86ms │       889.40ms │     no change │
│ QQuery 14    │   873.60ms │       716.54ms │ +1.22x faster │
│ QQuery 15    │   498.70ms │       507.27ms │     no change │
│ QQuery 16    │  1305.01ms │      1285.15ms │     no change │
│ QQuery 17    │  1190.26ms │      1138.84ms │     no change │
│ QQuery 18    │  3347.18ms │      2005.19ms │ +1.67x faster │
│ QQuery 19    │    55.77ms │        57.10ms │     no change │
│ QQuery 20    │   943.13ms │       940.61ms │     no change │
│ QQuery 21    │  1206.45ms │      1225.04ms │     no change │
│ QQuery 22    │  3230.71ms │      3221.00ms │     no change │
│ QQuery 23    │  8186.73ms │      7956.04ms │     no change │
│ QQuery 24    │   500.91ms │       479.24ms │     no change │
│ QQuery 25    │   502.13ms │       482.03ms │     no change │
│ QQuery 26    │   561.53ms │       536.60ms │     no change │
│ QQuery 27    │  1338.06ms │      1383.75ms │     no change │
│ QQuery 28    │ 10507.08ms │     11563.16ms │  1.10x slower │
│ QQuery 29    │   398.16ms │       395.54ms │     no change │
│ QQuery 30    │   760.72ms │       669.24ms │ +1.14x faster │
│ QQuery 31    │   712.91ms │       684.49ms │     no change │
│ QQuery 32    │  3567.36ms │      3222.74ms │ +1.11x faster │
│ QQuery 33    │  4863.25ms │      3308.70ms │ +1.47x faster │
│ QQuery 34    │  4115.97ms │      3256.70ms │ +1.26x faster │
│ QQuery 35    │   998.40ms │      1031.65ms │     no change │
│ QQuery 36    │   147.99ms │       130.10ms │ +1.14x faster │
│ QQuery 37    │   101.82ms │       108.67ms │  1.07x slower │
│ QQuery 38    │   108.77ms │       105.60ms │     no change │
│ QQuery 39    │   327.29ms │       259.94ms │ +1.26x faster │
│ QQuery 40    │    33.56ms │        32.70ms │     no change │
│ QQuery 41    │    34.09ms │        37.00ms │  1.09x slower │
│ QQuery 42    │    39.32ms │        42.22ms │  1.07x slower │
└──────────────┴────────────┴────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary             ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)             │ 55188.04ms │
│ Total Time (single-mode-v4)   │ 51319.38ms │
│ Average Time (main)           │  1283.44ms │
│ Average Time (single-mode-v4) │  1193.47ms │
│ Queries Faster                │         12 │
│ Queries Slower                │          5 │
│ Queries with No Change        │         26 │
└───────────────────────────────┴────────────┘
Note: Skipping /Users/bytedance/arrow-datafusion/benchmarks/results/main/clickbench_partitioned.json as /Users/bytedance/arrow-datafusion/benchmarks/results/single-mode-v4/clickbench_partitioned.json does not exist
Note: Skipping /Users/bytedance/arrow-datafusion/benchmarks/results/main/tpch_mem_sf10.json as /Users/bytedance/arrow-datafusion/benchmarks/results/single-mode-v4/tpch_mem_sf10.json does not exist
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     main ┃ single-mode-v4 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  94.58ms │       101.79ms │  1.08x slower │
│ QQuery 2     │  29.05ms │        27.99ms │     no change │
│ QQuery 3     │  45.17ms │        43.31ms │     no change │
│ QQuery 4     │  30.32ms │        30.31ms │     no change │
│ QQuery 5     │  62.54ms │        64.16ms │     no change │
│ QQuery 6     │  20.73ms │        20.86ms │     no change │
│ QQuery 7     │  77.07ms │        80.34ms │     no change │
│ QQuery 8     │  64.10ms │        61.17ms │     no change │
│ QQuery 9     │  75.10ms │        76.16ms │     no change │
│ QQuery 10    │  68.77ms │        75.01ms │  1.09x slower │
│ QQuery 11    │  17.59ms │        18.58ms │  1.06x slower │
│ QQuery 12    │  49.99ms │        52.32ms │     no change │
│ QQuery 13    │  41.30ms │        41.81ms │     no change │
│ QQuery 14    │  33.14ms │        36.12ms │  1.09x slower │
│ QQuery 15    │  48.04ms │        49.63ms │     no change │
│ QQuery 16    │  19.68ms │        20.69ms │  1.05x slower │
│ QQuery 17    │  87.01ms │        71.20ms │ +1.22x faster │
│ QQuery 18    │ 119.28ms │       123.64ms │     no change │
│ QQuery 19    │  61.27ms │        57.87ms │ +1.06x faster │
│ QQuery 20    │  46.45ms │        48.23ms │     no change │
│ QQuery 21    │  92.86ms │        92.10ms │     no change │
│ QQuery 22    │  18.37ms │        19.59ms │  1.07x slower │
└──────────────┴──────────┴────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary             ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)             │ 1202.43ms │
│ Total Time (single-mode-v4)   │ 1212.87ms │
│ Average Time (main)           │   54.66ms │
│ Average Time (single-mode-v4) │   55.13ms │
│ Queries Faster                │         2 │
│ Queries Slower                │         6 │
│ Queries with No Change        │        14 │
└───────────────────────────────┴───────────┘

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@github-actions github-actions bot added physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Oct 1, 2024
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
if group_values.is_empty() {
return internal_err!("group_values expected to have at least one element");
}
let mut output = group_values.swap_remove(0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result further improve with removing clone 😆 , sounds like cheating

@Rachelint
Copy link
Contributor

Rachelint commented Oct 1, 2024

🤔 As I understand, seems the main difference with the original partial skipping like to be threshold?

Is it possible that improvement comes from tirggering partial skipping earlier?

I guess maybe we can modify the params in partial skipping in main, and see if we can get the similar performance.

  • threshold here:
            let ratio = self.unique_hashes_count.len() as f32
                / self.accumulated_batch_size as f32;
            // TODO: Configure the threshold
            self.skip_partial_with_hash_count = ratio > 0.1;
  • original threshold:
        // `probe_rows_threshold` default to 100_000
       // `probe_ratio_threshold` default to 0.8
        if self.input_rows >= self.probe_rows_threshold {
            self.should_skip = self.num_groups as f64 / self.input_rows as f64
                >= self.probe_ratio_threshold;
            self.is_locked = true;
        }

Actually, I think if partial is really necessary in standalone, although it is ensured to provide a performance improve in distributed query (push down the partial to remote datasource)?

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@github-actions github-actions bot added common Related to common crate execution Related to the execution crate labels Oct 2, 2024
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Oct 2, 2024

Is it possible that improvement comes from tirggering partial skipping earlier?

If we only compute hash, we are able to skip early. Otherwise, the original approach is only able to skip after the group values are calculated

Threshold might be one of the reason, but skip based on hash is even better

I think partial is only helpful for low cardinality query where we can reduce the output rows at the first stage, so the repartition cost is low.

In distributed context, I guess we are still able to scale it with repartition + final?

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Oct 2, 2024
@Rachelint
Copy link
Contributor

Rachelint commented Oct 2, 2024

If we only compute hash, we are able to skip early. Otherwise, the original approach is only able to skip after the group values are calculated

I am a bit confused about this, it seems only the calculation of last batch before skipping will can be avoided? It may run like this. Maybe I misunderstand, I am not so sure?

-----batch 1-----
- compute hash
- check if can trigger skip, and found no
- call `group_aggregate_batch` to update intermediate states

-----batch 2-----
- compute hash
- check if can trigger skip, and found no
- call `group_aggregate_batch` to update intermediate states

...

-----last batch before skipping-----
- compute hash
- check if can trigger skip, and found yes
- avoid `group_aggregate_batch` here, and switch to partial skipping, and run the skip logic after

I think partial is only helpful for low cardinality query where we can reduce the output rows at the first stage, so the repartition cost is low.

But it introduced cost of partial aggr, I still have no idea how to compare the cost of them...

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Oct 2, 2024

I am a bit confused about this, it seems only the calculation of last batch before skipping will can be avoided? It may run like this. Maybe I misunderstand, I am not so sure?

I think in partial group by stage, all the batch is computed independently, so only the batch that has low cardinality is not skipped, others are skipped. For high cardinality case, we skip almost all the batch and move on to repartition quickly

But it introduced cost of partial aggr, I still have no idea how to compare the cost of them...

For low cardinality case, the cost of repartition + final is almost negligible. Partial aggregation is the only part takes time.

@Rachelint
Copy link
Contributor

Rachelint commented Oct 2, 2024

I think in partial group by stage, all the batch is computed independently, so only the batch that has low cardinality is not skipped, others are skipped. For high cardinality case, we skip almost all the batch and move on to repartition quickly

Yes, I am confused about that:

  • Assume we have 3 batches, name them batch 1, batch 2, batch 3

  • And we will statistic the total rows + unique groups(through hashes or actual group values) of them.

  • Now, after we statistic batch 1, batch2, we found the threashold is still not reached (e.g. unique gourps/ total rows < 0.1), so we still need to call group_aggregate_batch in partial aggr for them.

  • Finally when batch 3 comes, we found the threashold is reached.
    In current way, we can switch to skip mode without calling group_aggregate_batch for batch 3.

  • But in old way, we still need to call group_aggregate_batch for batch 3, because we can only update the skip mode probe after this.

So it seems it is actually the group_aggregate_batch calling for batch 3 is avoided?

I modify the threshold in main for quick test, and found the similar improvement in my local,
Maybe threshold is one of improtant reasons?

      /// Aggregation ratio (number of distinct groups / number of input rows)
      /// threshold for skipping partial aggregation. If the value is greater
      /// then partial aggregation will skip aggregation for further input
      pub skip_partial_aggregation_probe_ratio_threshold: f64, default = 0.1

      /// Number of input rows partial aggregation partition should process, before
      /// aggregation ratio check and trying to switch to skipping aggregation mode
      pub skip_partial_aggregation_probe_rows_threshold: usize, default = 0
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ main-skip-test ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.69ms │         0.66ms │     no change │
│ QQuery 1     │    66.84ms │        67.50ms │     no change │
│ QQuery 2     │   162.50ms │       164.91ms │     no change │
│ QQuery 3     │   179.63ms │       181.60ms │     no change │
│ QQuery 4     │  1573.29ms │      1358.25ms │ +1.16x faster │
│ QQuery 5     │  1562.44ms │      1654.18ms │  1.06x slower │
│ QQuery 6     │    59.98ms │        61.11ms │     no change │
│ QQuery 7     │    75.05ms │        80.01ms │  1.07x slower │
│ QQuery 8     │  1992.41ms │      1980.59ms │     no change │
│ QQuery 9     │  1917.94ms │      1923.21ms │     no change │
│ QQuery 10    │   519.81ms │       524.39ms │     no change │
│ QQuery 11    │   575.01ms │       596.81ms │     no change │
│ QQuery 12    │  1840.24ms │      1702.10ms │ +1.08x faster │
│ QQuery 13    │  2956.88ms │      2606.70ms │ +1.13x faster │
│ QQuery 14    │  2087.64ms │      1882.93ms │ +1.11x faster │
│ QQuery 15    │  1876.95ms │      1793.42ms │     no change │
│ QQuery 16    │  4067.66ms │      4034.91ms │     no change │
│ QQuery 17    │  3625.24ms │      3466.89ms │     no change │
│ QQuery 18    │  8239.67ms │      5998.16ms │ +1.37x faster │
│ QQuery 19    │   143.83ms │       146.26ms │     no change │
│ QQuery 20    │  3241.98ms │      3241.14ms │     no change │
│ QQuery 21    │  3960.47ms │      3979.09ms │     no change │
│ QQuery 22    │  9482.59ms │      9642.21ms │     no change │
│ QQuery 23    │ 24274.55ms │     24145.17ms │     no change │
│ QQuery 24    │  1160.40ms │      1160.34ms │     no change │
│ QQuery 25    │  1039.97ms │      1045.92ms │     no change │
│ QQuery 26    │  1342.15ms │      1342.59ms │     no change │
│ QQuery 27    │  4738.06ms │      4727.14ms │     no change │
│ QQuery 28    │ 25119.74ms │     25924.65ms │     no change │
│ QQuery 29    │   912.90ms │       911.02ms │     no change │
│ QQuery 30    │  1845.66ms │      1688.06ms │ +1.09x faster │
│ QQuery 31    │  2048.51ms │      2070.58ms │     no change │
│ QQuery 32    │  7394.42ms │      7402.67ms │     no change │
│ QQuery 33    │  9638.77ms │     10011.16ms │     no change │
│ QQuery 34    │  9680.47ms │      9964.31ms │     no change │
│ QQuery 35    │  2765.56ms │      2898.23ms │     no change │
│ QQuery 36    │   247.37ms │       224.61ms │ +1.10x faster │
│ QQuery 37    │   163.31ms │       165.14ms │     no change │
│ QQuery 38    │   153.40ms │       155.48ms │     no change │
│ QQuery 39    │   639.96ms │       480.78ms │ +1.33x faster │
│ QQuery 40    │    57.80ms │        58.41ms │     no change │
│ QQuery 41    │    54.26ms │        54.73ms │     no change │
│ QQuery 42    │    66.15ms │        72.01ms │  1.09x slower │
└──────────────┴────────────┴────────────────┴───────────────┘

@jayzhan211
Copy link
Contributor Author

Not only the threshold matters but also the group values computation matters, so I'm not surprised that changing the threshold speeds up the code.

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 marked this pull request as ready for review October 3, 2024 08:22
@alamb alamb requested a review from korowa October 3, 2024 14:50
@korowa
Copy link
Contributor

korowa commented Oct 3, 2024

Planning to review it during the next couple of days.

@alamb
Copy link
Contributor

alamb commented Oct 4, 2024

Running some benchmarks

@alamb
Copy link
Contributor

alamb commented Oct 4, 2024

To achieve this, we don't need to append group values to the GroupValueBuilder. Instead, we can simply check the hash value.

I haven't fully reivewed this PR yet. One thing I wonder is if the benfits in performance could potentially be had by adjusting the default aggregation ratio. Like for example, what if we changed the default ratio datafusion.execution.skip_partial_aggregation_probe_ratio_threshold to be more like 50 (as in if the first phase isn't cutting the cardinality down by at least a factor of 2, stop doing it) 🤔

@Rachelint
Copy link
Contributor

Rachelint commented Oct 4, 2024

I found maybe we can get similar benefits through(detail can see #12765 ):

  • adjusting parameters of skip_partial_aggregation_probe_rows_threshold and skip_partial_aggregation_probe_ratio_threshold

  • remove the is_locked in SkipAggregationProbe

This is the number in my local:

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main-skip-test ┃ single-mode-v4 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │         0.66ms │         0.70ms │  1.06x slower │
│ QQuery 1     │        66.12ms │        67.44ms │     no change │
│ QQuery 2     │       157.78ms │       156.85ms │     no change │
│ QQuery 3     │       182.37ms │       182.58ms │     no change │
│ QQuery 4     │      1166.76ms │      1208.11ms │     no change │
│ QQuery 5     │      1596.94ms │      1693.68ms │  1.06x slower │
│ QQuery 6     │        56.18ms │        56.64ms │     no change │
│ QQuery 7     │        75.18ms │        78.03ms │     no change │
│ QQuery 8     │      1769.26ms │      1792.14ms │     no change │
│ QQuery 9     │      1895.46ms │      2050.89ms │  1.08x slower │
│ QQuery 10    │       528.17ms │       544.64ms │     no change │
│ QQuery 11    │       590.00ms │       579.76ms │     no change │
│ QQuery 12    │      1671.23ms │      1729.08ms │     no change │
│ QQuery 13    │      2603.57ms │      2645.23ms │     no change │
│ QQuery 14    │      1898.81ms │      1867.07ms │     no change │
│ QQuery 15    │      1586.71ms │      1631.27ms │     no change │
│ QQuery 16    │      3571.19ms │      3571.63ms │     no change │
│ QQuery 17    │      3118.61ms │      3151.88ms │     no change │
│ QQuery 18    │      5961.07ms │      6039.86ms │     no change │
│ QQuery 19    │       146.76ms │       147.11ms │     no change │
│ QQuery 20    │      3223.87ms │      3218.55ms │     no change │
│ QQuery 21    │      3964.39ms │      3988.32ms │     no change │
│ QQuery 22    │      9464.66ms │      9598.59ms │     no change │
│ QQuery 23    │     23931.99ms │     23687.13ms │     no change │
│ QQuery 24    │      1177.19ms │      1179.07ms │     no change │
│ QQuery 25    │      1043.12ms │      1032.91ms │     no change │
│ QQuery 26    │      1356.44ms │      1348.37ms │     no change │
│ QQuery 27    │      4729.99ms │      4870.28ms │     no change │
│ QQuery 28    │     29261.56ms │     28726.86ms │     no change │
│ QQuery 29    │       899.79ms │       892.08ms │     no change │
│ QQuery 30    │      1686.06ms │      1679.16ms │     no change │
│ QQuery 31    │      2026.30ms │      2024.04ms │     no change │
│ QQuery 32    │      7345.30ms │      7510.59ms │     no change │
│ QQuery 33    │     10065.96ms │     10362.04ms │     no change │
│ QQuery 34    │     10039.00ms │     10334.93ms │     no change │
│ QQuery 35    │      2888.42ms │      2873.64ms │     no change │
│ QQuery 36    │       219.44ms │       228.90ms │     no change │
│ QQuery 37    │       160.72ms │       172.34ms │  1.07x slower │
│ QQuery 38    │       153.41ms │       154.50ms │     no change │
│ QQuery 39    │       473.11ms │       469.37ms │     no change │
│ QQuery 40    │        61.92ms │        58.07ms │ +1.07x faster │
│ QQuery 41    │        54.73ms │        53.30ms │     no change │
│ QQuery 42    │        74.19ms │        73.35ms │     no change │
└──────────────┴────────────────┴────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary             ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (main-skip-test)   │ 142944.39ms │
│ Total Time (single-mode-v4)   │ 143730.97ms │
│ Average Time (main-skip-test) │   3324.29ms │
│ Average Time (single-mode-v4) │   3342.58ms │
│ Queries Faster                │           1 │
│ Queries Slower                │           4 │
│ Queries with No Change        │          38 │
└───────────────────────────────┴─────────────┘

@jayzhan211
Copy link
Contributor Author

I think the threshold here is the main reason of the performance improvement but I think checking with hash value helps although the benchmark doesn't show that.

@jayzhan211 jayzhan211 marked this pull request as draft October 5, 2024 02:44
@jayzhan211
Copy link
Contributor Author

query like SELECT "UserID", "SearchPhrase", CONCAT("SearchPhrase", REPEAT('DataFusion', 32)) as R, COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase", R LIMIT 10; improve significantly, but other what I thought to be improved is not. Hmm.. 🤔

SELECT "UserID", "SearchPhrase", CONCAT("SearchPhrase", REPEAT('DataFusion', 32)) as R, COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase", R LIMIT 10;
SELECT "UserID", extract(minute FROM to_timestamp_seconds("EventTime")) AS m, "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", m, "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
SELECT "UserID", extract(minute FROM to_timestamp_seconds("EventTime")) AS m, "SearchPhrase", CONCAT("SearchPhrase", REPEAT('DataFusion', 32)) as R, COUNT(*) FROM hits GROUP BY "UserID", m, "SearchPhrase", R ORDER BY COUNT(*) DESC LIMIT 10;
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main-skip-test ┃ single-mode-v4 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │      6053.73ms │      4446.50ms │ +1.36x faster │
│ QQuery 1     │      1993.84ms │      2016.60ms │     no change │
│ QQuery 2     │     12862.52ms │     12575.49ms │     no change │
└──────────────┴────────────────┴────────────────┴───────────────┘

Comment on lines +593 to +595
if self.skip_partial_aggregation {
let states = self.transform_to_states(batch)?;
self.exec_state = ExecutionState::ProducingOutput(states);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand, some batches maybe have been updated into intermediate states(GroupValues and GroupAccumulators).
Maybe we should emit them first like what is done in switch_to_skip_aggregation?

I guess the triggering process may be like:

### batch 1
- not reach the skip threshold
- update into `GroupValues` and `GroupAccumulator`s

### batch 2
- not reach the skip threshold
- update into `GroupValues` and `GroupAccumulator`s

...

### batch n
- finally reach the skip threshold
- we need to `emit` things in `GroupValues` and `GroupAccumulator`s
- then we convert to the skip mode

🤔 And it seems that just the last batch (batch n in the exmaple)'s updating can be avoided, comparing to the original partial skipping logic. And the cost of updating one batch may be not so high?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it seems that just the last batch (batch n in the exmaple)'s updating can be avoided, comparing to the original partial skipping logic. And the cost of updating one batch may be not so high?

Why only the last batch is not computed? If batch m reach the skip threshold, I think m to n are all skipped, therefore n-m computation are saved.

Copy link
Contributor

@Rachelint Rachelint Oct 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it seems that just the last batch (batch n in the exmaple)'s updating can be avoided, comparing to the original partial skipping logic. And the cost of updating one batch may be not so high?

Why only the last batch is not computed? If batch m reach the skip threshold, I think m to n are all skipped, therefore n-m computation are saved.

Oh, I think maybe I got it? Do you mean the loop here?

fn group_aggregate_batch_with_skipping_partial() {
        for (index, group_values) in group_by_values.iter().enumerate() {
              ...
              // reach the threshold, and we break the loop and skip earlier
              ...
        }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, for simple group by case, group_by_values is mostly single, I think only grouping sets query have more than one group by values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any batch reach the threshold limit, we will switch to ProducingOutput

if self.skip_partial_aggregation {
      let states = self.transform_to_states(batch)?;
      self.exec_state = ExecutionState::ProducingOutput(states);
      // make sure the exec_state just set is not overwritten below
      break 'reading_input;
  }

I think the skipping logic has no difference from the main branch, the only difference is that we compute the hash, and skip based on the number of unique value in hash table

Copy link
Contributor

@Rachelint Rachelint Oct 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, for simple group by case, group_by_values is mostly single, I think only grouping sets query have more than one group by values.

🤔 Yes, I am just a bit confused about saving n - m computation in #12697 (comment) if it is said single group by case.

Assume the threshold is reached until batch n comes:

  • In main, we will firstly update batch n into GroupValues and GroupAccumulators, and found threshold reached after, so we skipped batch n+1 ~ batch last
// `batch n` comes, update it into `GroupValues` and `GroupAccumulator`s.
extract_ok!(self.group_aggregate_batch(batch));

// Then we found threshold reached here after updating.
self.update_skip_aggregation_probe(input_rows);

// Switch to skip mode, and we skip `batch n+1 ~ batch last`
extract_ok!(self.switch_to_skip_aggregation());
  • And currently, we can know threshold reached before updating batch n into GroupValues and GroupAccumulators, so we can skip batch n ~ batch last
fn group_aggregate_batch_with_skipping_partial(
    &mut self,
    batch: &RecordBatch,
) -> Result<()> {
   for (index, group_values) in group_by_values.iter().enumerate() {
         // We calculate hashes and check skip threshold here
         // When `batch n` comes, we found threshold reached, 
         // just return and not update `batch n`.
         
         // And it will switch to skip mode for `batch n ~ batch last` after return    
   } 
    
   // We will still update `batch 0 ~ batch n-1`, because threshold is only
   // reached until `batch n` comes.
   for (index, group_values) in group_by_values.iter().enumerate() {
          // Update `batch` into `GroupValues` and `GroupAccumulator`s
    }
  

Seems just updating of batch n is avoided?

😄 Sorry for maybe asking too many questions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. It tells why the improvement is not as high as expected 🤔

I think the attempt for this change failed 😢

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😢

@jayzhan211
Copy link
Contributor Author

Result if we set skip_partial_aggregation_probe_ratio_threshold to 0.5 and skip_partial_aggregation_probe_rows_threshold to 0 from main.

I think playing around with the threshold number is tricky and we may get overfitting on the benchmark we target on.

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃  threshold ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.41ms │     0.42ms │     no change │
│ QQuery 1     │    42.61ms │    46.05ms │  1.08x slower │
│ QQuery 2     │    75.89ms │    75.21ms │     no change │
│ QQuery 3     │    72.24ms │    62.42ms │ +1.16x faster │
│ QQuery 4     │   399.04ms │   385.47ms │     no change │
│ QQuery 5     │   663.20ms │   656.64ms │     no change │
│ QQuery 6     │    37.91ms │    39.05ms │     no change │
│ QQuery 7     │    42.71ms │    47.49ms │  1.11x slower │
│ QQuery 8     │   638.09ms │   600.01ms │ +1.06x faster │
│ QQuery 9     │   678.82ms │   660.36ms │     no change │
│ QQuery 10    │   200.65ms │   208.42ms │     no change │
│ QQuery 11    │   233.85ms │   225.67ms │     no change │
│ QQuery 12    │   718.91ms │   636.78ms │ +1.13x faster │
│ QQuery 13    │   926.86ms │   860.75ms │ +1.08x faster │
│ QQuery 14    │   873.60ms │   682.60ms │ +1.28x faster │
│ QQuery 15    │   498.70ms │   484.49ms │     no change │
│ QQuery 16    │  1305.01ms │  1250.18ms │     no change │
│ QQuery 17    │  1190.26ms │  1164.31ms │     no change │
│ QQuery 18    │  3347.18ms │  2090.57ms │ +1.60x faster │
│ QQuery 19    │    55.77ms │    57.33ms │     no change │
│ QQuery 20    │   943.13ms │   936.20ms │     no change │
│ QQuery 21    │  1206.45ms │  1228.93ms │     no change │
│ QQuery 22    │  3230.71ms │  3093.60ms │     no change │
│ QQuery 23    │  8186.73ms │  7946.95ms │     no change │
│ QQuery 24    │   500.91ms │   479.18ms │     no change │
│ QQuery 25    │   502.13ms │   480.47ms │     no change │
│ QQuery 26    │   561.53ms │   531.43ms │ +1.06x faster │
│ QQuery 27    │  1338.06ms │  1387.29ms │     no change │
│ QQuery 28    │ 10507.08ms │ 10471.02ms │     no change │
│ QQuery 29    │   398.16ms │   407.05ms │     no change │
│ QQuery 30    │   760.72ms │   658.42ms │ +1.16x faster │
│ QQuery 31    │   712.91ms │   669.57ms │ +1.06x faster │
│ QQuery 32    │  3567.36ms │  3453.34ms │     no change │
│ QQuery 33    │  4863.25ms │  3885.84ms │ +1.25x faster │
│ QQuery 34    │  4115.97ms │  3442.65ms │ +1.20x faster │
│ QQuery 35    │   998.40ms │  1005.24ms │     no change │
│ QQuery 36    │   147.99ms │   142.11ms │     no change │
│ QQuery 37    │   101.82ms │   103.37ms │     no change │
│ QQuery 38    │   108.77ms │   107.39ms │     no change │
│ QQuery 39    │   327.29ms │   260.03ms │ +1.26x faster │
│ QQuery 40    │    33.56ms │    33.73ms │     no change │
│ QQuery 41    │    34.09ms │    31.52ms │ +1.08x faster │
│ QQuery 42    │    39.32ms │    39.95ms │     no change │
└──────────────┴────────────┴────────────┴───────────────┘

@jayzhan211
Copy link
Contributor Author

I think the largest challenge of this change is that for low cardinality we have additional hash table computation time, although we don't see the slowdown on the benchmark, but we could still create extreme query that has significant impact on it

@alamb
Copy link
Contributor

alamb commented Oct 5, 2024

Here are the numbers I got on this branch:

--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ single-mode-v4 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.19ms │         2.18ms │     no change │
│ QQuery 1     │    38.81ms │        38.17ms │     no change │
│ QQuery 2     │    96.62ms │        92.46ms │     no change │
│ QQuery 3     │    97.06ms │        98.65ms │     no change │
│ QQuery 4     │   923.91ms │       724.87ms │ +1.27x faster │
│ QQuery 5     │   974.44ms │      1305.93ms │  1.34x slower │
│ QQuery 6     │    33.67ms │        37.66ms │  1.12x slower │
│ QQuery 7     │    43.21ms │        45.22ms │     no change │
│ QQuery 8     │  1353.51ms │      1215.72ms │ +1.11x faster │
│ QQuery 9     │  1346.26ms │      1430.30ms │  1.06x slower │
│ QQuery 10    │   344.35ms │       332.55ms │     no change │
│ QQuery 11    │   391.74ms │       382.66ms │     no change │
│ QQuery 12    │  1103.72ms │       862.90ms │ +1.28x faster │
│ QQuery 13    │  1735.33ms │      1461.53ms │ +1.19x faster │
│ QQuery 14    │  1240.47ms │       935.56ms │ +1.33x faster │
│ QQuery 15    │  1077.26ms │       949.83ms │ +1.13x faster │
│ QQuery 16    │  2495.53ms │      2180.97ms │ +1.14x faster │
│ QQuery 17    │  2276.83ms │      1947.58ms │ +1.17x faster │
│ QQuery 18    │  5030.06ms │      3684.12ms │ +1.37x faster │
│ QQuery 19    │    91.93ms │        95.77ms │     no change │
│ QQuery 20    │  1699.33ms │      1742.71ms │     no change │
│ QQuery 21    │  1982.68ms │      2040.31ms │     no change │
│ QQuery 22    │  4682.32ms │      5206.59ms │  1.11x slower │
│ QQuery 23    │ 10495.72ms │     10443.17ms │     no change │
│ QQuery 24    │   575.86ms │       587.24ms │     no change │
│ QQuery 25    │   484.52ms │       494.27ms │     no change │
│ QQuery 26    │   648.80ms │       656.98ms │     no change │
│ QQuery 27    │  2561.01ms │      2685.02ms │     no change │
│ QQuery 28    │ 15425.32ms │     18180.15ms │  1.18x slower │
│ QQuery 29    │   519.32ms │       529.97ms │     no change │
│ QQuery 30    │  1052.78ms │       903.01ms │ +1.17x faster │
│ QQuery 31    │  1116.38ms │      1082.78ms │     no change │
│ QQuery 32    │  4284.98ms │      4283.87ms │     no change │
│ QQuery 33    │  5183.56ms │      4947.87ms │     no change │
│ QQuery 34    │  5231.35ms │      5001.83ms │     no change │
│ QQuery 35    │  1916.17ms │      1822.67ms │     no change │
│ QQuery 36    │   260.43ms │       233.69ms │ +1.11x faster │
│ QQuery 37    │   121.10ms │       131.40ms │  1.09x slower │
│ QQuery 38    │   141.15ms │       140.35ms │     no change │
│ QQuery 39    │   757.69ms │       516.94ms │ +1.47x faster │
│ QQuery 40    │    58.24ms │        56.67ms │     no change │
│ QQuery 41    │    47.61ms │        48.35ms │     no change │
│ QQuery 42    │    64.04ms │        71.09ms │  1.11x slower │
└──────────────┴────────────┴────────────────┴───────────────┘
└──────────────┴────────────┴────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary             ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_base)        │ 80007.25ms │
│ Total Time (single-mode-v4)   │ 79631.55ms │
│ Average Time (main_base)      │  1860.63ms │
│ Average Time (single-mode-v4) │  1851.90ms │
│ Queries Faster                │         12 │
│ Queries Slower                │          7 │
│ Queries with No Change        │         24 │
└───────────────────────────────┴────────────┘
Note: Skipping /home/alamb/arrow-datafusion/benchmarks/results/main_base/tpch_mem_sf1.json as /home/alamb/arrow-\
datafusion/benchmarks/results/single-mode-v4/tpch_mem_sf1.json does not exist
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ single-mode-v4 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  202.68ms │       186.09ms │ +1.09x faster │
│ QQuery 2     │  123.07ms │       104.42ms │ +1.18x faster │
│ QQuery 3     │  130.29ms │       112.85ms │ +1.15x faster │
│ QQuery 4     │   85.75ms │        86.53ms │     no change │
│ QQuery 5     │  156.10ms │       158.66ms │     no change │
│ QQuery 6     │   40.63ms │        40.36ms │     no change │
│ QQuery 7     │  192.20ms │       193.36ms │     no change │
│ QQuery 8     │  155.99ms │       158.01ms │     no change │
│ QQuery 9     │  246.56ms │       228.72ms │ +1.08x faster │
│ QQuery 10    │  215.62ms │       219.22ms │     no change │
│ QQuery 11    │   91.33ms │        91.33ms │     no change │
│ QQuery 12    │  129.13ms │       128.18ms │     no change │
│ QQuery 13    │  216.71ms │       211.03ms │     no change │
│ QQuery 14    │   72.68ms │        86.45ms │  1.19x slower │
│ QQuery 15    │  110.85ms │        98.26ms │ +1.13x faster │
│ QQuery 16    │   78.57ms │        64.20ms │ +1.22x faster │
│ QQuery 17    │  219.63ms │       147.18ms │ +1.49x faster │
│ QQuery 18    │  312.29ms │       288.30ms │ +1.08x faster │
│ QQuery 19    │  137.91ms │       139.33ms │     no change │
│ QQuery 20    │  129.86ms │       112.25ms │ +1.16x faster │
│ QQuery 21    │  262.13ms │       266.61ms │     no change │
│ QQuery 22    │   61.46ms │        64.42ms │     no change │
└──────────────┴───────────┴────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary             ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main_base)        │ 3371.45ms │
│ Total Time (single-mode-v4)   │ 3185.76ms │
│ Average Time (main_base)      │  153.25ms │
│ Average Time (single-mode-v4) │  144.81ms │
│ Queries Faster                │         9 │
│ Queries Slower                │         1 │
│ Queries with No Change        │        12 │
└───────────────────────────────┴───────────┘


/// Number of input rows partial aggregation partition should process, before
/// aggregation ratio check and trying to switch to skipping aggregation mode
pub skip_partial_aggregation_probe_rows_threshold: usize, default = 100_000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has this setting been deleted? Checking if aggregation should be skipped right in the beginning of the execution may lead to skipping decision made based on insufficient amount of data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that in my strategy I make the decision per batch since I assume the data is distributed evenly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config with 0.1 and 100_000

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃  threshold ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.41ms │     0.45ms │  1.10x slower │
│ QQuery 1     │    42.61ms │    42.55ms │     no change │
│ QQuery 2     │    75.89ms │    74.88ms │     no change │
│ QQuery 3     │    72.24ms │    61.81ms │ +1.17x faster │
│ QQuery 4     │   399.04ms │   359.82ms │ +1.11x faster │
│ QQuery 5     │   663.20ms │   699.03ms │  1.05x slower │
│ QQuery 6     │    37.91ms │    38.01ms │     no change │
│ QQuery 7     │    42.71ms │    40.05ms │ +1.07x faster │
│ QQuery 8     │   638.09ms │   699.07ms │  1.10x slower │
│ QQuery 9     │   678.82ms │   742.56ms │  1.09x slower │
│ QQuery 10    │   200.65ms │   209.83ms │     no change │
│ QQuery 11    │   233.85ms │   220.56ms │ +1.06x faster │
│ QQuery 12    │   718.91ms │   660.93ms │ +1.09x faster │
│ QQuery 13    │   926.86ms │   937.19ms │     no change │
│ QQuery 14    │   873.60ms │   753.51ms │ +1.16x faster │
│ QQuery 15    │   498.70ms │   483.92ms │     no change │
│ QQuery 16    │  1305.01ms │  1238.38ms │ +1.05x faster │
│ QQuery 17    │  1190.26ms │  1111.67ms │ +1.07x faster │
│ QQuery 18    │  3347.18ms │  2903.17ms │ +1.15x faster │
│ QQuery 19    │    55.77ms │    61.39ms │  1.10x slower │
│ QQuery 20    │   943.13ms │   919.37ms │     no change │
│ QQuery 21    │  1206.45ms │  1230.95ms │     no change │
│ QQuery 22    │  3230.71ms │  3479.20ms │  1.08x slower │
│ QQuery 23    │  8186.73ms │  7989.35ms │     no change │
│ QQuery 24    │   500.91ms │   507.13ms │     no change │
│ QQuery 25    │   502.13ms │   551.57ms │  1.10x slower │
│ QQuery 26    │   561.53ms │   568.74ms │     no change │
│ QQuery 27    │  1338.06ms │  1382.16ms │     no change │
│ QQuery 28    │ 10507.08ms │ 11253.66ms │  1.07x slower │
│ QQuery 29    │   398.16ms │   429.90ms │  1.08x slower │
│ QQuery 30    │   760.72ms │   664.19ms │ +1.15x faster │
│ QQuery 31    │   712.91ms │   772.65ms │  1.08x slower │
│ QQuery 32    │  3567.36ms │  4167.37ms │  1.17x slower │
│ QQuery 33    │  4863.25ms │  4311.27ms │ +1.13x faster │
│ QQuery 34    │  4115.97ms │  3564.32ms │ +1.15x faster │
│ QQuery 35    │   998.40ms │   979.18ms │     no change │
│ QQuery 36    │   147.99ms │   136.97ms │ +1.08x faster │
│ QQuery 37    │   101.82ms │   102.79ms │     no change │
│ QQuery 38    │   108.77ms │   107.66ms │     no change │
│ QQuery 39    │   327.29ms │   272.91ms │ +1.20x faster │
│ QQuery 40    │    33.56ms │    34.88ms │     no change │
│ QQuery 41    │    34.09ms │    32.18ms │ +1.06x faster │
│ QQuery 42    │    39.32ms │    39.75ms │     no change │
└──────────────┴────────────┴────────────┴───────────────┘

config with 0.1 and 0

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃  threshold ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.41ms │     0.49ms │  1.21x slower │
│ QQuery 1     │    42.61ms │    42.34ms │     no change │
│ QQuery 2     │    75.89ms │    81.91ms │  1.08x slower │
│ QQuery 3     │    72.24ms │    69.94ms │     no change │
│ QQuery 4     │   399.04ms │   365.30ms │ +1.09x faster │
│ QQuery 5     │   663.20ms │   656.21ms │     no change │
│ QQuery 6     │    37.91ms │    38.24ms │     no change │
│ QQuery 7     │    42.71ms │    40.27ms │ +1.06x faster │
│ QQuery 8     │   638.09ms │   590.64ms │ +1.08x faster │
│ QQuery 9     │   678.82ms │   666.57ms │     no change │
│ QQuery 10    │   200.65ms │   196.94ms │     no change │
│ QQuery 11    │   233.85ms │   220.90ms │ +1.06x faster │
│ QQuery 12    │   718.91ms │   652.58ms │ +1.10x faster │
│ QQuery 13    │   926.86ms │   941.22ms │     no change │
│ QQuery 14    │   873.60ms │   749.50ms │ +1.17x faster │
│ QQuery 15    │   498.70ms │   495.09ms │     no change │
│ QQuery 16    │  1305.01ms │  1247.99ms │     no change │
│ QQuery 17    │  1190.26ms │  1158.71ms │     no change │
│ QQuery 18    │  3347.18ms │  2008.84ms │ +1.67x faster │
│ QQuery 19    │    55.77ms │    54.94ms │     no change │
│ QQuery 20    │   943.13ms │   913.04ms │     no change │
│ QQuery 21    │  1206.45ms │  1183.49ms │     no change │
│ QQuery 22    │  3230.71ms │  3181.47ms │     no change │
│ QQuery 23    │  8186.73ms │  7900.35ms │     no change │
│ QQuery 24    │   500.91ms │   501.30ms │     no change │
│ QQuery 25    │   502.13ms │   481.08ms │     no change │
│ QQuery 26    │   561.53ms │   571.92ms │     no change │
│ QQuery 27    │  1338.06ms │  1451.63ms │  1.08x slower │
│ QQuery 28    │ 10507.08ms │ 11396.72ms │  1.08x slower │
│ QQuery 29    │   398.16ms │   392.46ms │     no change │
│ QQuery 30    │   760.72ms │   658.78ms │ +1.15x faster │
│ QQuery 31    │   712.91ms │   702.45ms │     no change │
│ QQuery 32    │  3567.36ms │  3520.95ms │     no change │
│ QQuery 33    │  4863.25ms │  3527.36ms │ +1.38x faster │
│ QQuery 34    │  4115.97ms │  3731.86ms │ +1.10x faster │
│ QQuery 35    │   998.40ms │   964.65ms │     no change │
│ QQuery 36    │   147.99ms │   136.33ms │ +1.09x faster │
│ QQuery 37    │   101.82ms │   102.70ms │     no change │
│ QQuery 38    │   108.77ms │   105.97ms │     no change │
│ QQuery 39    │   327.29ms │   267.66ms │ +1.22x faster │
│ QQuery 40    │    33.56ms │    34.77ms │     no change │
│ QQuery 41    │    34.09ms │    31.76ms │ +1.07x faster │
│ QQuery 42    │    39.32ms │    40.56ms │     no change │
└──────────────┴────────────┴────────────┴───────────────┘


/// Record the number of rows that were output directly without aggregation
fn record_skipped(&mut self, batch: &RecordBatch) {
self.skipped_aggregation_rows.add(batch.num_rows());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this metric is not required -- it still may be helpful for debugging purposes or just providing information about how aggregation has been executed to end user (via explain analyze)

/// Number of input rows partial aggregation partition should process, before
/// aggregation ratio check and trying to switch to skipping aggregation mode
pub skip_partial_aggregation_probe_rows_threshold: usize, default = 100_000
pub skip_partial_aggregation_probe_ratio_threshold: f64, default = 0.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting why 0.1 value makes things faster 🤔

Is aggregation for simple cases (e.g. single integer) slower than repartitioning x10 rows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is aggregation for simple cases (e.g. single integer) slower than repartitioning x10 rows?

Given the benchmark result, I think so. Especially if we have millions of distinct integer.

/// without aggregation
skipped_aggregation_rows: metrics::Count,
/// Number of unique hash, which represents the cardinality of the group values
unique_hashes_count: HashSet<u64>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the main case when the group_values.len() is not enough and actual hashes needed?

And if they are really required, this HashSet should be accounted in operators memory reservation.

Copy link
Contributor Author

@jayzhan211 jayzhan211 Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the main case when the group_values.len() is not enough and actual hashes needed?

Even if the existing group is low, we still need to accumulate previous hashes, so we can calculate later on when the group is high enough.

unique_hashes_count.len() has the role similar to group_values.len()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation execution Related to the execution crate physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants