Skip partial aggregation based on the cardinality of hash value instead of group values #12697

jayzhan211 · 2024-10-01T08:58:45Z

Which issue does this PR close?

Closes #.

Rationale for this change

The current skip aggregation logic calculates the ratio between the number of group values and the total batch size. However, what we actually need is the ratio of unique group values compared to the total number of group values.

To achieve this, we don't need to append group values to the GroupValueBuilder. Instead, we can simply check the hash value.

Benchmarking also shows an improvement for high-cardinality queries. Although low-cardinality cases, like TPCH Q1, slow down due to the additional HashSet computation, I believe the tradeoff is worthwhile.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

TODO

Remove the existing skipping logic based on group values and see if there is any slowdown

Benchmark

Comparing main and single-mode-v4
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ single-mode-v4 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.41ms │         0.44ms │  1.09x slower │
│ QQuery 1     │    42.61ms │        42.14ms │     no change │
│ QQuery 2     │    75.89ms │        75.09ms │     no change │
│ QQuery 3     │    72.24ms │        63.22ms │ +1.14x faster │
│ QQuery 4     │   399.04ms │       376.35ms │ +1.06x faster │
│ QQuery 5     │   663.20ms │       634.01ms │     no change │
│ QQuery 6     │    37.91ms │        37.54ms │     no change │
│ QQuery 7     │    42.71ms │        41.72ms │     no change │
│ QQuery 8     │   638.09ms │       619.10ms │     no change │
│ QQuery 9     │   678.82ms │       704.86ms │     no change │
│ QQuery 10    │   200.65ms │       192.01ms │     no change │
│ QQuery 11    │   233.85ms │       221.90ms │ +1.05x faster │
│ QQuery 12    │   718.91ms │       638.75ms │ +1.13x faster │
│ QQuery 13    │   926.86ms │       889.40ms │     no change │
│ QQuery 14    │   873.60ms │       716.54ms │ +1.22x faster │
│ QQuery 15    │   498.70ms │       507.27ms │     no change │
│ QQuery 16    │  1305.01ms │      1285.15ms │     no change │
│ QQuery 17    │  1190.26ms │      1138.84ms │     no change │
│ QQuery 18    │  3347.18ms │      2005.19ms │ +1.67x faster │
│ QQuery 19    │    55.77ms │        57.10ms │     no change │
│ QQuery 20    │   943.13ms │       940.61ms │     no change │
│ QQuery 21    │  1206.45ms │      1225.04ms │     no change │
│ QQuery 22    │  3230.71ms │      3221.00ms │     no change │
│ QQuery 23    │  8186.73ms │      7956.04ms │     no change │
│ QQuery 24    │   500.91ms │       479.24ms │     no change │
│ QQuery 25    │   502.13ms │       482.03ms │     no change │
│ QQuery 26    │   561.53ms │       536.60ms │     no change │
│ QQuery 27    │  1338.06ms │      1383.75ms │     no change │
│ QQuery 28    │ 10507.08ms │     11563.16ms │  1.10x slower │
│ QQuery 29    │   398.16ms │       395.54ms │     no change │
│ QQuery 30    │   760.72ms │       669.24ms │ +1.14x faster │
│ QQuery 31    │   712.91ms │       684.49ms │     no change │
│ QQuery 32    │  3567.36ms │      3222.74ms │ +1.11x faster │
│ QQuery 33    │  4863.25ms │      3308.70ms │ +1.47x faster │
│ QQuery 34    │  4115.97ms │      3256.70ms │ +1.26x faster │
│ QQuery 35    │   998.40ms │      1031.65ms │     no change │
│ QQuery 36    │   147.99ms │       130.10ms │ +1.14x faster │
│ QQuery 37    │   101.82ms │       108.67ms │  1.07x slower │
│ QQuery 38    │   108.77ms │       105.60ms │     no change │
│ QQuery 39    │   327.29ms │       259.94ms │ +1.26x faster │
│ QQuery 40    │    33.56ms │        32.70ms │     no change │
│ QQuery 41    │    34.09ms │        37.00ms │  1.09x slower │
│ QQuery 42    │    39.32ms │        42.22ms │  1.07x slower │
└──────────────┴────────────┴────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary             ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)             │ 55188.04ms │
│ Total Time (single-mode-v4)   │ 51319.38ms │
│ Average Time (main)           │  1283.44ms │
│ Average Time (single-mode-v4) │  1193.47ms │
│ Queries Faster                │         12 │
│ Queries Slower                │          5 │
│ Queries with No Change        │         26 │
└───────────────────────────────┴────────────┘
Note: Skipping /Users/bytedance/arrow-datafusion/benchmarks/results/main/clickbench_partitioned.json as /Users/bytedance/arrow-datafusion/benchmarks/results/single-mode-v4/clickbench_partitioned.json does not exist
Note: Skipping /Users/bytedance/arrow-datafusion/benchmarks/results/main/tpch_mem_sf10.json as /Users/bytedance/arrow-datafusion/benchmarks/results/single-mode-v4/tpch_mem_sf10.json does not exist
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     main ┃ single-mode-v4 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  94.58ms │       101.79ms │  1.08x slower │
│ QQuery 2     │  29.05ms │        27.99ms │     no change │
│ QQuery 3     │  45.17ms │        43.31ms │     no change │
│ QQuery 4     │  30.32ms │        30.31ms │     no change │
│ QQuery 5     │  62.54ms │        64.16ms │     no change │
│ QQuery 6     │  20.73ms │        20.86ms │     no change │
│ QQuery 7     │  77.07ms │        80.34ms │     no change │
│ QQuery 8     │  64.10ms │        61.17ms │     no change │
│ QQuery 9     │  75.10ms │        76.16ms │     no change │
│ QQuery 10    │  68.77ms │        75.01ms │  1.09x slower │
│ QQuery 11    │  17.59ms │        18.58ms │  1.06x slower │
│ QQuery 12    │  49.99ms │        52.32ms │     no change │
│ QQuery 13    │  41.30ms │        41.81ms │     no change │
│ QQuery 14    │  33.14ms │        36.12ms │  1.09x slower │
│ QQuery 15    │  48.04ms │        49.63ms │     no change │
│ QQuery 16    │  19.68ms │        20.69ms │  1.05x slower │
│ QQuery 17    │  87.01ms │        71.20ms │ +1.22x faster │
│ QQuery 18    │ 119.28ms │       123.64ms │     no change │
│ QQuery 19    │  61.27ms │        57.87ms │ +1.06x faster │
│ QQuery 20    │  46.45ms │        48.23ms │     no change │
│ QQuery 21    │  92.86ms │        92.10ms │     no change │
│ QQuery 22    │  18.37ms │        19.59ms │  1.07x slower │
└──────────────┴──────────┴────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary             ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)             │ 1202.43ms │
│ Total Time (single-mode-v4)   │ 1212.87ms │
│ Average Time (main)           │   54.66ms │
│ Average Time (single-mode-v4) │   55.13ms │
│ Queries Faster                │         2 │
│ Queries Slower                │         6 │
│ Queries with No Change        │        14 │
└───────────────────────────────┴───────────┘

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

jayzhan211 · 2024-10-01T10:10:48Z

datafusion/physical-plan/src/aggregates/row_hash.rs

+        if group_values.is_empty() {
+            return internal_err!("group_values expected to have at least one element");
+        }
+        let mut output = group_values.swap_remove(0);


The result further improve with removing clone 😆 , sounds like cheating

Rachelint · 2024-10-01T13:44:30Z

🤔 As I understand, seems the main difference with the original partial skipping like to be threshold?

Is it possible that improvement comes from tirggering partial skipping earlier?

I guess maybe we can modify the params in partial skipping in main, and see if we can get the similar performance.

threshold here:

            let ratio = self.unique_hashes_count.len() as f32
                / self.accumulated_batch_size as f32;
            // TODO: Configure the threshold
            self.skip_partial_with_hash_count = ratio > 0.1;

original threshold:

        // `probe_rows_threshold` default to 100_000
       // `probe_ratio_threshold` default to 0.8
        if self.input_rows >= self.probe_rows_threshold {
            self.should_skip = self.num_groups as f64 / self.input_rows as f64
                >= self.probe_ratio_threshold;
            self.is_locked = true;
        }

Actually, I think if partial is really necessary in standalone, although it is ensured to provide a performance improve in distributed query (push down the partial to remote datasource)?

…le-mode-v4

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

jayzhan211 · 2024-10-02T09:49:24Z

Is it possible that improvement comes from tirggering partial skipping earlier?

If we only compute hash, we are able to skip early. Otherwise, the original approach is only able to skip after the group values are calculated

Threshold might be one of the reason, but skip based on hash is even better

I think partial is only helpful for low cardinality query where we can reduce the output rows at the first stage, so the repartition cost is low.

In distributed context, I guess we are still able to scale it with repartition + final?

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

Rachelint · 2024-10-02T10:27:48Z

If we only compute hash, we are able to skip early. Otherwise, the original approach is only able to skip after the group values are calculated

I am a bit confused about this, it seems only the calculation of last batch before skipping will can be avoided? It may run like this. Maybe I misunderstand, I am not so sure?

-----batch 1-----
- compute hash
- check if can trigger skip, and found no
- call `group_aggregate_batch` to update intermediate states

-----batch 2-----
- compute hash
- check if can trigger skip, and found no
- call `group_aggregate_batch` to update intermediate states

...

-----last batch before skipping-----
- compute hash
- check if can trigger skip, and found yes
- avoid `group_aggregate_batch` here, and switch to partial skipping, and run the skip logic after

I think partial is only helpful for low cardinality query where we can reduce the output rows at the first stage, so the repartition cost is low.

But it introduced cost of partial aggr, I still have no idea how to compare the cost of them...

jayzhan211 · 2024-10-02T11:39:57Z

I am a bit confused about this, it seems only the calculation of last batch before skipping will can be avoided? It may run like this. Maybe I misunderstand, I am not so sure?

I think in partial group by stage, all the batch is computed independently, so only the batch that has low cardinality is not skipped, others are skipped. For high cardinality case, we skip almost all the batch and move on to repartition quickly

But it introduced cost of partial aggr, I still have no idea how to compare the cost of them...

For low cardinality case, the cost of repartition + final is almost negligible. Partial aggregation is the only part takes time.

…le-mode-v4

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

Rachelint · 2024-10-02T13:06:29Z

I think in partial group by stage, all the batch is computed independently, so only the batch that has low cardinality is not skipped, others are skipped. For high cardinality case, we skip almost all the batch and move on to repartition quickly

Yes, I am confused about that:

Assume we have 3 batches, name them batch 1, batch 2, batch 3
And we will statistic the total rows + unique groups(through hashes or actual group values) of them.
Now, after we statistic batch 1, batch2, we found the threashold is still not reached (e.g. unique gourps/ total rows < 0.1), so we still need to call group_aggregate_batch in partial aggr for them.
Finally when batch 3 comes, we found the threashold is reached.
In current way, we can switch to skip mode without calling group_aggregate_batch for batch 3.
But in old way, we still need to call group_aggregate_batch for batch 3, because we can only update the skip mode probe after this.

So it seems it is actually the group_aggregate_batch calling for batch 3 is avoided?

I modify the threshold in main for quick test, and found the similar improvement in my local,
Maybe threshold is one of improtant reasons?

      /// Aggregation ratio (number of distinct groups / number of input rows)
      /// threshold for skipping partial aggregation. If the value is greater
      /// then partial aggregation will skip aggregation for further input
      pub skip_partial_aggregation_probe_ratio_threshold: f64, default = 0.1

      /// Number of input rows partial aggregation partition should process, before
      /// aggregation ratio check and trying to switch to skipping aggregation mode
      pub skip_partial_aggregation_probe_rows_threshold: usize, default = 0

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ main-skip-test ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.69ms │         0.66ms │     no change │
│ QQuery 1     │    66.84ms │        67.50ms │     no change │
│ QQuery 2     │   162.50ms │       164.91ms │     no change │
│ QQuery 3     │   179.63ms │       181.60ms │     no change │
│ QQuery 4     │  1573.29ms │      1358.25ms │ +1.16x faster │
│ QQuery 5     │  1562.44ms │      1654.18ms │  1.06x slower │
│ QQuery 6     │    59.98ms │        61.11ms │     no change │
│ QQuery 7     │    75.05ms │        80.01ms │  1.07x slower │
│ QQuery 8     │  1992.41ms │      1980.59ms │     no change │
│ QQuery 9     │  1917.94ms │      1923.21ms │     no change │
│ QQuery 10    │   519.81ms │       524.39ms │     no change │
│ QQuery 11    │   575.01ms │       596.81ms │     no change │
│ QQuery 12    │  1840.24ms │      1702.10ms │ +1.08x faster │
│ QQuery 13    │  2956.88ms │      2606.70ms │ +1.13x faster │
│ QQuery 14    │  2087.64ms │      1882.93ms │ +1.11x faster │
│ QQuery 15    │  1876.95ms │      1793.42ms │     no change │
│ QQuery 16    │  4067.66ms │      4034.91ms │     no change │
│ QQuery 17    │  3625.24ms │      3466.89ms │     no change │
│ QQuery 18    │  8239.67ms │      5998.16ms │ +1.37x faster │
│ QQuery 19    │   143.83ms │       146.26ms │     no change │
│ QQuery 20    │  3241.98ms │      3241.14ms │     no change │
│ QQuery 21    │  3960.47ms │      3979.09ms │     no change │
│ QQuery 22    │  9482.59ms │      9642.21ms │     no change │
│ QQuery 23    │ 24274.55ms │     24145.17ms │     no change │
│ QQuery 24    │  1160.40ms │      1160.34ms │     no change │
│ QQuery 25    │  1039.97ms │      1045.92ms │     no change │
│ QQuery 26    │  1342.15ms │      1342.59ms │     no change │
│ QQuery 27    │  4738.06ms │      4727.14ms │     no change │
│ QQuery 28    │ 25119.74ms │     25924.65ms │     no change │
│ QQuery 29    │   912.90ms │       911.02ms │     no change │
│ QQuery 30    │  1845.66ms │      1688.06ms │ +1.09x faster │
│ QQuery 31    │  2048.51ms │      2070.58ms │     no change │
│ QQuery 32    │  7394.42ms │      7402.67ms │     no change │
│ QQuery 33    │  9638.77ms │     10011.16ms │     no change │
│ QQuery 34    │  9680.47ms │      9964.31ms │     no change │
│ QQuery 35    │  2765.56ms │      2898.23ms │     no change │
│ QQuery 36    │   247.37ms │       224.61ms │ +1.10x faster │
│ QQuery 37    │   163.31ms │       165.14ms │     no change │
│ QQuery 38    │   153.40ms │       155.48ms │     no change │
│ QQuery 39    │   639.96ms │       480.78ms │ +1.33x faster │
│ QQuery 40    │    57.80ms │        58.41ms │     no change │
│ QQuery 41    │    54.26ms │        54.73ms │     no change │
│ QQuery 42    │    66.15ms │        72.01ms │  1.09x slower │
└──────────────┴────────────┴────────────────┴───────────────┘

jayzhan211 · 2024-10-02T14:41:19Z

Not only the threshold matters but also the group values computation matters, so I'm not surprised that changing the threshold speeds up the code.

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

…le-mode-v4

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

korowa · 2024-10-03T17:03:06Z

Planning to review it during the next couple of days.

alamb · 2024-10-04T19:03:17Z

Running some benchmarks

alamb · 2024-10-04T19:11:26Z

To achieve this, we don't need to append group values to the GroupValueBuilder. Instead, we can simply check the hash value.

I haven't fully reivewed this PR yet. One thing I wonder is if the benfits in performance could potentially be had by adjusting the default aggregation ratio. Like for example, what if we changed the default ratio datafusion.execution.skip_partial_aggregation_probe_ratio_threshold to be more like 50 (as in if the first phase isn't cutting the cardinality down by at least a factor of 2, stop doing it) 🤔

Rachelint · 2024-10-04T19:43:52Z

I found maybe we can get similar benefits through(detail can see #12765 ):

adjusting parameters of skip_partial_aggregation_probe_rows_threshold and skip_partial_aggregation_probe_ratio_threshold
remove the is_locked in SkipAggregationProbe

This is the number in my local:

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main-skip-test ┃ single-mode-v4 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │         0.66ms │         0.70ms │  1.06x slower │
│ QQuery 1     │        66.12ms │        67.44ms │     no change │
│ QQuery 2     │       157.78ms │       156.85ms │     no change │
│ QQuery 3     │       182.37ms │       182.58ms │     no change │
│ QQuery 4     │      1166.76ms │      1208.11ms │     no change │
│ QQuery 5     │      1596.94ms │      1693.68ms │  1.06x slower │
│ QQuery 6     │        56.18ms │        56.64ms │     no change │
│ QQuery 7     │        75.18ms │        78.03ms │     no change │
│ QQuery 8     │      1769.26ms │      1792.14ms │     no change │
│ QQuery 9     │      1895.46ms │      2050.89ms │  1.08x slower │
│ QQuery 10    │       528.17ms │       544.64ms │     no change │
│ QQuery 11    │       590.00ms │       579.76ms │     no change │
│ QQuery 12    │      1671.23ms │      1729.08ms │     no change │
│ QQuery 13    │      2603.57ms │      2645.23ms │     no change │
│ QQuery 14    │      1898.81ms │      1867.07ms │     no change │
│ QQuery 15    │      1586.71ms │      1631.27ms │     no change │
│ QQuery 16    │      3571.19ms │      3571.63ms │     no change │
│ QQuery 17    │      3118.61ms │      3151.88ms │     no change │
│ QQuery 18    │      5961.07ms │      6039.86ms │     no change │
│ QQuery 19    │       146.76ms │       147.11ms │     no change │
│ QQuery 20    │      3223.87ms │      3218.55ms │     no change │
│ QQuery 21    │      3964.39ms │      3988.32ms │     no change │
│ QQuery 22    │      9464.66ms │      9598.59ms │     no change │
│ QQuery 23    │     23931.99ms │     23687.13ms │     no change │
│ QQuery 24    │      1177.19ms │      1179.07ms │     no change │
│ QQuery 25    │      1043.12ms │      1032.91ms │     no change │
│ QQuery 26    │      1356.44ms │      1348.37ms │     no change │
│ QQuery 27    │      4729.99ms │      4870.28ms │     no change │
│ QQuery 28    │     29261.56ms │     28726.86ms │     no change │
│ QQuery 29    │       899.79ms │       892.08ms │     no change │
│ QQuery 30    │      1686.06ms │      1679.16ms │     no change │
│ QQuery 31    │      2026.30ms │      2024.04ms │     no change │
│ QQuery 32    │      7345.30ms │      7510.59ms │     no change │
│ QQuery 33    │     10065.96ms │     10362.04ms │     no change │
│ QQuery 34    │     10039.00ms │     10334.93ms │     no change │
│ QQuery 35    │      2888.42ms │      2873.64ms │     no change │
│ QQuery 36    │       219.44ms │       228.90ms │     no change │
│ QQuery 37    │       160.72ms │       172.34ms │  1.07x slower │
│ QQuery 38    │       153.41ms │       154.50ms │     no change │
│ QQuery 39    │       473.11ms │       469.37ms │     no change │
│ QQuery 40    │        61.92ms │        58.07ms │ +1.07x faster │
│ QQuery 41    │        54.73ms │        53.30ms │     no change │
│ QQuery 42    │        74.19ms │        73.35ms │     no change │
└──────────────┴────────────────┴────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary             ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (main-skip-test)   │ 142944.39ms │
│ Total Time (single-mode-v4)   │ 143730.97ms │
│ Average Time (main-skip-test) │   3324.29ms │
│ Average Time (single-mode-v4) │   3342.58ms │
│ Queries Faster                │           1 │
│ Queries Slower                │           4 │
│ Queries with No Change        │          38 │
└───────────────────────────────┴─────────────┘

jayzhan211 · 2024-10-05T01:32:58Z

I think the threshold here is the main reason of the performance improvement but I think checking with hash value helps although the benchmark doesn't show that.

jayzhan211 · 2024-10-05T04:21:21Z

query like SELECT "UserID", "SearchPhrase", CONCAT("SearchPhrase", REPEAT('DataFusion', 32)) as R, COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase", R LIMIT 10; improve significantly, but other what I thought to be improved is not. Hmm.. 🤔

SELECT "UserID", "SearchPhrase", CONCAT("SearchPhrase", REPEAT('DataFusion', 32)) as R, COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase", R LIMIT 10;
SELECT "UserID", extract(minute FROM to_timestamp_seconds("EventTime")) AS m, "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", m, "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
SELECT "UserID", extract(minute FROM to_timestamp_seconds("EventTime")) AS m, "SearchPhrase", CONCAT("SearchPhrase", REPEAT('DataFusion', 32)) as R, COUNT(*) FROM hits GROUP BY "UserID", m, "SearchPhrase", R ORDER BY COUNT(*) DESC LIMIT 10;

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main-skip-test ┃ single-mode-v4 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │      6053.73ms │      4446.50ms │ +1.36x faster │
│ QQuery 1     │      1993.84ms │      2016.60ms │     no change │
│ QQuery 2     │     12862.52ms │     12575.49ms │     no change │
└──────────────┴────────────────┴────────────────┴───────────────┘

Rachelint · 2024-10-05T04:55:47Z

datafusion/physical-plan/src/aggregates/row_hash.rs

+                            if self.skip_partial_aggregation {
+                                let states = self.transform_to_states(batch)?;
+                                self.exec_state = ExecutionState::ProducingOutput(states);


As I understand, some batches maybe have been updated into intermediate states(GroupValues and GroupAccumulators).
Maybe we should emit them first like what is done in switch_to_skip_aggregation?

I guess the triggering process may be like:

### batch 1 - not reach the skip threshold - update into `GroupValues` and `GroupAccumulator`s ### batch 2 - not reach the skip threshold - update into `GroupValues` and `GroupAccumulator`s ... ### batch n - finally reach the skip threshold - we need to `emit` things in `GroupValues` and `GroupAccumulator`s - then we convert to the skip mode

🤔 And it seems that just the last batch (batch n in the exmaple)'s updating can be avoided, comparing to the original partial skipping logic. And the cost of updating one batch may be not so high?

And it seems that just the last batch (batch n in the exmaple)'s updating can be avoided, comparing to the original partial skipping logic. And the cost of updating one batch may be not so high?

Why only the last batch is not computed? If batch m reach the skip threshold, I think m to n are all skipped, therefore n-m computation are saved.

And it seems that just the last batch (batch n in the exmaple)'s updating can be avoided, comparing to the original partial skipping logic. And the cost of updating one batch may be not so high?

Why only the last batch is not computed? If batch m reach the skip threshold, I think m to n are all skipped, therefore n-m computation are saved.

Oh, I think maybe I got it? Do you mean the loop here?

fn group_aggregate_batch_with_skipping_partial() { for (index, group_values) in group_by_values.iter().enumerate() { ... // reach the threshold, and we break the loop and skip earlier ... } }

Not really, for simple group by case, group_by_values is mostly single, I think only grouping sets query have more than one group by values.

If any batch reach the threshold limit, we will switch to ProducingOutput if self.skip_partial_aggregation { let states = self.transform_to_states(batch)?; self.exec_state = ExecutionState::ProducingOutput(states); // make sure the exec_state just set is not overwritten below break 'reading_input; }

I think the skipping logic has no difference from the main branch, the only difference is that we compute the hash, and skip based on the number of unique value in hash table

Not really, for simple group by case, group_by_values is mostly single, I think only grouping sets query have more than one group by values.

🤔 Yes, I am just a bit confused about saving n - m computation in #12697 (comment) if it is said single group by case.

Assume the threshold is reached until batch n comes:

In main, we will firstly update batch n into GroupValues and GroupAccumulators, and found threshold reached after, so we skipped batch n+1 ~ batch last

// `batch n` comes, update it into `GroupValues` and `GroupAccumulator`s. extract_ok!(self.group_aggregate_batch(batch)); // Then we found threshold reached here after updating. self.update_skip_aggregation_probe(input_rows); // Switch to skip mode, and we skip `batch n+1 ~ batch last` extract_ok!(self.switch_to_skip_aggregation());

And currently, we can know threshold reached before updating batch n into GroupValues and GroupAccumulators, so we can skip batch n ~ batch last

fn group_aggregate_batch_with_skipping_partial( &mut self, batch: &RecordBatch, ) -> Result<()> { for (index, group_values) in group_by_values.iter().enumerate() { // We calculate hashes and check skip threshold here // When `batch n` comes, we found threshold reached, // just return and not update `batch n`. // And it will switch to skip mode for `batch n ~ batch last` after return } // We will still update `batch 0 ~ batch n-1`, because threshold is only // reached until `batch n` comes. for (index, group_values) in group_by_values.iter().enumerate() { // Update `batch` into `GroupValues` and `GroupAccumulator`s }

Seems just updating of batch n is avoided?

😄 Sorry for maybe asking too many questions.

Oh, I see. It tells why the improvement is not as high as expected 🤔

I think the attempt for this change failed 😢

jayzhan211 · 2024-10-05T05:48:51Z

Result if we set skip_partial_aggregation_probe_ratio_threshold to 0.5 and skip_partial_aggregation_probe_rows_threshold to 0 from main.

I think playing around with the threshold number is tricky and we may get overfitting on the benchmark we target on.

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃  threshold ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.41ms │     0.42ms │     no change │
│ QQuery 1     │    42.61ms │    46.05ms │  1.08x slower │
│ QQuery 2     │    75.89ms │    75.21ms │     no change │
│ QQuery 3     │    72.24ms │    62.42ms │ +1.16x faster │
│ QQuery 4     │   399.04ms │   385.47ms │     no change │
│ QQuery 5     │   663.20ms │   656.64ms │     no change │
│ QQuery 6     │    37.91ms │    39.05ms │     no change │
│ QQuery 7     │    42.71ms │    47.49ms │  1.11x slower │
│ QQuery 8     │   638.09ms │   600.01ms │ +1.06x faster │
│ QQuery 9     │   678.82ms │   660.36ms │     no change │
│ QQuery 10    │   200.65ms │   208.42ms │     no change │
│ QQuery 11    │   233.85ms │   225.67ms │     no change │
│ QQuery 12    │   718.91ms │   636.78ms │ +1.13x faster │
│ QQuery 13    │   926.86ms │   860.75ms │ +1.08x faster │
│ QQuery 14    │   873.60ms │   682.60ms │ +1.28x faster │
│ QQuery 15    │   498.70ms │   484.49ms │     no change │
│ QQuery 16    │  1305.01ms │  1250.18ms │     no change │
│ QQuery 17    │  1190.26ms │  1164.31ms │     no change │
│ QQuery 18    │  3347.18ms │  2090.57ms │ +1.60x faster │
│ QQuery 19    │    55.77ms │    57.33ms │     no change │
│ QQuery 20    │   943.13ms │   936.20ms │     no change │
│ QQuery 21    │  1206.45ms │  1228.93ms │     no change │
│ QQuery 22    │  3230.71ms │  3093.60ms │     no change │
│ QQuery 23    │  8186.73ms │  7946.95ms │     no change │
│ QQuery 24    │   500.91ms │   479.18ms │     no change │
│ QQuery 25    │   502.13ms │   480.47ms │     no change │
│ QQuery 26    │   561.53ms │   531.43ms │ +1.06x faster │
│ QQuery 27    │  1338.06ms │  1387.29ms │     no change │
│ QQuery 28    │ 10507.08ms │ 10471.02ms │     no change │
│ QQuery 29    │   398.16ms │   407.05ms │     no change │
│ QQuery 30    │   760.72ms │   658.42ms │ +1.16x faster │
│ QQuery 31    │   712.91ms │   669.57ms │ +1.06x faster │
│ QQuery 32    │  3567.36ms │  3453.34ms │     no change │
│ QQuery 33    │  4863.25ms │  3885.84ms │ +1.25x faster │
│ QQuery 34    │  4115.97ms │  3442.65ms │ +1.20x faster │
│ QQuery 35    │   998.40ms │  1005.24ms │     no change │
│ QQuery 36    │   147.99ms │   142.11ms │     no change │
│ QQuery 37    │   101.82ms │   103.37ms │     no change │
│ QQuery 38    │   108.77ms │   107.39ms │     no change │
│ QQuery 39    │   327.29ms │   260.03ms │ +1.26x faster │
│ QQuery 40    │    33.56ms │    33.73ms │     no change │
│ QQuery 41    │    34.09ms │    31.52ms │ +1.08x faster │
│ QQuery 42    │    39.32ms │    39.95ms │     no change │
└──────────────┴────────────┴────────────┴───────────────┘

jayzhan211 · 2024-10-05T07:45:53Z

I think the largest challenge of this change is that for low cardinality we have additional hash table computation time, although we don't see the slowdown on the benchmark, but we could still create extreme query that has significant impact on it

alamb · 2024-10-05T10:13:54Z

Here are the numbers I got on this branch:

--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ single-mode-v4 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.19ms │         2.18ms │     no change │
│ QQuery 1     │    38.81ms │        38.17ms │     no change │
│ QQuery 2     │    96.62ms │        92.46ms │     no change │
│ QQuery 3     │    97.06ms │        98.65ms │     no change │
│ QQuery 4     │   923.91ms │       724.87ms │ +1.27x faster │
│ QQuery 5     │   974.44ms │      1305.93ms │  1.34x slower │
│ QQuery 6     │    33.67ms │        37.66ms │  1.12x slower │
│ QQuery 7     │    43.21ms │        45.22ms │     no change │
│ QQuery 8     │  1353.51ms │      1215.72ms │ +1.11x faster │
│ QQuery 9     │  1346.26ms │      1430.30ms │  1.06x slower │
│ QQuery 10    │   344.35ms │       332.55ms │     no change │
│ QQuery 11    │   391.74ms │       382.66ms │     no change │
│ QQuery 12    │  1103.72ms │       862.90ms │ +1.28x faster │
│ QQuery 13    │  1735.33ms │      1461.53ms │ +1.19x faster │
│ QQuery 14    │  1240.47ms │       935.56ms │ +1.33x faster │
│ QQuery 15    │  1077.26ms │       949.83ms │ +1.13x faster │
│ QQuery 16    │  2495.53ms │      2180.97ms │ +1.14x faster │
│ QQuery 17    │  2276.83ms │      1947.58ms │ +1.17x faster │
│ QQuery 18    │  5030.06ms │      3684.12ms │ +1.37x faster │
│ QQuery 19    │    91.93ms │        95.77ms │     no change │
│ QQuery 20    │  1699.33ms │      1742.71ms │     no change │
│ QQuery 21    │  1982.68ms │      2040.31ms │     no change │
│ QQuery 22    │  4682.32ms │      5206.59ms │  1.11x slower │
│ QQuery 23    │ 10495.72ms │     10443.17ms │     no change │
│ QQuery 24    │   575.86ms │       587.24ms │     no change │
│ QQuery 25    │   484.52ms │       494.27ms │     no change │
│ QQuery 26    │   648.80ms │       656.98ms │     no change │
│ QQuery 27    │  2561.01ms │      2685.02ms │     no change │
│ QQuery 28    │ 15425.32ms │     18180.15ms │  1.18x slower │
│ QQuery 29    │   519.32ms │       529.97ms │     no change │
│ QQuery 30    │  1052.78ms │       903.01ms │ +1.17x faster │
│ QQuery 31    │  1116.38ms │      1082.78ms │     no change │
│ QQuery 32    │  4284.98ms │      4283.87ms │     no change │
│ QQuery 33    │  5183.56ms │      4947.87ms │     no change │
│ QQuery 34    │  5231.35ms │      5001.83ms │     no change │
│ QQuery 35    │  1916.17ms │      1822.67ms │     no change │
│ QQuery 36    │   260.43ms │       233.69ms │ +1.11x faster │
│ QQuery 37    │   121.10ms │       131.40ms │  1.09x slower │
│ QQuery 38    │   141.15ms │       140.35ms │     no change │
│ QQuery 39    │   757.69ms │       516.94ms │ +1.47x faster │
│ QQuery 40    │    58.24ms │        56.67ms │     no change │
│ QQuery 41    │    47.61ms │        48.35ms │     no change │
│ QQuery 42    │    64.04ms │        71.09ms │  1.11x slower │
└──────────────┴────────────┴────────────────┴───────────────┘
└──────────────┴────────────┴────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary             ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_base)        │ 80007.25ms │
│ Total Time (single-mode-v4)   │ 79631.55ms │
│ Average Time (main_base)      │  1860.63ms │
│ Average Time (single-mode-v4) │  1851.90ms │
│ Queries Faster                │         12 │
│ Queries Slower                │          7 │
│ Queries with No Change        │         24 │
└───────────────────────────────┴────────────┘
Note: Skipping /home/alamb/arrow-datafusion/benchmarks/results/main_base/tpch_mem_sf1.json as /home/alamb/arrow-\
datafusion/benchmarks/results/single-mode-v4/tpch_mem_sf1.json does not exist
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ single-mode-v4 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  202.68ms │       186.09ms │ +1.09x faster │
│ QQuery 2     │  123.07ms │       104.42ms │ +1.18x faster │
│ QQuery 3     │  130.29ms │       112.85ms │ +1.15x faster │
│ QQuery 4     │   85.75ms │        86.53ms │     no change │
│ QQuery 5     │  156.10ms │       158.66ms │     no change │
│ QQuery 6     │   40.63ms │        40.36ms │     no change │
│ QQuery 7     │  192.20ms │       193.36ms │     no change │
│ QQuery 8     │  155.99ms │       158.01ms │     no change │
│ QQuery 9     │  246.56ms │       228.72ms │ +1.08x faster │
│ QQuery 10    │  215.62ms │       219.22ms │     no change │
│ QQuery 11    │   91.33ms │        91.33ms │     no change │
│ QQuery 12    │  129.13ms │       128.18ms │     no change │
│ QQuery 13    │  216.71ms │       211.03ms │     no change │
│ QQuery 14    │   72.68ms │        86.45ms │  1.19x slower │
│ QQuery 15    │  110.85ms │        98.26ms │ +1.13x faster │
│ QQuery 16    │   78.57ms │        64.20ms │ +1.22x faster │
│ QQuery 17    │  219.63ms │       147.18ms │ +1.49x faster │
│ QQuery 18    │  312.29ms │       288.30ms │ +1.08x faster │
│ QQuery 19    │  137.91ms │       139.33ms │     no change │
│ QQuery 20    │  129.86ms │       112.25ms │ +1.16x faster │
│ QQuery 21    │  262.13ms │       266.61ms │     no change │
│ QQuery 22    │   61.46ms │        64.42ms │     no change │
└──────────────┴───────────┴────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary             ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main_base)        │ 3371.45ms │
│ Total Time (single-mode-v4)   │ 3185.76ms │
│ Average Time (main_base)      │  153.25ms │
│ Average Time (single-mode-v4) │  144.81ms │
│ Queries Faster                │         9 │
│ Queries Slower                │         1 │
│ Queries with No Change        │        12 │
└───────────────────────────────┴───────────┘

korowa · 2024-10-06T13:29:01Z

datafusion/common/src/config.rs

-
-        /// Number of input rows partial aggregation partition should process, before
-        /// aggregation ratio check and trying to switch to skipping aggregation mode
-        pub skip_partial_aggregation_probe_rows_threshold: usize, default = 100_000


Why has this setting been deleted? Checking if aggregation should be skipped right in the beginning of the execution may lead to skipping decision made based on insufficient amount of data.

The reason is that in my strategy I make the decision per batch since I assume the data is distributed evenly

config with 0.1 and 100_000

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Query ┃ main ┃ threshold ┃ Change ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ QQuery 0 │ 0.41ms │ 0.45ms │ 1.10x slower │ │ QQuery 1 │ 42.61ms │ 42.55ms │ no change │ │ QQuery 2 │ 75.89ms │ 74.88ms │ no change │ │ QQuery 3 │ 72.24ms │ 61.81ms │ +1.17x faster │ │ QQuery 4 │ 399.04ms │ 359.82ms │ +1.11x faster │ │ QQuery 5 │ 663.20ms │ 699.03ms │ 1.05x slower │ │ QQuery 6 │ 37.91ms │ 38.01ms │ no change │ │ QQuery 7 │ 42.71ms │ 40.05ms │ +1.07x faster │ │ QQuery 8 │ 638.09ms │ 699.07ms │ 1.10x slower │ │ QQuery 9 │ 678.82ms │ 742.56ms │ 1.09x slower │ │ QQuery 10 │ 200.65ms │ 209.83ms │ no change │ │ QQuery 11 │ 233.85ms │ 220.56ms │ +1.06x faster │ │ QQuery 12 │ 718.91ms │ 660.93ms │ +1.09x faster │ │ QQuery 13 │ 926.86ms │ 937.19ms │ no change │ │ QQuery 14 │ 873.60ms │ 753.51ms │ +1.16x faster │ │ QQuery 15 │ 498.70ms │ 483.92ms │ no change │ │ QQuery 16 │ 1305.01ms │ 1238.38ms │ +1.05x faster │ │ QQuery 17 │ 1190.26ms │ 1111.67ms │ +1.07x faster │ │ QQuery 18 │ 3347.18ms │ 2903.17ms │ +1.15x faster │ │ QQuery 19 │ 55.77ms │ 61.39ms │ 1.10x slower │ │ QQuery 20 │ 943.13ms │ 919.37ms │ no change │ │ QQuery 21 │ 1206.45ms │ 1230.95ms │ no change │ │ QQuery 22 │ 3230.71ms │ 3479.20ms │ 1.08x slower │ │ QQuery 23 │ 8186.73ms │ 7989.35ms │ no change │ │ QQuery 24 │ 500.91ms │ 507.13ms │ no change │ │ QQuery 25 │ 502.13ms │ 551.57ms │ 1.10x slower │ │ QQuery 26 │ 561.53ms │ 568.74ms │ no change │ │ QQuery 27 │ 1338.06ms │ 1382.16ms │ no change │ │ QQuery 28 │ 10507.08ms │ 11253.66ms │ 1.07x slower │ │ QQuery 29 │ 398.16ms │ 429.90ms │ 1.08x slower │ │ QQuery 30 │ 760.72ms │ 664.19ms │ +1.15x faster │ │ QQuery 31 │ 712.91ms │ 772.65ms │ 1.08x slower │ │ QQuery 32 │ 3567.36ms │ 4167.37ms │ 1.17x slower │ │ QQuery 33 │ 4863.25ms │ 4311.27ms │ +1.13x faster │ │ QQuery 34 │ 4115.97ms │ 3564.32ms │ +1.15x faster │ │ QQuery 35 │ 998.40ms │ 979.18ms │ no change │ │ QQuery 36 │ 147.99ms │ 136.97ms │ +1.08x faster │ │ QQuery 37 │ 101.82ms │ 102.79ms │ no change │ │ QQuery 38 │ 108.77ms │ 107.66ms │ no change │ │ QQuery 39 │ 327.29ms │ 272.91ms │ +1.20x faster │ │ QQuery 40 │ 33.56ms │ 34.88ms │ no change │ │ QQuery 41 │ 34.09ms │ 32.18ms │ +1.06x faster │ │ QQuery 42 │ 39.32ms │ 39.75ms │ no change │ └──────────────┴────────────┴────────────┴───────────────┘

config with 0.1 and 0

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Query ┃ main ┃ threshold ┃ Change ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ QQuery 0 │ 0.41ms │ 0.49ms │ 1.21x slower │ │ QQuery 1 │ 42.61ms │ 42.34ms │ no change │ │ QQuery 2 │ 75.89ms │ 81.91ms │ 1.08x slower │ │ QQuery 3 │ 72.24ms │ 69.94ms │ no change │ │ QQuery 4 │ 399.04ms │ 365.30ms │ +1.09x faster │ │ QQuery 5 │ 663.20ms │ 656.21ms │ no change │ │ QQuery 6 │ 37.91ms │ 38.24ms │ no change │ │ QQuery 7 │ 42.71ms │ 40.27ms │ +1.06x faster │ │ QQuery 8 │ 638.09ms │ 590.64ms │ +1.08x faster │ │ QQuery 9 │ 678.82ms │ 666.57ms │ no change │ │ QQuery 10 │ 200.65ms │ 196.94ms │ no change │ │ QQuery 11 │ 233.85ms │ 220.90ms │ +1.06x faster │ │ QQuery 12 │ 718.91ms │ 652.58ms │ +1.10x faster │ │ QQuery 13 │ 926.86ms │ 941.22ms │ no change │ │ QQuery 14 │ 873.60ms │ 749.50ms │ +1.17x faster │ │ QQuery 15 │ 498.70ms │ 495.09ms │ no change │ │ QQuery 16 │ 1305.01ms │ 1247.99ms │ no change │ │ QQuery 17 │ 1190.26ms │ 1158.71ms │ no change │ │ QQuery 18 │ 3347.18ms │ 2008.84ms │ +1.67x faster │ │ QQuery 19 │ 55.77ms │ 54.94ms │ no change │ │ QQuery 20 │ 943.13ms │ 913.04ms │ no change │ │ QQuery 21 │ 1206.45ms │ 1183.49ms │ no change │ │ QQuery 22 │ 3230.71ms │ 3181.47ms │ no change │ │ QQuery 23 │ 8186.73ms │ 7900.35ms │ no change │ │ QQuery 24 │ 500.91ms │ 501.30ms │ no change │ │ QQuery 25 │ 502.13ms │ 481.08ms │ no change │ │ QQuery 26 │ 561.53ms │ 571.92ms │ no change │ │ QQuery 27 │ 1338.06ms │ 1451.63ms │ 1.08x slower │ │ QQuery 28 │ 10507.08ms │ 11396.72ms │ 1.08x slower │ │ QQuery 29 │ 398.16ms │ 392.46ms │ no change │ │ QQuery 30 │ 760.72ms │ 658.78ms │ +1.15x faster │ │ QQuery 31 │ 712.91ms │ 702.45ms │ no change │ │ QQuery 32 │ 3567.36ms │ 3520.95ms │ no change │ │ QQuery 33 │ 4863.25ms │ 3527.36ms │ +1.38x faster │ │ QQuery 34 │ 4115.97ms │ 3731.86ms │ +1.10x faster │ │ QQuery 35 │ 998.40ms │ 964.65ms │ no change │ │ QQuery 36 │ 147.99ms │ 136.33ms │ +1.09x faster │ │ QQuery 37 │ 101.82ms │ 102.70ms │ no change │ │ QQuery 38 │ 108.77ms │ 105.97ms │ no change │ │ QQuery 39 │ 327.29ms │ 267.66ms │ +1.22x faster │ │ QQuery 40 │ 33.56ms │ 34.77ms │ no change │ │ QQuery 41 │ 34.09ms │ 31.76ms │ +1.07x faster │ │ QQuery 42 │ 39.32ms │ 40.56ms │ no change │ └──────────────┴────────────┴────────────┴───────────────┘

korowa · 2024-10-06T13:33:26Z

datafusion/physical-plan/src/aggregates/row_hash.rs

-
-    /// Record the number of rows that were output directly without aggregation
-    fn record_skipped(&mut self, batch: &RecordBatch) {
-        self.skipped_aggregation_rows.add(batch.num_rows());


Not sure if this metric is not required -- it still may be helpful for debugging purposes or just providing information about how aggregation has been executed to end user (via explain analyze)

korowa · 2024-10-06T13:39:18Z

datafusion/common/src/config.rs

-        /// Number of input rows partial aggregation partition should process, before
-        /// aggregation ratio check and trying to switch to skipping aggregation mode
-        pub skip_partial_aggregation_probe_rows_threshold: usize, default = 100_000
+        pub skip_partial_aggregation_probe_ratio_threshold: f64, default = 0.1


That's interesting why 0.1 value makes things faster 🤔

Is aggregation for simple cases (e.g. single integer) slower than repartitioning x10 rows?

Is aggregation for simple cases (e.g. single integer) slower than repartitioning x10 rows?

Given the benchmark result, I think so. Especially if we have millions of distinct integer.

korowa · 2024-10-06T13:48:42Z

datafusion/physical-plan/src/aggregates/row_hash.rs

-    ///   without aggregation
-    skipped_aggregation_rows: metrics::Count,
+    /// Number of unique hash, which represents the cardinality of the group values
+    unique_hashes_count: HashSet<u64>,


What's the main case when the group_values.len() is not enough and actual hashes needed?

And if they are really required, this HashSet should be accounted in operators memory reservation.

What's the main case when the group_values.len() is not enough and actual hashes needed?

Even if the existing group is low, we still need to accumulate previous hashes, so we can calculate later on when the group is high enough.

unique_hashes_count.len() has the role similar to group_values.len()

jayzhan211 added 3 commits September 30, 2024 22:54

single node

e4e5229

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

skip from counted hash

3848ae7

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

first draft done

a08401b

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

github-actions bot added physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Oct 1, 2024

fix test

dab3ea7

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

jayzhan211 commented Oct 1, 2024

View reviewed changes

jayzhan211 mentioned this pull request Oct 2, 2024

Minor: Remove clone in transform_to_states #12707

Merged

jayzhan211 added 3 commits October 2, 2024 11:21

Merge branch 'main' of https://github.com/apache/datafusion into sing…

d2f68c9

…le-mode-v4

rm old skip logic

4a171f4

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

set config for fuzz test

bd11ff5

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

github-actions bot added common Related to common crate execution Related to the execution crate labels Oct 2, 2024

jayzhan211 added 2 commits October 2, 2024 16:29

fix test

0aefcc6

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

fix test

405c463

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

fix doc

e80c2b1

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

github-actions bot added the documentation Improvements or additions to documentation label Oct 2, 2024

jayzhan211 added 2 commits October 2, 2024 19:41

Merge branch 'main' of https://github.com/apache/datafusion into sing…

07e4f78

…le-mode-v4

rm clear to avoid reallocation

80aae67

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

jayzhan211 added 4 commits October 3, 2024 10:26

independent hash per batch

ed8e135

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

Merge branch 'main' of https://github.com/apache/datafusion into sing…

72a8c18

…le-mode-v4

move hash buffer close to intern so we can reuse the buffer

86923c3

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

clear hash buffer

a266d3f

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

jayzhan211 added 2 commits October 3, 2024 14:44

calculate hash early

d494cfa

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

change from fill to clear

9865cbd

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

jayzhan211 marked this pull request as ready for review October 3, 2024 08:22

alamb requested a review from korowa October 3, 2024 14:50

jayzhan211 marked this pull request as draft October 5, 2024 02:44

Rachelint reviewed Oct 5, 2024

View reviewed changes

korowa reviewed Oct 6, 2024

View reviewed changes

Rachelint mentioned this pull request Oct 9, 2024

[DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench #12821

Open

jayzhan211 closed this Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip partial aggregation based on the cardinality of hash value instead of group values #12697

Skip partial aggregation based on the cardinality of hash value instead of group values #12697

jayzhan211 commented Oct 1, 2024 •

edited

Loading

jayzhan211 Oct 1, 2024

Rachelint commented Oct 1, 2024 •

edited

Loading

jayzhan211 commented Oct 2, 2024 •

edited

Loading

Rachelint commented Oct 2, 2024 •

edited

Loading

jayzhan211 commented Oct 2, 2024 •

edited

Loading

Rachelint commented Oct 2, 2024 •

edited

Loading

jayzhan211 commented Oct 2, 2024

korowa commented Oct 3, 2024

alamb commented Oct 4, 2024

alamb commented Oct 4, 2024

Rachelint commented Oct 4, 2024 •

edited

Loading

jayzhan211 commented Oct 5, 2024

jayzhan211 commented Oct 5, 2024

Rachelint Oct 5, 2024

jayzhan211 Oct 5, 2024

Rachelint Oct 5, 2024 •

edited

Loading

jayzhan211 Oct 5, 2024

jayzhan211 Oct 5, 2024

Rachelint Oct 5, 2024 •

edited

Loading

jayzhan211 Oct 5, 2024

Rachelint Oct 5, 2024

jayzhan211 commented Oct 5, 2024

jayzhan211 commented Oct 5, 2024

alamb commented Oct 5, 2024

korowa Oct 6, 2024

jayzhan211 Oct 7, 2024

jayzhan211 Oct 7, 2024

korowa Oct 6, 2024

korowa Oct 6, 2024

jayzhan211 Oct 7, 2024

korowa Oct 6, 2024

jayzhan211 Oct 7, 2024 •

edited

Loading

Skip partial aggregation based on the cardinality of hash value instead of group values #12697

Skip partial aggregation based on the cardinality of hash value instead of group values #12697

Conversation

jayzhan211 commented Oct 1, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

TODO

Benchmark

Choose a reason for hiding this comment

Rachelint commented Oct 1, 2024 • edited Loading

jayzhan211 commented Oct 2, 2024 • edited Loading

Rachelint commented Oct 2, 2024 • edited Loading

jayzhan211 commented Oct 2, 2024 • edited Loading

Rachelint commented Oct 2, 2024 • edited Loading

jayzhan211 commented Oct 2, 2024

korowa commented Oct 3, 2024

alamb commented Oct 4, 2024

alamb commented Oct 4, 2024

Rachelint commented Oct 4, 2024 • edited Loading

jayzhan211 commented Oct 5, 2024

jayzhan211 commented Oct 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rachelint Oct 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rachelint Oct 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 commented Oct 5, 2024

jayzhan211 commented Oct 5, 2024

alamb commented Oct 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Oct 7, 2024 • edited Loading

Choose a reason for hiding this comment

jayzhan211 commented Oct 1, 2024 •

edited

Loading

Rachelint commented Oct 1, 2024 •

edited

Loading

jayzhan211 commented Oct 2, 2024 •

edited

Loading

Rachelint commented Oct 2, 2024 •

edited

Loading

jayzhan211 commented Oct 2, 2024 •

edited

Loading

Rachelint commented Oct 2, 2024 •

edited

Loading

Rachelint commented Oct 4, 2024 •

edited

Loading

Rachelint Oct 5, 2024 •

edited

Loading

Rachelint Oct 5, 2024 •

edited

Loading

jayzhan211 Oct 7, 2024 •

edited

Loading