Provide groupByKey shortcuts for groupBy.as #213

EnricoMi · 2023-12-08T15:13:40Z

This provides shortcuts for groupBy(...).as[...] that make it easier to use column-based groupByKey.

Calling Dataset.groupBy(...).as[K, T] should be preferred over calling Dataset.groupByKey(...) whenever possible. The former allows Catalyst to exploit existing partitioning and ordering of the Dataset, while the latter hides from Catalyst which columns are used to create the keys.

When the dataset is already partitioned and ordered by the grouping columns, Dataset.groupByKey(...) will repartition and order the entire dataset again.

Example:

Calling ds.groupByKey(_.id) hides from Catalyst that column id is the grouping key, while ds.groupBy($"id").as[Int, V] tells Catalyst that ds is to be grouped by (partitioned and ordered by) column id.

The new column-based groupByKey methods make it easier for users to find a way to express the grouping by expressions. Looking at the Dataset API, the user finds groupByKey with Column. The existing groupBy method returns a RelationalGroupedDataset, which provides the as[K, V] method, which allows for the same semantics, but is difficult to find.

The new column-based groupByKey methods further do not require the user to specify the type V of the original Dataset[V], as groupByKey has access to the type / encoder:

ds.groupBy($"id").as[Int, V]

vs.

ds.groupByKey[Int]($"id")

github-actions · 2023-12-08T16:12:41Z

Test Results

    566 files ±  0     566 suites ±0 1h 29m 12s ⏱️ +11s
    536 tests +  2     536 ✔️ +  2 0 💤 ±0 0 ❌ ±0
16 828 runs +72 16 826 ✔️ +72 2 💤 ±0 0 ❌ ±0

Results for commit 411afe8. ± Comparison against base commit 8314de4.

This pull request removes 28 and adds 30 tests. Note that renamed tests count towards both.

uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups reverse
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with partition num
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with partition num and reverse
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with state
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with tuple key
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with tuple key and state
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupBySorted should flatMapSortedGroups
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupBySorted should flatMapSortedGroups reverse
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupBySorted should flatMapSortedGroups with partition num
…

uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups reverse
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with partition num
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with partition num and reverse
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with state
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with tuple key
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with tuple key and state
uk.co.gresearch.spark.GroupBySuite ‑ df.groupBySorted should flatMapSortedGroups
uk.co.gresearch.spark.GroupBySuite ‑ df.groupBySorted should flatMapSortedGroups reverse
uk.co.gresearch.spark.GroupBySuite ‑ df.groupBySorted should flatMapSortedGroups with partition num
…

♻️ This comment has been updated with latest results.

Provide groupByKey shortcuts for groupBy.as

1668470

EnricoMi force-pushed the groupbykey branch from 41c5c69 to 1668470 Compare December 8, 2023 18:44

EnricoMi added 2 commits December 9, 2023 17:16

Add to README.md

d9df9fc

Improve wording in README.md

411afe8

EnricoMi merged commit 119c854 into master Dec 9, 2023
87 checks passed

EnricoMi deleted the groupbykey branch December 9, 2023 21:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide groupByKey shortcuts for groupBy.as #213

Provide groupByKey shortcuts for groupBy.as #213

EnricoMi commented Dec 8, 2023

github-actions bot commented Dec 8, 2023 •

edited

Loading

Provide groupByKey shortcuts for groupBy.as #213

Provide groupByKey shortcuts for groupBy.as #213

Conversation

EnricoMi commented Dec 8, 2023

github-actions bot commented Dec 8, 2023 • edited Loading

Test Results

github-actions bot commented Dec 8, 2023 •

edited

Loading