Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide groupByKey shortcuts for groupBy.as #213

Merged
merged 3 commits into from
Dec 9, 2023
Merged

Provide groupByKey shortcuts for groupBy.as #213

merged 3 commits into from
Dec 9, 2023

Conversation

EnricoMi
Copy link
Contributor

@EnricoMi EnricoMi commented Dec 8, 2023

This provides shortcuts for groupBy(...).as[...] that make it easier to use column-based groupByKey.

Calling Dataset.groupBy(...).as[K, T] should be preferred over calling Dataset.groupByKey(...) whenever possible. The former allows Catalyst to exploit existing partitioning and ordering of the Dataset, while the latter hides from Catalyst which columns are used to create the keys.

When the dataset is already partitioned and ordered by the grouping columns, Dataset.groupByKey(...) will repartition and order the entire dataset again.

Example:

Calling ds.groupByKey(_.id) hides from Catalyst that column id is the grouping key, while ds.groupBy($"id").as[Int, V] tells Catalyst that ds is to be grouped by (partitioned and ordered by) column id.

The new column-based groupByKey methods make it easier for users to find a way to express the grouping by expressions. Looking at the Dataset API, the user finds groupByKey with Column. The existing groupBy method returns a RelationalGroupedDataset, which provides the as[K, V] method, which allows for the same semantics, but is difficult to find.

The new column-based groupByKey methods further do not require the user to specify the type V of the original Dataset[V], as groupByKey has access to the type / encoder:

ds.groupBy($"id").as[Int, V]

vs.

ds.groupByKey[Int]($"id")

Copy link

github-actions bot commented Dec 8, 2023

Test Results

     566 files  ±  0       566 suites  ±0   1h 29m 12s ⏱️ +11s
     536 tests +  2       536 ✔️ +  2  0 💤 ±0  0 ±0 
16 828 runs  +72  16 826 ✔️ +72  2 💤 ±0  0 ±0 

Results for commit 411afe8. ± Comparison against base commit 8314de4.

This pull request removes 28 and adds 30 tests. Note that renamed tests count towards both.
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups reverse
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with partition num
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with partition num and reverse
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with state
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with tuple key
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with tuple key and state
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupBySorted should flatMapSortedGroups
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupBySorted should flatMapSortedGroups reverse
uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupBySorted should flatMapSortedGroups with partition num
…
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups reverse
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with partition num
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with partition num and reverse
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with state
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with tuple key
uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with tuple key and state
uk.co.gresearch.spark.GroupBySuite ‑ df.groupBySorted should flatMapSortedGroups
uk.co.gresearch.spark.GroupBySuite ‑ df.groupBySorted should flatMapSortedGroups reverse
uk.co.gresearch.spark.GroupBySuite ‑ df.groupBySorted should flatMapSortedGroups with partition num
…

♻️ This comment has been updated with latest results.

@EnricoMi EnricoMi merged commit 119c854 into master Dec 9, 2023
87 checks passed
@EnricoMi EnricoMi deleted the groupbykey branch December 9, 2023 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant