Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement window functions with partition_by clause #558

Merged
merged 1 commit into from
Jun 21, 2021

Conversation

jimexist
Copy link
Member

@jimexist jimexist commented Jun 14, 2021

Which issue does this PR close?

Closes #299

Rationale for this change

with order by implemented, we can add partition by support.

What changes are included in this PR?

Are there any user-facing changes?

@jimexist jimexist changed the title Impl window partition by Implement window functions with partition_by clause Jun 14, 2021
@codecov-commenter
Copy link

codecov-commenter commented Jun 14, 2021

Codecov Report

Merging #558 (4a7a499) into master (e3e7e29) will decrease coverage by 0.03%.
The diff coverage is 81.01%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #558      +/-   ##
==========================================
- Coverage   76.12%   76.08%   -0.04%     
==========================================
  Files         156      156              
  Lines       27074    27121      +47     
==========================================
+ Hits        20609    20635      +26     
- Misses       6465     6486      +21     
Impacted Files Coverage Δ
datafusion/src/physical_plan/window_functions.rs 86.42% <ø> (+0.71%) ⬆️
datafusion/src/sql/planner.rs 84.75% <ø> (ø)
datafusion/src/physical_plan/planner.rs 79.84% <33.33%> (+2.30%) ⬆️
datafusion/src/physical_plan/mod.rs 80.00% <70.96%> (+0.90%) ⬆️
datafusion/src/physical_plan/windows.rs 82.59% <75.00%> (-3.88%) ⬇️
...afusion/src/physical_plan/expressions/nth_value.rs 79.41% <75.67%> (-11.07%) ⬇️
datafusion/src/execution/context.rs 92.13% <100.00%> (+0.13%) ⬆️
...fusion/src/physical_plan/expressions/row_number.rs 94.28% <100.00%> (+13.03%) ⬆️
datafusion/src/physical_plan/hash_aggregate.rs 86.54% <100.00%> (ø)
datafusion/src/scalar.rs 56.19% <100.00%> (ø)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e3e7e29...4a7a499. Read the comment docs.

@jimexist jimexist force-pushed the impl-window-partition-by branch 2 times, most recently from 0c6f31f to c3c0ef5 Compare June 15, 2021 00:16
@jimexist jimexist marked this pull request as ready for review June 15, 2021 00:40
@jimexist
Copy link
Member Author

@Dandandan and @alamb this is ready now

@jimexist
Copy link
Member Author

after this pull request i'll rebase and merge #564 so that we can have a benchmark for future iterations

new_null_array(value.data_type(), num_rows)
} else {
let value = ScalarValue::try_from_array(value, index)?;
value.to_array_of_size(num_rows)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here applies for normal aggregations as probably happens here: if we have a partition by that creates a lot of groups, we will create many individual arrow arrays (which is slow / memory consuming).

Probably what would be better in the long run is store the offsets to the values in a contiguous array, and the values as well and extend / update them instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed for this PR btw, but just noting there are similar needs/performance issues in both aggregation and window functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, not in this pull request but I believe this can warrant a dedicated compute kernel in arrow for batched array slice transformation and then concatenation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dandandan down the road I've started to work on this issue:

https://github.com/apache/arrow-datafusion/pull/579/files#diff-8b6b5ea3976c91229244e4e7a31a7026422b1374d1683e44b41af67a6bd43187R246-R254

-        let results = partition_points
-            .iter()
-            .map(|partition_range| {
-                let sort_partition_points =
-                    find_ranges_in_range(partition_range, &sort_partition_points);
-                let mut window_accumulators = self.create_accumulator()?;
-                sort_partition_points
-                    .iter()
-                    .map(|range| window_accumulators.scan_peers(&values, range))
-                    .collect::<Result<Vec<_>>>()
-            })
-            .collect::<Result<Vec<Vec<ArrayRef>>>>()?
-            .into_iter()
-            .flatten()
-            .collect::<Vec<ArrayRef>>();
-        let results = results.iter().map(|i| i.as_ref()).collect::<Vec<_>>();
-        concat(&results).map_err(DataFusionError::ArrowError)
+        let mut result = Vec::with_capacity(num_rows);
+        for partition_range in partition_points {
+            let sort_partition_points =
+                find_ranges_in_range(&partition_range, &sort_partition_points);
+            let mut window_accumulators = self.create_accumulator()?;
+            for range in sort_partition_points {
+                result.extend(window_accumulators.scan_peers(&values, range)?);
+            }
+        }
+        ScalarValue::iter_to_array(result.into_iter())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - that should probably already be quite an improvement 👍

@jimexist jimexist force-pushed the impl-window-partition-by branch 2 times, most recently from 0ab7340 to 4f98195 Compare June 19, 2021 06:15
@jimexist
Copy link
Member Author

@Dandandan this is fixed now

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great again! - 2 comments about tests for being a bit more future proof

@jimexist
Copy link
Member Author

Looks great again! - 2 comments about tests for being a bit more future proof

fixed, about repartition i'll handle that in #569 but so far i'm seeing regressions in performance

@alamb
Copy link
Contributor

alamb commented Jun 21, 2021

Thanks @jimexist

@alamb alamb merged commit 05d5f01 into apache:master Jun 21, 2021
@houqp houqp added datafusion Changes in the datafusion crate enhancement New feature or request labels Jul 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support window functions with PARTITION BY clause
5 participants