add `first_value`, `last_value`, and `nth_value` built-in window functions #403

jimexist · 2021-05-24T01:05:59Z

Which issue does this PR close?

add first_value, last_value, and nth_value

Closes #298

Based on #375 so review that first

Rationale for this change

adding three window functions

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter · 2021-05-24T01:25:05Z

Codecov Report

Merging #403 (59663eb) into master (174226c) will increase coverage by 0.48%.
The diff coverage is 73.23%.

@@            Coverage Diff             @@
##           master     #403      +/-   ##
==========================================
+ Coverage   74.94%   75.42%   +0.48%     
==========================================
  Files         146      148       +2     
  Lines       24314    24595     +281     
==========================================
+ Hits        18223    18552     +329     
+ Misses       6091     6043      -48

Impacted Files	Coverage Δ
datafusion/src/physical_plan/expressions/mod.rs	`71.42% <ø> (ø)`
...fusion/src/physical_plan/expressions/row_number.rs	`0.00% <0.00%> (ø)`
datafusion/src/physical_plan/hash_aggregate.rs	`85.21% <ø> (ø)`
datafusion/src/physical_plan/sort.rs	`92.07% <ø> (ø)`
datafusion/src/physical_plan/window_functions.rs	`86.95% <64.70%> (-1.77%)`	⬇️
datafusion/src/physical_plan/mod.rs	`80.58% <68.75%> (-2.18%)`	⬇️
...afusion/src/physical_plan/expressions/nth_value.rs	`70.76% <70.76%> (ø)`
datafusion/src/physical_plan/windows.rs	`82.12% <88.09%> (+82.12%)`	⬆️
datafusion/src/execution/context.rs	`92.07% <100.00%> (+0.02%)`	⬆️
datafusion/src/physical_plan/planner.rs	`80.45% <100.00%> (+3.94%)`	⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 174226c...59663eb. Read the comment docs.

datafusion/src/physical_plan/expressions/nth_value.rs

alamb · 2021-05-26T20:21:55Z

@jimexist I plan to review this PR later today or tomorrow

alamb

Thank you @jimexist -- the code structure looks clean to me . Nice work

However, I am concerned about the correctness of these results. As I understand it, first_value, last_value and nth_value are not well defined unless there is an ordering on the window (aka without an ordering you basically can get some arbitrary value from the window).

I wonder if it would make sense to implement ordering for windows first, so we can write tests will well defined output

I also see some change to the parquet-testing module which I wonder if that was intended

alamb · 2021-05-27T17:44:09Z

datafusion/src/physical_plan/expressions/nth_value.rs

+
+#[derive(Debug)]
+struct NthValueAccumulator {
+    // n the target nth_value, however we'll reuse it for last_value acc, so when n == 0 it specifically


I wonder if a more idomatic Rust way of doing this would be an enum like

enum NthValue { First, Last, Nth(u32) }

And then in NthValueAccumulator::scan you would have something like

match self.n { NthValue::First|NthValue::Nth(1) => {...} Nth::Last => {..} }

There is nothing wrong with the special value approach either -- I just figured I would point it out

Yeah I agree that sounds like a more idiomatic way.

to address later: #448

alamb · 2021-05-27T17:45:41Z

datafusion/src/physical_plan/expressions/nth_value.rs

+use std::convert::TryFrom;
+use std::sync::Arc;
+
+/// first_value expression


Given how similar FirstValue, LastValue and NthValue are, would it possible to just use the NthValue struct rather than have three mostly repeated structs?

alamb · 2021-05-27T17:46:50Z

datafusion/src/physical_plan/expressions/nth_value.rs

+}
+
+impl WindowAccumulator for NthValueAccumulator {
+    fn scan(&mut self, values: &[ScalarValue]) -> Result<Option<ScalarValue>> {


Are we guaranteed that scan() will see the entire window in a single call? Or would it be possible to see scan() called for two different slices?

Future performance improvement idea: rather than converting each row to ScalarValue and passing values: &[ScalarValue] we should strive to slice into the original arrays (i.e. don't do a lot of work / don't copy data) and produce arrays based on offsets and then use take to build new arrays out of all of the indices (in this case 1, but in case of smaller windows / partitions etc. this could grow by a lot).

alamb · 2021-05-27T17:47:42Z

datafusion/src/physical_plan/expressions/nth_value.rs

+    fn scan(&mut self, values: &[ScalarValue]) -> Result<Option<ScalarValue>> {
+        if self.n == SPECIAL_SIZE_VALUE_FOR_LAST {
+            // for last_value function
+            self.value = values[0].clone();


Should this be values.last() rather than the first (0th) value?

here the values array is per row

alamb · 2021-05-27T17:48:54Z

datafusion/src/physical_plan/windows.rs

+
+    #[tokio::test]
+    async fn window_function() -> Result<()> {
+        let (input, schema) = create_test_schema(1)?;


I think it would be valuable to create a test with more than one partition (so that the data may not arrive to the WindowFunction as a single RecordBatch)

alamb · 2021-05-27T18:01:03Z

datafusion/tests/sql.rs

-        vec!["1", "781", "7.81", "100", "125", "-117"],
-        vec!["1", "781", "7.81", "100", "125", "-117"],
-        vec!["1", "781", "7.81", "100", "125", "-117"],
+        vec!["1", "781", "7.81", "100", "125", "-117", "1", "30", "-40"],


I am not sure about these results. "first_value", "last_value" and "nth_value" over an unsorted window (as this query is is using) seems undefined to me.

For example, Using datafusion-cli:

CREATE EXTERNAL TABLE aggregate_test_100 ( c1 VARCHAR NOT NULL, c2 INT NOT NULL, c3 SMALLINT NOT NULL, c4 SMALLINT NOT NULL, c5 INT NOT NULL, c6 BIGINT NOT NULL, c7 SMALLINT NOT NULL, c8 INT NOT NULL, c9 BIGINT NOT NULL, c10 VARCHAR NOT NULL, c11 FLOAT NOT NULL, c12 DOUBLE NOT NULL, c13 VARCHAR NOT NULL ) STORED AS CSV WITH HEADER ROW LOCATION '/Users/alamb/Software/arrow-datafusion/testing/data/csv/';

You can see there are many values for c3, and the values of first_value, last_value and nth_value seem to be picking some arbitrary rows:

> select c2, c3 from aggregate_test_100 order by c2; +----+------+ | c2 | c3 | +----+------+ | 1 | 29 | | 1 | -85 | | 1 | 38 | | 1 | 57 | | 1 | 54 | | 1 | 103 | | 1 | -98 | | 1 | -99 | | 1 | -25 | | 1 | 36 | | 1 | 41 | | 1 | -8 | | 1 | -24 | | 1 | 125 | | 1 | 70 | | 1 | -72 | | 1 | 71 | | 1 | -56 | | 1 | -5 | | 1 | 12 | | 1 | 83 | | 1 | 120 | | 2 | 1 | | 2 | 113 | | 2 | 49 | | 2 | 97 | | 2 | -29 | | 2 | 45 | | 2 | -60 | | 2 | 93 | | 2 | 63 | | 2 | 52 | | 2 | 31 | | 2 | -106 | | 2 | -60 | | 2 | 68 | | 2 | -61 | | 2 | 122 | | 2 | -48 | | 2 | 52 | | 2 | -117 | | 2 | 29 | | 2 | -107 | | 2 | -43 | | 3 | 104 | | 3 | 13 | | 3 | 112 | | 3 | 77 | | 3 | 17 | | 3 | 13 | | 3 | 73 | | 3 | -2 | | 3 | 22 | | 3 | 17 | | 3 | -76 | | 3 | 71 | | 3 | 14 | | 3 | -12 | | 3 | -72 | | 3 | 97 | | 3 | -101 | | 3 | -95 | | 3 | 123 | | 4 | -111 | | 4 | -38 | | 4 | -54 | | 4 | -56 | | 4 | -53 | | 4 | 123 | | 4 | 97 | | 4 | 102 | | 4 | 65 | | 4 | 17 | | 4 | 55 | | 4 | 73 | | 4 | -117 | | 4 | -101 | | 4 | -79 | | 4 | 74 | | 4 | 96 | | 4 | -90 | | 4 | -59 | | 4 | 3 | | 4 | 5 | | 4 | 47 | | 4 | 30 | | 5 | -40 | | 5 | -82 | | 5 | 36 | | 5 | -31 | | 5 | -5 | | 5 | 68 | | 5 | -59 | | 5 | 62 | | 5 | -94 | | 5 | 64 | | 5 | -86 | | 5 | 118 | | 5 | -101 | | 5 | -44 | +----+------+

It is not clear that 1, 30 and -40 are the "right" answers (there is no good answer for this dataset)

jimexist · 2021-05-28T01:35:34Z

Thank you @jimexist -- the code structure looks clean to me . Nice work

However, I am concerned about the correctness of these results. As I understand it, first_value, last_value and nth_value are not well defined unless there is an ordering on the window (aka without an ordering you basically can get some arbitrary value from the window).

I wonder if it would make sense to implement ordering for windows first, so we can write tests will well defined output

I also see some change to the parquet-testing module which I wonder if that was intended

I guess it's not arbitrary but rather just take the ordering as is. When #425 is merged I'll add one test case to compare with psql so the behavior is consistent.

Of course when ordering clause is implemented then the behavior can also be tested in the same way, along with some unit test.

I plan to implement ordering after this and the lead/lag PR are merged because it requires some structural changes to the planner.

jimexist · 2021-05-28T01:39:16Z

Thank you @jimexist -- the code structure looks clean to me . Nice work

However, I am concerned about the correctness of these results. As I understand it, first_value, last_value and nth_value are not well defined unless there is an ordering on the window (aka without an ordering you basically can get some arbitrary value from the window).

I wonder if it would make sense to implement ordering for windows first, so we can write tests will well defined output

I also see some change to the parquet-testing module which I wonder if that was intended

I guess it's not arbitrary but rather just take the ordering as is. When #425 is merged I'll add one test case to compare with psql so the behavior is consistent.

Of course when ordering clause is implemented then the behavior can also be tested in the same way, along with some unit test.

I plan to implement ordering after this and the lead/lag PR are merged because it requires some structural changes to the planner.

See also #429

Both can be independently merged first because with the sort clause implemented these two logic shall stay unchanged as sorting happens as a separate physical plan that precedes these and feeds immediately to these.

Regarding the submodule change it's not intended - will revert.

Dandandan · 2021-05-28T06:50:02Z

Thank you @jimexist -- the code structure looks clean to me . Nice work
However, I am concerned about the correctness of these results. As I understand it, first_value, last_value and nth_value are not well defined unless there is an ordering on the window (aka without an ordering you basically can get some arbitrary value from the window).
I wonder if it would make sense to implement ordering for windows first, so we can write tests will well defined output
I also see some change to the parquet-testing module which I wonder if that was intended

I guess it's not arbitrary but rather just take the ordering as is. When #425 is merged I'll add one test case to compare with psql so the behavior is consistent.

Of course when ordering clause is implemented then the behavior can also be tested in the same way, along with some unit test.

I plan to implement ordering after this and the lead/lag PR are merged because it requires some structural changes to the planner.

I think the implementation might take the order as is, however the order as given by the underlying plan might give different result. For example, a table scan might give the results in a different order each time it runs.

alamb · 2021-05-28T10:56:07Z

I guess it's not arbitrary but rather just take the ordering as is. When #425 is merged I'll add one test case to compare with psql so the behavior is consistent.

Sounds good -- makes sense. As @Dandandan says what this means to me is that there is no well defined "correct result" for the query (and thus the test is ensuring the code doesn't crash, but doesn't really ensure it is getting the correct values)

Of course when ordering clause is implemented then the behavior can also be tested in the same way, along with some unit test.

👍

I plan to implement ordering after this and the lead/lag PR are merged because it requires some structural changes to the planner.

I think that makes sense.

alamb

I think we should merge this PR as is. Rationale:

while I have concerns about the "correctness" of this implementation, it really represents an incremental step towards full window function support
I also trust that @jimexist has a plan for verifying correctness subsequently and that we aren't going to leave this in a half done state

So I think we should merge this one in and let @jimexist keep on coding 👍

Dandandan · 2021-05-28T11:42:07Z

@jimexist there is some change in parquet-testing can you revert that change?
Otherwise it's looking good (added some comments), I agree with @alamb we can merge it as is and improving the implementation in following PRs.

Thanks for all the great work so far!!

jimexist · 2021-05-28T11:57:44Z

@jimexist there is some change in parquet-testing can you revert that change?

Otherwise it's looking good (added some comments), I agree with @alamb we can merge it as is and improving the implementation in following PRs.

Thanks for all the great work so far!!

Thanks I'll make sure to figure out a way to consistently test these and also try to stay on par with psql.

I'm AFK until Monday so we can either way until then or feel free to help me revert that submodule change and merge

commit 7fb3640 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 16:38:25 2021 +0800 row number done commit 1723926 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 16:05:50 2021 +0800 add row number commit bf5b8a5 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 15:04:49 2021 +0800 save commit d2ce852 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 14:53:05 2021 +0800 add streams commit 0a861a7 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Thu May 20 22:28:34 2021 +0800 save stream commit a9121af Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Thu May 20 22:01:51 2021 +0800 update unit test commit 2af2a27 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 14:25:12 2021 +0800 fix unit test commit bb57c76 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 14:23:34 2021 +0800 use upper case commit 5d96e52 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 14:16:16 2021 +0800 fix unit test commit 1ecae8f Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 12:27:26 2021 +0800 fix unit test commit bc2271d Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 10:04:29 2021 +0800 fix error commit 880b94f Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 08:24:00 2021 +0800 fix unit test commit 4e792e1 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 08:05:17 2021 +0800 fix test commit c36c04a Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 00:07:54 2021 +0800 add more tests commit f5e64de Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Thu May 20 23:41:36 2021 +0800 update commit a1eae86 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Thu May 20 23:36:15 2021 +0800 enrich unit test commit 0d2a214 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Thu May 20 23:25:43 2021 +0800 adding filter by todo commit 8b486d5 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Thu May 20 23:17:22 2021 +0800 adding more built-in functions commit abf08cd Author: Jiayu Liu <Jimexist@users.noreply.github.com> Date: Thu May 20 22:36:27 2021 +0800 Update datafusion/src/physical_plan/window_functions.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> commit 0cbca53 Author: Jiayu Liu <Jimexist@users.noreply.github.com> Date: Thu May 20 22:34:57 2021 +0800 Update datafusion/src/physical_plan/window_functions.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> commit 831c069 Author: Jiayu Liu <Jimexist@users.noreply.github.com> Date: Thu May 20 22:34:04 2021 +0800 Update datafusion/src/logical_plan/builder.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> commit f70c739 Author: Jiayu Liu <Jimexist@users.noreply.github.com> Date: Thu May 20 22:33:04 2021 +0800 Update datafusion/src/logical_plan/builder.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> commit 3ee87aa Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Wed May 19 22:55:08 2021 +0800 fix unit test commit 5c4d92d Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Wed May 19 22:48:26 2021 +0800 fix clippy commit a0b7526 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Wed May 19 22:46:38 2021 +0800 fix unused imports commit 1d3b076 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Thu May 13 18:51:14 2021 +0800 add window expr

jimexist · 2021-05-30T17:13:48Z

@jimexist there is some change in parquet-testing can you revert that change?
Otherwise it's looking good (added some comments), I agree with @alamb we can merge it as is and improving the implementation in following PRs.

Thanks for all the great work so far!!

@Dandandan this is reverted

jimexist marked this pull request as draft May 24, 2021 01:07

nevi-me reviewed May 24, 2021

View reviewed changes

datafusion/src/physical_plan/expressions/nth_value.rs Show resolved Hide resolved

jimexist force-pushed the add-row-number-exec branch 2 times, most recently from a0d43c6 to 2355533 Compare May 24, 2021 14:36

jimexist marked this pull request as ready for review May 24, 2021 14:56

jimexist mentioned this pull request May 24, 2021

finish implementation of window accumulation and func row_number() and add sql.rs test #413

Closed

jimexist force-pushed the add-row-number-exec branch from 2355533 to c7cc798 Compare May 25, 2021 01:46

jimexist marked this pull request as draft May 25, 2021 07:04

jimexist force-pushed the add-row-number-exec branch from c7cc798 to d1f53ce Compare May 26, 2021 05:25

jimexist marked this pull request as ready for review May 26, 2021 05:29

This was referenced May 26, 2021

implement lead and lag built-in window function #429

Merged

Support window functions with empty OVER clause #298

Closed

jimexist force-pushed the add-row-number-exec branch 2 times, most recently from 1e543bb to 8daa8b7 Compare May 26, 2021 20:47

alamb reviewed May 27, 2021

View reviewed changes

alamb approved these changes May 28, 2021

View reviewed changes

jimexist mentioned this pull request May 28, 2021

create a test with more than one partition for window functions #435

Closed

jimexist force-pushed the add-row-number-exec branch from 8daa8b7 to 0c5efac Compare May 30, 2021 17:06

jimexist mentioned this pull request May 30, 2021

use a more rust idiomatic way of handling nth_value #448

Closed

alamb merged commit 2b5b009 into apache:master May 31, 2021

jimexist deleted the add-row-number-exec branch May 31, 2021 14:29

houqp added datafusion Changes in the datafusion crate enhancement New feature or request labels Jul 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add `first_value`, `last_value`, and `nth_value` built-in window functions #403

add `first_value`, `last_value`, and `nth_value` built-in window functions #403

jimexist commented May 24, 2021 •

edited

Loading

codecov-commenter commented May 24, 2021

alamb commented May 26, 2021

alamb left a comment

alamb May 27, 2021

Dandandan May 28, 2021

jimexist May 30, 2021

alamb May 27, 2021

alamb May 27, 2021

Dandandan May 28, 2021 •

edited

Loading

alamb May 27, 2021

jimexist May 30, 2021

alamb May 27, 2021

alamb May 27, 2021

jimexist commented May 28, 2021

jimexist commented May 28, 2021

Dandandan commented May 28, 2021

alamb commented May 28, 2021

alamb left a comment

Dandandan commented May 28, 2021

jimexist commented May 28, 2021

jimexist commented May 30, 2021

add first_value, last_value, and nth_value built-in window functions #403

add first_value, last_value, and nth_value built-in window functions #403

Conversation

jimexist commented May 24, 2021 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter commented May 24, 2021

Codecov Report

alamb commented May 26, 2021

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan May 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimexist commented May 28, 2021

jimexist commented May 28, 2021

Dandandan commented May 28, 2021

alamb commented May 28, 2021

alamb left a comment

Choose a reason for hiding this comment

Dandandan commented May 28, 2021

jimexist commented May 28, 2021

jimexist commented May 30, 2021

add `first_value`, `last_value`, and `nth_value` built-in window functions #403

add `first_value`, `last_value`, and `nth_value` built-in window functions #403

jimexist commented May 24, 2021 •

edited

Loading

Dandandan May 28, 2021 •

edited

Loading