ARROW-11349: [Rust] Add from_iter_values to create arrays from (non null) values #9293

Dandandan · 2021-01-22T17:07:14Z

The idea of this PR is to have a function from_iter_values that (just like from_iter) creates an array based on an iterator, but from T instead of Option<T>.

I have seen some places in DataFusion (especially to_array_of_size) where an Array is generated from a Vec of items, which could be replaced by this.
The other iterators have some memory / time overhead in both creating and manipulating the null buffer (and in the case of Vec for allocating / dropping the Vec)

github-actions · 2021-01-22T17:07:35Z

https://issues.apache.org/jira/browse/ARROW-11349

jorgecarleitao · 2021-01-22T18:30:54Z

@Dandandan , thanks a lot for this.

Looking at the use-case, couldn't it make more sense to offer a method that creates a constant non-null array and a constant null array?

I am asking this because the common use-case I see that justifies an iterator (instead of a constant) is to perform an infalible operation over the values while keeping the null buffer untouched (e.g. the typical unary operator).

Dandandan · 2021-01-22T18:55:46Z

@jorgecarleitao yes, that would make more sense for the particular use case I mentioned, and probably would be more performant as well (it could even use memset / slice.fill when that's stabilized).

But I think this would cover other use cases as well?

I'm happy to contribute the "from constant" method as well in a different PR for the use case.

jorgecarleitao · 2021-01-22T19:15:35Z

Sorry for the noise, you are of course right. Let me just review it :)

jorgecarleitao · 2021-01-22T19:18:16Z

rust/arrow/src/array/array_primitive.rs

+        let mut val_buf = MutableBuffer::new(
+            data_len * mem::size_of::<<T as ArrowPrimitiveType>::Native>(),
+        );
+
+        iter.for_each(|item| {
+            val_buf.push(item);
+        });


Could we wait for #9235? It introduces a method to extend a MutableBuffer out of an iterator of Native types. It will also allow to drop the .expect("Iterator must be sized"), since it will be possible to extend it out of unsized iterators.

Yes, sure 👍 sounds like a good idea

#9235 is merged

@jorgecarleitao is the new usage with extend what you meant?

We also have to know the final count for creating the array, are you thinking of calculating that with val_buf.len() / mem::size_of::<<T as ArrowPrimitiveType>::Native>() or is there an other way?

Yes, that is what I meant: it should now be possible to do

let values = vec![0i32, 1]; let values = values.iter().map(|x| x + 1); let buffer: Buffer: values.collect();

wrt to the len: that is a good point that I have had not thought about. 👍

One option is to do what you wrote. Another could be (I haven't tried to compile this, as Fn could become FnMut):

let mut count = 0; let iter = iter.map(|x| { count += 1; x });

I would probably have taken your idea, though :)

Ah with just Buffer and collect this is even more clean. I adapted the code with the new API.

alamb

FWIW this feature (building a PrimitiveArray from an iterator of T rather than Option<T> would have been helpful for me in a few instances too).

In our case we want ot do something like:

fn make_it_an_array(v: Vec<u32>) -> ArrayRef {
...
}

Thanks @Dandandan

alamb · 2021-01-23T11:52:40Z

rust/arrow/src/array/array_primitive.rs

+        let mut val_buf = MutableBuffer::new(
+            data_len * mem::size_of::<<T as ArrowPrimitiveType>::Native>(),
+        );
+
+        iter.for_each(|item| {
+            val_buf.push(item);
+        });


#9235 is merged

…null

codecov-io · 2021-01-23T12:48:32Z

Codecov Report

Merging #9293 (a37941c) into master (67d0c2e) will increase coverage by 0.04%.
The diff coverage is 94.87%.

@@            Coverage Diff             @@
##           master    #9293      +/-   ##
==========================================
+ Coverage   81.84%   81.89%   +0.04%     
==========================================
  Files         215      215              
  Lines       52949    52988      +39     
==========================================
+ Hits        43336    43392      +56     
+ Misses       9613     9596      -17

Impacted Files	Coverage Δ
rust/arrow/src/array/array_primitive.rs	`94.48% <93.33%> (-0.05%)`	⬇️
rust/arrow/src/array/array_string.rs	`94.11% <95.83%> (+4.11%)`	⬆️
rust/arrow/src/buffer.rs	`96.21% <0.00%> (+2.52%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 67d0c2e...a37941c. Read the comment docs.

jorgecarleitao

LGTM. Thanks a lot, and thanks for also fixing the un-ran test. 👍

…ull) values The idea of this PR is to have a function `from_iter_values` that (just like `from_iter`) creates an array based on an iterator, but from `T` instead of `Option<T>`. I have seen some places in DataFusion (especially `to_array_of_size`) where an `Array` is generated from a `Vec` of items, which could be replaced by this. The other iterators have some memory / time overhead in both creating and manipulating the null buffer (and in the case of `Vec` for allocating / dropping the Vec) Closes #9293 from Dandandan/array_iter_non_null Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

…to improve performance This function `to_array_of_size` is about 8.3% of total instructions in the db-benchmark (aggregation) queries. This uses the PR #9293 The case of converting an int32 to an array improved by ~5x according to the microbenchmark: ``` to_array_of_size 100000 time: [55.501 us 55.627 us 55.809 us] change: [-82.457% -82.384% -82.299%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 2 (2.00%) high mild 1 (1.00%) high severe ``` And on TCPH query 1 (SF=1, 16 partitions). PR: ``` Query 1 iteration 0 took 90.8 ms Query 1 iteration 1 took 106.6 ms Query 1 iteration 2 took 101.1 ms Query 1 iteration 3 took 101.5 ms Query 1 iteration 4 took 96.9 ms Query 1 iteration 5 took 100.3 ms Query 1 iteration 6 took 99.6 ms Query 1 iteration 7 took 100.4 ms Query 1 iteration 8 took 104.2 ms Query 1 iteration 9 took 100.3 ms Query 1 avg time: 100.18 ms ``` Master: ``` Query 1 iteration 0 took 121.1 ms Query 1 iteration 1 took 123.4 ms Query 1 iteration 2 took 121.0 ms Query 1 iteration 3 took 121.0 ms Query 1 iteration 4 took 123.0 ms Query 1 iteration 5 took 121.7 ms Query 1 iteration 6 took 121.7 ms Query 1 iteration 7 took 120.2 ms Query 1 iteration 8 took 119.7 ms Query 1 iteration 9 took 121.4 ms Query 1 avg time: 121.43 ms ``` Closes #9305 from Dandandan/to_array_of_size_perf Lead-authored-by: Heres, Daniel <danielheres@gmail.com> Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Dmitry Patsura <zaets28rus@gmail.com> Co-authored-by: Yibo Cai <yibo.cai@arm.com> Co-authored-by: Daniël Heres <danielheres@gmail.com> Co-authored-by: Kenta Murata <mrkn@mrkn.jp> Co-authored-by: Mahmut Bulut <vertexclique@gmail.com> Co-authored-by: Yordan Pavlov <yordan.pavlov@outlook.com> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Ryan Jennings <ryan@ryanj.net> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com> Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Johannes Müller <JohannesMueller@fico.com> Co-authored-by: mqy <meng.qingyou@gmail.com> Co-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Matt Brubeck <mbrubeck@limpet.net> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

…ull) values The idea of this PR is to have a function `from_iter_values` that (just like `from_iter`) creates an array based on an iterator, but from `T` instead of `Option<T>`. I have seen some places in DataFusion (especially `to_array_of_size`) where an `Array` is generated from a `Vec` of items, which could be replaced by this. The other iterators have some memory / time overhead in both creating and manipulating the null buffer (and in the case of `Vec` for allocating / dropping the Vec) Closes apache#9293 from Dandandan/array_iter_non_null Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

…to improve performance This function `to_array_of_size` is about 8.3% of total instructions in the db-benchmark (aggregation) queries. This uses the PR apache#9293 The case of converting an int32 to an array improved by ~5x according to the microbenchmark: ``` to_array_of_size 100000 time: [55.501 us 55.627 us 55.809 us] change: [-82.457% -82.384% -82.299%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 2 (2.00%) high mild 1 (1.00%) high severe ``` And on TCPH query 1 (SF=1, 16 partitions). PR: ``` Query 1 iteration 0 took 90.8 ms Query 1 iteration 1 took 106.6 ms Query 1 iteration 2 took 101.1 ms Query 1 iteration 3 took 101.5 ms Query 1 iteration 4 took 96.9 ms Query 1 iteration 5 took 100.3 ms Query 1 iteration 6 took 99.6 ms Query 1 iteration 7 took 100.4 ms Query 1 iteration 8 took 104.2 ms Query 1 iteration 9 took 100.3 ms Query 1 avg time: 100.18 ms ``` Master: ``` Query 1 iteration 0 took 121.1 ms Query 1 iteration 1 took 123.4 ms Query 1 iteration 2 took 121.0 ms Query 1 iteration 3 took 121.0 ms Query 1 iteration 4 took 123.0 ms Query 1 iteration 5 took 121.7 ms Query 1 iteration 6 took 121.7 ms Query 1 iteration 7 took 120.2 ms Query 1 iteration 8 took 119.7 ms Query 1 iteration 9 took 121.4 ms Query 1 avg time: 121.43 ms ``` Closes apache#9305 from Dandandan/to_array_of_size_perf Lead-authored-by: Heres, Daniel <danielheres@gmail.com> Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Dmitry Patsura <zaets28rus@gmail.com> Co-authored-by: Yibo Cai <yibo.cai@arm.com> Co-authored-by: Daniël Heres <danielheres@gmail.com> Co-authored-by: Kenta Murata <mrkn@mrkn.jp> Co-authored-by: Mahmut Bulut <vertexclique@gmail.com> Co-authored-by: Yordan Pavlov <yordan.pavlov@outlook.com> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Ryan Jennings <ryan@ryanj.net> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com> Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Johannes Müller <JohannesMueller@fico.com> Co-authored-by: mqy <meng.qingyou@gmail.com> Co-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Matt Brubeck <mbrubeck@limpet.net> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

…ull) values The idea of this PR is to have a function `from_iter_values` that (just like `from_iter`) creates an array based on an iterator, but from `T` instead of `Option<T>`. I have seen some places in DataFusion (especially `to_array_of_size`) where an `Array` is generated from a `Vec` of items, which could be replaced by this. The other iterators have some memory / time overhead in both creating and manipulating the null buffer (and in the case of `Vec` for allocating / dropping the Vec) Closes apache#9293 from Dandandan/array_iter_non_null Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

…to improve performance This function `to_array_of_size` is about 8.3% of total instructions in the db-benchmark (aggregation) queries. This uses the PR apache#9293 The case of converting an int32 to an array improved by ~5x according to the microbenchmark: ``` to_array_of_size 100000 time: [55.501 us 55.627 us 55.809 us] change: [-82.457% -82.384% -82.299%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 2 (2.00%) high mild 1 (1.00%) high severe ``` And on TCPH query 1 (SF=1, 16 partitions). PR: ``` Query 1 iteration 0 took 90.8 ms Query 1 iteration 1 took 106.6 ms Query 1 iteration 2 took 101.1 ms Query 1 iteration 3 took 101.5 ms Query 1 iteration 4 took 96.9 ms Query 1 iteration 5 took 100.3 ms Query 1 iteration 6 took 99.6 ms Query 1 iteration 7 took 100.4 ms Query 1 iteration 8 took 104.2 ms Query 1 iteration 9 took 100.3 ms Query 1 avg time: 100.18 ms ``` Master: ``` Query 1 iteration 0 took 121.1 ms Query 1 iteration 1 took 123.4 ms Query 1 iteration 2 took 121.0 ms Query 1 iteration 3 took 121.0 ms Query 1 iteration 4 took 123.0 ms Query 1 iteration 5 took 121.7 ms Query 1 iteration 6 took 121.7 ms Query 1 iteration 7 took 120.2 ms Query 1 iteration 8 took 119.7 ms Query 1 iteration 9 took 121.4 ms Query 1 avg time: 121.43 ms ``` Closes apache#9305 from Dandandan/to_array_of_size_perf Lead-authored-by: Heres, Daniel <danielheres@gmail.com> Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Dmitry Patsura <zaets28rus@gmail.com> Co-authored-by: Yibo Cai <yibo.cai@arm.com> Co-authored-by: Daniël Heres <danielheres@gmail.com> Co-authored-by: Kenta Murata <mrkn@mrkn.jp> Co-authored-by: Mahmut Bulut <vertexclique@gmail.com> Co-authored-by: Yordan Pavlov <yordan.pavlov@outlook.com> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Ryan Jennings <ryan@ryanj.net> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com> Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Johannes Müller <JohannesMueller@fico.com> Co-authored-by: mqy <meng.qingyou@gmail.com> Co-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Matt Brubeck <mbrubeck@limpet.net> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

Dandandan added 3 commits January 22, 2021 17:52

Add from_iter_values to create arrays from (non null) values

37c70fb

Remove borrow (they are primitive types anyway)

8cd118d

Fix comment

3a63974

github-actions bot added the Component: Rust label Jan 22, 2021

jorgecarleitao reviewed Jan 22, 2021

View reviewed changes

jorgecarleitao changed the title ~~ARROW-11349: [Rust[ Add from_iter_values to create arrays from (non null) values~~ ARROW-11349: [Rust] Add from_iter_values to create arrays from (non null) values Jan 22, 2021

alamb approved these changes Jan 23, 2021

View reviewed changes

Dandandan mentioned this pull request Jan 23, 2021

ARROW-11356: [Rust] Add method to PrimitiveArray::from_value to efficiently create an Array from a value #9300

Closed

Dandandan added 2 commits January 23, 2021 13:08

Merge branch 'master' of github.com:apache/arrow into array_iter_non_…

79c92aa

…null

Use extend

941ee5d

Use .collect() api

a37941c

jorgecarleitao approved these changes Jan 23, 2021

View reviewed changes

Dandandan mentioned this pull request Jan 23, 2021

ARROW-11362:[Rust][DataFusion] Use iterator APIs in to_array_of_size to improve performance #9305

Closed

jorgecarleitao closed this in cf7638f Jan 25, 2021

asfimport mentioned this pull request Jan 26, 2021

[Rust] Add from_iter_values to create arrays from T instead of Option<T> #27245

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-11349: [Rust] Add from_iter_values to create arrays from (non null) values #9293

ARROW-11349: [Rust] Add from_iter_values to create arrays from (non null) values #9293

Dandandan commented Jan 22, 2021 •

edited

Loading

github-actions bot commented Jan 22, 2021

jorgecarleitao commented Jan 22, 2021

Dandandan commented Jan 22, 2021

jorgecarleitao commented Jan 22, 2021

jorgecarleitao Jan 22, 2021

Dandandan Jan 22, 2021

alamb Jan 23, 2021

Dandandan Jan 23, 2021

jorgecarleitao Jan 23, 2021

Dandandan Jan 23, 2021

alamb left a comment

alamb Jan 23, 2021

codecov-io commented Jan 23, 2021 •

edited

Loading

jorgecarleitao left a comment

ARROW-11349: [Rust] Add from_iter_values to create arrays from (non null) values #9293

ARROW-11349: [Rust] Add from_iter_values to create arrays from (non null) values #9293

Conversation

Dandandan commented Jan 22, 2021 • edited Loading

github-actions bot commented Jan 22, 2021

jorgecarleitao commented Jan 22, 2021

Dandandan commented Jan 22, 2021

jorgecarleitao commented Jan 22, 2021

jorgecarleitao Jan 22, 2021

Choose a reason for hiding this comment

Dandandan Jan 22, 2021

Choose a reason for hiding this comment

alamb Jan 23, 2021

Choose a reason for hiding this comment

Dandandan Jan 23, 2021

Choose a reason for hiding this comment

jorgecarleitao Jan 23, 2021

Choose a reason for hiding this comment

Dandandan Jan 23, 2021

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Jan 23, 2021

Choose a reason for hiding this comment

codecov-io commented Jan 23, 2021 • edited Loading

Codecov Report

jorgecarleitao left a comment

Choose a reason for hiding this comment

Dandandan commented Jan 22, 2021 •

edited

Loading

codecov-io commented Jan 23, 2021 •

edited

Loading