Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11349: [Rust] Add from_iter_values to create arrays from (non null) values #9293

Closed
wants to merge 6 commits into from

Conversation

Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Jan 22, 2021

The idea of this PR is to have a function from_iter_values that (just like from_iter) creates an array based on an iterator, but from T instead of Option<T>.

I have seen some places in DataFusion (especially to_array_of_size) where an Array is generated from a Vec of items, which could be replaced by this.
The other iterators have some memory / time overhead in both creating and manipulating the null buffer (and in the case of Vec for allocating / dropping the Vec)

@github-actions
Copy link

@jorgecarleitao
Copy link
Member

@Dandandan , thanks a lot for this.

Looking at the use-case, couldn't it make more sense to offer a method that creates a constant non-null array and a constant null array?

I am asking this because the common use-case I see that justifies an iterator (instead of a constant) is to perform an infalible operation over the values while keeping the null buffer untouched (e.g. the typical unary operator).

@Dandandan
Copy link
Contributor Author

@jorgecarleitao yes, that would make more sense for the particular use case I mentioned, and probably would be more performant as well (it could even use memset / slice.fill when that's stabilized).

But I think this would cover other use cases as well?

I'm happy to contribute the "from constant" method as well in a different PR for the use case.

@jorgecarleitao
Copy link
Member

Sorry for the noise, you are of course right. Let me just review it :)

Comment on lines 104 to 110
let mut val_buf = MutableBuffer::new(
data_len * mem::size_of::<<T as ArrowPrimitiveType>::Native>(),
);

iter.for_each(|item| {
val_buf.push(item);
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we wait for #9235? It introduces a method to extend a MutableBuffer out of an iterator of Native types. It will also allow to drop the .expect("Iterator must be sized"), since it will be possible to extend it out of unsized iterators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sure 👍 sounds like a good idea

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#9235 is merged

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgecarleitao is the new usage with extend what you meant?

We also have to know the final count for creating the array, are you thinking of calculating that with val_buf.len() / mem::size_of::<<T as ArrowPrimitiveType>::Native>() or is there an other way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is what I meant: it should now be possible to do

let values = vec![0i32, 1];

let values = values.iter().map(|x| x + 1);

let buffer: Buffer: values.collect();

wrt to the len: that is a good point that I have had not thought about. 👍

One option is to do what you wrote. Another could be (I haven't tried to compile this, as Fn could become FnMut):

let mut count = 0;
let iter = iter.map(|x| {
   count += 1;
x
});

I would probably have taken your idea, though :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah with just Buffer and collect this is even more clean. I adapted the code with the new API.

@jorgecarleitao jorgecarleitao changed the title ARROW-11349: [Rust[ Add from_iter_values to create arrays from (non null) values ARROW-11349: [Rust] Add from_iter_values to create arrays from (non null) values Jan 22, 2021
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this feature (building a PrimitiveArray from an iterator of T rather than Option<T> would have been helpful for me in a few instances too).

In our case we want ot do something like:

fn make_it_an_array(v: Vec<u32>) -> ArrayRef {
...
}

Thanks @Dandandan

Comment on lines 104 to 110
let mut val_buf = MutableBuffer::new(
data_len * mem::size_of::<<T as ArrowPrimitiveType>::Native>(),
);

iter.for_each(|item| {
val_buf.push(item);
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#9235 is merged

@codecov-io
Copy link

codecov-io commented Jan 23, 2021

Codecov Report

Merging #9293 (a37941c) into master (67d0c2e) will increase coverage by 0.04%.
The diff coverage is 94.87%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9293      +/-   ##
==========================================
+ Coverage   81.84%   81.89%   +0.04%     
==========================================
  Files         215      215              
  Lines       52949    52988      +39     
==========================================
+ Hits        43336    43392      +56     
+ Misses       9613     9596      -17     
Impacted Files Coverage Δ
rust/arrow/src/array/array_primitive.rs 94.48% <93.33%> (-0.05%) ⬇️
rust/arrow/src/array/array_string.rs 94.11% <95.83%> (+4.11%) ⬆️
rust/arrow/src/buffer.rs 96.21% <0.00%> (+2.52%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 67d0c2e...a37941c. Read the comment docs.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks a lot, and thanks for also fixing the un-ran test. 👍

kszucs pushed a commit that referenced this pull request Jan 25, 2021
…ull) values

The idea of this PR is to have a function `from_iter_values` that (just like `from_iter`) creates an array based on an iterator, but from `T` instead of `Option<T>`.

I have seen some places in DataFusion (especially `to_array_of_size`) where an `Array` is generated from a `Vec` of items, which could be replaced by this.
The other iterators have some memory / time overhead in both creating and manipulating the null buffer (and in the case of `Vec` for allocating / dropping the Vec)

Closes #9293 from Dandandan/array_iter_non_null

Authored-by: Heres, Daniel <danielheres@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
jorgecarleitao added a commit that referenced this pull request Jan 29, 2021
…to improve performance

This function `to_array_of_size` is about 8.3% of total instructions in the db-benchmark (aggregation) queries.

This uses the PR #9293

The case of converting an int32 to an array improved by ~5x according to the microbenchmark:

```
to_array_of_size 100000 time:   [55.501 us 55.627 us 55.809 us]
                        change: [-82.457% -82.384% -82.299%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
```

And on TCPH query 1 (SF=1, 16 partitions).

PR:

```
Query 1 iteration 0 took 90.8 ms
Query 1 iteration 1 took 106.6 ms
Query 1 iteration 2 took 101.1 ms
Query 1 iteration 3 took 101.5 ms
Query 1 iteration 4 took 96.9 ms
Query 1 iteration 5 took 100.3 ms
Query 1 iteration 6 took 99.6 ms
Query 1 iteration 7 took 100.4 ms
Query 1 iteration 8 took 104.2 ms
Query 1 iteration 9 took 100.3 ms
Query 1 avg time: 100.18 ms
```
Master:
```
Query 1 iteration 0 took 121.1 ms
Query 1 iteration 1 took 123.4 ms
Query 1 iteration 2 took 121.0 ms
Query 1 iteration 3 took 121.0 ms
Query 1 iteration 4 took 123.0 ms
Query 1 iteration 5 took 121.7 ms
Query 1 iteration 6 took 121.7 ms
Query 1 iteration 7 took 120.2 ms
Query 1 iteration 8 took 119.7 ms
Query 1 iteration 9 took 121.4 ms
Query 1 avg time: 121.43 ms
```

Closes #9305 from Dandandan/to_array_of_size_perf

Lead-authored-by: Heres, Daniel <danielheres@gmail.com>
Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Dmitry Patsura <zaets28rus@gmail.com>
Co-authored-by: Yibo Cai <yibo.cai@arm.com>
Co-authored-by: Daniël Heres <danielheres@gmail.com>
Co-authored-by: Kenta Murata <mrkn@mrkn.jp>
Co-authored-by: Mahmut Bulut <vertexclique@gmail.com>
Co-authored-by: Yordan Pavlov <yordan.pavlov@outlook.com>
Co-authored-by: Max Burke <max@urbanlogiq.com>
Co-authored-by: Ryan Jennings <ryan@ryanj.net>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com>
Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Johannes Müller <JohannesMueller@fico.com>
Co-authored-by: mqy <meng.qingyou@gmail.com>
Co-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Matt Brubeck <mbrubeck@limpet.net>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…ull) values

The idea of this PR is to have a function `from_iter_values` that (just like `from_iter`) creates an array based on an iterator, but from `T` instead of `Option<T>`.

I have seen some places in DataFusion (especially `to_array_of_size`) where an `Array` is generated from a `Vec` of items, which could be replaced by this.
The other iterators have some memory / time overhead in both creating and manipulating the null buffer (and in the case of `Vec` for allocating / dropping the Vec)

Closes apache#9293 from Dandandan/array_iter_non_null

Authored-by: Heres, Daniel <danielheres@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…to improve performance

This function `to_array_of_size` is about 8.3% of total instructions in the db-benchmark (aggregation) queries.

This uses the PR apache#9293

The case of converting an int32 to an array improved by ~5x according to the microbenchmark:

```
to_array_of_size 100000 time:   [55.501 us 55.627 us 55.809 us]
                        change: [-82.457% -82.384% -82.299%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
```

And on TCPH query 1 (SF=1, 16 partitions).

PR:

```
Query 1 iteration 0 took 90.8 ms
Query 1 iteration 1 took 106.6 ms
Query 1 iteration 2 took 101.1 ms
Query 1 iteration 3 took 101.5 ms
Query 1 iteration 4 took 96.9 ms
Query 1 iteration 5 took 100.3 ms
Query 1 iteration 6 took 99.6 ms
Query 1 iteration 7 took 100.4 ms
Query 1 iteration 8 took 104.2 ms
Query 1 iteration 9 took 100.3 ms
Query 1 avg time: 100.18 ms
```
Master:
```
Query 1 iteration 0 took 121.1 ms
Query 1 iteration 1 took 123.4 ms
Query 1 iteration 2 took 121.0 ms
Query 1 iteration 3 took 121.0 ms
Query 1 iteration 4 took 123.0 ms
Query 1 iteration 5 took 121.7 ms
Query 1 iteration 6 took 121.7 ms
Query 1 iteration 7 took 120.2 ms
Query 1 iteration 8 took 119.7 ms
Query 1 iteration 9 took 121.4 ms
Query 1 avg time: 121.43 ms
```

Closes apache#9305 from Dandandan/to_array_of_size_perf

Lead-authored-by: Heres, Daniel <danielheres@gmail.com>
Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Dmitry Patsura <zaets28rus@gmail.com>
Co-authored-by: Yibo Cai <yibo.cai@arm.com>
Co-authored-by: Daniël Heres <danielheres@gmail.com>
Co-authored-by: Kenta Murata <mrkn@mrkn.jp>
Co-authored-by: Mahmut Bulut <vertexclique@gmail.com>
Co-authored-by: Yordan Pavlov <yordan.pavlov@outlook.com>
Co-authored-by: Max Burke <max@urbanlogiq.com>
Co-authored-by: Ryan Jennings <ryan@ryanj.net>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com>
Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Johannes Müller <JohannesMueller@fico.com>
Co-authored-by: mqy <meng.qingyou@gmail.com>
Co-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Matt Brubeck <mbrubeck@limpet.net>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
…ull) values

The idea of this PR is to have a function `from_iter_values` that (just like `from_iter`) creates an array based on an iterator, but from `T` instead of `Option<T>`.

I have seen some places in DataFusion (especially `to_array_of_size`) where an `Array` is generated from a `Vec` of items, which could be replaced by this.
The other iterators have some memory / time overhead in both creating and manipulating the null buffer (and in the case of `Vec` for allocating / dropping the Vec)

Closes apache#9293 from Dandandan/array_iter_non_null

Authored-by: Heres, Daniel <danielheres@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
…to improve performance

This function `to_array_of_size` is about 8.3% of total instructions in the db-benchmark (aggregation) queries.

This uses the PR apache#9293

The case of converting an int32 to an array improved by ~5x according to the microbenchmark:

```
to_array_of_size 100000 time:   [55.501 us 55.627 us 55.809 us]
                        change: [-82.457% -82.384% -82.299%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
```

And on TCPH query 1 (SF=1, 16 partitions).

PR:

```
Query 1 iteration 0 took 90.8 ms
Query 1 iteration 1 took 106.6 ms
Query 1 iteration 2 took 101.1 ms
Query 1 iteration 3 took 101.5 ms
Query 1 iteration 4 took 96.9 ms
Query 1 iteration 5 took 100.3 ms
Query 1 iteration 6 took 99.6 ms
Query 1 iteration 7 took 100.4 ms
Query 1 iteration 8 took 104.2 ms
Query 1 iteration 9 took 100.3 ms
Query 1 avg time: 100.18 ms
```
Master:
```
Query 1 iteration 0 took 121.1 ms
Query 1 iteration 1 took 123.4 ms
Query 1 iteration 2 took 121.0 ms
Query 1 iteration 3 took 121.0 ms
Query 1 iteration 4 took 123.0 ms
Query 1 iteration 5 took 121.7 ms
Query 1 iteration 6 took 121.7 ms
Query 1 iteration 7 took 120.2 ms
Query 1 iteration 8 took 119.7 ms
Query 1 iteration 9 took 121.4 ms
Query 1 avg time: 121.43 ms
```

Closes apache#9305 from Dandandan/to_array_of_size_perf

Lead-authored-by: Heres, Daniel <danielheres@gmail.com>
Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Dmitry Patsura <zaets28rus@gmail.com>
Co-authored-by: Yibo Cai <yibo.cai@arm.com>
Co-authored-by: Daniël Heres <danielheres@gmail.com>
Co-authored-by: Kenta Murata <mrkn@mrkn.jp>
Co-authored-by: Mahmut Bulut <vertexclique@gmail.com>
Co-authored-by: Yordan Pavlov <yordan.pavlov@outlook.com>
Co-authored-by: Max Burke <max@urbanlogiq.com>
Co-authored-by: Ryan Jennings <ryan@ryanj.net>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com>
Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Johannes Müller <JohannesMueller@fico.com>
Co-authored-by: mqy <meng.qingyou@gmail.com>
Co-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Matt Brubeck <mbrubeck@limpet.net>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants