-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11349: [Rust] Add from_iter_values to create arrays from (non null) values #9293
Conversation
@Dandandan , thanks a lot for this. Looking at the use-case, couldn't it make more sense to offer a method that creates a constant non-null array and a constant null array? I am asking this because the common use-case I see that justifies an iterator (instead of a constant) is to perform an infalible operation over the values while keeping the null buffer untouched (e.g. the typical unary operator). |
@jorgecarleitao yes, that would make more sense for the particular use case I mentioned, and probably would be more performant as well (it could even use But I think this would cover other use cases as well? I'm happy to contribute the "from constant" method as well in a different PR for the use case. |
Sorry for the noise, you are of course right. Let me just review it :) |
let mut val_buf = MutableBuffer::new( | ||
data_len * mem::size_of::<<T as ArrowPrimitiveType>::Native>(), | ||
); | ||
|
||
iter.for_each(|item| { | ||
val_buf.push(item); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we wait for #9235? It introduces a method to extend a MutableBuffer
out of an iterator of Native types. It will also allow to drop the .expect("Iterator must be sized")
, since it will be possible to extend it out of unsized iterators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sure 👍 sounds like a good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#9235 is merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorgecarleitao is the new usage with extend
what you meant?
We also have to know the final count for creating the array, are you thinking of calculating that with val_buf.len() / mem::size_of::<<T as ArrowPrimitiveType>::Native>()
or is there an other way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is what I meant: it should now be possible to do
let values = vec![0i32, 1];
let values = values.iter().map(|x| x + 1);
let buffer: Buffer: values.collect();
wrt to the len
: that is a good point that I have had not thought about. 👍
One option is to do what you wrote. Another could be (I haven't tried to compile this, as Fn
could become FnMut
):
let mut count = 0;
let iter = iter.map(|x| {
count += 1;
x
});
I would probably have taken your idea, though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah with just Buffer
and collect
this is even more clean. I adapted the code with the new API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this feature (building a PrimitiveArray
from an iterator of T
rather than Option<T>
would have been helpful for me in a few instances too).
In our case we want ot do something like:
fn make_it_an_array(v: Vec<u32>) -> ArrayRef {
...
}
Thanks @Dandandan
let mut val_buf = MutableBuffer::new( | ||
data_len * mem::size_of::<<T as ArrowPrimitiveType>::Native>(), | ||
); | ||
|
||
iter.for_each(|item| { | ||
val_buf.push(item); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#9235 is merged
Codecov Report
@@ Coverage Diff @@
## master #9293 +/- ##
==========================================
+ Coverage 81.84% 81.89% +0.04%
==========================================
Files 215 215
Lines 52949 52988 +39
==========================================
+ Hits 43336 43392 +56
+ Misses 9613 9596 -17
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks a lot, and thanks for also fixing the un-ran test. 👍
…ull) values The idea of this PR is to have a function `from_iter_values` that (just like `from_iter`) creates an array based on an iterator, but from `T` instead of `Option<T>`. I have seen some places in DataFusion (especially `to_array_of_size`) where an `Array` is generated from a `Vec` of items, which could be replaced by this. The other iterators have some memory / time overhead in both creating and manipulating the null buffer (and in the case of `Vec` for allocating / dropping the Vec) Closes #9293 from Dandandan/array_iter_non_null Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
…to improve performance This function `to_array_of_size` is about 8.3% of total instructions in the db-benchmark (aggregation) queries. This uses the PR #9293 The case of converting an int32 to an array improved by ~5x according to the microbenchmark: ``` to_array_of_size 100000 time: [55.501 us 55.627 us 55.809 us] change: [-82.457% -82.384% -82.299%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 2 (2.00%) high mild 1 (1.00%) high severe ``` And on TCPH query 1 (SF=1, 16 partitions). PR: ``` Query 1 iteration 0 took 90.8 ms Query 1 iteration 1 took 106.6 ms Query 1 iteration 2 took 101.1 ms Query 1 iteration 3 took 101.5 ms Query 1 iteration 4 took 96.9 ms Query 1 iteration 5 took 100.3 ms Query 1 iteration 6 took 99.6 ms Query 1 iteration 7 took 100.4 ms Query 1 iteration 8 took 104.2 ms Query 1 iteration 9 took 100.3 ms Query 1 avg time: 100.18 ms ``` Master: ``` Query 1 iteration 0 took 121.1 ms Query 1 iteration 1 took 123.4 ms Query 1 iteration 2 took 121.0 ms Query 1 iteration 3 took 121.0 ms Query 1 iteration 4 took 123.0 ms Query 1 iteration 5 took 121.7 ms Query 1 iteration 6 took 121.7 ms Query 1 iteration 7 took 120.2 ms Query 1 iteration 8 took 119.7 ms Query 1 iteration 9 took 121.4 ms Query 1 avg time: 121.43 ms ``` Closes #9305 from Dandandan/to_array_of_size_perf Lead-authored-by: Heres, Daniel <danielheres@gmail.com> Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Dmitry Patsura <zaets28rus@gmail.com> Co-authored-by: Yibo Cai <yibo.cai@arm.com> Co-authored-by: Daniël Heres <danielheres@gmail.com> Co-authored-by: Kenta Murata <mrkn@mrkn.jp> Co-authored-by: Mahmut Bulut <vertexclique@gmail.com> Co-authored-by: Yordan Pavlov <yordan.pavlov@outlook.com> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Ryan Jennings <ryan@ryanj.net> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com> Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Johannes Müller <JohannesMueller@fico.com> Co-authored-by: mqy <meng.qingyou@gmail.com> Co-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Matt Brubeck <mbrubeck@limpet.net> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
…ull) values The idea of this PR is to have a function `from_iter_values` that (just like `from_iter`) creates an array based on an iterator, but from `T` instead of `Option<T>`. I have seen some places in DataFusion (especially `to_array_of_size`) where an `Array` is generated from a `Vec` of items, which could be replaced by this. The other iterators have some memory / time overhead in both creating and manipulating the null buffer (and in the case of `Vec` for allocating / dropping the Vec) Closes apache#9293 from Dandandan/array_iter_non_null Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
…to improve performance This function `to_array_of_size` is about 8.3% of total instructions in the db-benchmark (aggregation) queries. This uses the PR apache#9293 The case of converting an int32 to an array improved by ~5x according to the microbenchmark: ``` to_array_of_size 100000 time: [55.501 us 55.627 us 55.809 us] change: [-82.457% -82.384% -82.299%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 2 (2.00%) high mild 1 (1.00%) high severe ``` And on TCPH query 1 (SF=1, 16 partitions). PR: ``` Query 1 iteration 0 took 90.8 ms Query 1 iteration 1 took 106.6 ms Query 1 iteration 2 took 101.1 ms Query 1 iteration 3 took 101.5 ms Query 1 iteration 4 took 96.9 ms Query 1 iteration 5 took 100.3 ms Query 1 iteration 6 took 99.6 ms Query 1 iteration 7 took 100.4 ms Query 1 iteration 8 took 104.2 ms Query 1 iteration 9 took 100.3 ms Query 1 avg time: 100.18 ms ``` Master: ``` Query 1 iteration 0 took 121.1 ms Query 1 iteration 1 took 123.4 ms Query 1 iteration 2 took 121.0 ms Query 1 iteration 3 took 121.0 ms Query 1 iteration 4 took 123.0 ms Query 1 iteration 5 took 121.7 ms Query 1 iteration 6 took 121.7 ms Query 1 iteration 7 took 120.2 ms Query 1 iteration 8 took 119.7 ms Query 1 iteration 9 took 121.4 ms Query 1 avg time: 121.43 ms ``` Closes apache#9305 from Dandandan/to_array_of_size_perf Lead-authored-by: Heres, Daniel <danielheres@gmail.com> Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Dmitry Patsura <zaets28rus@gmail.com> Co-authored-by: Yibo Cai <yibo.cai@arm.com> Co-authored-by: Daniël Heres <danielheres@gmail.com> Co-authored-by: Kenta Murata <mrkn@mrkn.jp> Co-authored-by: Mahmut Bulut <vertexclique@gmail.com> Co-authored-by: Yordan Pavlov <yordan.pavlov@outlook.com> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Ryan Jennings <ryan@ryanj.net> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com> Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Johannes Müller <JohannesMueller@fico.com> Co-authored-by: mqy <meng.qingyou@gmail.com> Co-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Matt Brubeck <mbrubeck@limpet.net> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
…ull) values The idea of this PR is to have a function `from_iter_values` that (just like `from_iter`) creates an array based on an iterator, but from `T` instead of `Option<T>`. I have seen some places in DataFusion (especially `to_array_of_size`) where an `Array` is generated from a `Vec` of items, which could be replaced by this. The other iterators have some memory / time overhead in both creating and manipulating the null buffer (and in the case of `Vec` for allocating / dropping the Vec) Closes apache#9293 from Dandandan/array_iter_non_null Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
…to improve performance This function `to_array_of_size` is about 8.3% of total instructions in the db-benchmark (aggregation) queries. This uses the PR apache#9293 The case of converting an int32 to an array improved by ~5x according to the microbenchmark: ``` to_array_of_size 100000 time: [55.501 us 55.627 us 55.809 us] change: [-82.457% -82.384% -82.299%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 2 (2.00%) high mild 1 (1.00%) high severe ``` And on TCPH query 1 (SF=1, 16 partitions). PR: ``` Query 1 iteration 0 took 90.8 ms Query 1 iteration 1 took 106.6 ms Query 1 iteration 2 took 101.1 ms Query 1 iteration 3 took 101.5 ms Query 1 iteration 4 took 96.9 ms Query 1 iteration 5 took 100.3 ms Query 1 iteration 6 took 99.6 ms Query 1 iteration 7 took 100.4 ms Query 1 iteration 8 took 104.2 ms Query 1 iteration 9 took 100.3 ms Query 1 avg time: 100.18 ms ``` Master: ``` Query 1 iteration 0 took 121.1 ms Query 1 iteration 1 took 123.4 ms Query 1 iteration 2 took 121.0 ms Query 1 iteration 3 took 121.0 ms Query 1 iteration 4 took 123.0 ms Query 1 iteration 5 took 121.7 ms Query 1 iteration 6 took 121.7 ms Query 1 iteration 7 took 120.2 ms Query 1 iteration 8 took 119.7 ms Query 1 iteration 9 took 121.4 ms Query 1 avg time: 121.43 ms ``` Closes apache#9305 from Dandandan/to_array_of_size_perf Lead-authored-by: Heres, Daniel <danielheres@gmail.com> Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Dmitry Patsura <zaets28rus@gmail.com> Co-authored-by: Yibo Cai <yibo.cai@arm.com> Co-authored-by: Daniël Heres <danielheres@gmail.com> Co-authored-by: Kenta Murata <mrkn@mrkn.jp> Co-authored-by: Mahmut Bulut <vertexclique@gmail.com> Co-authored-by: Yordan Pavlov <yordan.pavlov@outlook.com> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Ryan Jennings <ryan@ryanj.net> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com> Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Johannes Müller <JohannesMueller@fico.com> Co-authored-by: mqy <meng.qingyou@gmail.com> Co-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Matt Brubeck <mbrubeck@limpet.net> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
The idea of this PR is to have a function
from_iter_values
that (just likefrom_iter
) creates an array based on an iterator, but fromT
instead ofOption<T>
.I have seen some places in DataFusion (especially
to_array_of_size
) where anArray
is generated from aVec
of items, which could be replaced by this.The other iterators have some memory / time overhead in both creating and manipulating the null buffer (and in the case of
Vec
for allocating / dropping the Vec)