-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11291: [Rust] Add extend to MutableBuffer (-20% for arithmetic, -97% for length) #9235
Conversation
let lengths: Vec<OffsetSize> = slice | ||
.windows(2) | ||
.map(|offset| offset[1] - offset[0]) | ||
.collect(); | ||
let lengths = slice.windows(2).map(|offset| offset[1] - offset[0]); | ||
|
||
let mut buffer = MutableBuffer::new(0); | ||
buffer.extend_from_exact_iter(lengths); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main difference in this PR for the length: we extend from the iterator instead of extending a vector and then memcopying the data.
.map(|(l, r)| op(*l, *r)) | ||
.collect::<Vec<T::Native>>(); | ||
.map(|(l, r)| op(*l, *r)); | ||
let mut buffer = MutableBuffer::new(0); | ||
buffer.extend_from_exact_iter(values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main difference in this PR for the arithmetics: extend from the iterator instead of extending a vector and then memcopying the data.
Codecov Report
@@ Coverage Diff @@
## master #9235 +/- ##
==========================================
- Coverage 81.61% 81.61% -0.01%
==========================================
Files 215 215
Lines 52508 52589 +81
==========================================
+ Hits 42854 42918 +64
- Misses 9654 9671 +17
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pulled this PR and took a brief look at all changes, filed four trivial comments. Basically I can understand the changes but can't argue deeply about the fundamental benefit vs risk. What looks good to me is: the core changes are concise than before and newly added functions are well documented.
Among the relevant benches , the buffer_create
that I had run outputs average improvements comparing to the merge base 08cccd688 ARROW-11168: [Rust] [Doc] Fix cargo doc warnings
:
- mutable: -11.748%
- mutable prepared: -17.702%
- from_slice: +6.3866%
- from_slice prepared: -9.9773%
BTW, it shocked me at first glance of the file list size. Perhaps it's better to add/test new functions before refactor existing codes. Anyway, the overall LOC is not large, so LGTM as of the change size. Thanks @jorgecarleitao.
rust/arrow/src/buffer.rs
Outdated
if self.len >= self.capacity() { | ||
let (lower, _) = iterator.size_hint(); | ||
let additional = (lower + 1) * size; | ||
self.reserve_at_least(self.len + additional); |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry, I did not understand your question.
This is being called here because iterators do not have a reliable size_hint
(thus the + 1
). This code is spiritually equal to extend_desugared
in the standard lib.
On my computer this is now faster than simd in master 🤣 |
I am drafting this because there is room for improvement in the API and soundness, details here https://users.rust-lang.org/t/collect-for-exactsizediterator/54367/15 |
Looks like there is an error in the SIMD tests https://github.com/apache/arrow/pull/9235/checks?check_run_id=1733044845
|
rust/arrow/src/buffer.rs
Outdated
let mut dst = unsafe { ptr.as_ptr().add(len) as *mut T }; | ||
|
||
while let Some(item) = iterator.next() { | ||
if len >= capacity { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorgecarleitao Line 1011, the test condition may cause overflow, perhaps should increase len
by size
before the comparison. The following test shows this problem:
#[test]
fn mutable_extend_from_iter_write_overflow() {
let mut buf = MutableBuffer::new(0);
buf.extend(vec![1_u8; 63]);
assert_eq!(63, buf.len());
assert_eq!(64, buf.capacity());
buf.extend(vec![1_u16; 1]);
assert_eq!(65, buf.len());
assert_eq!(64, buf.capacity());
}
I had run this test for several times, all passed, but you can see the len
is bigger than capacity
.
|
rust/arrow/src/buffer.rs
Outdated
let (lower, _) = iterator.size_hint(); | ||
let additional = (lower + 1) * size; | ||
let (new_ptr, new_capacity) = | ||
unsafe { reallocate(ptr, capacity, len + additional) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unsound if unwinding is enabled, because Iterator::next
may panic on a future iteration of the loop, causing self
to be dropped while its ptr
and capacity
are incorrect. This is why a drop guard like SetLenOnDrop
is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot. I do agree that this is unsound atm.
Note that arrow does not support complex structs on its buffers (i.e. we only support u8-u64, i8-i64, f32 and f64
), which means that we never need to call drop
on the elements. Under this, do we still need a valid len
? I understood that the len
was only needed because we could have to call drop on the elements up to len
.
With SetLenOnDrop
, we borrow a mutable reference to self.len
, which wont allow us to call self.reserve
inside the if
. Could this be the reason why SetLenOnDrop
is not being used on the extend_desugared
(which in my understanding this part of the code is reproducing)?
(I am working on a fix)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that arrow does not support complex structs on its buffers (i.e. we only support
u8-u64, i8-i64, f32 and f64
), which means that we never need to calldrop
on the elements. Under this, do we still need a validlen
?
No. Dropping before self.len
is updated should be fine, SetLenOnDrop
is not actually necessary for soundness or correctness (as long as the pointer and capacity are still valid).
However, in my benchmarking I still found that using SetLenOnDrop
provided a small performance benefit compared to just updating self.len
after the loop. I'm not sure why. Maybe the mutable borrow provides some additional hints to the optimizer.
I was able to speed up the fast path of #[inline]
fn extend_from_iter<T: ArrowNativeType, I: Iterator<Item = T>>(
&mut self,
mut iterator: I,
) {
let size = std::mem::size_of::<T>();
let (lower, _) = iterator.size_hint();
let additional = lower * size;
self.reserve(additional);
// this is necessary because of https://github.com/rust-lang/rust/issues/32155
let mut len = SetLenOnDrop::new(&mut self.len);
let mut dst = unsafe { self.data.as_ptr().add(len.local_len) as *mut T };
let capacity = self.capacity;
while len.local_len + size <= capacity {
if let Some(item) = iterator.next() {
unsafe {
std::ptr::write(dst, item);
dst = dst.add(1);
}
len.local_len += size;
} else {
break;
}
}
drop(len);
iterator.for_each(|item| self.push(item));
} I haven't benchmarked the slow path (the second loop), which I implemented using safe code. It's possible that it could be made faster with unsafe code, but I don't expect large gains there. My complete code including benchmarks is on this branch: master...mbrubeck:length_faster |
I validated that with @mbrubeck 's change, we can just use @mbrubeck , thank you so much for taking your time in addressing this; I have been really struggling with the compiler on this one and your implementation is exactly what I was trying to achieve in this PR: an efficient, When we commit to master we squash all commits. Would you be willing to take credit for this whole PR + your changes? IMO the most relevant part of this whole thing was your change, that enables Alternatively, we could merge this PR as is, and then apply your changes on a separate one. Let me know what you prefer I will act accordingly (@alamb let me know if you see any problem here). |
I think it would be more appropriate to keep your name as the author, because you are more likely to be able to answer future questions about it, etc. But feel free to credit me in the commit message, if you like. Thanks! |
I made a mistake in benchmarking and have to retract my statement wrt to the @mbrubeck , even though there is a small difference between the the two extends in the individual benches, when used in a non-trivial operation ( This hints that the what I am using for the final benches: git checkout 7bb2f3c13f0e652b01e3fecbee0ae32ff094f3e6
cargo bench --bench arithmetic_kernels -- "add 512"
git checkout cdbbc6f9d81d37216dc158177405529988fa0279
cargo bench --bench arithmetic_kernels -- "add 512" The difference in code is let values = left
.values()
.iter()
.zip(right.values().iter())
.map(|(l, r)| op(*l, *r));
let mut buffer = MutableBuffer::new(0);
unsafe { buffer.extend_from_trusted_len_iter(values) }; to let buffer = left
.values()
.iter()
.zip(right.values().iter())
.map(|(l, r)| op(*l, *r))
.collect(); IMO this is interesting by itself. |
I updated the description and this is now ready for a review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good @jorgecarleitao -- thank you. Nice work.
Given the unsafe nature of this PR I would feel better about using from_trusted_len_iter
if it could detect when the invariant had been violated and panic!
-- perhaps by tracking how many items it actually wrote (I had a suggestion on how to do so)
@@ -1003,6 +1156,28 @@ impl PartialEq for MutableBuffer { | |||
unsafe impl Sync for MutableBuffer {} | |||
unsafe impl Send for MutableBuffer {} | |||
|
|||
struct SetLenOnDrop<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this construct is fascinating, but I will take your word that it was necessary for speedup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this one another quick review and it is looking good. 👍
… -97% for length) # Rational Rust forbids safely accessing uninitialized memory because it is undefined behavior. However, when building `Buffer`s, it is important to be able to _write_ to uninitialized memory regions, thereby avoiding the need to write _something_ to it before using it. Currently, all our initializations are zeroed, which is expensive. #9076 modifies our allocator to allocate uninitialized regions. However, by itself, this is not useful if we do not offer any methods to write to those (uninitialized) regions. # This PR This PR is built on top of #9076 and introduces methods to extend a `MutableBuffer` from an iterator, thereby offering an API to efficiently grow `MutableBuffer` without having to initialize memory regions with zeros (i.e. without `with_bitset` and the like). This PR also introduces methods to create a `Buffer` from an iterator and a `trusted_len` iterator. The design is inspired in `Vec`, with the catch that we use stable Rust (i.e. no trait specialization, no `TrustedLen`), and thus have to expose a bit more methods than what `Vec` exposes. This means that we can't use that (nicer) API for trustedlen iterators based on `collect()`. Note that, as before, there are still `unsafe` uses of the `MutableBuffer` derived from the fact that it is not a generic over a type `T` (and thus people can mix types and grow the buffer in unsound ways). Special thanks to @mbrubeck for all the help on this, originally discussed [here](https://users.rust-lang.org/t/collect-for-exactsizediterator/54367/6). ```bash git checkout master cargo bench --bench arithmetic_kernels git checkout length_faster cargo bench --bench arithmetic_kernels git checkout 16bc720 cargo bench --bench length_kernel git checkout length_faster ``` ``` Switched to branch 'length_faster' (use "git pull" to merge the remote branch into yours) Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 02s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/arithmetic_kernels-ec2cc20ce07d9b83 Gnuplot not found, using plotters backend add 512 time: [522.24 ns 523.67 ns 525.26 ns] change: [-21.738% -20.960% -20.233%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 9 (9.00%) high mild 3 (3.00%) high severe subtract 512 time: [503.18 ns 504.93 ns 506.81 ns] change: [-21.741% -21.075% -20.308%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) high mild 2 (2.00%) high severe multiply 512 time: [508.25 ns 512.04 ns 516.06 ns] change: [-22.569% -21.946% -21.305%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild divide 512 time: [1.4711 us 1.4753 us 1.4799 us] change: [-24.783% -23.305% -22.176%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) high mild 3 (3.00%) high severe limit 512, 512 time: [373.47 ns 377.76 ns 382.21 ns] change: [+3.3055% +4.4193% +5.5923%] (p = 0.00 < 0.05) Performance has regressed. add_nulls_512 time: [502.94 ns 504.51 ns 506.28 ns] change: [-24.876% -24.299% -23.709%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high severe divide_nulls_512 time: [1.4843 us 1.4931 us 1.5053 us] change: [-22.968% -22.243% -21.420%] (p = 0.00 < 0.05) Performance has improved. Found 24 outliers among 100 measurements (24.00%) 15 (15.00%) low mild 1 (1.00%) high mild 8 (8.00%) high severe ``` Length (against the commit that fixes the bench, `16bc7200f3baa6e526aea7135c60dcc949c9b592`, not master): ``` length time: [1.5379 us 1.5408 us 1.5437 us] change: [-97.311% -97.295% -97.278%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 1 (1.00%) low severe 4 (4.00%) low mild 3 (3.00%) high mild 4 (4.00%) high severe ``` Closes #9235 from jorgecarleitao/length_faster Lead-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Co-authored-by: Matt Brubeck <mbrubeck@limpet.net> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This PR improves the performance of certain time / date casts by using the brand new API proposed in #9235 . That API allows for a very fast execution of unary and infalible operations on primitive arrays, and this PR leverages that for cast operations that require a simple mathematical operation. ``` Switched to branch 'fast_unary' Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 06s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/cast_kernels-25ee76597a8b997b Gnuplot not found, using plotters backend cast date64 to date32 512 time: [1.1668 us 1.1706 us 1.1749 us] change: [-83.347% -83.248% -83.144%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe cast date32 to date64 512 time: [899.73 ns 930.58 ns 971.56 ns] change: [-86.799% -86.520% -86.190%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) high mild 7 (7.00%) high severe cast time32s to time32ms 512 time: [728.73 ns 732.33 ns 735.90 ns] change: [-54.503% -54.201% -53.917%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe cast time64ns to time32s 512 time: [4.8374 us 4.8447 us 4.8526 us] change: [-57.022% -56.791% -56.587%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 6 (6.00%) high mild 2 (2.00%) high severe ``` Closes #9297 from jorgecarleitao/fast_unary Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
… -97% for length) # Rational Rust forbids safely accessing uninitialized memory because it is undefined behavior. However, when building `Buffer`s, it is important to be able to _write_ to uninitialized memory regions, thereby avoiding the need to write _something_ to it before using it. Currently, all our initializations are zeroed, which is expensive. apache#9076 modifies our allocator to allocate uninitialized regions. However, by itself, this is not useful if we do not offer any methods to write to those (uninitialized) regions. # This PR This PR is built on top of apache#9076 and introduces methods to extend a `MutableBuffer` from an iterator, thereby offering an API to efficiently grow `MutableBuffer` without having to initialize memory regions with zeros (i.e. without `with_bitset` and the like). This PR also introduces methods to create a `Buffer` from an iterator and a `trusted_len` iterator. The design is inspired in `Vec`, with the catch that we use stable Rust (i.e. no trait specialization, no `TrustedLen`), and thus have to expose a bit more methods than what `Vec` exposes. This means that we can't use that (nicer) API for trustedlen iterators based on `collect()`. Note that, as before, there are still `unsafe` uses of the `MutableBuffer` derived from the fact that it is not a generic over a type `T` (and thus people can mix types and grow the buffer in unsound ways). Special thanks to @mbrubeck for all the help on this, originally discussed [here](https://users.rust-lang.org/t/collect-for-exactsizediterator/54367/6). ```bash git checkout master cargo bench --bench arithmetic_kernels git checkout length_faster cargo bench --bench arithmetic_kernels git checkout 16bc720 cargo bench --bench length_kernel git checkout length_faster ``` ``` Switched to branch 'length_faster' (use "git pull" to merge the remote branch into yours) Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 02s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/arithmetic_kernels-ec2cc20ce07d9b83 Gnuplot not found, using plotters backend add 512 time: [522.24 ns 523.67 ns 525.26 ns] change: [-21.738% -20.960% -20.233%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 9 (9.00%) high mild 3 (3.00%) high severe subtract 512 time: [503.18 ns 504.93 ns 506.81 ns] change: [-21.741% -21.075% -20.308%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) high mild 2 (2.00%) high severe multiply 512 time: [508.25 ns 512.04 ns 516.06 ns] change: [-22.569% -21.946% -21.305%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild divide 512 time: [1.4711 us 1.4753 us 1.4799 us] change: [-24.783% -23.305% -22.176%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) high mild 3 (3.00%) high severe limit 512, 512 time: [373.47 ns 377.76 ns 382.21 ns] change: [+3.3055% +4.4193% +5.5923%] (p = 0.00 < 0.05) Performance has regressed. add_nulls_512 time: [502.94 ns 504.51 ns 506.28 ns] change: [-24.876% -24.299% -23.709%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high severe divide_nulls_512 time: [1.4843 us 1.4931 us 1.5053 us] change: [-22.968% -22.243% -21.420%] (p = 0.00 < 0.05) Performance has improved. Found 24 outliers among 100 measurements (24.00%) 15 (15.00%) low mild 1 (1.00%) high mild 8 (8.00%) high severe ``` Length (against the commit that fixes the bench, `16bc7200f3baa6e526aea7135c60dcc949c9b592`, not master): ``` length time: [1.5379 us 1.5408 us 1.5437 us] change: [-97.311% -97.295% -97.278%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 1 (1.00%) low severe 4 (4.00%) low mild 3 (3.00%) high mild 4 (4.00%) high severe ``` Closes apache#9235 from jorgecarleitao/length_faster Lead-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Co-authored-by: Matt Brubeck <mbrubeck@limpet.net> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This PR improves the performance of certain time / date casts by using the brand new API proposed in apache#9235 . That API allows for a very fast execution of unary and infalible operations on primitive arrays, and this PR leverages that for cast operations that require a simple mathematical operation. ``` Switched to branch 'fast_unary' Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 06s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/cast_kernels-25ee76597a8b997b Gnuplot not found, using plotters backend cast date64 to date32 512 time: [1.1668 us 1.1706 us 1.1749 us] change: [-83.347% -83.248% -83.144%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe cast date32 to date64 512 time: [899.73 ns 930.58 ns 971.56 ns] change: [-86.799% -86.520% -86.190%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) high mild 7 (7.00%) high severe cast time32s to time32ms 512 time: [728.73 ns 732.33 ns 735.90 ns] change: [-54.503% -54.201% -53.917%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe cast time64ns to time32s 512 time: [4.8374 us 4.8447 us 4.8526 us] change: [-57.022% -56.791% -56.587%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 6 (6.00%) high mild 2 (2.00%) high severe ``` Closes apache#9297 from jorgecarleitao/fast_unary Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
… -97% for length) # Rational Rust forbids safely accessing uninitialized memory because it is undefined behavior. However, when building `Buffer`s, it is important to be able to _write_ to uninitialized memory regions, thereby avoiding the need to write _something_ to it before using it. Currently, all our initializations are zeroed, which is expensive. apache#9076 modifies our allocator to allocate uninitialized regions. However, by itself, this is not useful if we do not offer any methods to write to those (uninitialized) regions. # This PR This PR is built on top of apache#9076 and introduces methods to extend a `MutableBuffer` from an iterator, thereby offering an API to efficiently grow `MutableBuffer` without having to initialize memory regions with zeros (i.e. without `with_bitset` and the like). This PR also introduces methods to create a `Buffer` from an iterator and a `trusted_len` iterator. The design is inspired in `Vec`, with the catch that we use stable Rust (i.e. no trait specialization, no `TrustedLen`), and thus have to expose a bit more methods than what `Vec` exposes. This means that we can't use that (nicer) API for trustedlen iterators based on `collect()`. Note that, as before, there are still `unsafe` uses of the `MutableBuffer` derived from the fact that it is not a generic over a type `T` (and thus people can mix types and grow the buffer in unsound ways). Special thanks to @mbrubeck for all the help on this, originally discussed [here](https://users.rust-lang.org/t/collect-for-exactsizediterator/54367/6). ```bash git checkout master cargo bench --bench arithmetic_kernels git checkout length_faster cargo bench --bench arithmetic_kernels git checkout 16bc720 cargo bench --bench length_kernel git checkout length_faster ``` ``` Switched to branch 'length_faster' (use "git pull" to merge the remote branch into yours) Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 02s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/arithmetic_kernels-ec2cc20ce07d9b83 Gnuplot not found, using plotters backend add 512 time: [522.24 ns 523.67 ns 525.26 ns] change: [-21.738% -20.960% -20.233%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 9 (9.00%) high mild 3 (3.00%) high severe subtract 512 time: [503.18 ns 504.93 ns 506.81 ns] change: [-21.741% -21.075% -20.308%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) high mild 2 (2.00%) high severe multiply 512 time: [508.25 ns 512.04 ns 516.06 ns] change: [-22.569% -21.946% -21.305%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild divide 512 time: [1.4711 us 1.4753 us 1.4799 us] change: [-24.783% -23.305% -22.176%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) high mild 3 (3.00%) high severe limit 512, 512 time: [373.47 ns 377.76 ns 382.21 ns] change: [+3.3055% +4.4193% +5.5923%] (p = 0.00 < 0.05) Performance has regressed. add_nulls_512 time: [502.94 ns 504.51 ns 506.28 ns] change: [-24.876% -24.299% -23.709%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high severe divide_nulls_512 time: [1.4843 us 1.4931 us 1.5053 us] change: [-22.968% -22.243% -21.420%] (p = 0.00 < 0.05) Performance has improved. Found 24 outliers among 100 measurements (24.00%) 15 (15.00%) low mild 1 (1.00%) high mild 8 (8.00%) high severe ``` Length (against the commit that fixes the bench, `16bc7200f3baa6e526aea7135c60dcc949c9b592`, not master): ``` length time: [1.5379 us 1.5408 us 1.5437 us] change: [-97.311% -97.295% -97.278%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 1 (1.00%) low severe 4 (4.00%) low mild 3 (3.00%) high mild 4 (4.00%) high severe ``` Closes apache#9235 from jorgecarleitao/length_faster Lead-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Co-authored-by: Matt Brubeck <mbrubeck@limpet.net> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This PR improves the performance of certain time / date casts by using the brand new API proposed in apache#9235 . That API allows for a very fast execution of unary and infalible operations on primitive arrays, and this PR leverages that for cast operations that require a simple mathematical operation. ``` Switched to branch 'fast_unary' Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 06s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/cast_kernels-25ee76597a8b997b Gnuplot not found, using plotters backend cast date64 to date32 512 time: [1.1668 us 1.1706 us 1.1749 us] change: [-83.347% -83.248% -83.144%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe cast date32 to date64 512 time: [899.73 ns 930.58 ns 971.56 ns] change: [-86.799% -86.520% -86.190%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) high mild 7 (7.00%) high severe cast time32s to time32ms 512 time: [728.73 ns 732.33 ns 735.90 ns] change: [-54.503% -54.201% -53.917%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe cast time64ns to time32s 512 time: [4.8374 us 4.8447 us 4.8526 us] change: [-57.022% -56.791% -56.587%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 6 (6.00%) high mild 2 (2.00%) high severe ``` Closes apache#9297 from jorgecarleitao/fast_unary Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
Rational
Rust forbids safely accessing uninitialized memory because it is undefined behavior. However, when building
Buffer
s, it is important to be able to write to uninitialized memory regions, thereby avoiding the need to write something to it before using it.Currently, all our initializations are zeroed, which is expensive. #9076 modifies our allocator to allocate uninitialized regions. However, by itself, this is not useful if we do not offer any methods to write to those (uninitialized) regions.
This PR
This PR is built on top of #9076 and introduces methods to extend a
MutableBuffer
from an iterator, thereby offering an API to efficiently growMutableBuffer
without having to initialize memory regions with zeros (i.e. withoutwith_bitset
and the like).This PR also introduces methods to create a
Buffer
from an iterator and atrusted_len
iterator.The design is inspired in
Vec
, with the catch that we use stable Rust (i.e. no trait specialization, noTrustedLen
), and thus have to expose a bit more methods than whatVec
exposes. This means that we can't use that (nicer) API for trustedlen iterators based oncollect()
.Note that, as before, there are still
unsafe
uses of theMutableBuffer
derived from the fact that it is not a generic over a typeT
(and thus people can mix types and grow the buffer in unsound ways).Special thanks to @mbrubeck for all the help on this, originally discussed here.
Length (against the commit that fixes the bench,
16bc7200f3baa6e526aea7135c60dcc949c9b592
, not master):