ARROW-11291: [Rust] Add extend to MutableBuffer (-20% for arithmetic, -97% for length) #9235

jorgecarleitao · 2021-01-17T18:56:50Z

Rational

Rust forbids safely accessing uninitialized memory because it is undefined behavior. However, when building Buffers, it is important to be able to write to uninitialized memory regions, thereby avoiding the need to write something to it before using it.

Currently, all our initializations are zeroed, which is expensive. #9076 modifies our allocator to allocate uninitialized regions. However, by itself, this is not useful if we do not offer any methods to write to those (uninitialized) regions.

This PR

This PR is built on top of #9076 and introduces methods to extend a MutableBuffer from an iterator, thereby offering an API to efficiently grow MutableBuffer without having to initialize memory regions with zeros (i.e. without with_bitset and the like).

This PR also introduces methods to create a Buffer from an iterator and a trusted_len iterator.

The design is inspired in Vec, with the catch that we use stable Rust (i.e. no trait specialization, no TrustedLen), and thus have to expose a bit more methods than what Vec exposes. This means that we can't use that (nicer) API for trustedlen iterators based on collect().

Note that, as before, there are still unsafe uses of the MutableBuffer derived from the fact that it is not a generic over a type T (and thus people can mix types and grow the buffer in unsound ways).

Special thanks to @mbrubeck for all the help on this, originally discussed here.

git checkout master
cargo bench --bench arithmetic_kernels
git checkout length_faster
cargo bench --bench arithmetic_kernels

git checkout 16bc7200f3baa6e526aea7135c60dcc949c9b592
cargo bench --bench length_kernel
git checkout length_faster

Switched to branch 'length_faster'
  (use "git pull" to merge the remote branch into yours)
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 02s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/arithmetic_kernels-ec2cc20ce07d9b83
Gnuplot not found, using plotters backend
add 512                 time:   [522.24 ns 523.67 ns 525.26 ns]                     
                        change: [-21.738% -20.960% -20.233%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  9 (9.00%) high mild
  3 (3.00%) high severe

subtract 512            time:   [503.18 ns 504.93 ns 506.81 ns]                          
                        change: [-21.741% -21.075% -20.308%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

multiply 512            time:   [508.25 ns 512.04 ns 516.06 ns]                          
                        change: [-22.569% -21.946% -21.305%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

divide 512              time:   [1.4711 us 1.4753 us 1.4799 us]                        
                        change: [-24.783% -23.305% -22.176%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

limit 512, 512          time:   [373.47 ns 377.76 ns 382.21 ns]                           
                        change: [+3.3055% +4.4193% +5.5923%] (p = 0.00 < 0.05)
                        Performance has regressed.

add_nulls_512           time:   [502.94 ns 504.51 ns 506.28 ns]                           
                        change: [-24.876% -24.299% -23.709%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

divide_nulls_512        time:   [1.4843 us 1.4931 us 1.5053 us]                              
                        change: [-22.968% -22.243% -21.420%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 24 outliers among 100 measurements (24.00%)
  15 (15.00%) low mild
  1 (1.00%) high mild
  8 (8.00%) high severe

Length (against the commit that fixes the bench, 16bc7200f3baa6e526aea7135c60dcc949c9b592, not master):

length                  time:   [1.5379 us 1.5408 us 1.5437 us]                    
                        change: [-97.311% -97.295% -97.278%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe

github-actions · 2021-01-17T18:57:09Z

https://issues.apache.org/jira/browse/ARROW-11291

jorgecarleitao · 2021-01-17T19:10:17Z

rust/arrow/src/compute/kernels/length.rs

-    let lengths: Vec<OffsetSize> = slice
-        .windows(2)
-        .map(|offset| offset[1] - offset[0])
-        .collect();
+    let lengths = slice.windows(2).map(|offset| offset[1] - offset[0]);
+
+    let mut buffer = MutableBuffer::new(0);
+    buffer.extend_from_exact_iter(lengths);



This is the main difference in this PR for the length: we extend from the iterator instead of extending a vector and then memcopying the data.

jorgecarleitao · 2021-01-17T19:10:54Z

rust/arrow/src/compute/kernels/arithmetic.rs

-        .map(|(l, r)| op(*l, *r))
-        .collect::<Vec<T::Native>>();
+        .map(|(l, r)| op(*l, *r));
+    let mut buffer = MutableBuffer::new(0);
+    buffer.extend_from_exact_iter(values);


This is the main difference in this PR for the arithmetics: extend from the iterator instead of extending a vector and then memcopying the data.

codecov-io · 2021-01-17T19:13:01Z

Codecov Report

Merging #9235 (38b131d) into master (c413566) will decrease coverage by 0.00%.
The diff coverage is 84.48%.

@@            Coverage Diff             @@
##           master    #9235      +/-   ##
==========================================
- Coverage   81.61%   81.61%   -0.01%     
==========================================
  Files         215      215              
  Lines       52508    52589      +81     
==========================================
+ Hits        42854    42918      +64     
- Misses       9654     9671      +17

Impacted Files	Coverage Δ
rust/arrow/src/bytes.rs	`53.12% <ø> (ø)`
rust/arrow/src/array/transform/fixed_binary.rs	`78.94% <50.00%> (ø)`
rust/arrow/src/buffer.rs	`93.43% <81.17%> (-2.77%)`	⬇️
rust/arrow/src/compute/kernels/arithmetic.rs	`89.55% <95.83%> (-0.28%)`	⬇️
rust/arrow/src/array/transform/primitive.rs	`100.00% <100.00%> (ø)`
rust/arrow/src/compute/kernels/length.rs	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c413566...38b131d. Read the comment docs.

mqy

I pulled this PR and took a brief look at all changes, filed four trivial comments. Basically I can understand the changes but can't argue deeply about the fundamental benefit vs risk. What looks good to me is: the core changes are concise than before and newly added functions are well documented.

Among the relevant benches , the buffer_create that I had run outputs average improvements comparing to the merge base 08cccd688 ARROW-11168: [Rust] [Doc] Fix cargo doc warnings:

mutable: -11.748%
mutable prepared: -17.702%
from_slice: +6.3866%
from_slice prepared: -9.9773%

BTW, it shocked me at first glance of the file list size. Perhaps it's better to add/test new functions before refactor existing codes. Anyway, the overall LOC is not large, so LGTM as of the change size. Thanks @jorgecarleitao.

rust/arrow/src/buffer.rs

jorgecarleitao · 2021-01-18T18:19:34Z

rust/arrow/src/buffer.rs

+            if self.len >= self.capacity() {
+                let (lower, _) = iterator.size_hint();
+                let additional = (lower + 1) * size;
+                self.reserve_at_least(self.len + additional);


I am sorry, I did not understand your question.

This is being called here because iterators do not have a reliable size_hint (thus the + 1). This code is spiritually equal to extend_desugared in the standard lib.

rust/arrow/src/array/array_primitive.rs

jorgecarleitao · 2021-01-18T18:05:51Z

On my computer this is now faster than simd in master 🤣

jorgecarleitao · 2021-01-20T04:02:59Z

I am drafting this because there is room for improvement in the API and soundness, details here https://users.rust-lang.org/t/collect-for-exactsizediterator/54367/15

alamb · 2021-01-20T14:41:28Z

Looks like there is an error in the SIMD tests https://github.com/apache/arrow/pull/9235/checks?check_run_id=1733044845


failures:

---- src/buffer.rs - buffer::MutableBuffer::extend_from_trusted_len_iter (line 1036) stdout ----
error[E0716]: temporary value dropped while borrowed
 --> src/buffer.rs:1038:12
  |
5 | let iter = vec![1u32].iter().map(|x| x * 2);
  |            ^^^^^^^^^^                      - temporary value is freed at the end of this statement
  |            |
  |            creates a temporary which is freed while still in use
6 | let mut buffer = MutableBuffer::new(0);
7 | unsafe { buffer.extend_from_trusted_len_iter(iter) };
  |                                              ---- borrow later used here
  |
  = note: consider using a `let` binding to create a longer lived value
  = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

error: aborting due to previous error

mqy · 2021-01-20T15:09:54Z

rust/arrow/src/buffer.rs

+        let mut dst = unsafe { ptr.as_ptr().add(len) as *mut T };
+
+        while let Some(item) = iterator.next() {
+            if len >= capacity {


@jorgecarleitao Line 1011, the test condition may cause overflow, perhaps should increase len by size before the comparison. The following test shows this problem:

#[test] fn mutable_extend_from_iter_write_overflow() { let mut buf = MutableBuffer::new(0); buf.extend(vec![1_u8; 63]); assert_eq!(63, buf.len()); assert_eq!(64, buf.capacity()); buf.extend(vec![1_u16; 1]); assert_eq!(65, buf.len()); assert_eq!(64, buf.capacity()); }

I had run this test for several times, all passed, but you can see the len is bigger than capacity.

mqy · 2021-01-20T16:29:05Z

Looks like there is an error in the SIMD tests https://github.com/apache/arrow/pull/9235/checks?

cargo test --doc --package arrow -- buffer::MutableBuffer::extend_from_trusted_len_iter --nocapture reproduces this error.

mbrubeck · 2021-01-20T17:03:11Z

rust/arrow/src/buffer.rs

+                let (lower, _) = iterator.size_hint();
+                let additional = (lower + 1) * size;
+                let (new_ptr, new_capacity) =
+                    unsafe { reallocate(ptr, capacity, len + additional) };


This is unsound if unwinding is enabled, because Iterator::next may panic on a future iteration of the loop, causing self to be dropped while its ptr and capacity are incorrect. This is why a drop guard like SetLenOnDrop is needed.

Thanks a lot. I do agree that this is unsound atm.

Note that arrow does not support complex structs on its buffers (i.e. we only support u8-u64, i8-i64, f32 and f64), which means that we never need to call drop on the elements. Under this, do we still need a valid len? I understood that the len was only needed because we could have to call drop on the elements up to len.

With SetLenOnDrop, we borrow a mutable reference to self.len, which wont allow us to call self.reserve inside the if. Could this be the reason why SetLenOnDrop is not being used on the extend_desugared (which in my understanding this part of the code is reproducing)?

(I am working on a fix)

Note that arrow does not support complex structs on its buffers (i.e. we only support u8-u64, i8-i64, f32 and f64), which means that we never need to call drop on the elements. Under this, do we still need a valid len?

No. Dropping before self.len is updated should be fine, SetLenOnDrop is not actually necessary for soundness or correctness (as long as the pointer and capacity are still valid).

However, in my benchmarking I still found that using SetLenOnDrop provided a small performance benefit compared to just updating self.len after the loop. I'm not sure why. Maybe the mutable borrow provides some additional hints to the optimizer.

rust/arrow/src/buffer.rs

mbrubeck · 2021-01-20T18:31:28Z

I was able to speed up the fast path of extend_from_iter by a few percent, by splitting the loop into two parts. This version is only about 8% slower than extend_from_trusted_len_iter. (Previously it was about 12% slower.) Both are still much slower than extend_from_slice.

#[inline]
fn extend_from_iter<T: ArrowNativeType, I: Iterator<Item = T>>(
    &mut self,
    mut iterator: I,
) {
    let size = std::mem::size_of::<T>();
    let (lower, _) = iterator.size_hint();
    let additional = lower * size;
    self.reserve(additional);

    // this is necessary because of https://github.com/rust-lang/rust/issues/32155
    let mut len = SetLenOnDrop::new(&mut self.len);
    let mut dst = unsafe { self.data.as_ptr().add(len.local_len) as *mut T };
    let capacity = self.capacity;

    while len.local_len + size <= capacity {
        if let Some(item) = iterator.next() {
            unsafe {
                std::ptr::write(dst, item);
                dst = dst.add(1);
            }
            len.local_len += size;
        } else {
            break;
        }
    }
    drop(len);

    iterator.for_each(|item| self.push(item));
}

I haven't benchmarked the slow path (the second loop), which I implemented using safe code. It's possible that it could be made faster with unsafe code, but I don't expect large gains there.

My complete code including benchmarks is on this branch: master...mbrubeck:length_faster

jorgecarleitao · 2021-01-20T19:27:11Z

I validated that with @mbrubeck 's change, we can just use collect and drop the trusted_len and unsafe hack with the same performance on the arithmetics 🎉 🎉🎉

@mbrubeck , thank you so much for taking your time in addressing this; I have been really struggling with the compiler on this one and your implementation is exactly what I was trying to achieve in this PR: an efficient, safe, and expressive API for our users to build a MutableBuffer.

When we commit to master we squash all commits. Would you be willing to take credit for this whole PR + your changes? IMO the most relevant part of this whole thing was your change, that enables collect.

Alternatively, we could merge this PR as is, and then apply your changes on a separate one. Let me know what you prefer I will act accordingly (@alamb let me know if you see any problem here).

mbrubeck · 2021-01-20T19:30:41Z

When we commit to master we squash all commits. Would you be willing to take credit for this whole PR + your changes?

I think it would be more appropriate to keep your name as the author, because you are more likely to be able to answer future questions about it, etc. But feel free to credit me in the commit message, if you like. Thanks!

jorgecarleitao · 2021-01-20T20:03:40Z

I made a mistake in benchmarking and have to retract my statement wrt to the collect. I am really sorry for the noise:

@mbrubeck , even though there is a small difference between the the two extends in the individual benches, when used in a non-trivial operation (arithmetic), collect is about +65% slower than using extend_from_trusted_len_iter.

This hints that the extend_trusted_len_iter does have a significant benefit, if anything at helping the compiler optimize out some code.

what I am using for the final benches:

git checkout 7bb2f3c13f0e652b01e3fecbee0ae32ff094f3e6
cargo bench --bench arithmetic_kernels -- "add 512"
git checkout cdbbc6f9d81d37216dc158177405529988fa0279
cargo bench --bench arithmetic_kernels -- "add 512"

The difference in code is

    let values = left
        .values()
        .iter()
        .zip(right.values().iter())
        .map(|(l, r)| op(*l, *r));
    let mut buffer = MutableBuffer::new(0);
    unsafe { buffer.extend_from_trusted_len_iter(values) };

to

    let buffer = left
        .values()
        .iter()
        .zip(right.values().iter())
        .map(|(l, r)| op(*l, *r))
        .collect();

IMO this is interesting by itself.

jorgecarleitao · 2021-01-21T21:11:20Z

I updated the description and this is now ready for a review.

alamb

I think this looks good @jorgecarleitao -- thank you. Nice work.

Given the unsafe nature of this PR I would feel better about using from_trusted_len_iter if it could detect when the invariant had been violated and panic! -- perhaps by tracking how many items it actually wrote (I had a suggestion on how to do so)

rust/arrow/src/buffer.rs

alamb · 2021-01-22T12:23:29Z

rust/arrow/src/buffer.rs

@@ -1003,6 +1156,28 @@ impl PartialEq for MutableBuffer {
 unsafe impl Sync for MutableBuffer {}
 unsafe impl Send for MutableBuffer {}

+struct SetLenOnDrop<'a> {


this construct is fascinating, but I will take your word that it was necessary for speedup

alamb

I gave this one another quick review and it is looking good. 👍

@mbrubeck

… -97% for length) # Rational Rust forbids safely accessing uninitialized memory because it is undefined behavior. However, when building `Buffer`s, it is important to be able to _write_ to uninitialized memory regions, thereby avoiding the need to write _something_ to it before using it. Currently, all our initializations are zeroed, which is expensive. #9076 modifies our allocator to allocate uninitialized regions. However, by itself, this is not useful if we do not offer any methods to write to those (uninitialized) regions. # This PR This PR is built on top of #9076 and introduces methods to extend a `MutableBuffer` from an iterator, thereby offering an API to efficiently grow `MutableBuffer` without having to initialize memory regions with zeros (i.e. without `with_bitset` and the like). This PR also introduces methods to create a `Buffer` from an iterator and a `trusted_len` iterator. The design is inspired in `Vec`, with the catch that we use stable Rust (i.e. no trait specialization, no `TrustedLen`), and thus have to expose a bit more methods than what `Vec` exposes. This means that we can't use that (nicer) API for trustedlen iterators based on `collect()`. Note that, as before, there are still `unsafe` uses of the `MutableBuffer` derived from the fact that it is not a generic over a type `T` (and thus people can mix types and grow the buffer in unsound ways). Special thanks to @mbrubeck for all the help on this, originally discussed [here](https://users.rust-lang.org/t/collect-for-exactsizediterator/54367/6). ```bash git checkout master cargo bench --bench arithmetic_kernels git checkout length_faster cargo bench --bench arithmetic_kernels git checkout 16bc720 cargo bench --bench length_kernel git checkout length_faster ``` ``` Switched to branch 'length_faster' (use "git pull" to merge the remote branch into yours) Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 02s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/arithmetic_kernels-ec2cc20ce07d9b83 Gnuplot not found, using plotters backend add 512 time: [522.24 ns 523.67 ns 525.26 ns] change: [-21.738% -20.960% -20.233%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 9 (9.00%) high mild 3 (3.00%) high severe subtract 512 time: [503.18 ns 504.93 ns 506.81 ns] change: [-21.741% -21.075% -20.308%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) high mild 2 (2.00%) high severe multiply 512 time: [508.25 ns 512.04 ns 516.06 ns] change: [-22.569% -21.946% -21.305%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild divide 512 time: [1.4711 us 1.4753 us 1.4799 us] change: [-24.783% -23.305% -22.176%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) high mild 3 (3.00%) high severe limit 512, 512 time: [373.47 ns 377.76 ns 382.21 ns] change: [+3.3055% +4.4193% +5.5923%] (p = 0.00 < 0.05) Performance has regressed. add_nulls_512 time: [502.94 ns 504.51 ns 506.28 ns] change: [-24.876% -24.299% -23.709%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high severe divide_nulls_512 time: [1.4843 us 1.4931 us 1.5053 us] change: [-22.968% -22.243% -21.420%] (p = 0.00 < 0.05) Performance has improved. Found 24 outliers among 100 measurements (24.00%) 15 (15.00%) low mild 1 (1.00%) high mild 8 (8.00%) high severe ``` Length (against the commit that fixes the bench, `16bc7200f3baa6e526aea7135c60dcc949c9b592`, not master): ``` length time: [1.5379 us 1.5408 us 1.5437 us] change: [-97.311% -97.295% -97.278%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 1 (1.00%) low severe 4 (4.00%) low mild 3 (3.00%) high mild 4 (4.00%) high severe ``` Closes #9235 from jorgecarleitao/length_faster Lead-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Co-authored-by: Matt Brubeck <mbrubeck@limpet.net> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

This PR improves the performance of certain time / date casts by using the brand new API proposed in #9235 . That API allows for a very fast execution of unary and infalible operations on primitive arrays, and this PR leverages that for cast operations that require a simple mathematical operation. ``` Switched to branch 'fast_unary' Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 06s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/cast_kernels-25ee76597a8b997b Gnuplot not found, using plotters backend cast date64 to date32 512 time: [1.1668 us 1.1706 us 1.1749 us] change: [-83.347% -83.248% -83.144%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe cast date32 to date64 512 time: [899.73 ns 930.58 ns 971.56 ns] change: [-86.799% -86.520% -86.190%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) high mild 7 (7.00%) high severe cast time32s to time32ms 512 time: [728.73 ns 732.33 ns 735.90 ns] change: [-54.503% -54.201% -53.917%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe cast time64ns to time32s 512 time: [4.8374 us 4.8447 us 4.8526 us] change: [-57.022% -56.791% -56.587%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 6 (6.00%) high mild 2 (2.00%) high severe ``` Closes #9297 from jorgecarleitao/fast_unary Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

@mbrubeck

… -97% for length) # Rational Rust forbids safely accessing uninitialized memory because it is undefined behavior. However, when building `Buffer`s, it is important to be able to _write_ to uninitialized memory regions, thereby avoiding the need to write _something_ to it before using it. Currently, all our initializations are zeroed, which is expensive. apache#9076 modifies our allocator to allocate uninitialized regions. However, by itself, this is not useful if we do not offer any methods to write to those (uninitialized) regions. # This PR This PR is built on top of apache#9076 and introduces methods to extend a `MutableBuffer` from an iterator, thereby offering an API to efficiently grow `MutableBuffer` without having to initialize memory regions with zeros (i.e. without `with_bitset` and the like). This PR also introduces methods to create a `Buffer` from an iterator and a `trusted_len` iterator. The design is inspired in `Vec`, with the catch that we use stable Rust (i.e. no trait specialization, no `TrustedLen`), and thus have to expose a bit more methods than what `Vec` exposes. This means that we can't use that (nicer) API for trustedlen iterators based on `collect()`. Note that, as before, there are still `unsafe` uses of the `MutableBuffer` derived from the fact that it is not a generic over a type `T` (and thus people can mix types and grow the buffer in unsound ways). Special thanks to @mbrubeck for all the help on this, originally discussed [here](https://users.rust-lang.org/t/collect-for-exactsizediterator/54367/6). ```bash git checkout master cargo bench --bench arithmetic_kernels git checkout length_faster cargo bench --bench arithmetic_kernels git checkout 16bc720 cargo bench --bench length_kernel git checkout length_faster ``` ``` Switched to branch 'length_faster' (use "git pull" to merge the remote branch into yours) Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 02s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/arithmetic_kernels-ec2cc20ce07d9b83 Gnuplot not found, using plotters backend add 512 time: [522.24 ns 523.67 ns 525.26 ns] change: [-21.738% -20.960% -20.233%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 9 (9.00%) high mild 3 (3.00%) high severe subtract 512 time: [503.18 ns 504.93 ns 506.81 ns] change: [-21.741% -21.075% -20.308%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) high mild 2 (2.00%) high severe multiply 512 time: [508.25 ns 512.04 ns 516.06 ns] change: [-22.569% -21.946% -21.305%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild divide 512 time: [1.4711 us 1.4753 us 1.4799 us] change: [-24.783% -23.305% -22.176%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) high mild 3 (3.00%) high severe limit 512, 512 time: [373.47 ns 377.76 ns 382.21 ns] change: [+3.3055% +4.4193% +5.5923%] (p = 0.00 < 0.05) Performance has regressed. add_nulls_512 time: [502.94 ns 504.51 ns 506.28 ns] change: [-24.876% -24.299% -23.709%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high severe divide_nulls_512 time: [1.4843 us 1.4931 us 1.5053 us] change: [-22.968% -22.243% -21.420%] (p = 0.00 < 0.05) Performance has improved. Found 24 outliers among 100 measurements (24.00%) 15 (15.00%) low mild 1 (1.00%) high mild 8 (8.00%) high severe ``` Length (against the commit that fixes the bench, `16bc7200f3baa6e526aea7135c60dcc949c9b592`, not master): ``` length time: [1.5379 us 1.5408 us 1.5437 us] change: [-97.311% -97.295% -97.278%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 1 (1.00%) low severe 4 (4.00%) low mild 3 (3.00%) high mild 4 (4.00%) high severe ``` Closes apache#9235 from jorgecarleitao/length_faster Lead-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Co-authored-by: Matt Brubeck <mbrubeck@limpet.net> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

This PR improves the performance of certain time / date casts by using the brand new API proposed in apache#9235 . That API allows for a very fast execution of unary and infalible operations on primitive arrays, and this PR leverages that for cast operations that require a simple mathematical operation. ``` Switched to branch 'fast_unary' Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 06s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/cast_kernels-25ee76597a8b997b Gnuplot not found, using plotters backend cast date64 to date32 512 time: [1.1668 us 1.1706 us 1.1749 us] change: [-83.347% -83.248% -83.144%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe cast date32 to date64 512 time: [899.73 ns 930.58 ns 971.56 ns] change: [-86.799% -86.520% -86.190%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) high mild 7 (7.00%) high severe cast time32s to time32ms 512 time: [728.73 ns 732.33 ns 735.90 ns] change: [-54.503% -54.201% -53.917%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe cast time64ns to time32s 512 time: [4.8374 us 4.8447 us 4.8526 us] change: [-57.022% -56.791% -56.587%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 6 (6.00%) high mild 2 (2.00%) high severe ``` Closes apache#9297 from jorgecarleitao/fast_unary Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

@mbrubeck

… -97% for length) # Rational Rust forbids safely accessing uninitialized memory because it is undefined behavior. However, when building `Buffer`s, it is important to be able to _write_ to uninitialized memory regions, thereby avoiding the need to write _something_ to it before using it. Currently, all our initializations are zeroed, which is expensive. apache#9076 modifies our allocator to allocate uninitialized regions. However, by itself, this is not useful if we do not offer any methods to write to those (uninitialized) regions. # This PR This PR is built on top of apache#9076 and introduces methods to extend a `MutableBuffer` from an iterator, thereby offering an API to efficiently grow `MutableBuffer` without having to initialize memory regions with zeros (i.e. without `with_bitset` and the like). This PR also introduces methods to create a `Buffer` from an iterator and a `trusted_len` iterator. The design is inspired in `Vec`, with the catch that we use stable Rust (i.e. no trait specialization, no `TrustedLen`), and thus have to expose a bit more methods than what `Vec` exposes. This means that we can't use that (nicer) API for trustedlen iterators based on `collect()`. Note that, as before, there are still `unsafe` uses of the `MutableBuffer` derived from the fact that it is not a generic over a type `T` (and thus people can mix types and grow the buffer in unsound ways). Special thanks to @mbrubeck for all the help on this, originally discussed [here](https://users.rust-lang.org/t/collect-for-exactsizediterator/54367/6). ```bash git checkout master cargo bench --bench arithmetic_kernels git checkout length_faster cargo bench --bench arithmetic_kernels git checkout 16bc720 cargo bench --bench length_kernel git checkout length_faster ``` ``` Switched to branch 'length_faster' (use "git pull" to merge the remote branch into yours) Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 02s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/arithmetic_kernels-ec2cc20ce07d9b83 Gnuplot not found, using plotters backend add 512 time: [522.24 ns 523.67 ns 525.26 ns] change: [-21.738% -20.960% -20.233%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 9 (9.00%) high mild 3 (3.00%) high severe subtract 512 time: [503.18 ns 504.93 ns 506.81 ns] change: [-21.741% -21.075% -20.308%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) high mild 2 (2.00%) high severe multiply 512 time: [508.25 ns 512.04 ns 516.06 ns] change: [-22.569% -21.946% -21.305%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild divide 512 time: [1.4711 us 1.4753 us 1.4799 us] change: [-24.783% -23.305% -22.176%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) high mild 3 (3.00%) high severe limit 512, 512 time: [373.47 ns 377.76 ns 382.21 ns] change: [+3.3055% +4.4193% +5.5923%] (p = 0.00 < 0.05) Performance has regressed. add_nulls_512 time: [502.94 ns 504.51 ns 506.28 ns] change: [-24.876% -24.299% -23.709%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high severe divide_nulls_512 time: [1.4843 us 1.4931 us 1.5053 us] change: [-22.968% -22.243% -21.420%] (p = 0.00 < 0.05) Performance has improved. Found 24 outliers among 100 measurements (24.00%) 15 (15.00%) low mild 1 (1.00%) high mild 8 (8.00%) high severe ``` Length (against the commit that fixes the bench, `16bc7200f3baa6e526aea7135c60dcc949c9b592`, not master): ``` length time: [1.5379 us 1.5408 us 1.5437 us] change: [-97.311% -97.295% -97.278%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 1 (1.00%) low severe 4 (4.00%) low mild 3 (3.00%) high mild 4 (4.00%) high severe ``` Closes apache#9235 from jorgecarleitao/length_faster Lead-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Co-authored-by: Matt Brubeck <mbrubeck@limpet.net> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

This PR improves the performance of certain time / date casts by using the brand new API proposed in apache#9235 . That API allows for a very fast execution of unary and infalible operations on primitive arrays, and this PR leverages that for cast operations that require a simple mathematical operation. ``` Switched to branch 'fast_unary' Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 06s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/cast_kernels-25ee76597a8b997b Gnuplot not found, using plotters backend cast date64 to date32 512 time: [1.1668 us 1.1706 us 1.1749 us] change: [-83.347% -83.248% -83.144%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe cast date32 to date64 512 time: [899.73 ns 930.58 ns 971.56 ns] change: [-86.799% -86.520% -86.190%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) high mild 7 (7.00%) high severe cast time32s to time32ms 512 time: [728.73 ns 732.33 ns 735.90 ns] change: [-54.503% -54.201% -53.917%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe cast time64ns to time32s 512 time: [4.8374 us 4.8447 us 4.8526 us] change: [-57.022% -56.791% -56.587%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 6 (6.00%) high mild 2 (2.00%) high severe ``` Closes apache#9297 from jorgecarleitao/fast_unary Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

github-actions bot added Component: Rust Component: Parquet labels Jan 17, 2021

jorgecarleitao commented Jan 17, 2021

View reviewed changes

mqy reviewed Jan 18, 2021

View reviewed changes

jorgecarleitao marked this pull request as draft January 20, 2021 04:02

jorgecarleitao marked this pull request as ready for review January 20, 2021 06:55

mqy reviewed Jan 20, 2021

View reviewed changes

mbrubeck reviewed Jan 20, 2021

View reviewed changes

mqy reviewed Jan 20, 2021

View reviewed changes

rust/arrow/src/buffer.rs Outdated Show resolved Hide resolved

jorgecarleitao marked this pull request as draft January 20, 2021 19:58

jorgecarleitao and others added 9 commits January 21, 2021 22:01

Fixed bench

c8eba36

Added methods to extend MutableBuffer from iterators.

4a38006

Improved performance of length.

cba4b64

Improved performance of arithmetic ops.

aa5f503

Fixed error.

681cf45

Benchmark extend vs extend_from_trusted_len_iter

fb03464

Optimize extend_from_iter

b96559e

Improved comment.

ff89f4a

fmt

579b8db

jorgecarleitao changed the title ~~ARROW-11291: [Rust] Add extend to MutableBuffer (-25% for aritmetic, -97% for length)~~ ARROW-11291: [Rust] Add extend to MutableBuffer (-20% for arithmetic, -97% for length) Jan 21, 2021

jorgecarleitao marked this pull request as ready for review January 21, 2021 21:11

Improved API.

844e20a

alamb approved these changes Jan 22, 2021

View reviewed changes

Added extra safeguard.

018a400

This was referenced Jan 22, 2021

ARROW-11349: [Rust] Add from_iter_values to create arrays from (non null) values #9293

Closed

ARROW-11354: [Rust] Speed-up cast of dates and times (2-4x) #9297

Closed

alamb reviewed Jan 23, 2021

View reviewed changes

alamb closed this in 13e2134 Jan 23, 2021

asfimport mentioned this pull request Jan 23, 2021

[Rust] implement extend for MutableBuffer (from iterator) #27191

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-11291: [Rust] Add extend to MutableBuffer (-20% for arithmetic, -97% for length) #9235

ARROW-11291: [Rust] Add extend to MutableBuffer (-20% for arithmetic, -97% for length) #9235

jorgecarleitao commented Jan 17, 2021 •

edited

Loading

github-actions bot commented Jan 17, 2021

jorgecarleitao Jan 17, 2021

jorgecarleitao Jan 17, 2021

codecov-io commented Jan 17, 2021 •

edited

Loading

mqy left a comment •

edited

Loading

This comment was marked as outdated.

jorgecarleitao Jan 18, 2021

jorgecarleitao commented Jan 18, 2021

jorgecarleitao commented Jan 20, 2021

alamb commented Jan 20, 2021

mqy Jan 20, 2021 •

edited

Loading

mqy commented Jan 20, 2021

mbrubeck Jan 20, 2021

jorgecarleitao Jan 20, 2021 •

edited

Loading

mbrubeck Jan 20, 2021 •

edited

Loading

mbrubeck commented Jan 20, 2021 •

edited

Loading

jorgecarleitao commented Jan 20, 2021

mbrubeck commented Jan 20, 2021

jorgecarleitao commented Jan 20, 2021 •

edited

Loading

jorgecarleitao commented Jan 21, 2021

alamb left a comment

alamb Jan 22, 2021

alamb left a comment

ARROW-11291: [Rust] Add extend to MutableBuffer (-20% for arithmetic, -97% for length) #9235

ARROW-11291: [Rust] Add extend to MutableBuffer (-20% for arithmetic, -97% for length) #9235

Conversation

jorgecarleitao commented Jan 17, 2021 • edited Loading

Rational

This PR

github-actions bot commented Jan 17, 2021

jorgecarleitao Jan 17, 2021

Choose a reason for hiding this comment

jorgecarleitao Jan 17, 2021

Choose a reason for hiding this comment

codecov-io commented Jan 17, 2021 • edited Loading

Codecov Report

mqy left a comment • edited Loading

Choose a reason for hiding this comment

This comment was marked as outdated.

jorgecarleitao Jan 18, 2021

Choose a reason for hiding this comment

jorgecarleitao commented Jan 18, 2021

jorgecarleitao commented Jan 20, 2021

alamb commented Jan 20, 2021

mqy Jan 20, 2021 • edited Loading

Choose a reason for hiding this comment

mqy commented Jan 20, 2021

mbrubeck Jan 20, 2021

Choose a reason for hiding this comment

jorgecarleitao Jan 20, 2021 • edited Loading

Choose a reason for hiding this comment

mbrubeck Jan 20, 2021 • edited Loading

Choose a reason for hiding this comment

mbrubeck commented Jan 20, 2021 • edited Loading

jorgecarleitao commented Jan 20, 2021

mbrubeck commented Jan 20, 2021

jorgecarleitao commented Jan 20, 2021 • edited Loading

jorgecarleitao commented Jan 21, 2021

alamb left a comment

Choose a reason for hiding this comment

alamb Jan 22, 2021

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

jorgecarleitao commented Jan 17, 2021 •

edited

Loading

codecov-io commented Jan 17, 2021 •

edited

Loading

mqy left a comment •

edited

Loading

mqy Jan 20, 2021 •

edited

Loading

jorgecarleitao Jan 20, 2021 •

edited

Loading

mbrubeck Jan 20, 2021 •

edited

Loading

mbrubeck commented Jan 20, 2021 •

edited

Loading

jorgecarleitao commented Jan 20, 2021 •

edited

Loading