Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11291: [Rust] Add extend to MutableBuffer (-20% for arithmetic, -97% for length) #9235

Closed
wants to merge 11 commits into from
Closed

Conversation

jorgecarleitao
Copy link
Member

@jorgecarleitao jorgecarleitao commented Jan 17, 2021

Rational

Rust forbids safely accessing uninitialized memory because it is undefined behavior. However, when building Buffers, it is important to be able to write to uninitialized memory regions, thereby avoiding the need to write something to it before using it.

Currently, all our initializations are zeroed, which is expensive. #9076 modifies our allocator to allocate uninitialized regions. However, by itself, this is not useful if we do not offer any methods to write to those (uninitialized) regions.

This PR

This PR is built on top of #9076 and introduces methods to extend a MutableBuffer from an iterator, thereby offering an API to efficiently grow MutableBuffer without having to initialize memory regions with zeros (i.e. without with_bitset and the like).

This PR also introduces methods to create a Buffer from an iterator and a trusted_len iterator.

The design is inspired in Vec, with the catch that we use stable Rust (i.e. no trait specialization, no TrustedLen), and thus have to expose a bit more methods than what Vec exposes. This means that we can't use that (nicer) API for trustedlen iterators based on collect().

Note that, as before, there are still unsafe uses of the MutableBuffer derived from the fact that it is not a generic over a type T (and thus people can mix types and grow the buffer in unsound ways).

Special thanks to @mbrubeck for all the help on this, originally discussed here.

git checkout master
cargo bench --bench arithmetic_kernels
git checkout length_faster
cargo bench --bench arithmetic_kernels

git checkout 16bc7200f3baa6e526aea7135c60dcc949c9b592
cargo bench --bench length_kernel
git checkout length_faster
Switched to branch 'length_faster'
  (use "git pull" to merge the remote branch into yours)
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 02s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/arithmetic_kernels-ec2cc20ce07d9b83
Gnuplot not found, using plotters backend
add 512                 time:   [522.24 ns 523.67 ns 525.26 ns]                     
                        change: [-21.738% -20.960% -20.233%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  9 (9.00%) high mild
  3 (3.00%) high severe

subtract 512            time:   [503.18 ns 504.93 ns 506.81 ns]                          
                        change: [-21.741% -21.075% -20.308%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

multiply 512            time:   [508.25 ns 512.04 ns 516.06 ns]                          
                        change: [-22.569% -21.946% -21.305%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

divide 512              time:   [1.4711 us 1.4753 us 1.4799 us]                        
                        change: [-24.783% -23.305% -22.176%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

limit 512, 512          time:   [373.47 ns 377.76 ns 382.21 ns]                           
                        change: [+3.3055% +4.4193% +5.5923%] (p = 0.00 < 0.05)
                        Performance has regressed.

add_nulls_512           time:   [502.94 ns 504.51 ns 506.28 ns]                           
                        change: [-24.876% -24.299% -23.709%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

divide_nulls_512        time:   [1.4843 us 1.4931 us 1.5053 us]                              
                        change: [-22.968% -22.243% -21.420%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 24 outliers among 100 measurements (24.00%)
  15 (15.00%) low mild
  1 (1.00%) high mild
  8 (8.00%) high severe

Length (against the commit that fixes the bench, 16bc7200f3baa6e526aea7135c60dcc949c9b592, not master):

length                  time:   [1.5379 us 1.5408 us 1.5437 us]                    
                        change: [-97.311% -97.295% -97.278%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe

@github-actions
Copy link

Comment on lines 39 to 46
let lengths: Vec<OffsetSize> = slice
.windows(2)
.map(|offset| offset[1] - offset[0])
.collect();
let lengths = slice.windows(2).map(|offset| offset[1] - offset[0]);

let mut buffer = MutableBuffer::new(0);
buffer.extend_from_exact_iter(lengths);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main difference in this PR for the length: we extend from the iterator instead of extending a vector and then memcopying the data.

Comment on lines 150 to 148
.map(|(l, r)| op(*l, *r))
.collect::<Vec<T::Native>>();
.map(|(l, r)| op(*l, *r));
let mut buffer = MutableBuffer::new(0);
buffer.extend_from_exact_iter(values);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main difference in this PR for the arithmetics: extend from the iterator instead of extending a vector and then memcopying the data.

@codecov-io
Copy link

codecov-io commented Jan 17, 2021

Codecov Report

Merging #9235 (38b131d) into master (c413566) will decrease coverage by 0.00%.
The diff coverage is 84.48%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9235      +/-   ##
==========================================
- Coverage   81.61%   81.61%   -0.01%     
==========================================
  Files         215      215              
  Lines       52508    52589      +81     
==========================================
+ Hits        42854    42918      +64     
- Misses       9654     9671      +17     
Impacted Files Coverage Δ
rust/arrow/src/bytes.rs 53.12% <ø> (ø)
rust/arrow/src/array/transform/fixed_binary.rs 78.94% <50.00%> (ø)
rust/arrow/src/buffer.rs 93.43% <81.17%> (-2.77%) ⬇️
rust/arrow/src/compute/kernels/arithmetic.rs 89.55% <95.83%> (-0.28%) ⬇️
rust/arrow/src/array/transform/primitive.rs 100.00% <100.00%> (ø)
rust/arrow/src/compute/kernels/length.rs 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c413566...38b131d. Read the comment docs.

Copy link
Contributor

@mqy mqy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled this PR and took a brief look at all changes, filed four trivial comments. Basically I can understand the changes but can't argue deeply about the fundamental benefit vs risk. What looks good to me is: the core changes are concise than before and newly added functions are well documented.

Among the relevant benches , the buffer_create that I had run outputs average improvements comparing to the merge base 08cccd688 ARROW-11168: [Rust] [Doc] Fix cargo doc warnings:

  • mutable: -11.748%
  • mutable prepared: -17.702%
  • from_slice: +6.3866%
  • from_slice prepared: -9.9773%

BTW, it shocked me at first glance of the file list size. Perhaps it's better to add/test new functions before refactor existing codes. Anyway, the overall LOC is not large, so LGTM as of the change size. Thanks @jorgecarleitao.

rust/arrow/src/buffer.rs Outdated Show resolved Hide resolved
rust/arrow/src/buffer.rs Outdated Show resolved Hide resolved
if self.len >= self.capacity() {
let (lower, _) = iterator.size_hint();
let additional = (lower + 1) * size;
self.reserve_at_least(self.len + additional);

This comment was marked as outdated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry, I did not understand your question.

This is being called here because iterators do not have a reliable size_hint (thus the + 1). This code is spiritually equal to extend_desugared in the standard lib.

rust/arrow/src/array/array_primitive.rs Outdated Show resolved Hide resolved
@jorgecarleitao
Copy link
Member Author

On my computer this is now faster than simd in master 🤣

@jorgecarleitao jorgecarleitao marked this pull request as draft January 20, 2021 04:02
@jorgecarleitao
Copy link
Member Author

I am drafting this because there is room for improvement in the API and soundness, details here https://users.rust-lang.org/t/collect-for-exactsizediterator/54367/15

@jorgecarleitao jorgecarleitao marked this pull request as ready for review January 20, 2021 06:55
@alamb
Copy link
Contributor

alamb commented Jan 20, 2021

Looks like there is an error in the SIMD tests https://github.com/apache/arrow/pull/9235/checks?check_run_id=1733044845


failures:

---- src/buffer.rs - buffer::MutableBuffer::extend_from_trusted_len_iter (line 1036) stdout ----
error[E0716]: temporary value dropped while borrowed
 --> src/buffer.rs:1038:12
  |
5 | let iter = vec![1u32].iter().map(|x| x * 2);
  |            ^^^^^^^^^^                      - temporary value is freed at the end of this statement
  |            |
  |            creates a temporary which is freed while still in use
6 | let mut buffer = MutableBuffer::new(0);
7 | unsafe { buffer.extend_from_trusted_len_iter(iter) };
  |                                              ---- borrow later used here
  |
  = note: consider using a `let` binding to create a longer lived value
  = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

error: aborting due to previous error

let mut dst = unsafe { ptr.as_ptr().add(len) as *mut T };

while let Some(item) = iterator.next() {
if len >= capacity {
Copy link
Contributor

@mqy mqy Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgecarleitao Line 1011, the test condition may cause overflow, perhaps should increase len by size before the comparison. The following test shows this problem:

    #[test]
    fn mutable_extend_from_iter_write_overflow() {
        let mut buf = MutableBuffer::new(0);

        buf.extend(vec![1_u8; 63]);
        assert_eq!(63, buf.len());
        assert_eq!(64, buf.capacity());

        buf.extend(vec![1_u16; 1]);
        assert_eq!(65, buf.len());
        assert_eq!(64, buf.capacity());
    }

I had run this test for several times, all passed, but you can see the len is bigger than capacity.

@mqy
Copy link
Contributor

mqy commented Jan 20, 2021

Looks like there is an error in the SIMD tests https://github.com/apache/arrow/pull/9235/checks?

cargo test --doc --package arrow -- buffer::MutableBuffer::extend_from_trusted_len_iter --nocapture reproduces this error.

let (lower, _) = iterator.size_hint();
let additional = (lower + 1) * size;
let (new_ptr, new_capacity) =
unsafe { reallocate(ptr, capacity, len + additional) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unsound if unwinding is enabled, because Iterator::next may panic on a future iteration of the loop, causing self to be dropped while its ptr and capacity are incorrect. This is why a drop guard like SetLenOnDrop is needed.

Copy link
Member Author

@jorgecarleitao jorgecarleitao Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot. I do agree that this is unsound atm.

Note that arrow does not support complex structs on its buffers (i.e. we only support u8-u64, i8-i64, f32 and f64), which means that we never need to call drop on the elements. Under this, do we still need a valid len? I understood that the len was only needed because we could have to call drop on the elements up to len.

With SetLenOnDrop, we borrow a mutable reference to self.len, which wont allow us to call self.reserve inside the if. Could this be the reason why SetLenOnDrop is not being used on the extend_desugared (which in my understanding this part of the code is reproducing)?

(I am working on a fix)

Copy link
Contributor

@mbrubeck mbrubeck Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that arrow does not support complex structs on its buffers (i.e. we only support u8-u64, i8-i64, f32 and f64), which means that we never need to call drop on the elements. Under this, do we still need a valid len?

No. Dropping before self.len is updated should be fine, SetLenOnDrop is not actually necessary for soundness or correctness (as long as the pointer and capacity are still valid).

However, in my benchmarking I still found that using SetLenOnDrop provided a small performance benefit compared to just updating self.len after the loop. I'm not sure why. Maybe the mutable borrow provides some additional hints to the optimizer.

rust/arrow/src/buffer.rs Outdated Show resolved Hide resolved
@mbrubeck
Copy link
Contributor

mbrubeck commented Jan 20, 2021

I was able to speed up the fast path of extend_from_iter by a few percent, by splitting the loop into two parts. This version is only about 8% slower than extend_from_trusted_len_iter. (Previously it was about 12% slower.) Both are still much slower than extend_from_slice.

#[inline]
fn extend_from_iter<T: ArrowNativeType, I: Iterator<Item = T>>(
    &mut self,
    mut iterator: I,
) {
    let size = std::mem::size_of::<T>();
    let (lower, _) = iterator.size_hint();
    let additional = lower * size;
    self.reserve(additional);

    // this is necessary because of https://github.com/rust-lang/rust/issues/32155
    let mut len = SetLenOnDrop::new(&mut self.len);
    let mut dst = unsafe { self.data.as_ptr().add(len.local_len) as *mut T };
    let capacity = self.capacity;

    while len.local_len + size <= capacity {
        if let Some(item) = iterator.next() {
            unsafe {
                std::ptr::write(dst, item);
                dst = dst.add(1);
            }
            len.local_len += size;
        } else {
            break;
        }
    }
    drop(len);

    iterator.for_each(|item| self.push(item));
}

I haven't benchmarked the slow path (the second loop), which I implemented using safe code. It's possible that it could be made faster with unsafe code, but I don't expect large gains there.

My complete code including benchmarks is on this branch: master...mbrubeck:length_faster

@jorgecarleitao
Copy link
Member Author

I validated that with @mbrubeck 's change, we can just use collect and drop the trusted_len and unsafe hack with the same performance on the arithmetics 🎉 🎉🎉

@mbrubeck , thank you so much for taking your time in addressing this; I have been really struggling with the compiler on this one and your implementation is exactly what I was trying to achieve in this PR: an efficient, safe, and expressive API for our users to build a MutableBuffer.

When we commit to master we squash all commits. Would you be willing to take credit for this whole PR + your changes? IMO the most relevant part of this whole thing was your change, that enables collect.

Alternatively, we could merge this PR as is, and then apply your changes on a separate one. Let me know what you prefer I will act accordingly (@alamb let me know if you see any problem here).

@mbrubeck
Copy link
Contributor

When we commit to master we squash all commits. Would you be willing to take credit for this whole PR + your changes?

I think it would be more appropriate to keep your name as the author, because you are more likely to be able to answer future questions about it, etc. But feel free to credit me in the commit message, if you like. Thanks!

@jorgecarleitao jorgecarleitao marked this pull request as draft January 20, 2021 19:58
@jorgecarleitao
Copy link
Member Author

jorgecarleitao commented Jan 20, 2021

I made a mistake in benchmarking and have to retract my statement wrt to the collect. I am really sorry for the noise:

@mbrubeck , even though there is a small difference between the the two extends in the individual benches, when used in a non-trivial operation (arithmetic), collect is about +65% slower than using extend_from_trusted_len_iter.

This hints that the extend_trusted_len_iter does have a significant benefit, if anything at helping the compiler optimize out some code.

what I am using for the final benches:

git checkout 7bb2f3c13f0e652b01e3fecbee0ae32ff094f3e6
cargo bench --bench arithmetic_kernels -- "add 512"
git checkout cdbbc6f9d81d37216dc158177405529988fa0279
cargo bench --bench arithmetic_kernels -- "add 512"

The difference in code is

    let values = left
        .values()
        .iter()
        .zip(right.values().iter())
        .map(|(l, r)| op(*l, *r));
    let mut buffer = MutableBuffer::new(0);
    unsafe { buffer.extend_from_trusted_len_iter(values) };

to

    let buffer = left
        .values()
        .iter()
        .zip(right.values().iter())
        .map(|(l, r)| op(*l, *r))
        .collect();

IMO this is interesting by itself.

@jorgecarleitao jorgecarleitao changed the title ARROW-11291: [Rust] Add extend to MutableBuffer (-25% for aritmetic, -97% for length) ARROW-11291: [Rust] Add extend to MutableBuffer (-20% for arithmetic, -97% for length) Jan 21, 2021
@jorgecarleitao jorgecarleitao marked this pull request as ready for review January 21, 2021 21:11
@jorgecarleitao
Copy link
Member Author

I updated the description and this is now ready for a review.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good @jorgecarleitao -- thank you. Nice work.

Given the unsafe nature of this PR I would feel better about using from_trusted_len_iter if it could detect when the invariant had been violated and panic! -- perhaps by tracking how many items it actually wrote (I had a suggestion on how to do so)

rust/arrow/src/buffer.rs Show resolved Hide resolved
rust/arrow/src/buffer.rs Show resolved Hide resolved
rust/arrow/src/buffer.rs Show resolved Hide resolved
@@ -1003,6 +1156,28 @@ impl PartialEq for MutableBuffer {
unsafe impl Sync for MutableBuffer {}
unsafe impl Send for MutableBuffer {}

struct SetLenOnDrop<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this construct is fascinating, but I will take your word that it was necessary for speedup

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this one another quick review and it is looking good. 👍

@alamb alamb closed this in 13e2134 Jan 23, 2021
kszucs pushed a commit that referenced this pull request Jan 25, 2021
… -97% for length)

# Rational

Rust forbids safely accessing uninitialized memory because it is undefined behavior. However, when building `Buffer`s, it is important to be able to _write_ to uninitialized memory regions, thereby avoiding the need to write _something_ to it before using it.

Currently, all our initializations are zeroed, which is expensive. #9076 modifies our allocator to allocate uninitialized regions. However, by itself, this is not useful if we do not offer any methods to write to those (uninitialized) regions.

# This PR

This PR is built on top of #9076 and introduces methods to extend a `MutableBuffer` from an iterator, thereby offering an API to efficiently grow `MutableBuffer` without having to initialize memory regions with zeros (i.e. without `with_bitset` and the like).

This PR also introduces methods to create a `Buffer` from an iterator and a `trusted_len` iterator.

The design is inspired in `Vec`, with the catch that we use stable Rust (i.e. no trait specialization, no `TrustedLen`), and thus have to expose a bit more methods than what `Vec` exposes. This means that we can't use that (nicer) API for trustedlen iterators based on `collect()`.

Note that, as before, there are still `unsafe` uses of the `MutableBuffer` derived from the fact that it is not a generic over a type `T` (and thus people can mix types and grow the buffer in unsound ways).

Special thanks to @mbrubeck for all the help on this, originally discussed [here](https://users.rust-lang.org/t/collect-for-exactsizediterator/54367/6).

```bash
git checkout master
cargo bench --bench arithmetic_kernels
git checkout length_faster
cargo bench --bench arithmetic_kernels

git checkout 16bc720
cargo bench --bench length_kernel
git checkout length_faster
```

```
Switched to branch 'length_faster'
  (use "git pull" to merge the remote branch into yours)
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 02s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/arithmetic_kernels-ec2cc20ce07d9b83
Gnuplot not found, using plotters backend
add 512                 time:   [522.24 ns 523.67 ns 525.26 ns]
                        change: [-21.738% -20.960% -20.233%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  9 (9.00%) high mild
  3 (3.00%) high severe

subtract 512            time:   [503.18 ns 504.93 ns 506.81 ns]
                        change: [-21.741% -21.075% -20.308%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

multiply 512            time:   [508.25 ns 512.04 ns 516.06 ns]
                        change: [-22.569% -21.946% -21.305%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

divide 512              time:   [1.4711 us 1.4753 us 1.4799 us]
                        change: [-24.783% -23.305% -22.176%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

limit 512, 512          time:   [373.47 ns 377.76 ns 382.21 ns]
                        change: [+3.3055% +4.4193% +5.5923%] (p = 0.00 < 0.05)
                        Performance has regressed.

add_nulls_512           time:   [502.94 ns 504.51 ns 506.28 ns]
                        change: [-24.876% -24.299% -23.709%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

divide_nulls_512        time:   [1.4843 us 1.4931 us 1.5053 us]
                        change: [-22.968% -22.243% -21.420%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 24 outliers among 100 measurements (24.00%)
  15 (15.00%) low mild
  1 (1.00%) high mild
  8 (8.00%) high severe
```

Length (against the commit that fixes the bench, `16bc7200f3baa6e526aea7135c60dcc949c9b592`, not master):

```
length                  time:   [1.5379 us 1.5408 us 1.5437 us]
                        change: [-97.311% -97.295% -97.278%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe
```

Closes #9235 from jorgecarleitao/length_faster

Lead-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Co-authored-by: Matt Brubeck <mbrubeck@limpet.net>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
nevi-me pushed a commit that referenced this pull request Jan 28, 2021
This PR improves the performance of certain time / date casts by using the brand new API proposed in #9235 .

That API allows for a very fast execution of unary and infalible operations on primitive arrays, and this PR leverages that for cast operations that require a simple mathematical operation.

```
Switched to branch 'fast_unary'
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 06s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/cast_kernels-25ee76597a8b997b
Gnuplot not found, using plotters backend

cast date64 to date32 512
                        time:   [1.1668 us 1.1706 us 1.1749 us]
                        change: [-83.347% -83.248% -83.144%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

cast date32 to date64 512
                        time:   [899.73 ns 930.58 ns 971.56 ns]
                        change: [-86.799% -86.520% -86.190%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

cast time32s to time32ms 512
                        time:   [728.73 ns 732.33 ns 735.90 ns]
                        change: [-54.503% -54.201% -53.917%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

cast time64ns to time32s 512
                        time:   [4.8374 us 4.8447 us 4.8526 us]
                        change: [-57.022% -56.791% -56.587%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
```

Closes #9297 from jorgecarleitao/fast_unary

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
… -97% for length)

# Rational

Rust forbids safely accessing uninitialized memory because it is undefined behavior. However, when building `Buffer`s, it is important to be able to _write_ to uninitialized memory regions, thereby avoiding the need to write _something_ to it before using it.

Currently, all our initializations are zeroed, which is expensive. apache#9076 modifies our allocator to allocate uninitialized regions. However, by itself, this is not useful if we do not offer any methods to write to those (uninitialized) regions.

# This PR

This PR is built on top of apache#9076 and introduces methods to extend a `MutableBuffer` from an iterator, thereby offering an API to efficiently grow `MutableBuffer` without having to initialize memory regions with zeros (i.e. without `with_bitset` and the like).

This PR also introduces methods to create a `Buffer` from an iterator and a `trusted_len` iterator.

The design is inspired in `Vec`, with the catch that we use stable Rust (i.e. no trait specialization, no `TrustedLen`), and thus have to expose a bit more methods than what `Vec` exposes. This means that we can't use that (nicer) API for trustedlen iterators based on `collect()`.

Note that, as before, there are still `unsafe` uses of the `MutableBuffer` derived from the fact that it is not a generic over a type `T` (and thus people can mix types and grow the buffer in unsound ways).

Special thanks to @mbrubeck for all the help on this, originally discussed [here](https://users.rust-lang.org/t/collect-for-exactsizediterator/54367/6).

```bash
git checkout master
cargo bench --bench arithmetic_kernels
git checkout length_faster
cargo bench --bench arithmetic_kernels

git checkout 16bc720
cargo bench --bench length_kernel
git checkout length_faster
```

```
Switched to branch 'length_faster'
  (use "git pull" to merge the remote branch into yours)
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 02s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/arithmetic_kernels-ec2cc20ce07d9b83
Gnuplot not found, using plotters backend
add 512                 time:   [522.24 ns 523.67 ns 525.26 ns]
                        change: [-21.738% -20.960% -20.233%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  9 (9.00%) high mild
  3 (3.00%) high severe

subtract 512            time:   [503.18 ns 504.93 ns 506.81 ns]
                        change: [-21.741% -21.075% -20.308%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

multiply 512            time:   [508.25 ns 512.04 ns 516.06 ns]
                        change: [-22.569% -21.946% -21.305%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

divide 512              time:   [1.4711 us 1.4753 us 1.4799 us]
                        change: [-24.783% -23.305% -22.176%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

limit 512, 512          time:   [373.47 ns 377.76 ns 382.21 ns]
                        change: [+3.3055% +4.4193% +5.5923%] (p = 0.00 < 0.05)
                        Performance has regressed.

add_nulls_512           time:   [502.94 ns 504.51 ns 506.28 ns]
                        change: [-24.876% -24.299% -23.709%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

divide_nulls_512        time:   [1.4843 us 1.4931 us 1.5053 us]
                        change: [-22.968% -22.243% -21.420%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 24 outliers among 100 measurements (24.00%)
  15 (15.00%) low mild
  1 (1.00%) high mild
  8 (8.00%) high severe
```

Length (against the commit that fixes the bench, `16bc7200f3baa6e526aea7135c60dcc949c9b592`, not master):

```
length                  time:   [1.5379 us 1.5408 us 1.5437 us]
                        change: [-97.311% -97.295% -97.278%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe
```

Closes apache#9235 from jorgecarleitao/length_faster

Lead-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Co-authored-by: Matt Brubeck <mbrubeck@limpet.net>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This PR improves the performance of certain time / date casts by using the brand new API proposed in apache#9235 .

That API allows for a very fast execution of unary and infalible operations on primitive arrays, and this PR leverages that for cast operations that require a simple mathematical operation.

```
Switched to branch 'fast_unary'
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 06s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/cast_kernels-25ee76597a8b997b
Gnuplot not found, using plotters backend

cast date64 to date32 512
                        time:   [1.1668 us 1.1706 us 1.1749 us]
                        change: [-83.347% -83.248% -83.144%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

cast date32 to date64 512
                        time:   [899.73 ns 930.58 ns 971.56 ns]
                        change: [-86.799% -86.520% -86.190%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

cast time32s to time32ms 512
                        time:   [728.73 ns 732.33 ns 735.90 ns]
                        change: [-54.503% -54.201% -53.917%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

cast time64ns to time32s 512
                        time:   [4.8374 us 4.8447 us 4.8526 us]
                        change: [-57.022% -56.791% -56.587%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
```

Closes apache#9297 from jorgecarleitao/fast_unary

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
… -97% for length)

# Rational

Rust forbids safely accessing uninitialized memory because it is undefined behavior. However, when building `Buffer`s, it is important to be able to _write_ to uninitialized memory regions, thereby avoiding the need to write _something_ to it before using it.

Currently, all our initializations are zeroed, which is expensive. apache#9076 modifies our allocator to allocate uninitialized regions. However, by itself, this is not useful if we do not offer any methods to write to those (uninitialized) regions.

# This PR

This PR is built on top of apache#9076 and introduces methods to extend a `MutableBuffer` from an iterator, thereby offering an API to efficiently grow `MutableBuffer` without having to initialize memory regions with zeros (i.e. without `with_bitset` and the like).

This PR also introduces methods to create a `Buffer` from an iterator and a `trusted_len` iterator.

The design is inspired in `Vec`, with the catch that we use stable Rust (i.e. no trait specialization, no `TrustedLen`), and thus have to expose a bit more methods than what `Vec` exposes. This means that we can't use that (nicer) API for trustedlen iterators based on `collect()`.

Note that, as before, there are still `unsafe` uses of the `MutableBuffer` derived from the fact that it is not a generic over a type `T` (and thus people can mix types and grow the buffer in unsound ways).

Special thanks to @mbrubeck for all the help on this, originally discussed [here](https://users.rust-lang.org/t/collect-for-exactsizediterator/54367/6).

```bash
git checkout master
cargo bench --bench arithmetic_kernels
git checkout length_faster
cargo bench --bench arithmetic_kernels

git checkout 16bc720
cargo bench --bench length_kernel
git checkout length_faster
```

```
Switched to branch 'length_faster'
  (use "git pull" to merge the remote branch into yours)
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 02s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/arithmetic_kernels-ec2cc20ce07d9b83
Gnuplot not found, using plotters backend
add 512                 time:   [522.24 ns 523.67 ns 525.26 ns]
                        change: [-21.738% -20.960% -20.233%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  9 (9.00%) high mild
  3 (3.00%) high severe

subtract 512            time:   [503.18 ns 504.93 ns 506.81 ns]
                        change: [-21.741% -21.075% -20.308%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

multiply 512            time:   [508.25 ns 512.04 ns 516.06 ns]
                        change: [-22.569% -21.946% -21.305%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

divide 512              time:   [1.4711 us 1.4753 us 1.4799 us]
                        change: [-24.783% -23.305% -22.176%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

limit 512, 512          time:   [373.47 ns 377.76 ns 382.21 ns]
                        change: [+3.3055% +4.4193% +5.5923%] (p = 0.00 < 0.05)
                        Performance has regressed.

add_nulls_512           time:   [502.94 ns 504.51 ns 506.28 ns]
                        change: [-24.876% -24.299% -23.709%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

divide_nulls_512        time:   [1.4843 us 1.4931 us 1.5053 us]
                        change: [-22.968% -22.243% -21.420%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 24 outliers among 100 measurements (24.00%)
  15 (15.00%) low mild
  1 (1.00%) high mild
  8 (8.00%) high severe
```

Length (against the commit that fixes the bench, `16bc7200f3baa6e526aea7135c60dcc949c9b592`, not master):

```
length                  time:   [1.5379 us 1.5408 us 1.5437 us]
                        change: [-97.311% -97.295% -97.278%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe
```

Closes apache#9235 from jorgecarleitao/length_faster

Lead-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Co-authored-by: Matt Brubeck <mbrubeck@limpet.net>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
This PR improves the performance of certain time / date casts by using the brand new API proposed in apache#9235 .

That API allows for a very fast execution of unary and infalible operations on primitive arrays, and this PR leverages that for cast operations that require a simple mathematical operation.

```
Switched to branch 'fast_unary'
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 06s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/cast_kernels-25ee76597a8b997b
Gnuplot not found, using plotters backend

cast date64 to date32 512
                        time:   [1.1668 us 1.1706 us 1.1749 us]
                        change: [-83.347% -83.248% -83.144%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

cast date32 to date64 512
                        time:   [899.73 ns 930.58 ns 971.56 ns]
                        change: [-86.799% -86.520% -86.190%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

cast time32s to time32ms 512
                        time:   [728.73 ns 732.33 ns 735.90 ns]
                        change: [-54.503% -54.201% -53.917%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

cast time64ns to time32s 512
                        time:   [4.8374 us 4.8447 us 4.8526 us]
                        change: [-57.022% -56.791% -56.587%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
```

Closes apache#9297 from jorgecarleitao/fast_unary

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants