Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11045: [Rust] Fix performance issues of allocator #9027

Closed
wants to merge 6 commits into from
Closed

ARROW-11045: [Rust] Fix performance issues of allocator #9027

wants to merge 6 commits into from

Conversation

jorgecarleitao
Copy link
Member

@jorgecarleitao jorgecarleitao commented Dec 28, 2020

This PR addresses a performance issue in how we allocate and reallocate the MutableBuffer by migrating the relevant parts of rust's std alloc lib into this crate.

Problem

The following is the result of 4 runs:

mutable                 time:   [929.26 us 931.88 us 935.42 us]                    
mutable prepared        time:   [1.0682 ms 1.0693 ms 1.0709 ms]                              
from_slice              time:   [4.4857 ms 4.5043 ms 4.5247 ms]                        
from_slice prepared     time:   [1.4358 ms 1.4406 ms 1.4467 ms]                                 
  1. start with an empty MutableBuffer and grow it (realloc + memcopy)
  2. start with a mutable with the correct capacity and grow (i.e. memcopy)
  3. do the same as 1. with a Vec<u8> (realloc + memcopy) and at the end of all use Buffer::from (a memcopy)
  4. same as 2 and at the end of all use Buffer::from (memcopy to vec + memcopy to Buffer)

The fact that there is no difference between 1 and 2 but a 3.5x difference between 3 and 4 shows that we are doing something wrong. The fact that 1 is as fast as 2 shows that we are doing something wrong.

This PR

This PR rewrites our current allocator code to a code very close to the code used by std allocator. The core reason we do this is that we benefit from cache-line aligned allocated buffers ref, but Rust's custom allocator's API is unstable (and thus only available in nightly).

The code in this PR is not very complex and I assume that it was already well though through from rust's std team. I did the necessary modifications for our use-case:

  • always allocate aligned
  • always allocate in chunks of 64 bytes
  • always allocate initialized to zero (std::alloc::alloc_zeroed)

The main benefit of this is that we can use MutableBuffer, BooleanBufferBuilder and BufferBuilder in the same way as we would use Vec<u8>, Vec<bool> and Vec<T: Primitive> respectively without having to rely on unsafe code to efficiently build buffers.

The performance difference is mostly present in variable-sized buffers, such as strings.

Benchmarks for take:

git checkout master
cargo bench --bench take_kernels --features simd
git checkout alloc
cargo bench --bench take_kernels --features simd
benchmark variation (%)
take i32 nulls 1024 5.9
take i32 1024 4.4
take i32 512 2.1
take str 512 2.1
take i32 nulls 512 1.6
take str 1024 1.0
take bool nulls 1024 -1.5
take bool nulls 512 -2.4
take bool 512 -6.8
take bool 1024 -8.3
take str null values 1024 -10.6
take str null values null indices 1024 -18.9

Builder benches:

bench_primitive         time:   [967.71 us 969.06 us 970.62 us]                            
                        thrpt:  [4.0245 GiB/s 4.0310 GiB/s 4.0366 GiB/s]
                 change:
                        time:   [-4.3022% -3.2064% -2.2317%] (p = 0.00 < 0.05)
                        thrpt:  [+2.2826% +3.3126% +4.4956%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe

bench_bool              time:   [2.0547 ms 2.0823 ms 2.1163 ms]                        
                        thrpt:  [236.27 MiB/s 240.11 MiB/s 243.35 MiB/s]
                 change:
                        time:   [-25.302% -24.178% -22.857%] (p = 0.00 < 0.05)
                        thrpt:  [+29.629% +31.887% +33.873%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

@github-actions
Copy link

jorgecarleitao added a commit that referenced this pull request Jan 19, 2021
This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code.

This is the second performance improvement originally presented in #8796 and, together with #9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly.

Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code.

This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature

```
pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T])
pub fn push<T: ToByteSlice>(&mut self, item: &T)
```

i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing).

Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](#9016 (comment)).

> [...] current conversion to a byte slice may add some overhead? - @Dandandan

Benches (against master, so, both this PR and #9044 ):

```
Switched to branch 'perf_buffer'
Your branch and 'origin/perf_buffer' have diverged,
and have 6 and 1 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 00s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471
Gnuplot not found, using plotters backend
mutable                 time:   [463.11 us 463.57 us 464.07 us]
                        change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe

mutable prepared        time:   [527.84 us 528.46 us 529.14 us]
                        change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe

Benchmarking from_slice: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60.
from_slice              time:   [1.1968 ms 1.1979 ms 1.1991 ms]
                        change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

from_slice prepared     time:   [917.49 us 918.89 us 920.60 us]
                        change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
```

Closes #9076 from jorgecarleitao/perf_buffer

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
kszucs pushed a commit that referenced this pull request Jan 25, 2021
This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code.

This is the second performance improvement originally presented in #8796 and, together with #9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly.

Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code.

This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature

```
pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T])
pub fn push<T: ToByteSlice>(&mut self, item: &T)
```

i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing).

Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](#9016 (comment)).

> [...] current conversion to a byte slice may add some overhead? - @Dandandan

Benches (against master, so, both this PR and #9044 ):

```
Switched to branch 'perf_buffer'
Your branch and 'origin/perf_buffer' have diverged,
and have 6 and 1 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 00s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471
Gnuplot not found, using plotters backend
mutable                 time:   [463.11 us 463.57 us 464.07 us]
                        change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe

mutable prepared        time:   [527.84 us 528.46 us 529.14 us]
                        change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe

Benchmarking from_slice: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60.
from_slice              time:   [1.1968 ms 1.1979 ms 1.1991 ms]
                        change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

from_slice prepared     time:   [917.49 us 918.89 us 920.60 us]
                        change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
```

Closes #9076 from jorgecarleitao/perf_buffer

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code.

This is the second performance improvement originally presented in apache#8796 and, together with apache#9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly.

Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code.

This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature

```
pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T])
pub fn push<T: ToByteSlice>(&mut self, item: &T)
```

i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing).

Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](apache#9016 (comment)).

> [...] current conversion to a byte slice may add some overhead? - @Dandandan

Benches (against master, so, both this PR and apache#9044 ):

```
Switched to branch 'perf_buffer'
Your branch and 'origin/perf_buffer' have diverged,
and have 6 and 1 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 00s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471
Gnuplot not found, using plotters backend
mutable                 time:   [463.11 us 463.57 us 464.07 us]
                        change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe

mutable prepared        time:   [527.84 us 528.46 us 529.14 us]
                        change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe

Benchmarking from_slice: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60.
from_slice              time:   [1.1968 ms 1.1979 ms 1.1991 ms]
                        change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

from_slice prepared     time:   [917.49 us 918.89 us 920.60 us]
                        change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
```

Closes apache#9076 from jorgecarleitao/perf_buffer

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code.

This is the second performance improvement originally presented in apache#8796 and, together with apache#9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly.

Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code.

This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature

```
pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T])
pub fn push<T: ToByteSlice>(&mut self, item: &T)
```

i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing).

Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](apache#9016 (comment)).

> [...] current conversion to a byte slice may add some overhead? - @Dandandan

Benches (against master, so, both this PR and apache#9044 ):

```
Switched to branch 'perf_buffer'
Your branch and 'origin/perf_buffer' have diverged,
and have 6 and 1 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 00s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471
Gnuplot not found, using plotters backend
mutable                 time:   [463.11 us 463.57 us 464.07 us]
                        change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe

mutable prepared        time:   [527.84 us 528.46 us 529.14 us]
                        change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe

Benchmarking from_slice: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60.
from_slice              time:   [1.1968 ms 1.1979 ms 1.1991 ms]
                        change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

from_slice prepared     time:   [917.49 us 918.89 us 920.60 us]
                        change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
```

Closes apache#9076 from jorgecarleitao/perf_buffer

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant