ARROW-11045: [Rust] Fix performance issues of allocator #9027

jorgecarleitao · 2020-12-28T09:58:49Z

This PR addresses a performance issue in how we allocate and reallocate the MutableBuffer by migrating the relevant parts of rust's std alloc lib into this crate.

Problem

The following is the result of 4 runs:

mutable                 time:   [929.26 us 931.88 us 935.42 us]                    
mutable prepared        time:   [1.0682 ms 1.0693 ms 1.0709 ms]                              
from_slice              time:   [4.4857 ms 4.5043 ms 4.5247 ms]                        
from_slice prepared     time:   [1.4358 ms 1.4406 ms 1.4467 ms]

start with an empty MutableBuffer and grow it (realloc + memcopy)
start with a mutable with the correct capacity and grow (i.e. memcopy)
do the same as 1. with a Vec<u8> (realloc + memcopy) and at the end of all use Buffer::from (a memcopy)
same as 2 and at the end of all use Buffer::from (memcopy to vec + memcopy to Buffer)

The fact that there is no difference between 1 and 2 but a 3.5x difference between 3 and 4 shows that we are doing something wrong. The fact that 1 is as fast as 2 shows that we are doing something wrong.

This PR

This PR rewrites our current allocator code to a code very close to the code used by std allocator. The core reason we do this is that we benefit from cache-line aligned allocated buffers ref, but Rust's custom allocator's API is unstable (and thus only available in nightly).

The code in this PR is not very complex and I assume that it was already well though through from rust's std team. I did the necessary modifications for our use-case:

always allocate aligned
always allocate in chunks of 64 bytes
always allocate initialized to zero (std::alloc::alloc_zeroed)

The main benefit of this is that we can use MutableBuffer, BooleanBufferBuilder and BufferBuilder in the same way as we would use Vec<u8>, Vec<bool> and Vec<T: Primitive> respectively without having to rely on unsafe code to efficiently build buffers.

The performance difference is mostly present in variable-sized buffers, such as strings.

Benchmarks for take:

git checkout master
cargo bench --bench take_kernels --features simd
git checkout alloc
cargo bench --bench take_kernels --features simd

benchmark	variation (%)
take i32 nulls 1024	5.9
take i32 1024	4.4
take i32 512	2.1
take str 512	2.1
take i32 nulls 512	1.6
take str 1024	1.0
take bool nulls 1024	-1.5
take bool nulls 512	-2.4
take bool 512	-6.8
take bool 1024	-8.3
take str null values 1024	-10.6
take str null values null indices 1024	-18.9

Builder benches:

bench_primitive         time:   [967.71 us 969.06 us 970.62 us]                            
                        thrpt:  [4.0245 GiB/s 4.0310 GiB/s 4.0366 GiB/s]
                 change:
                        time:   [-4.3022% -3.2064% -2.2317%] (p = 0.00 < 0.05)
                        thrpt:  [+2.2826% +3.3126% +4.4956%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe

bench_bool              time:   [2.0547 ms 2.0823 ms 2.1163 ms]                        
                        thrpt:  [236.27 MiB/s 240.11 MiB/s 243.35 MiB/s]
                 change:
                        time:   [-25.302% -24.178% -22.857%] (p = 0.00 < 0.05)
                        thrpt:  [+29.629% +31.887% +33.873%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

github-actions · 2020-12-28T10:04:02Z

https://issues.apache.org/jira/browse/ARROW-11045

@Dandandan

This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code. This is the second performance improvement originally presented in #8796 and, together with #9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly. Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code. This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature ``` pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T]) pub fn push<T: ToByteSlice>(&mut self, item: &T) ``` i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing). Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](#9016 (comment)). > [...] current conversion to a byte slice may add some overhead? - @Dandandan Benches (against master, so, both this PR and #9044 ): ``` Switched to branch 'perf_buffer' Your branch and 'origin/perf_buffer' have diverged, and have 6 and 1 different commits each, respectively. (use "git pull" to merge the remote branch into yours) Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 00s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471 Gnuplot not found, using plotters backend mutable time: [463.11 us 463.57 us 464.07 us] change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) high mild 9 (9.00%) high severe mutable prepared time: [527.84 us 528.46 us 529.14 us] change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 5 (5.00%) high mild 7 (7.00%) high severe Benchmarking from_slice: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60. from_slice time: [1.1968 ms 1.1979 ms 1.1991 ms] change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) high mild 7 (7.00%) high severe from_slice prepared time: [917.49 us 918.89 us 920.60 us] change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) high mild 6 (6.00%) high severe ``` Closes #9076 from jorgecarleitao/perf_buffer Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

@Dandandan

This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code. This is the second performance improvement originally presented in #8796 and, together with #9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly. Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code. This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature ``` pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T]) pub fn push<T: ToByteSlice>(&mut self, item: &T) ``` i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing). Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](#9016 (comment)). > [...] current conversion to a byte slice may add some overhead? - @Dandandan Benches (against master, so, both this PR and #9044 ): ``` Switched to branch 'perf_buffer' Your branch and 'origin/perf_buffer' have diverged, and have 6 and 1 different commits each, respectively. (use "git pull" to merge the remote branch into yours) Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 00s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471 Gnuplot not found, using plotters backend mutable time: [463.11 us 463.57 us 464.07 us] change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) high mild 9 (9.00%) high severe mutable prepared time: [527.84 us 528.46 us 529.14 us] change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 5 (5.00%) high mild 7 (7.00%) high severe Benchmarking from_slice: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60. from_slice time: [1.1968 ms 1.1979 ms 1.1991 ms] change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) high mild 7 (7.00%) high severe from_slice prepared time: [917.49 us 918.89 us 920.60 us] change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) high mild 6 (6.00%) high severe ``` Closes #9076 from jorgecarleitao/perf_buffer Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

@Dandandan

This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code. This is the second performance improvement originally presented in apache#8796 and, together with apache#9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly. Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code. This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature ``` pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T]) pub fn push<T: ToByteSlice>(&mut self, item: &T) ``` i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing). Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](apache#9016 (comment)). > [...] current conversion to a byte slice may add some overhead? - @Dandandan Benches (against master, so, both this PR and apache#9044 ): ``` Switched to branch 'perf_buffer' Your branch and 'origin/perf_buffer' have diverged, and have 6 and 1 different commits each, respectively. (use "git pull" to merge the remote branch into yours) Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 00s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471 Gnuplot not found, using plotters backend mutable time: [463.11 us 463.57 us 464.07 us] change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) high mild 9 (9.00%) high severe mutable prepared time: [527.84 us 528.46 us 529.14 us] change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 5 (5.00%) high mild 7 (7.00%) high severe Benchmarking from_slice: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60. from_slice time: [1.1968 ms 1.1979 ms 1.1991 ms] change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) high mild 7 (7.00%) high severe from_slice prepared time: [917.49 us 918.89 us 920.60 us] change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) high mild 6 (6.00%) high severe ``` Closes apache#9076 from jorgecarleitao/perf_buffer Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

@Dandandan

This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code. This is the second performance improvement originally presented in apache#8796 and, together with apache#9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly. Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code. This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature ``` pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T]) pub fn push<T: ToByteSlice>(&mut self, item: &T) ``` i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing). Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](apache#9016 (comment)). > [...] current conversion to a byte slice may add some overhead? - @Dandandan Benches (against master, so, both this PR and apache#9044 ): ``` Switched to branch 'perf_buffer' Your branch and 'origin/perf_buffer' have diverged, and have 6 and 1 different commits each, respectively. (use "git pull" to merge the remote branch into yours) Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 00s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471 Gnuplot not found, using plotters backend mutable time: [463.11 us 463.57 us 464.07 us] change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) high mild 9 (9.00%) high severe mutable prepared time: [527.84 us 528.46 us 529.14 us] change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 5 (5.00%) high mild 7 (7.00%) high severe Benchmarking from_slice: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60. from_slice time: [1.1968 ms 1.1979 ms 1.1991 ms] change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) high mild 7 (7.00%) high severe from_slice prepared time: [917.49 us 918.89 us 920.60 us] change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) high mild 6 (6.00%) high severe ``` Closes apache#9076 from jorgecarleitao/perf_buffer Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

jorgecarleitao added 4 commits December 28, 2020 09:00

Aligned naming with Rust's std

bbe817d

Removed some unsafes, added some more.

a00b9fd

Migrated parquet.

8093d17

Rest data -> as_slice

b5bb0b9

github-actions bot added Component: Rust Component: Parquet labels Dec 28, 2020

jorgecarleitao added 2 commits December 28, 2020 11:41

Added new bench.

ab0ae2a

New alloc

dff2789

jorgecarleitao closed this Dec 28, 2020

jorgecarleitao mentioned this pull request Jan 2, 2021

ARROW-11108: [Rust] Fixed performance issue in mutableBuffer. #9076

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-11045: [Rust] Fix performance issues of allocator #9027

ARROW-11045: [Rust] Fix performance issues of allocator #9027

jorgecarleitao commented Dec 28, 2020 •

edited

Loading

github-actions bot commented Dec 28, 2020

ARROW-11045: [Rust] Fix performance issues of allocator #9027

ARROW-11045: [Rust] Fix performance issues of allocator #9027

Conversation

jorgecarleitao commented Dec 28, 2020 • edited Loading

Problem

This PR

github-actions bot commented Dec 28, 2020

jorgecarleitao commented Dec 28, 2020 •

edited

Loading