Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Rust] [Experiment] Vec<u8> vs current allocations #8796

Closed
wants to merge 3 commits into from
Closed

[Rust] [Experiment] Vec<u8> vs current allocations #8796

wants to merge 3 commits into from

Conversation

jorgecarleitao
Copy link
Member

@jorgecarleitao jorgecarleitao commented Nov 28, 2020

@nevi-me , @alamb @jhorstmann , I have been playing around with the buffers on the arrow crate, and just for the fun, tried to replace all our memory logic by a simple Vec<u8>. Perhaps unsurprisingly to you, but a bit to me, this leads to a significant improvement over almost all benches. I.e. even though memory alignment is good for some kernels, overall our allocations and memory handling seems to be much worse than Vec.

I am not proposing that we drop the alignment over cache lines as it is theoretically more sound. However, practically (and based on our microbenchmarks alone), there seems to be a good case here, specially the fact that we drop a large amount of unsafe code. Maybe this behavior is different if we use simd feature gate?

Here are the results ordered from worse to best (results not significant are not shown):

benchmark variation (%)
nlike_utf8 scalar ends with 15.8
sum nulls 512 13.5
struct_array_from_vec 1024 8.0
array_slice 512 6.9
nlike_utf8 scalar contains 5.7
cast timestamp_ns to timestamp_s 512 5.1
record_batches_to_csv 3.9
sort nulls 2^12 3.5
sort nulls 2^10 3.0
min string 512 3.0
cast timestamp_ms to i64 512 2.3
nlike_utf8 scalar equals 2.1
struct_array_from_vec 512 1.8
take str 1024 1.4
nlike_utf8 scalar complex 1.3
array_slice 128 0.9
like_utf8 scalar complex 0.5
like_utf8 scalar contains 0.5
min 512 -1.0
sort 2^12 -1.1
sort 2^10 -1.1
like_utf8 scalar equals -1.3
like_utf8 scalar starts with -1.4
limit 512, 512 -1.4
cast time32s to time32ms 512 -1.9
subtract 512 -2.2
filter context f32 high selectivity -2.2
add 512 -2.7
struct_array_from_vec 256 -2.7
divide_nulls_512 -2.9
sum 512 -3.0
add_nulls_512 -3.1
take str nulls 512 -3.5
multiply 512 -3.6
filter context u8 very low selectivity -3.9
array_slice 2048 -4.3
cast date64 to date32 512 -4.5
take i32 nulls 1024 -4.8
min nulls string 512 -5.1
take i32 1024 -5.3
array_string_from_vec 256 -5.5
array_string_from_vec 128 -5.7
filter context u8 w NULLs very low selectivity -6.4
filter context u8 low selectivity -6.6
filter u8 high selectivity -7.1
filter context u8 w NULLs high selectivity -7.2
filter u8 very low selectivity -7.4
struct_array_from_vec 128 -7.4
cast int64 to int32 512 -7.8
cast date32 to date64 512 -8.2
take i32 nulls 512 -8.3
equal_string_nulls_512 -8.4
take i32 512 -8.5
buffer_bit_ops and -9.4
equal_512 -9.5
take str 512 -9.6
cast time64ns to time32s 512 -9.7
take bool 1024 -10.2
filter context u8 high selectivity -11.2
filter u8 low selectivity -11.2
equal_string_512 -12.2
array_from_vec 256 -12.5
take bool 512 -12.8
cast time32s to time64us 512 -15.6
buffer_bit_ops or -17.0
eq scalar Float32 -17.6
lt_eq scalar Float32 -17.9
lt scalar Float32 -18.2
array_from_vec 512 -19.4
gt_eq scalar Float32 -19.5
take bool nulls 1024 -19.7
lt_eq Float32 -19.8
eq Float32 -19.9
gt_eq Float32 -20.2
filter context u8 w NULLs low selectivity -20.4
neq scalar Float32 -21.1
gt scalar Float32 -21.5
and -21.8
or -22.1
not -22.6
take bool nulls 512 -22.7
cast int32 to int64 512 -23.0
min nulls 512 -23.2
array_from_vec 128 -23.4
cast float64 to uint64 512 -24.3
neq Float32 -24.8
lt Float32 -24.9
gt Float32 -25.6
cast float64 to float32 512 -25.9
cast int32 to float64 512 -27.6
equal_nulls_512 -28.0
cast int32 to uint32 512 -30.4
cast int32 to float32 512 -33.3
cast float32 to int32 512 -35.0

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@nevi-me
Copy link
Contributor

nevi-me commented Nov 29, 2020

Perhaps unsurprisingly to you, but a bit to me, this leads to a significant improvement over almost all benches

You can consider me surprised too.

This is very interesting, can one then say that the alignment requirements from arrow::memory are the main cause of the difference? If we only enforce alignment at boundaries like IPC and FFI, could we still be able to use Vec<u8> internally? I don't think it should also be much of an issue for Parquet, as we currently materialise Arrow data into the primitives that Parquet supports, due to the way arrays are indexed for definition levels.

@jhorstmann
Copy link
Contributor

I'm surprised too. It might just be the removed assertions, but I did not expect them to have measurable overhead. If you want to investigate further you could try only removing those or replacing with debug_assert. I can't easily reproduce it on my notebook since the variations per run are too high, would need to spin up another ec2 instance to run stable benchmarks.

Reading the columnnar specification again, the alignment is only a recommendation, and only required when serialized. I'm not familiar with that part of the code, but I assume it already needs to ensure the required padding.

The only problem I could see would be with shared memory or FFI if the other side relies on the padding. I think it already can't rely on 64byte alignment, because arrays can be arbitrary slices of the underlying buffers. But relying on padding could happen when accessing data using vector instructions.

It seems rust will soon get support for custom allocators for Vec, that way we could get both a simplified internal api and still ensure padded allocations using a custom allocator.

@TimDiekmann
Copy link

TimDiekmann commented Nov 29, 2020

It seems rust will soon get support for custom allocators for Vec, that way we could get both a simplified internal api and still ensure padded allocations using a custom allocator.

Let me hook in here for a moment. Although it is true that custom allocators for Vec is now implemented, be aware that this is not yet stable and the API may undergo some changes. Without #[feature(allocator_api)] it's not even possible to declare Vec<T, _> or Vec<T, Global>.

@ritchie46
Copy link
Contributor

ritchie46 commented Nov 29, 2020

Do these benchmarks also include writing to the buffers, or only allocating/deallocating?

I ask because I already used some sort of wrapper around a Vec. This is just a wrapper around a rust Vec that has overwritten reserve to use Arrows memory aligned allocation.

However, for writing and indexing it uses the methods default to Vec. I found this to be a lot faster for creating Arrow arrays. I use it when I know I don't have null values.

@Dandandan
Copy link
Contributor

I did some profiling with Valgrind yesterday. I don't have a very good knowledge currently of the workings memory but there I saw as well that some of the lines in the code related to MutableBuffer::extend_from_slice in terms of instruction cycles as they are called a lot in the code / kernels. I think a few "micro optimizations" are possible there.

It makes sense to me that switching to Vec<u8> benefits from it being optimized/benchmarked already extremely well.

@jorgecarleitao
Copy link
Member Author

Thanks a lot for all the comments so far.

@jhorstmann really good points. Unfortunately, I do not think it is just the asserts, because Vec performs the same asserts as it is also safe code.

wrt to FFI, @jhorstmann and @nevi-me : I don't think FFI can rely on it: as @jhorstmann mentions, since this only a recommendation, implementations must be able to handle non-aligned buffers. The Rust implementation is even funnier here, because the C data interface has no API to export Buffer::offset (only Array::offset). This implies that we need to offset pointer by Buffer::offset when we export to the C data interface (details on #8401). I think that this makes the receiving end unable to determine whether the allocated region is aligned or not.

I think that this Buffer::offset may also destroy the benefit of alignment on our own implementation as ArrayData::data will output a non-aligned bytes slice whenever Buffer::offset is not 0. To use the aligned memory, I think we would need to use the data without the offset, perform the SIMD operation in chunks of 64 bytes starting at the beginning of the buffer, and then pass the offset to the new buffer.

@ritchie46 good question. The benchmarks include allocations and mutations, as they cover a wide range of situations.

@Dandandan that is also my current hypothesis: the implementation is competing with some of the brightest minds when we try to re-invent a Vec, and the benefits of 64-byte aligned memory do not overcome the benefits of a highly optimized container (Vec).

@github-actions github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 29, 2020
@houqp
Copy link
Member

houqp commented Nov 30, 2020

Great simplification indeed 👍 I have seen conflicting assertions about misaligned access in simd online, from minor overhead to significant performance impact. I am now very curious what the benchmark result will look like with simd feature gate turned on.

@github-actions github-actions bot removed the needs-rebase A PR that needs to be rebased by the author label Nov 30, 2020
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgecarleitao -- in general I personally would advocate for the approach of "Use a simpler, Vec based implementation, unless there are compelling performance benchmarks that show the more complicated implementation is going faster"

Like the other reviewers have mentioned, I think there we are at the point of this project where ad-hoc execution micro benches on a variety of hardware (our various dev machines) is likely insufficiently precise to drive decisions such as this.

As part of https://github.com/influxdata/influxdb_iox we are actively planning on getting a regular benchmarking story in place -- I will see if I can figure out a good way to get the arrow micro benches included

But all in all, really nice @jorgecarleitao

@Dandandan
Copy link
Contributor

Dandandan commented Dec 1, 2020

@jorgecarleitao
Maybe I'm saying something weird/impossible, but would it also be possible/ beneficial to store the (mutable) buffer in a Vec<T>?
This way it could simplify mutation of the buffer for the different types, while also relying less on unsafe code / code that could segfault or lead to other errors when using it wrong. In profiling/benchmarks I saw there are mayor inefficiencies related to writing values as individual bytes / instead of being able to store them directly in the builder API (e.g. in the append function).

For the rest, I think it really makes sense to push this idea forward as the current implementation is much more complicated without a good reason. I think using Vec it will be actually easier to optimize for performance. I agree with @alamb that getting rid of the other code is beneficial, the benchmarks at least don't show clear regressions.

Really look forward to those benchmarks too @alamb !

@jorgecarleitao
Copy link
Member Author

@alamb I agree with the simplest but no simpler rule.

I also agree with your concerns about the benches being run on ad-hoc hardware. It makes it more difficult to reproduce and draw conclusions.

@Dandandan , I do not think so, but you may have better ideas than me:

The way I currently see it, ArrayData is 'array-type'-independent. If we make buffers generic over T, we need to find a way to write ArrayData. We could make it logic-dependent, but then we lose the flexibility of a non-generic ArrayData, particularly on composite types such as ListArray, which have childs of generic types.

One way out would be to make ArrayData's dynamically typed, so that it can hold arbitrary childs, but I think that at some point we will need to downcast them as we will need to extract which type T their buffers contain. This is just my analysis and in no way a definitive answer about this, though ^_^

@jorgecarleitao
Copy link
Member Author

@houqp I will benchmark against simd and will post the results on the PR's table ASAP.

@ritchie46
Copy link
Contributor

ArrayData is 'array-type'-independent. If we make buffers generic over T, we need to find a way to write ArrayData

Would it perhaps be possible to have a hybrid solution? The buffer remains typeless Vec<u8>, but the public API exposes generic typed methods like Buffer::push::<T>(). Then some code regarding type conversion to bytes, alignment etc. could be abstracted.

@github-actions github-actions bot added needs-rebase A PR that needs to be rebased by the author and removed needs-rebase A PR that needs to be rebased by the author labels Dec 3, 2020
@jorgecarleitao jorgecarleitao deleted the buffer2 branch December 4, 2020 07:41
@jorgecarleitao jorgecarleitao restored the buffer2 branch December 4, 2020 07:41
@jorgecarleitao jorgecarleitao reopened this Dec 4, 2020
@github-actions github-actions bot added Component: Parquet needs-rebase A PR that needs to be rebased by the author and removed needs-rebase A PR that needs to be rebased by the author labels Dec 4, 2020
@github-actions github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Dec 11, 2020
@jorgecarleitao
Copy link
Member Author

So, you for a quick update:

I was trying to finalize this before the 3.0 but this seems now unlikely as I can't figure out a way to make #8829 CI green.

@github-actions github-actions bot removed the needs-rebase A PR that needs to be rebased by the author label Dec 13, 2020
@codecov-io
Copy link

codecov-io commented Dec 13, 2020

Codecov Report

Merging #8796 (8a1c52c) into master (091df20) will decrease coverage by 0.12%.
The diff coverage is 83.60%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8796      +/-   ##
==========================================
- Coverage   83.22%   83.10%   -0.13%     
==========================================
  Files         196      195       -1     
  Lines       48232    47977     -255     
==========================================
- Hits        40142    39869     -273     
- Misses       8090     8108      +18     
Impacted Files Coverage Δ
rust/arrow/src/array/array_primitive.rs 90.79% <ø> (-1.48%) ⬇️
rust/arrow/src/array/array_union.rs 87.55% <ø> (-2.24%) ⬇️
rust/arrow/src/array/data.rs 93.40% <0.00%> (-3.85%) ⬇️
rust/arrow/src/array/raw_pointer.rs 100.00% <ø> (ø)
rust/arrow/src/bitmap.rs 84.74% <0.00%> (-6.78%) ⬇️
rust/arrow/src/compute/kernels/comparison.rs 96.28% <ø> (ø)
rust/arrow/src/bytes.rs 41.37% <50.00%> (-12.68%) ⬇️
rust/arrow/src/array/array_binary.rs 90.73% <100.00%> (ø)
rust/arrow/src/array/array_boolean.rs 86.50% <100.00%> (-0.22%) ⬇️
rust/arrow/src/array/array_list.rs 92.74% <100.00%> (-0.38%) ⬇️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 091df20...8a1c52c. Read the comment docs.

@jorgecarleitao
Copy link
Member Author

Small update: this is blocked by segfaults coming from the SIMD implementation, as you can see from the logs on the test of the SIMD feature.

I think that they come from the sum.

I though that the SIMD implementation would not make assumptions about memory alignment or minimum buffer size. Need some investigation.

@jhorstmann
Copy link
Contributor

@jorgecarleitao I had a quick look at the failing test, that one actually uses the addition kernel to prepare test data and I think that is where the problem is. I'll try to find time to have a deeper look at it today.

@jorgecarleitao
Copy link
Member Author

@jorgecarleitao I had a quick look at the failing test, that one actually uses the addition kernel to prepare test data and I think that is where the problem is. I'll try to find time to have a deeper look at it today.

Thank you so much, @jhorstmann , really appreciated.

@alamb
Copy link
Contributor

alamb commented Dec 15, 2020

Possibly related to #8929

@jorgecarleitao
Copy link
Member Author

jorgecarleitao commented Dec 19, 2020

I have now rebased this against master. After @jhorstmann fix to the out of bounds on #8954, it now runs correctly.

Here are the results:

no SIMD

git checkout master
cargo bench --benches
git checkout buffer2
cargo bench --benches
benchmark variation (%)
cast date64 to date32 512 63.2
cast int32 to float64 512 52.4
cast float64 to float32 512 43.0
cast time64ns to time32s 512 42.9
filter context f32 very low selectivity 40.5
cast date32 to date64 512 37.0
cast int32 to float32 512 35.5
struct_array_from_vec 1024 32.6
lt scalar Float32 32.4
concat str 1024 31.6
cast int32 to int64 512 29.8
cast float64 to uint64 512 27.7
filter context u8 very low selectivity 26.8
take str 1024 26.6
concat str nulls 1024 26.3
take str null indices 1024 25.2
take str null values 1024 25.1
struct_array_from_vec 512 24.8
cast float32 to int32 512 20.3
filter u8 very low selectivity 19.7
take str null indices 512 19.2
take str 512 18.7
cast time32s to time64us 512 17.3
nlike_utf8 scalar equals 16.1
struct_array_from_vec 256 15.7
nlike_utf8 scalar ends with 13.4
take i32 nulls 512 13.0
take i32 512 11.7
take i32 nulls 1024 11.1
take str null values null indices 1024 10.3
take i32 1024 10.2
filter context u8 low selectivity 10.0
cast int32 to uint32 512 9.7
filter u8 low selectivity 9.4
filter context u8 w NULLs low selectivity 7.4
min string 512 6.9
like_utf8 scalar equals 6.5
like_utf8 scalar complex 6.0
filter context f32 high selectivity 5.6
min nulls 512 5.3
like_utf8 scalar contains 5.1
divide 512 4.8
nlike_utf8 scalar contains 4.8
concat i32 1024 4.5
divide_nulls_512 4.4
struct_array_from_vec 128 4.2
take bool nulls 512 2.2
equal_512 2.0
take bool 512 1.4
min nulls string 512 1.4
array_string_from_vec 512 1.1
nlike_utf8 scalar complex 0.9
gt_eq scalar Float32 0.9
eq scalar Float32 0.7
cast int64 to int32 512 0.7
cast int32 to int32 512 0.5
neq scalar Float32 0.3
cast timestamp_ns to timestamp_s 512 0.2
gt Float32 -0.3
min 512 -0.3
lt Float32 -0.4
sort nulls 2^12 -0.4
sort 2^12 -0.6
not -0.8
array_string_from_vec 256 -0.9
sort nulls 2^10 -1.1
nlike_utf8 scalar starts with -1.5
max nulls 512 -1.7
array_slice 512 -1.7
filter context u8 w NULLs very low selectivity -1.8
cast timestamp_ms to i64 512 -1.8
length -2.0
and -2.3
or -2.6
add 512 -2.6
like_utf8 scalar starts with -2.9
limit 512, 512 -3.4
multiply 512 -3.5
cast time32s to time32ms 512 -3.6
array_from_vec 512 -3.7
subtract 512 -3.9
take bool nulls 1024 -4.3
add_nulls_512 -4.9
cast timestamp_ms to timestamp_ns 512 -5.2
sum 512 -5.3
like_utf8 scalar ends with -6.1
array_string_from_vec 128 -6.2
filter context f32 low selectivity -8.2
buffer_bit_ops or -9.0
equal_string_nulls_512 -9.2
array_from_vec 256 -9.3
array_from_vec 128 -10.5
equal_string_512 -12.1
filter context u8 high selectivity -12.2
filter u8 high selectivity -12.8
buffer_bit_ops and -13.0
take bool 1024 -13.3
filter context u8 w NULLs high selectivity -13.3
sum nulls 512 -14.6

SIMD

git checkout master
cargo bench --benches --features simd
git checkout buffer2
cargo bench --benches --features simd
benchmark variation (%)
cast date64 to date32 512 64.0
cast date32 to date64 512 49.5
cast float64 to float32 512 46.1
cast time64ns to time32s 512 44.2
filter context f32 very low selectivity 43.9
lt scalar Float32 39.1
lt_eq Float32 38.7
cast int32 to int64 512 35.7
struct_array_from_vec 1024 35.4
lt_eq scalar Float32 35.4
eq scalar Float32 34.2
neq Float32 32.1
cast int32 to float64 512 31.6
concat str 1024 31.4
gt Float32 30.3
neq scalar Float32 29.8
filter context u8 very low selectivity 28.2
equal_nulls_512 27.3
eq Float32 27.1
struct_array_from_vec 512 26.1
cast float64 to uint64 512 25.9
lt Float32 24.8
filter context u8 low selectivity 24.5
filter u8 low selectivity 24.2
gt_eq Float32 23.6
cast time32s to time64us 512 23.6
cast float32 to int32 512 22.5
multiply 512 21.2
buffer_bit_ops and 20.3
gt_eq scalar Float32 20.1
take str 1024 19.5
subtract 512 19.5
cast int32 to float32 512 19.0
take str null indices 1024 19.0
take str null values 1024 17.4
and 17.4
struct_array_from_vec 256 16.8
or 16.1
not 15.8
take str 512 15.1
cast int32 to uint32 512 14.8
add_nulls_512 14.2
add 512 14.0
take str null indices 512 13.6
filter u8 very low selectivity 12.9
gt scalar Float32 12.5
take i32 512 12.5
filter context u8 w NULLs low selectivity 12.4
concat i32 nulls 1024 10.5
concat str nulls 1024 10.1
min string 512 9.5
equal_string_nulls_512 9.2
take i32 1024 8.3
take i32 nulls 1024 8.0
array_slice 2048 7.6
take i32 nulls 512 7.5
divide_nulls_512 7.1
take str null values null indices 1024 6.8
cast time32s to time32ms 512 6.0
divide 512 5.5
array_string_from_vec 512 4.6
min 512 4.6
array_slice 512 4.4
concat i32 1024 4.0
cast timestamp_ms to timestamp_ns 512 3.2
struct_array_from_vec 128 2.8
array_string_from_vec 256 2.7
filter context f32 high selectivity 2.6
array_slice 128 2.5
nlike_utf8 scalar complex 2.2
cast timestamp_ms to i64 512 1.9
like_utf8 scalar complex 1.8
equal_string_512 1.6
sort nulls 2^10 1.6
equal_512 1.2
take bool nulls 512 1.1
limit 512, 512 0.8
max nulls 512 0.6
sum nulls 512 0.4
min nulls 512 0.3
max 512 0.3
cast timestamp_ns to timestamp_s 512 -0.2
sort 2^12 -0.3
filter context f32 low selectivity -1.0
array_string_from_vec 128 -1.4
sort 2^10 -1.4
cast int64 to int32 512 -2.6
take bool nulls 1024 -2.9
buffer_bit_ops or -3.2
take bool 512 -4.0
filter context u8 high selectivity -4.5
filter context u8 w NULLs very low selectivity -4.6
length -5.1
min nulls string 512 -5.3
nlike_utf8 scalar contains -5.6
take bool 1024 -6.1
filter u8 high selectivity -6.2
array_from_vec 256 -6.3
array_from_vec 512 -7.0
like_utf8 scalar contains -7.1
filter context u8 w NULLs high selectivity -8.2
array_from_vec 128 -9.3
nlike_utf8 scalar starts with -15.6
nlike_utf8 scalar equals -18.5
nlike_utf8 scalar ends with -20.1
like_utf8 scalar ends with -23.5
like_utf8 scalar starts with -27.8
like_utf8 scalar equals -43.8
record_batches_to_csv -52.9

memory::memcpy(buffer, slice.as_ptr(), len);
Buffer::build_with_arguments(buffer, len, Deallocation::Native(capacity))
}
let bytes = unsafe { Bytes::new(p.as_ref().to_vec(), Deallocation::Native) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be potential for further optimization here: to_vec has to copy the slice contents, a separate implementation of From<Vec<u8>> or From<Vec<ArrowPrimitiveType>> could avoid that copy and speed up several kernels involving primitives or list offsets.

As a From implementation that would give a "conflicting implementations" error, an explicit from_vec method could work. I'd suggest trying it in a separate PR as it could change a bunch of code not directly related to the refactoring in this PR.

@jorgecarleitao
Copy link
Member Author

I am closing this in favor of #9076 where the performance is fixed.

The gist is that there were two reasons for the performance issues:

  1. we were using std::alloc::alloc_zero instead of std::alloc::alloc
  2. we were converting everything to a byte slice instead of writing directly to the buffer

That PR addresses them both and brings the MutableBuffer to be faster than Vec

@jorgecarleitao jorgecarleitao deleted the buffer2 branch January 5, 2021 19:18
jorgecarleitao added a commit that referenced this pull request Jan 19, 2021
This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code.

This is the second performance improvement originally presented in #8796 and, together with #9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly.

Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code.

This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature

```
pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T])
pub fn push<T: ToByteSlice>(&mut self, item: &T)
```

i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing).

Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](#9016 (comment)).

> [...] current conversion to a byte slice may add some overhead? - @Dandandan

Benches (against master, so, both this PR and #9044 ):

```
Switched to branch 'perf_buffer'
Your branch and 'origin/perf_buffer' have diverged,
and have 6 and 1 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 00s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471
Gnuplot not found, using plotters backend
mutable                 time:   [463.11 us 463.57 us 464.07 us]
                        change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe

mutable prepared        time:   [527.84 us 528.46 us 529.14 us]
                        change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe

Benchmarking from_slice: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60.
from_slice              time:   [1.1968 ms 1.1979 ms 1.1991 ms]
                        change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

from_slice prepared     time:   [917.49 us 918.89 us 920.60 us]
                        change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
```

Closes #9076 from jorgecarleitao/perf_buffer

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
kszucs pushed a commit that referenced this pull request Jan 25, 2021
This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code.

This is the second performance improvement originally presented in #8796 and, together with #9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly.

Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code.

This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature

```
pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T])
pub fn push<T: ToByteSlice>(&mut self, item: &T)
```

i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing).

Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](#9016 (comment)).

> [...] current conversion to a byte slice may add some overhead? - @Dandandan

Benches (against master, so, both this PR and #9044 ):

```
Switched to branch 'perf_buffer'
Your branch and 'origin/perf_buffer' have diverged,
and have 6 and 1 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 00s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471
Gnuplot not found, using plotters backend
mutable                 time:   [463.11 us 463.57 us 464.07 us]
                        change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe

mutable prepared        time:   [527.84 us 528.46 us 529.14 us]
                        change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe

Benchmarking from_slice: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60.
from_slice              time:   [1.1968 ms 1.1979 ms 1.1991 ms]
                        change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

from_slice prepared     time:   [917.49 us 918.89 us 920.60 us]
                        change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
```

Closes #9076 from jorgecarleitao/perf_buffer

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code.

This is the second performance improvement originally presented in apache#8796 and, together with apache#9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly.

Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code.

This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature

```
pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T])
pub fn push<T: ToByteSlice>(&mut self, item: &T)
```

i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing).

Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](apache#9016 (comment)).

> [...] current conversion to a byte slice may add some overhead? - @Dandandan

Benches (against master, so, both this PR and apache#9044 ):

```
Switched to branch 'perf_buffer'
Your branch and 'origin/perf_buffer' have diverged,
and have 6 and 1 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 00s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471
Gnuplot not found, using plotters backend
mutable                 time:   [463.11 us 463.57 us 464.07 us]
                        change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe

mutable prepared        time:   [527.84 us 528.46 us 529.14 us]
                        change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe

Benchmarking from_slice: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60.
from_slice              time:   [1.1968 ms 1.1979 ms 1.1991 ms]
                        change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

from_slice prepared     time:   [917.49 us 918.89 us 920.60 us]
                        change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
```

Closes apache#9076 from jorgecarleitao/perf_buffer

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code.

This is the second performance improvement originally presented in apache#8796 and, together with apache#9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly.

Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code.

This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature

```
pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T])
pub fn push<T: ToByteSlice>(&mut self, item: &T)
```

i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing).

Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](apache#9016 (comment)).

> [...] current conversion to a byte slice may add some overhead? - @Dandandan

Benches (against master, so, both this PR and apache#9044 ):

```
Switched to branch 'perf_buffer'
Your branch and 'origin/perf_buffer' have diverged,
and have 6 and 1 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 00s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471
Gnuplot not found, using plotters backend
mutable                 time:   [463.11 us 463.57 us 464.07 us]
                        change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe

mutable prepared        time:   [527.84 us 528.46 us 529.14 us]
                        change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe

Benchmarking from_slice: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60.
from_slice              time:   [1.1968 ms 1.1979 ms 1.1991 ms]
                        change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

from_slice prepared     time:   [917.49 us 918.89 us 920.60 us]
                        change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
```

Closes apache#9076 from jorgecarleitao/perf_buffer

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Parquet Component: Rust needs-rebase A PR that needs to be rebased by the author
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants