[Rust] [Experiment] Vec<u8> vs current allocations #8796

jorgecarleitao · 2020-11-28T21:15:40Z

@nevi-me , @alamb @jhorstmann , I have been playing around with the buffers on the arrow crate, and just for the fun, tried to replace all our memory logic by a simple Vec<u8>. Perhaps unsurprisingly to you, but a bit to me, this leads to a significant improvement over almost all benches. I.e. even though memory alignment is good for some kernels, overall our allocations and memory handling seems to be much worse than Vec.

I am not proposing that we drop the alignment over cache lines as it is theoretically more sound. However, practically (and based on our microbenchmarks alone), there seems to be a good case here, specially the fact that we drop a large amount of unsafe code. Maybe this behavior is different if we use simd feature gate?

Here are the results ordered from worse to best (results not significant are not shown):

benchmark	variation (%)
nlike_utf8 scalar ends with	15.8
sum nulls 512	13.5
struct_array_from_vec 1024	8.0
array_slice 512	6.9
nlike_utf8 scalar contains	5.7
cast timestamp_ns to timestamp_s 512	5.1
record_batches_to_csv	3.9
sort nulls 2^12	3.5
sort nulls 2^10	3.0
min string 512	3.0
cast timestamp_ms to i64 512	2.3
nlike_utf8 scalar equals	2.1
struct_array_from_vec 512	1.8
take str 1024	1.4
nlike_utf8 scalar complex	1.3
array_slice 128	0.9
like_utf8 scalar complex	0.5
like_utf8 scalar contains	0.5
min 512	-1.0
sort 2^12	-1.1
sort 2^10	-1.1
like_utf8 scalar equals	-1.3
like_utf8 scalar starts with	-1.4
limit 512, 512	-1.4
cast time32s to time32ms 512	-1.9
subtract 512	-2.2
filter context f32 high selectivity	-2.2
add 512	-2.7
struct_array_from_vec 256	-2.7
divide_nulls_512	-2.9
sum 512	-3.0
add_nulls_512	-3.1
take str nulls 512	-3.5
multiply 512	-3.6
filter context u8 very low selectivity	-3.9
array_slice 2048	-4.3
cast date64 to date32 512	-4.5
take i32 nulls 1024	-4.8
min nulls string 512	-5.1
take i32 1024	-5.3
array_string_from_vec 256	-5.5
array_string_from_vec 128	-5.7
filter context u8 w NULLs very low selectivity	-6.4
filter context u8 low selectivity	-6.6
filter u8 high selectivity	-7.1
filter context u8 w NULLs high selectivity	-7.2
filter u8 very low selectivity	-7.4
struct_array_from_vec 128	-7.4
cast int64 to int32 512	-7.8
cast date32 to date64 512	-8.2
take i32 nulls 512	-8.3
equal_string_nulls_512	-8.4
take i32 512	-8.5
buffer_bit_ops and	-9.4
equal_512	-9.5
take str 512	-9.6
cast time64ns to time32s 512	-9.7
take bool 1024	-10.2
filter context u8 high selectivity	-11.2
filter u8 low selectivity	-11.2
equal_string_512	-12.2
array_from_vec 256	-12.5
take bool 512	-12.8
cast time32s to time64us 512	-15.6
buffer_bit_ops or	-17.0
eq scalar Float32	-17.6
lt_eq scalar Float32	-17.9
lt scalar Float32	-18.2
array_from_vec 512	-19.4
gt_eq scalar Float32	-19.5
take bool nulls 1024	-19.7
lt_eq Float32	-19.8
eq Float32	-19.9
gt_eq Float32	-20.2
filter context u8 w NULLs low selectivity	-20.4
neq scalar Float32	-21.1
gt scalar Float32	-21.5
and	-21.8
or	-22.1
not	-22.6
take bool nulls 512	-22.7
cast int32 to int64 512	-23.0
min nulls 512	-23.2
array_from_vec 128	-23.4
cast float64 to uint64 512	-24.3
neq Float32	-24.8
lt Float32	-24.9
gt Float32	-25.6
cast float64 to float32 512	-25.9
cast int32 to float64 512	-27.6
equal_nulls_512	-28.0
cast int32 to uint32 512	-30.4
cast int32 to float32 512	-33.3
cast float32 to int32 512	-35.0

github-actions · 2020-11-28T21:21:28Z

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

nevi-me · 2020-11-29T08:49:44Z

Perhaps unsurprisingly to you, but a bit to me, this leads to a significant improvement over almost all benches

You can consider me surprised too.

This is very interesting, can one then say that the alignment requirements from arrow::memory are the main cause of the difference? If we only enforce alignment at boundaries like IPC and FFI, could we still be able to use Vec<u8> internally? I don't think it should also be much of an issue for Parquet, as we currently materialise Arrow data into the primitives that Parquet supports, due to the way arrays are indexed for definition levels.

jhorstmann · 2020-11-29T09:27:39Z

I'm surprised too. It might just be the removed assertions, but I did not expect them to have measurable overhead. If you want to investigate further you could try only removing those or replacing with debug_assert. I can't easily reproduce it on my notebook since the variations per run are too high, would need to spin up another ec2 instance to run stable benchmarks.

Reading the columnnar specification again, the alignment is only a recommendation, and only required when serialized. I'm not familiar with that part of the code, but I assume it already needs to ensure the required padding.

The only problem I could see would be with shared memory or FFI if the other side relies on the padding. I think it already can't rely on 64byte alignment, because arrays can be arbitrary slices of the underlying buffers. But relying on padding could happen when accessing data using vector instructions.

It seems rust will soon get support for custom allocators for Vec, that way we could get both a simplified internal api and still ensure padded allocations using a custom allocator.

TimDiekmann · 2020-11-29T09:56:04Z

It seems rust will soon get support for custom allocators for Vec, that way we could get both a simplified internal api and still ensure padded allocations using a custom allocator.

Let me hook in here for a moment. Although it is true that custom allocators for Vec is now implemented, be aware that this is not yet stable and the API may undergo some changes. Without #[feature(allocator_api)] it's not even possible to declare Vec<T, _> or Vec<T, Global>.

ritchie46 · 2020-11-29T11:31:13Z

Do these benchmarks also include writing to the buffers, or only allocating/deallocating?

I ask because I already used some sort of wrapper around a Vec. This is just a wrapper around a rust Vec that has overwritten reserve to use Arrows memory aligned allocation.

However, for writing and indexing it uses the methods default to Vec. I found this to be a lot faster for creating Arrow arrays. I use it when I know I don't have null values.

Dandandan · 2020-11-29T11:37:00Z

I did some profiling with Valgrind yesterday. I don't have a very good knowledge currently of the workings memory but there I saw as well that some of the lines in the code related to MutableBuffer::extend_from_slice in terms of instruction cycles as they are called a lot in the code / kernels. I think a few "micro optimizations" are possible there.

It makes sense to me that switching to Vec<u8> benefits from it being optimized/benchmarked already extremely well.

jorgecarleitao · 2020-11-29T12:07:56Z

Thanks a lot for all the comments so far.

@jhorstmann really good points. Unfortunately, I do not think it is just the asserts, because Vec performs the same asserts as it is also safe code.

wrt to FFI, @jhorstmann and @nevi-me : I don't think FFI can rely on it: as @jhorstmann mentions, since this only a recommendation, implementations must be able to handle non-aligned buffers. The Rust implementation is even funnier here, because the C data interface has no API to export Buffer::offset (only Array::offset). This implies that we need to offset pointer by Buffer::offset when we export to the C data interface (details on #8401). I think that this makes the receiving end unable to determine whether the allocated region is aligned or not.

I think that this Buffer::offset may also destroy the benefit of alignment on our own implementation as ArrayData::data will output a non-aligned bytes slice whenever Buffer::offset is not 0. To use the aligned memory, I think we would need to use the data without the offset, perform the SIMD operation in chunks of 64 bytes starting at the beginning of the buffer, and then pass the offset to the new buffer.

@ritchie46 good question. The benchmarks include allocations and mutations, as they cover a wide range of situations.

@Dandandan that is also my current hypothesis: the implementation is competing with some of the brightest minds when we try to re-invent a Vec, and the benefits of 64-byte aligned memory do not overcome the benefits of a highly optimized container (Vec).

houqp · 2020-11-30T07:30:54Z

Great simplification indeed 👍 I have seen conflicting assertions about misaligned access in simd online, from minor overhead to significant performance impact. I am now very curious what the benchmark result will look like with simd feature gate turned on.

alamb

@jorgecarleitao -- in general I personally would advocate for the approach of "Use a simpler, Vec based implementation, unless there are compelling performance benchmarks that show the more complicated implementation is going faster"

Like the other reviewers have mentioned, I think there we are at the point of this project where ad-hoc execution micro benches on a variety of hardware (our various dev machines) is likely insufficiently precise to drive decisions such as this.

As part of https://github.com/influxdata/influxdb_iox we are actively planning on getting a regular benchmarking story in place -- I will see if I can figure out a good way to get the arrow micro benches included

But all in all, really nice @jorgecarleitao

Dandandan · 2020-12-01T07:35:59Z

@jorgecarleitao
Maybe I'm saying something weird/impossible, but would it also be possible/ beneficial to store the (mutable) buffer in a Vec<T>?
This way it could simplify mutation of the buffer for the different types, while also relying less on unsafe code / code that could segfault or lead to other errors when using it wrong. In profiling/benchmarks I saw there are mayor inefficiencies related to writing values as individual bytes / instead of being able to store them directly in the builder API (e.g. in the append function).

For the rest, I think it really makes sense to push this idea forward as the current implementation is much more complicated without a good reason. I think using Vec it will be actually easier to optimize for performance. I agree with @alamb that getting rid of the other code is beneficial, the benchmarks at least don't show clear regressions.

Really look forward to those benchmarks too @alamb !

jorgecarleitao · 2020-12-02T16:19:08Z

@alamb I agree with the simplest but no simpler rule.

I also agree with your concerns about the benches being run on ad-hoc hardware. It makes it more difficult to reproduce and draw conclusions.

@Dandandan , I do not think so, but you may have better ideas than me:

The way I currently see it, ArrayData is 'array-type'-independent. If we make buffers generic over T, we need to find a way to write ArrayData. We could make it logic-dependent, but then we lose the flexibility of a non-generic ArrayData, particularly on composite types such as ListArray, which have childs of generic types.

One way out would be to make ArrayData's dynamically typed, so that it can hold arbitrary childs, but I think that at some point we will need to downcast them as we will need to extract which type T their buffers contain. This is just my analysis and in no way a definitive answer about this, though ^_^

jorgecarleitao · 2020-12-02T16:46:08Z

@houqp I will benchmark against simd and will post the results on the PR's table ASAP.

ritchie46 · 2020-12-02T20:30:29Z

ArrayData is 'array-type'-independent. If we make buffers generic over T, we need to find a way to write ArrayData

Would it perhaps be possible to have a hybrid solution? The buffer remains typeless Vec<u8>, but the public API exposes generic typed methods like Buffer::push::<T>(). Then some code regarding type conversion to bytes, alignment etc. could be abstracted.

jorgecarleitao · 2020-12-12T05:07:23Z

So, you for a quick update:

I haven't run the benches for SIMD yet. This requires a nightly run on my computer and I have been forgetting
A green CI on this is dependent on removing an unsafe struct on the parquet crate that I have so far been unable to, see ARROW-10804: [Rust] Removed some unsafe code from the parquet crate #8829

I was trying to finalize this before the 3.0 but this seems now unlikely as I can't figure out a way to make #8829 CI green.

codecov-io · 2020-12-13T07:35:49Z

Codecov Report

Merging #8796 (8a1c52c) into master (091df20) will decrease coverage by 0.12%.
The diff coverage is 83.60%.

@@            Coverage Diff             @@
##           master    #8796      +/-   ##
==========================================
- Coverage   83.22%   83.10%   -0.13%     
==========================================
  Files         196      195       -1     
  Lines       48232    47977     -255     
==========================================
- Hits        40142    39869     -273     
- Misses       8090     8108      +18

Impacted Files	Coverage Δ
rust/arrow/src/array/array_primitive.rs	`90.79% <ø> (-1.48%)`	⬇️
rust/arrow/src/array/array_union.rs	`87.55% <ø> (-2.24%)`	⬇️
rust/arrow/src/array/data.rs	`93.40% <0.00%> (-3.85%)`	⬇️
rust/arrow/src/array/raw_pointer.rs	`100.00% <ø> (ø)`
rust/arrow/src/bitmap.rs	`84.74% <0.00%> (-6.78%)`	⬇️
rust/arrow/src/compute/kernels/comparison.rs	`96.28% <ø> (ø)`
rust/arrow/src/bytes.rs	`41.37% <50.00%> (-12.68%)`	⬇️
rust/arrow/src/array/array_binary.rs	`90.73% <100.00%> (ø)`
rust/arrow/src/array/array_boolean.rs	`86.50% <100.00%> (-0.22%)`	⬇️
rust/arrow/src/array/array_list.rs	`92.74% <100.00%> (-0.38%)`	⬇️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 091df20...8a1c52c. Read the comment docs.

jorgecarleitao · 2020-12-15T08:07:06Z

Small update: this is blocked by segfaults coming from the SIMD implementation, as you can see from the logs on the test of the SIMD feature.

I think that they come from the sum.

I though that the SIMD implementation would not make assumptions about memory alignment or minimum buffer size. Need some investigation.

jhorstmann · 2020-12-15T09:10:05Z

@jorgecarleitao I had a quick look at the failing test, that one actually uses the addition kernel to prepare test data and I think that is where the problem is. I'll try to find time to have a deeper look at it today.

jorgecarleitao · 2020-12-15T10:06:17Z

@jorgecarleitao I had a quick look at the failing test, that one actually uses the addition kernel to prepare test data and I think that is where the problem is. I'll try to find time to have a deeper look at it today.

Thank you so much, @jhorstmann , really appreciated.

alamb · 2020-12-15T21:38:08Z

Possibly related to #8929

jorgecarleitao · 2020-12-19T12:47:54Z

I have now rebased this against master. After @jhorstmann fix to the out of bounds on #8954, it now runs correctly.

Here are the results:

no SIMD

git checkout master
cargo bench --benches
git checkout buffer2
cargo bench --benches

benchmark	variation (%)
cast date64 to date32 512	63.2
cast int32 to float64 512	52.4
cast float64 to float32 512	43.0
cast time64ns to time32s 512	42.9
filter context f32 very low selectivity	40.5
cast date32 to date64 512	37.0
cast int32 to float32 512	35.5
struct_array_from_vec 1024	32.6
lt scalar Float32	32.4
concat str 1024	31.6
cast int32 to int64 512	29.8
cast float64 to uint64 512	27.7
filter context u8 very low selectivity	26.8
take str 1024	26.6
concat str nulls 1024	26.3
take str null indices 1024	25.2
take str null values 1024	25.1
struct_array_from_vec 512	24.8
cast float32 to int32 512	20.3
filter u8 very low selectivity	19.7
take str null indices 512	19.2
take str 512	18.7
cast time32s to time64us 512	17.3
nlike_utf8 scalar equals	16.1
struct_array_from_vec 256	15.7
nlike_utf8 scalar ends with	13.4
take i32 nulls 512	13.0
take i32 512	11.7
take i32 nulls 1024	11.1
take str null values null indices 1024	10.3
take i32 1024	10.2
filter context u8 low selectivity	10.0
cast int32 to uint32 512	9.7
filter u8 low selectivity	9.4
filter context u8 w NULLs low selectivity	7.4
min string 512	6.9
like_utf8 scalar equals	6.5
like_utf8 scalar complex	6.0
filter context f32 high selectivity	5.6
min nulls 512	5.3
like_utf8 scalar contains	5.1
divide 512	4.8
nlike_utf8 scalar contains	4.8
concat i32 1024	4.5
divide_nulls_512	4.4
struct_array_from_vec 128	4.2
take bool nulls 512	2.2
equal_512	2.0
take bool 512	1.4
min nulls string 512	1.4
array_string_from_vec 512	1.1
nlike_utf8 scalar complex	0.9
gt_eq scalar Float32	0.9
eq scalar Float32	0.7
cast int64 to int32 512	0.7
cast int32 to int32 512	0.5
neq scalar Float32	0.3
cast timestamp_ns to timestamp_s 512	0.2
gt Float32	-0.3
min 512	-0.3
lt Float32	-0.4
sort nulls 2^12	-0.4
sort 2^12	-0.6
not	-0.8
array_string_from_vec 256	-0.9
sort nulls 2^10	-1.1
nlike_utf8 scalar starts with	-1.5
max nulls 512	-1.7
array_slice 512	-1.7
filter context u8 w NULLs very low selectivity	-1.8
cast timestamp_ms to i64 512	-1.8
length	-2.0
and	-2.3
or	-2.6
add 512	-2.6
like_utf8 scalar starts with	-2.9
limit 512, 512	-3.4
multiply 512	-3.5
cast time32s to time32ms 512	-3.6
array_from_vec 512	-3.7
subtract 512	-3.9
take bool nulls 1024	-4.3
add_nulls_512	-4.9
cast timestamp_ms to timestamp_ns 512	-5.2
sum 512	-5.3
like_utf8 scalar ends with	-6.1
array_string_from_vec 128	-6.2
filter context f32 low selectivity	-8.2
buffer_bit_ops or	-9.0
equal_string_nulls_512	-9.2
array_from_vec 256	-9.3
array_from_vec 128	-10.5
equal_string_512	-12.1
filter context u8 high selectivity	-12.2
filter u8 high selectivity	-12.8
buffer_bit_ops and	-13.0
take bool 1024	-13.3
filter context u8 w NULLs high selectivity	-13.3
sum nulls 512	-14.6

SIMD

git checkout master
cargo bench --benches --features simd
git checkout buffer2
cargo bench --benches --features simd

benchmark	variation (%)
cast date64 to date32 512	64.0
cast date32 to date64 512	49.5
cast float64 to float32 512	46.1
cast time64ns to time32s 512	44.2
filter context f32 very low selectivity	43.9
lt scalar Float32	39.1
lt_eq Float32	38.7
cast int32 to int64 512	35.7
struct_array_from_vec 1024	35.4
lt_eq scalar Float32	35.4
eq scalar Float32	34.2
neq Float32	32.1
cast int32 to float64 512	31.6
concat str 1024	31.4
gt Float32	30.3
neq scalar Float32	29.8
filter context u8 very low selectivity	28.2
equal_nulls_512	27.3
eq Float32	27.1
struct_array_from_vec 512	26.1
cast float64 to uint64 512	25.9
lt Float32	24.8
filter context u8 low selectivity	24.5
filter u8 low selectivity	24.2
gt_eq Float32	23.6
cast time32s to time64us 512	23.6
cast float32 to int32 512	22.5
multiply 512	21.2
buffer_bit_ops and	20.3
gt_eq scalar Float32	20.1
take str 1024	19.5
subtract 512	19.5
cast int32 to float32 512	19.0
take str null indices 1024	19.0
take str null values 1024	17.4
and	17.4
struct_array_from_vec 256	16.8
or	16.1
not	15.8
take str 512	15.1
cast int32 to uint32 512	14.8
add_nulls_512	14.2
add 512	14.0
take str null indices 512	13.6
filter u8 very low selectivity	12.9
gt scalar Float32	12.5
take i32 512	12.5
filter context u8 w NULLs low selectivity	12.4
concat i32 nulls 1024	10.5
concat str nulls 1024	10.1
min string 512	9.5
equal_string_nulls_512	9.2
take i32 1024	8.3
take i32 nulls 1024	8.0
array_slice 2048	7.6
take i32 nulls 512	7.5
divide_nulls_512	7.1
take str null values null indices 1024	6.8
cast time32s to time32ms 512	6.0
divide 512	5.5
array_string_from_vec 512	4.6
min 512	4.6
array_slice 512	4.4
concat i32 1024	4.0
cast timestamp_ms to timestamp_ns 512	3.2
struct_array_from_vec 128	2.8
array_string_from_vec 256	2.7
filter context f32 high selectivity	2.6
array_slice 128	2.5
nlike_utf8 scalar complex	2.2
cast timestamp_ms to i64 512	1.9
like_utf8 scalar complex	1.8
equal_string_512	1.6
sort nulls 2^10	1.6
equal_512	1.2
take bool nulls 512	1.1
limit 512, 512	0.8
max nulls 512	0.6
sum nulls 512	0.4
min nulls 512	0.3
max 512	0.3
cast timestamp_ns to timestamp_s 512	-0.2
sort 2^12	-0.3
filter context f32 low selectivity	-1.0
array_string_from_vec 128	-1.4
sort 2^10	-1.4
cast int64 to int32 512	-2.6
take bool nulls 1024	-2.9
buffer_bit_ops or	-3.2
take bool 512	-4.0
filter context u8 high selectivity	-4.5
filter context u8 w NULLs very low selectivity	-4.6
length	-5.1
min nulls string 512	-5.3
nlike_utf8 scalar contains	-5.6
take bool 1024	-6.1
filter u8 high selectivity	-6.2
array_from_vec 256	-6.3
array_from_vec 512	-7.0
like_utf8 scalar contains	-7.1
filter context u8 w NULLs high selectivity	-8.2
array_from_vec 128	-9.3
nlike_utf8 scalar starts with	-15.6
nlike_utf8 scalar equals	-18.5
nlike_utf8 scalar ends with	-20.1
like_utf8 scalar ends with	-23.5
like_utf8 scalar starts with	-27.8
like_utf8 scalar equals	-43.8
record_batches_to_csv	-52.9

jhorstmann · 2020-12-20T11:19:01Z

rust/arrow/src/buffer.rs

-            memory::memcpy(buffer, slice.as_ptr(), len);
-            Buffer::build_with_arguments(buffer, len, Deallocation::Native(capacity))
-        }
+        let bytes = unsafe { Bytes::new(p.as_ref().to_vec(), Deallocation::Native) };


There could be potential for further optimization here: to_vec has to copy the slice contents, a separate implementation of From<Vec<u8>> or From<Vec<ArrowPrimitiveType>> could avoid that copy and speed up several kernels involving primitives or list offsets.

As a From implementation that would give a "conflicting implementations" error, an explicit from_vec method could work. I'd suggest trying it in a separate PR as it could change a bunch of code not directly related to the refactoring in this PR.

jorgecarleitao · 2021-01-05T19:18:24Z

I am closing this in favor of #9076 where the performance is fixed.

The gist is that there were two reasons for the performance issues:

we were using std::alloc::alloc_zero instead of std::alloc::alloc
we were converting everything to a byte slice instead of writing directly to the buffer

That PR addresses them both and brings the MutableBuffer to be faster than Vec

@Dandandan

This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code. This is the second performance improvement originally presented in #8796 and, together with #9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly. Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code. This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature ``` pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T]) pub fn push<T: ToByteSlice>(&mut self, item: &T) ``` i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing). Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](#9016 (comment)). > [...] current conversion to a byte slice may add some overhead? - @Dandandan Benches (against master, so, both this PR and #9044 ): ``` Switched to branch 'perf_buffer' Your branch and 'origin/perf_buffer' have diverged, and have 6 and 1 different commits each, respectively. (use "git pull" to merge the remote branch into yours) Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 00s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471 Gnuplot not found, using plotters backend mutable time: [463.11 us 463.57 us 464.07 us] change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) high mild 9 (9.00%) high severe mutable prepared time: [527.84 us 528.46 us 529.14 us] change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 5 (5.00%) high mild 7 (7.00%) high severe Benchmarking from_slice: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60. from_slice time: [1.1968 ms 1.1979 ms 1.1991 ms] change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) high mild 7 (7.00%) high severe from_slice prepared time: [917.49 us 918.89 us 920.60 us] change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) high mild 6 (6.00%) high severe ``` Closes #9076 from jorgecarleitao/perf_buffer Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

@Dandandan

This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code. This is the second performance improvement originally presented in #8796 and, together with #9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly. Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code. This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature ``` pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T]) pub fn push<T: ToByteSlice>(&mut self, item: &T) ``` i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing). Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](#9016 (comment)). > [...] current conversion to a byte slice may add some overhead? - @Dandandan Benches (against master, so, both this PR and #9044 ): ``` Switched to branch 'perf_buffer' Your branch and 'origin/perf_buffer' have diverged, and have 6 and 1 different commits each, respectively. (use "git pull" to merge the remote branch into yours) Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 00s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471 Gnuplot not found, using plotters backend mutable time: [463.11 us 463.57 us 464.07 us] change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) high mild 9 (9.00%) high severe mutable prepared time: [527.84 us 528.46 us 529.14 us] change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 5 (5.00%) high mild 7 (7.00%) high severe Benchmarking from_slice: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60. from_slice time: [1.1968 ms 1.1979 ms 1.1991 ms] change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) high mild 7 (7.00%) high severe from_slice prepared time: [917.49 us 918.89 us 920.60 us] change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) high mild 6 (6.00%) high severe ``` Closes #9076 from jorgecarleitao/perf_buffer Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

@Dandandan

This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code. This is the second performance improvement originally presented in apache#8796 and, together with apache#9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly. Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code. This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature ``` pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T]) pub fn push<T: ToByteSlice>(&mut self, item: &T) ``` i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing). Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](apache#9016 (comment)). > [...] current conversion to a byte slice may add some overhead? - @Dandandan Benches (against master, so, both this PR and apache#9044 ): ``` Switched to branch 'perf_buffer' Your branch and 'origin/perf_buffer' have diverged, and have 6 and 1 different commits each, respectively. (use "git pull" to merge the remote branch into yours) Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 00s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471 Gnuplot not found, using plotters backend mutable time: [463.11 us 463.57 us 464.07 us] change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) high mild 9 (9.00%) high severe mutable prepared time: [527.84 us 528.46 us 529.14 us] change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 5 (5.00%) high mild 7 (7.00%) high severe Benchmarking from_slice: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60. from_slice time: [1.1968 ms 1.1979 ms 1.1991 ms] change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) high mild 7 (7.00%) high severe from_slice prepared time: [917.49 us 918.89 us 920.60 us] change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) high mild 6 (6.00%) high severe ``` Closes apache#9076 from jorgecarleitao/perf_buffer Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

@Dandandan

This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code. This is the second performance improvement originally presented in apache#8796 and, together with apache#9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly. Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code. This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature ``` pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T]) pub fn push<T: ToByteSlice>(&mut self, item: &T) ``` i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing). Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](apache#9016 (comment)). > [...] current conversion to a byte slice may add some overhead? - @Dandandan Benches (against master, so, both this PR and apache#9044 ): ``` Switched to branch 'perf_buffer' Your branch and 'origin/perf_buffer' have diverged, and have 6 and 1 different commits each, respectively. (use "git pull" to merge the remote branch into yours) Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 00s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471 Gnuplot not found, using plotters backend mutable time: [463.11 us 463.57 us 464.07 us] change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) high mild 9 (9.00%) high severe mutable prepared time: [527.84 us 528.46 us 529.14 us] change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 5 (5.00%) high mild 7 (7.00%) high severe Benchmarking from_slice: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60. from_slice time: [1.1968 ms 1.1979 ms 1.1991 ms] change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) high mild 7 (7.00%) high severe from_slice prepared time: [917.49 us 918.89 us 920.60 us] change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) high mild 6 (6.00%) high severe ``` Closes apache#9076 from jorgecarleitao/perf_buffer Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

github-actions bot added the Component: Rust label Nov 28, 2020

github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 29, 2020

github-actions bot removed the needs-rebase A PR that needs to be rebased by the author label Nov 30, 2020

alamb reviewed Nov 30, 2020

View reviewed changes

github-actions bot added needs-rebase A PR that needs to be rebased by the author and removed needs-rebase A PR that needs to be rebased by the author labels Dec 3, 2020

jorgecarleitao closed this Dec 4, 2020

jorgecarleitao deleted the buffer2 branch December 4, 2020 07:41

jorgecarleitao restored the buffer2 branch December 4, 2020 07:41

jorgecarleitao reopened this Dec 4, 2020

github-actions bot added Component: Parquet needs-rebase A PR that needs to be rebased by the author and removed needs-rebase A PR that needs to be rebased by the author labels Dec 4, 2020

jorgecarleitao mentioned this pull request Dec 7, 2020

ARROW-10804: [Rust] Removed some unsafe code from the parquet crate #8829

Closed

github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Dec 11, 2020

github-actions bot removed the needs-rebase A PR that needs to be rebased by the author label Dec 13, 2020

jorgecarleitao added 3 commits December 19, 2020 08:56

Removed lots of unsafe code.

afabc5e

Simplified API for buffer.

666ac94

Replaced deprecated API.

8a1c52c

jhorstmann reviewed Dec 20, 2020

View reviewed changes

github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Dec 24, 2020

This was referenced Dec 28, 2020

ARROW-11045: [Rust] Fix performance issues of allocator #9027

Closed

ARROW-11108: [Rust] Fixed performance issue in mutableBuffer. #9076

Closed

jorgecarleitao closed this Jan 5, 2021

jorgecarleitao deleted the buffer2 branch January 5, 2021 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Rust] [Experiment] Vec<u8> vs current allocations #8796

[Rust] [Experiment] Vec<u8> vs current allocations #8796

jorgecarleitao commented Nov 28, 2020 •

edited

Loading

github-actions bot commented Nov 28, 2020

nevi-me commented Nov 29, 2020

jhorstmann commented Nov 29, 2020

TimDiekmann commented Nov 29, 2020 •

edited

Loading

ritchie46 commented Nov 29, 2020 •

edited

Loading

Dandandan commented Nov 29, 2020

jorgecarleitao commented Nov 29, 2020

houqp commented Nov 30, 2020

alamb left a comment

Dandandan commented Dec 1, 2020 •

edited

Loading

jorgecarleitao commented Dec 2, 2020

jorgecarleitao commented Dec 2, 2020

ritchie46 commented Dec 2, 2020

jorgecarleitao commented Dec 12, 2020

codecov-io commented Dec 13, 2020 •

edited

Loading

jorgecarleitao commented Dec 15, 2020

jhorstmann commented Dec 15, 2020

jorgecarleitao commented Dec 15, 2020

alamb commented Dec 15, 2020

jorgecarleitao commented Dec 19, 2020 •

edited

Loading

jhorstmann Dec 20, 2020

jorgecarleitao commented Jan 5, 2021

[Rust] [Experiment] Vec<u8> vs current allocations #8796

[Rust] [Experiment] Vec<u8> vs current allocations #8796

Conversation

jorgecarleitao commented Nov 28, 2020 • edited Loading

github-actions bot commented Nov 28, 2020

nevi-me commented Nov 29, 2020

jhorstmann commented Nov 29, 2020

TimDiekmann commented Nov 29, 2020 • edited Loading

ritchie46 commented Nov 29, 2020 • edited Loading

Dandandan commented Nov 29, 2020

jorgecarleitao commented Nov 29, 2020

houqp commented Nov 30, 2020

alamb left a comment

Choose a reason for hiding this comment

Dandandan commented Dec 1, 2020 • edited Loading

jorgecarleitao commented Dec 2, 2020

jorgecarleitao commented Dec 2, 2020

ritchie46 commented Dec 2, 2020

jorgecarleitao commented Dec 12, 2020

codecov-io commented Dec 13, 2020 • edited Loading

Codecov Report

jorgecarleitao commented Dec 15, 2020

jhorstmann commented Dec 15, 2020

jorgecarleitao commented Dec 15, 2020

alamb commented Dec 15, 2020

jorgecarleitao commented Dec 19, 2020 • edited Loading

no SIMD

SIMD

jhorstmann Dec 20, 2020

Choose a reason for hiding this comment

jorgecarleitao commented Jan 5, 2021

jorgecarleitao commented Nov 28, 2020 •

edited

Loading

TimDiekmann commented Nov 29, 2020 •

edited

Loading

ritchie46 commented Nov 29, 2020 •

edited

Loading

Dandandan commented Dec 1, 2020 •

edited

Loading

codecov-io commented Dec 13, 2020 •

edited

Loading

jorgecarleitao commented Dec 19, 2020 •

edited

Loading