GH-43953: [C++] Add tests based on random data and benchmarks to ChunkResolver::ResolveMany #43954

felipecrv · 2024-09-04T15:04:29Z

Rationale for this change

Improve tests and add benchmarks. I wrote the tests and benchmarks while trying to improve the performance of ResolveMany and failing at it.

What changes are included in this PR?

Tests, benchmarks, and changes that don't really affect performance but might unlock more optimization opportunities in the future.

Are these changes tested?

Yes.

GitHub Issue: [C++] No micro-benchmarks for ChunkResolver::ResolveMany() #43953

github-actions · 2024-09-04T15:05:55Z

⚠️ GitHub issue #43953 has been automatically assigned in GitHub to PR creator.

felipecrv · 2024-09-04T15:09:22Z

The chunk_hint stuff really works. I forgot to plot it, but without it the throughput gains don't exist when indices are sorted.

pitrou · 2024-09-04T15:31:22Z

Results here (AMD Zen 2 CPU, gcc):

ResolveManyUInt16Random/65535/10000/508         7072 ns         7071 ns        98917 items_per_second=71.8471M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt16Random/65535/100/508           4098 ns         4097 ns       170804 items_per_second=123.986M/s logical_len=65.535k num_chunks=100
ResolveManyUInt16Random/65535/10000/32764    1812677 ns      1812178 ns          393 items_per_second=18.0799M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt16Random/65535/100/32764       723697 ns       723335 ns          876 items_per_second=45.2958M/s logical_len=65.535k num_chunks=100

ResolveManyUInt32Random/65535/10000/508         7796 ns         7795 ns        88954 items_per_second=65.1731M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt32Random/65535/100/508           4397 ns         4396 ns       157489 items_per_second=115.566M/s logical_len=65.535k num_chunks=100
ResolveManyUInt32Random/65535/10000/32764    1848926 ns      1848181 ns          379 items_per_second=17.7277M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt32Random/65535/100/32764       752273 ns       752045 ns          917 items_per_second=43.5665M/s logical_len=65.535k num_chunks=100

ResolveManyUInt64Random/65535/10000/508         6731 ns         6730 ns       103505 items_per_second=75.4878M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt64Random/65535/100/508           4042 ns         4041 ns       172940 items_per_second=125.717M/s logical_len=65.535k num_chunks=100
ResolveManyUInt64Random/65535/10000/32764    1809563 ns      1809068 ns          387 items_per_second=18.111M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt64Random/65535/100/32764       731165 ns       730968 ns          953 items_per_second=44.8227M/s logical_len=65.535k num_chunks=100

ResolveManyUInt16Sorted/65535/10000/508         6233 ns         6231 ns       111713 items_per_second=81.5236M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt16Sorted/65535/100/508            955 ns          955 ns       728588 items_per_second=532.031M/s logical_len=65.535k num_chunks=100
ResolveManyUInt16Sorted/65535/10000/32764     314016 ns       313931 ns         2221 items_per_second=104.367M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt16Sorted/65535/100/32764        41365 ns        41356 ns        17068 items_per_second=792.247M/s logical_len=65.535k num_chunks=100

ResolveManyUInt32Sorted/65535/10000/508         7176 ns         7174 ns        91879 items_per_second=70.8084M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt32Sorted/65535/100/508            928 ns          928 ns       749739 items_per_second=547.568M/s logical_len=65.535k num_chunks=100
ResolveManyUInt32Sorted/65535/10000/32764     326595 ns       326507 ns         2145 items_per_second=100.347M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt32Sorted/65535/100/32764        34162 ns        34154 ns        20518 items_per_second=959.309M/s logical_len=65.535k num_chunks=100

ResolveManyUInt64Sorted/65535/10000/508         6057 ns         6055 ns       113029 items_per_second=83.8968M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt64Sorted/65535/100/508            994 ns          993 ns       700153 items_per_second=511.41M/s logical_len=65.535k num_chunks=100
ResolveManyUInt64Sorted/65535/10000/32764     316107 ns       316023 ns         2212 items_per_second=103.676M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt64Sorted/65535/100/32764        38441 ns        38427 ns        20599 items_per_second=852.64M/s logical_len=65.535k num_chunks=100

pitrou · 2024-09-04T15:32:47Z

Ironically, uint32 seems slightly slower than both uint16 and uint64. Not sure that's due to the compiler or to the CPU.

pitrou · 2024-09-04T15:33:35Z

cpp/src/arrow/chunk_resolver_benchmark.cc

+
+template <typename IndexType>
+void ResolveManySetArgs(benchmark::internal::Benchmark* bench) {
+  constexpr int32_t kNonAligned = 3;


Can you explain what this is for?

The optimizations I was experimenting with involved some unrolling so I didn't want input values to neatly align to powers of 2.

pitrou · 2024-09-04T15:35:46Z

cpp/src/arrow/chunk_resolver_benchmark.cc

+    case 2:
+    case 4:
+    case 8:
+      bench->Args({kChunkedArrayLength, /*num_chunks*/ 10000, kNumIndicesFew});


10000 chunks is really a lot and I'm not sure it's really useful to test different numbers of chunks. By accumulating different combinations of parameters we make the benchmark results less immediately readable.

The huge number is necessary (even though it's a bit unrealistic) to measure how effective the binary search is at reducing the search space (proportional to the number of chunks).

pitrou · 2024-09-04T15:37:01Z

cpp/src/arrow/chunked_array.cc

@@ -55,7 +55,7 @@ ChunkedArray::ChunkedArray(ArrayVector chunks, std::shared_ptr<DataType> type)
        << "cannot construct ChunkedArray from empty vector and omitted type";
    type_ = chunks_[0]->type();
  }
-
+  ARROW_CHECK_LE(chunks.size(), std::numeric_limits<int>::max());


Is this limit useful if it ends up not making performance better anyway?

This is useful in ensuring we are not generating more chunks than can be addressed by the index type.

cpp/src/arrow/chunked_array_test.cc

felipecrv · 2024-09-04T16:56:37Z

Ironically, uint32 seems slightly slower than both uint16 and uint64. Not sure that's due to the compiler or to the CPU.

Same on 12th Gen Intel(R) Core(TM) i7-12800H. This is the reason I changed lo and hi to uint32_t and nothing changed in the benchmarks.

Things are saner on the Apple M1 Pro with 32-bit being the fastest.

Run on (8 X 24 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x8)
Load Average: 14.23, 6.75, 4.28
------------------------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------
ResolveManyUInt16Random/65535/100/508       2769 ns         2747 ns       252543 items_per_second=184.959M/s logical_len=65.535k num_chunks=100
ResolveManyUInt32Random/65535/100/508       2715 ns         2689 ns       257431 items_per_second=188.93M/s logical_len=65.535k num_chunks=100
ResolveManyUInt64Random/65535/100/508       2928 ns         2757 ns       252188 items_per_second=184.281M/s logical_len=65.535k num_chunks=100
ResolveManyUInt16Sorted/65535/100/508       1023 ns         1012 ns       693694 items_per_second=501.751M/s logical_len=65.535k num_chunks=100
ResolveManyUInt32Sorted/65535/100/508        929 ns          920 ns       761052 items_per_second=551.881M/s logical_len=65.535k num_chunks=100
ResolveManyUInt64Sorted/65535/100/508        933 ns          926 ns       759878 items_per_second=548.86M/s logical_len=65.535k num_chunks=100

NOTE: I have a lot of stuff running on the M1 during this benchmark :)

pitrou · 2024-09-04T17:14:34Z

There are a number of CI failures that need fixing.

felipecrv · 2024-09-04T17:24:55Z

There are a number of CI failures that need fixing.

TIL uniform_int_distribution can't take (u)int8_t as type parameter on MSVC builds.

pitrou · 2024-09-12T13:26:15Z

The chunk_hint stuff really works. I forgot to plot it, but without it the throughput gains don't exist when indices are sorted.

Yes, it's also quite dramatic for the actual Sort implementation :-)

pitrou · 2024-09-12T13:49:54Z

There are still a couple CI failures that need fixing.

…ution

for consistency with the codebase style

pitrou · 2024-09-17T13:51:05Z

There are still a couple compilation errors on CI it seems :-)

felipecrv · 2024-09-23T23:04:37Z

@pitrou now it's all green. The only failure is unrelated.

pitrou · 2024-09-24T13:21:41Z

Just a question: do you mean to keep the 64-bit to 32-bit chunk index change?

pitrou

LGTM after a minor push. I'll merge if CI is green.

felipecrv · 2024-09-24T17:01:41Z

Just a question: do you mean to keep the 64-bit to 32-bit chunk index change?

Yes I do.

conbench-apache-arrow · 2024-09-24T22:21:14Z

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 83f35de.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 16 possible false positives for unstable benchmarks that are known to sometimes produce them.

github-actions bot added Component: C++ awaiting committer review Awaiting committer review labels Sep 4, 2024

felipecrv changed the title ~~[C++]~~ GH-43953: [C++] Add tests based on random data and benchmarks to ChunkResolver::ResolveMany Sep 4, 2024

apache deleted a comment from github-actions bot Sep 4, 2024

felipecrv requested a review from pitrou September 4, 2024 15:05

pitrou reviewed Sep 4, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Sep 4, 2024

felipecrv requested a review from pitrou September 11, 2024 17:27

felipecrv added 8 commits September 12, 2024 15:37

ChunkResolver: Implement tests based on large and random input

b703351

ChunkResolver: Add benchmarks to ResolveMany()

bfd69f8

ChunkResolver: Extract ResolveOneInline

a116ad8

ChunkedArray: CHECK the assumption that num_chunks() fits in 31 bits

ddbc621

ChunkResolver: Leverage the assumption that num_chunks() fits on 31 bits

45bafa9

ChunkResolver: Fix the order of declarations

5ef5ff8

chunked_array_test.cpp: Use uint64_t instead of size_t

2f1db85

resolve issues on 32-bit builds

462be72

felipecrv force-pushed the chunk_resolver_bench branch from 7aeaf3f to 462be72 Compare September 12, 2024 18:37

MSVC doesn't like byte-sized integers

b697d64

felipecrv added 4 commits September 12, 2024 21:39

chunk_resolver.h: Fiz sign-compare with a cast

b5d5d8e

chunked_array_test.cc: MSVC can't take int8_t for uniform_int_distrib…

6a2c590

…ution

Use int64_t instead of size_t as distribution parameter

5ea2a1e

for consistency with the codebase style

another one

ab4b8dd

and anothe one

c72d9f8

Set benchmark arg names instead of custom counters

43e7904

pitrou approved these changes Sep 24, 2024

View reviewed changes

pitrou merged commit 83f35de into apache:main Sep 24, 2024
36 of 40 checks passed

pitrou removed the awaiting change review Awaiting change review label Sep 24, 2024

pitrou mentioned this pull request Sep 24, 2024

[C++] No micro-benchmarks for ChunkResolver::ResolveMany() #43953

Closed

felipecrv deleted the chunk_resolver_bench branch September 24, 2024 16:59

felipecrv mentioned this pull request Oct 12, 2024

GH-34535: [C++] Move ChunkResolver to the public API #44357

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-43953: [C++] Add tests based on random data and benchmarks to ChunkResolver::ResolveMany #43954

GH-43953: [C++] Add tests based on random data and benchmarks to ChunkResolver::ResolveMany #43954

felipecrv commented Sep 4, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Sep 4, 2024

felipecrv commented Sep 4, 2024

pitrou commented Sep 4, 2024

pitrou commented Sep 4, 2024

pitrou Sep 4, 2024

felipecrv Sep 4, 2024

pitrou Sep 4, 2024

felipecrv Sep 4, 2024

pitrou Sep 4, 2024

felipecrv Sep 4, 2024

felipecrv commented Sep 4, 2024 •

edited

Loading

pitrou commented Sep 4, 2024

felipecrv commented Sep 4, 2024

pitrou commented Sep 12, 2024

pitrou commented Sep 12, 2024

pitrou commented Sep 17, 2024

felipecrv commented Sep 23, 2024

pitrou commented Sep 24, 2024

pitrou left a comment

felipecrv commented Sep 24, 2024

conbench-apache-arrow bot commented Sep 24, 2024

GH-43953: [C++] Add tests based on random data and benchmarks to ChunkResolver::ResolveMany #43954

GH-43953: [C++] Add tests based on random data and benchmarks to ChunkResolver::ResolveMany #43954

Conversation

felipecrv commented Sep 4, 2024 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

github-actions bot commented Sep 4, 2024

felipecrv commented Sep 4, 2024

pitrou commented Sep 4, 2024

pitrou commented Sep 4, 2024

pitrou Sep 4, 2024

Choose a reason for hiding this comment

felipecrv Sep 4, 2024

Choose a reason for hiding this comment

pitrou Sep 4, 2024

Choose a reason for hiding this comment

felipecrv Sep 4, 2024

Choose a reason for hiding this comment

pitrou Sep 4, 2024

Choose a reason for hiding this comment

felipecrv Sep 4, 2024

Choose a reason for hiding this comment

felipecrv commented Sep 4, 2024 • edited Loading

pitrou commented Sep 4, 2024

felipecrv commented Sep 4, 2024

pitrou commented Sep 12, 2024

pitrou commented Sep 12, 2024

pitrou commented Sep 17, 2024

felipecrv commented Sep 23, 2024

pitrou commented Sep 24, 2024

pitrou left a comment

Choose a reason for hiding this comment

felipecrv commented Sep 24, 2024

conbench-apache-arrow bot commented Sep 24, 2024

felipecrv commented Sep 4, 2024 •

edited by github-actions bot

Loading

felipecrv commented Sep 4, 2024 •

edited

Loading