Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor hash join with cuCollections multimap #8934

Merged
merged 75 commits into from
Nov 2, 2021

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented Aug 3, 2021

This PR refactors the existing hash join implementation by using cuCollections static_multimap. Related functions now invoke cuco::static_multimap host-bulk APIs thus join-specific kernels are no longer needed. Compared to the current cudf multimap, cuco::static_multimap applied a list of optimizations to improve overall performance:

  • Using CUDA Cooperative Groups & parallel retrieval to leverage multiple threads for a single hash map operation
  • Double hashing probe method instead of linear probing to reduce the chance of collisions
  • Relaxed atomic operations instead of expensive sequential memory order atomic
  • Using vector loads for "packable" pairs (no larger than 4B/4B) to coarsen the workload granularity for each thread

inner_join benchmarks are used to evaluate new hash join performance:

  • multiplicity = 1
  • matching rate = 0.3
  • occupancy = 0.5
  • double hashing
  • CG size = 2
  • 75% nulls if nulls are present:
inner_join_32bit

[0] Tesla V100-SXM2-32GB

Key Type Payload Type Nullable Build Table Size Probe Table Size Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I32 I32 0 100000 100000 135.391 us 15.35% 268.328 us 18.26% 132.937 us 98.19% PASS
I32 I32 0 100000 400000 248.937 us 5.55% 364.678 us 10.10% 115.740 us 46.49% PASS
I32 I32 0 10000000 10000000 12.635 ms 0.10% 13.580 ms 0.13% 945.092 us 7.48% PASS
I32 I32 0 10000000 40000000 37.934 ms 0.05% 43.083 ms 0.08% 5.148 ms 13.57% PASS
I32 I32 0 10000000 100000000 88.486 ms 0.20% 108.750 ms 0.05% 20.264 ms 22.90% PASS
I32 I32 0 80000000 100000000 124.207 ms 0.04% 130.181 ms 0.04% 5.974 ms 4.81% PASS
I32 I32 0 100000000 100000000 134.720 ms 0.12% 139.181 ms 0.03% 4.462 ms 3.31% PASS
I32 I32 0 10000000 240000000 218.672 ms 0.04% 295.274 ms 0.03% 76.602 ms 35.03% PASS
I32 I32 0 80000000 240000000 242.677 ms 0.03% 266.421 ms 0.04% 23.744 ms 9.78% PASS
I32 I32 0 100000000 240000000 253.245 ms 0.04% 273.783 ms 0.04% 20.538 ms 8.11% PASS

inner_join_64bit

[0] Tesla V100-SXM2-32GB

Key Type Payload Type Nullable Build Table Size Probe Table Size Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I64 I64 0 40000000 50000000 62.728 ms 0.04% 65.638 ms 0.07% 2.910 ms 4.64% PASS
I64 I64 0 50000000 50000000 68.142 ms 0.05% 70.164 ms 0.05% 2.022 ms 2.97% PASS
I64 I64 0 40000000 120000000 122.424 ms 0.04% 134.553 ms 0.05% 12.129 ms 9.91% PASS
I64 I64 0 50000000 120000000 127.979 ms 0.04% 138.379 ms 0.04% 10.400 ms 8.13% PASS

inner_join_32bit_nulls

[0] Tesla V100-SXM2-32GB

Key Type Payload Type Nullable Build Table Size Probe Table Size Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I32 I32 1 100000 100000 118.400 us 10.77% 237.164 us 10.76% 118.764 us 100.31% PASS
I32 I32 1 100000 400000 167.385 us 7.46% 275.823 us 6.00% 108.438 us 64.78% PASS
I32 I32 1 10000000 10000000 3.699 ms 0.28% 3.053 ms 0.24% -646.276 us -17.47% PASS
I32 I32 1 10000000 40000000 10.574 ms 0.15% 8.634 ms 0.08% -1939.859 us -18.35% PASS
I32 I32 1 10000000 100000000 24.461 ms 0.11% 19.744 ms 0.05% -4716.875 us -19.28% PASS
I32 I32 1 80000000 100000000 35.013 ms 0.11% 27.576 ms 0.05% -7437.653 us -21.24% PASS
I32 I32 1 100000000 100000000 37.870 ms 0.42% 29.800 ms 0.06% -8069.778 us -21.31% PASS
I32 I32 1 10000000 240000000 58.529 ms 0.19% 46.316 ms 0.02% -12213.277 us -20.87% PASS
I32 I32 1 80000000 240000000 67.443 ms 0.12% 53.721 ms 0.49% -13722.214 us -20.35% PASS
I32 I32 1 100000000 240000000 70.809 ms 0.50% 55.938 ms 0.03% -14870.980 us -21.00% PASS

inner_join_64bit_nulls

[0] Tesla V100-SXM2-32GB

Key Type Payload Type Nullable Build Table Size Probe Table Size Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I64 I64 1 40000000 50000000 17.560 ms 0.21% 13.926 ms 0.09% -3634.422 us -20.70% PASS
I64 I64 1 50000000 50000000 19.070 ms 0.14% 15.082 ms 0.07% -3987.592 us -20.91% PASS
I64 I64 1 40000000 120000000 34.253 ms 0.13% 27.170 ms 0.05% -7082.957 us -20.68% PASS
I64 I64 1 50000000 120000000 35.807 ms 0.19% 28.340 ms 0.03% -7467.177 us -20.85% PASS

When nulls are not present, the map has an actual occupancy of 0.5. The new hash join implementation (Cmp) outperforms the existing one (Ref) by 20% (can be achieved by tuning CG size) to 100%. When nulls are present and treated as unequal, however, the actual occupancy is 0.125 and Cmp is always about 20% slower than Ref. Using the number of valid rows (#9176) to build hash map can solve this performance issue. Note that the above results show the minimum speedups brought by the new hash join since multiplicity and load factor are relatively low in this case.

Also, this PR simplifies and improves the current join benchmark by adding multiplicity control. It fixs bugs in join cpp tests and pytests where outputs were not sorted before compare.

@PointKernel PointKernel added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Aug 3, 2021
@devavret devavret self-requested a review September 20, 2021 11:30
cpp/src/join/hash_join.cu Outdated Show resolved Hide resolved
cpp/src/join/hash_join.cuh Show resolved Hide resolved
cpp/src/join/hash_join.cu Outdated Show resolved Hide resolved
cpp/src/join/hash_join.cu Outdated Show resolved Hide resolved
cpp/src/join/hash_join.cu Show resolved Hide resolved
cpp/src/join/hash_join.cu Show resolved Hide resolved
@PointKernel
Copy link
Member Author

rerun tests

@shwina
Copy link
Contributor

shwina commented Sep 24, 2021

@PointKernel Could you please re-target this PR to 21.12? Thanks.

@PointKernel PointKernel changed the base branch from branch-21.10 to branch-21.12 September 24, 2021 19:04
@jrhemstad jrhemstad added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Nov 1, 2021
@jrhemstad
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 0e76035 into rapidsai:branch-21.12 Nov 2, 2021
@PointKernel PointKernel deleted the cuco-integration branch November 4, 2021 18:34
rapids-bot bot pushed a commit that referenced this pull request Apr 12, 2022
The `concurrent_unordered_multimap` is no longer used in libcudf. It has been replaced by `cuco::static_multimap`. The majority of the refactoring was done in PRs #8934 and #9704. A similar effort is in progress for `concurrent_unordered_map` and `cuco::static_map` in #9666 (and may depend on porting some optimizations from libcudf to cuco -- need to look into this before doing a direct replacement).

This partially resolves issue #10401.

cc: @PointKernel @vyasr

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #10642
rapids-bot bot pushed a commit that referenced this pull request May 2, 2022
When working on #8934, we observed a performance regression when nulls are unequal. One major reason is that the new hash map uses a CG-based double hashing algorithm. This algorithm is dedicated to improving hash collision handling. The existing implementation determines hash map size by the number of rows in the build table regardless of how many rows are valid. In the case of nulls being unequal, the actual map occupancy is, therefore, lower than the default 50% thus resulting in fewer hash collisions. The old scalar linear probing is more efficient in this case due to less CG-related overhead and the probe will mostly end at the first probe slot. 

To improve this situation, the original idea of this PR was to construct the hash map based on the number of valid rows. There are supposed to be two benefits:

1. Increases map occupancy to benefit more from CG-based double hashing thus improving runtime efficiency
2. Reduces peak memory usage: for 1'000 elements with 75% nulls, the new capacity would be 500 (1000 * 0.25 * 2) as opposed to 2000 (1000 * 2)

During this work, however, we noticed the first assumption is improper since it didn't consider the performance degradation along with reduced capacity (see #10248 (comment)). Though this effort will reduce peak memory usage, it seems Python/Spark workflows would never benefit from it since they tend to drop nulls before any join operations.

Finally, all changes related to map size reduction are discarded. This PR only adds `_composite_bitmask` as a `detail::hash_join` member which is a preparation step for #9151

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #10248
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants