Refactor hash join with cuCollections multimap #8934

PointKernel · 2021-08-03T15:03:01Z

This PR refactors the existing hash join implementation by using cuCollections static_multimap. Related functions now invoke cuco::static_multimap host-bulk APIs thus join-specific kernels are no longer needed. Compared to the current cudf multimap, cuco::static_multimap applied a list of optimizations to improve overall performance:

Using CUDA Cooperative Groups & parallel retrieval to leverage multiple threads for a single hash map operation
Double hashing probe method instead of linear probing to reduce the chance of collisions
Relaxed atomic operations instead of expensive sequential memory order atomic
Using vector loads for "packable" pairs (no larger than 4B/4B) to coarsen the workload granularity for each thread

inner_join benchmarks are used to evaluate new hash join performance:

multiplicity = 1
matching rate = 0.3
occupancy = 0.5
double hashing
CG size = 2
75% nulls if nulls are present:

inner_join_32bit

[0] Tesla V100-SXM2-32GB

Key Type	Payload Type	Build Table Size	Probe Table Size	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I32	I32	100000	100000	135.391 us	15.35%	268.328 us	18.26%	132.937 us	98.19%	PASS
I32	I32	100000	400000	248.937 us	5.55%	364.678 us	10.10%	115.740 us	46.49%	PASS
I32	I32	10000000	10000000	12.635 ms	0.10%	13.580 ms	0.13%	945.092 us	7.48%	PASS
I32	I32	10000000	40000000	37.934 ms	0.05%	43.083 ms	0.08%	5.148 ms	13.57%	PASS
I32	I32	10000000	100000000	88.486 ms	0.20%	108.750 ms	0.05%	20.264 ms	22.90%	PASS
I32	I32	80000000	100000000	124.207 ms	0.04%	130.181 ms	0.04%	5.974 ms	4.81%	PASS
I32	I32	100000000	100000000	134.720 ms	0.12%	139.181 ms	0.03%	4.462 ms	3.31%	PASS
I32	I32	10000000	240000000	218.672 ms	0.04%	295.274 ms	0.03%	76.602 ms	35.03%	PASS
I32	I32	80000000	240000000	242.677 ms	0.03%	266.421 ms	0.04%	23.744 ms	9.78%	PASS
I32	I32	100000000	240000000	253.245 ms	0.04%	273.783 ms	0.04%	20.538 ms	8.11%	PASS

inner_join_64bit

[0] Tesla V100-SXM2-32GB

Key Type	Payload Type	Build Table Size	Probe Table Size	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I64	I64	40000000	50000000	62.728 ms	0.04%	65.638 ms	0.07%	2.910 ms	4.64%	PASS
I64	I64	50000000	50000000	68.142 ms	0.05%	70.164 ms	0.05%	2.022 ms	2.97%	PASS
I64	I64	40000000	120000000	122.424 ms	0.04%	134.553 ms	0.05%	12.129 ms	9.91%	PASS
I64	I64	50000000	120000000	127.979 ms	0.04%	138.379 ms	0.04%	10.400 ms	8.13%	PASS

inner_join_32bit_nulls

[0] Tesla V100-SXM2-32GB

Key Type	Payload Type	Nullable	Build Table Size	Probe Table Size	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I32	I32	1	100000	100000	118.400 us	10.77%	237.164 us	10.76%	118.764 us	100.31%	PASS
I32	I32	1	100000	400000	167.385 us	7.46%	275.823 us	6.00%	108.438 us	64.78%	PASS
I32	I32	1	10000000	10000000	3.699 ms	0.28%	3.053 ms	0.24%	-646.276 us	-17.47%	PASS
I32	I32	1	10000000	40000000	10.574 ms	0.15%	8.634 ms	0.08%	-1939.859 us	-18.35%	PASS
I32	I32	1	10000000	100000000	24.461 ms	0.11%	19.744 ms	0.05%	-4716.875 us	-19.28%	PASS
I32	I32	1	80000000	100000000	35.013 ms	0.11%	27.576 ms	0.05%	-7437.653 us	-21.24%	PASS
I32	I32	1	100000000	100000000	37.870 ms	0.42%	29.800 ms	0.06%	-8069.778 us	-21.31%	PASS
I32	I32	1	10000000	240000000	58.529 ms	0.19%	46.316 ms	0.02%	-12213.277 us	-20.87%	PASS
I32	I32	1	80000000	240000000	67.443 ms	0.12%	53.721 ms	0.49%	-13722.214 us	-20.35%	PASS
I32	I32	1	100000000	240000000	70.809 ms	0.50%	55.938 ms	0.03%	-14870.980 us	-21.00%	PASS

inner_join_64bit_nulls

[0] Tesla V100-SXM2-32GB

Key Type	Payload Type	Nullable	Build Table Size	Probe Table Size	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I64	I64	1	40000000	50000000	17.560 ms	0.21%	13.926 ms	0.09%	-3634.422 us	-20.70%	PASS
I64	I64	1	50000000	50000000	19.070 ms	0.14%	15.082 ms	0.07%	-3987.592 us	-20.91%	PASS
I64	I64	1	40000000	120000000	34.253 ms	0.13%	27.170 ms	0.05%	-7082.957 us	-20.68%	PASS
I64	I64	1	50000000	120000000	35.807 ms	0.19%	28.340 ms	0.03%	-7467.177 us	-20.85%	PASS

When nulls are not present, the map has an actual occupancy of 0.5. The new hash join implementation (Cmp) outperforms the existing one (Ref) by 20% (can be achieved by tuning CG size) to 100%. When nulls are present and treated as unequal, however, the actual occupancy is 0.125 and Cmp is always about 20% slower than Ref. Using the number of valid rows (#9176) to build hash map can solve this performance issue. Note that the above results show the minimum speedups brought by the new hash join since multiplicity and load factor are relatively low in this case.

Also, this PR simplifies and improves the current join benchmark by adding multiplicity control. It fixs bugs in join cpp tests and pytests where outputs were not sorted before compare.

…ation

…ltimap

…ation

…timap

…ation

cpp/src/join/hash_join.cu

cpp/src/join/hash_join.cuh

cpp/src/join/hash_join.cu

cpp/cmake/thirdparty/get_cucollections.cmake

PointKernel · 2021-09-20T23:16:30Z

rerun tests

shwina · 2021-09-24T18:15:47Z

@PointKernel Could you please re-target this PR to 21.12? Thanks.

…ation

jrhemstad · 2021-11-02T14:44:50Z

@gpucibot merge

@PointKernel

The `concurrent_unordered_multimap` is no longer used in libcudf. It has been replaced by `cuco::static_multimap`. The majority of the refactoring was done in PRs #8934 and #9704. A similar effort is in progress for `concurrent_unordered_map` and `cuco::static_map` in #9666 (and may depend on porting some optimizations from libcudf to cuco -- need to look into this before doing a direct replacement). This partially resolves issue #10401. cc: @PointKernel @vyasr Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Vyas Ramasubramani (https://github.com/vyasr) - Jake Hemstad (https://github.com/jrhemstad) URL: #10642

When working on #8934, we observed a performance regression when nulls are unequal. One major reason is that the new hash map uses a CG-based double hashing algorithm. This algorithm is dedicated to improving hash collision handling. The existing implementation determines hash map size by the number of rows in the build table regardless of how many rows are valid. In the case of nulls being unequal, the actual map occupancy is, therefore, lower than the default 50% thus resulting in fewer hash collisions. The old scalar linear probing is more efficient in this case due to less CG-related overhead and the probe will mostly end at the first probe slot. To improve this situation, the original idea of this PR was to construct the hash map based on the number of valid rows. There are supposed to be two benefits: 1. Increases map occupancy to benefit more from CG-based double hashing thus improving runtime efficiency 2. Reduces peak memory usage: for 1'000 elements with 75% nulls, the new capacity would be 500 (1000 * 0.25 * 2) as opposed to 2000 (1000 * 2) During this work, however, we noticed the first assumption is improper since it didn't consider the performance degradation along with reduced capacity (see #10248 (comment)). Though this effort will reduce peak memory usage, it seems Python/Spark workflows would never benefit from it since they tend to drop nulls before any join operations. Finally, all changes related to map size reduction are discarded. This PR only adds `_composite_bitmask` as a `detail::hash_join` member which is a preparation step for #9151 Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Vyas Ramasubramani (https://github.com/vyasr) URL: #10248

PointKernel added 27 commits June 14, 2021 14:49

Add cuco dependency in CMake

506e610

Add CUDF_GetCUCO CMake file

0bc5881

Add cuco CPM build options

04d588c

Add cuco include dir

108a0aa

Merge remote-tracking branch 'upstream/branch-21.08' into cuco-integr…

fea4331

…ation

Merge remote-tracking branch 'upstream/branch-21.08' into cuco-integr…

9f8b68c

…ation

Merge remote-tracking branch 'upstream/branch-21.08' into cuco-integr…

23d1ab1

…ation

Set cuco::static_multimap as default multimap

720922a

Refactor join APIs: take cuco multimap view as argument

c192e71

Update docs

a3a92ea

Merge remote-tracking branch 'upstream/branch-21.08' into cuco-integr…

d265477

…ation

Merge remote-tracking branch 'upstream/branch-21.08' into cuco-integr…

0e84203

…ation

Merge remote-tracking branch 'upstream/branch-21.08' into cuco-integr…

0837404

…ation

Insert using cuco multimap

c3e67ad

Merge remote-tracking branch 'upstream/branch-21.08' into cuco-integr…

8c36b6e

…ation

Add pair_equality callable

6376ca6

Optimize pair_equality callable

8c5adb4

Refactor build_hash_table and compute_join_output_size to use cuco mu…

9920ccb

…ltimap

Merge remote-tracking branch 'upstream/branch-21.10' into cuco-integr…

8c1e04e

…ation

Minor cleanups in common utils

3f5a444

Use cuco host bulk function instead of cudf multimap

9e7662e

Merge remote-tracking branch 'upstream/branch-21.10' into cuco-integr…

627b669

…ation

Refactor probe_join_hash_table and get_full_join_size to use cuco mul…

7acb608

…timap

Cleanups: get rid of join kernels

16ff127

Merge remote-tracking branch 'upstream/branch-21.10' into cuco-integr…

da2ca5d

…ation

Remove join kernels

7af07f8

Use JoinNoneValue as empty value sentinel

e16aa8b

PointKernel added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Aug 3, 2021

devavret self-requested a review September 20, 2021 11:30

devavret suggested changes Sep 20, 2021

View reviewed changes

PointKernel added 2 commits September 20, 2021 12:17

Address review comments: add docs + use make_counting_transform_iterator

bb66362

Minor updates + rename row_contains_null as row_is_valid

df0da48

robertmaynard reviewed Sep 20, 2021

View reviewed changes

cpp/cmake/thirdparty/get_cucollections.cmake Outdated Show resolved Hide resolved

devavret approved these changes Sep 22, 2021

View reviewed changes

PointKernel changed the base branch from branch-21.10 to branch-21.12 September 24, 2021 19:04

PointKernel mentioned this pull request Oct 8, 2021

Add cuco::static_multimap NVIDIA/cuCollections#85

Merged

PointKernel added 2 commits November 1, 2021 14:13

Merge remote-tracking branch 'upstream/branch-21.12' into cuco-integr…

2783b48

…ation

Updates: fetching the latest cuco tag

c49166e

jrhemstad added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Nov 1, 2021

rapids-bot bot merged commit 0e76035 into rapidsai:branch-21.12 Nov 2, 2021

jrhemstad mentioned this pull request Nov 3, 2021

[FEA] Refactor semi/anti join to use cuCollections #9586

Closed

PointKernel deleted the cuco-integration branch November 4, 2021 18:34

vyasr mentioned this pull request Jan 5, 2022

[FEA] Address performance regression in semi/anti joins from switching to cuco #9973

Open

bdice mentioned this pull request Apr 12, 2022

Remove concurrent_unordered_multimap. #10642

Merged

PointKernel mentioned this pull request Apr 29, 2022

Add row bitmask as a detail::hash_join member #10248

Merged

vyasr mentioned this pull request Jun 28, 2024

Use size_t to allow large conditional joins #16127

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor hash join with cuCollections multimap #8934

Refactor hash join with cuCollections multimap #8934

PointKernel commented Aug 3, 2021 •

edited

Loading

PointKernel commented Sep 20, 2021

shwina commented Sep 24, 2021

jrhemstad commented Nov 2, 2021

Refactor hash join with cuCollections multimap #8934

Refactor hash join with cuCollections multimap #8934

Conversation

PointKernel commented Aug 3, 2021 • edited Loading

[0] Tesla V100-SXM2-32GB

[0] Tesla V100-SXM2-32GB

[0] Tesla V100-SXM2-32GB

[0] Tesla V100-SXM2-32GB

PointKernel commented Sep 20, 2021

shwina commented Sep 24, 2021

jrhemstad commented Nov 2, 2021

PointKernel commented Aug 3, 2021 •

edited

Loading