Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate conditional inner joins with larger right tables #9523

Merged

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Oct 25, 2021

This PR introduces reconfigures the conditional join kernels to launch one thread for each row in the larger of the two tables rather than always launching one thread for each row in the left table. Swapping the table on which the kernel is ordered helps improve performance in cases where the right table is significantly larger than the left table.

This PR is related to #9461 but it does not completely address its since this change only works for inner joins. Other join kinds are less straightforward since they require keeping track of rows for which no matches were found, which is easy to do if all the work for a given row is performed on the same thread (the current approach) but will require a significantly modified or entirely new kernel to also incorporate the inter-thread communication needed if the table ordering is swapped.

@vyasr vyasr added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 25, 2021
@vyasr vyasr added this to the Conditional Joins milestone Oct 25, 2021
@vyasr vyasr requested review from abellina and jrhemstad October 25, 2021 23:46
@vyasr vyasr self-assigned this Oct 25, 2021
@vyasr vyasr requested a review from a team as a code owner October 25, 2021 23:46
@vyasr vyasr requested a review from devavret October 25, 2021 23:46
@vyasr
Copy link
Contributor Author

vyasr commented Oct 25, 2021

Benchmarks

Before:

------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                Time             CPU   Iterations
------------------------------------------------------------------------------------------------------------------------------------------
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/1000000/manual_time             6312 ms         6312 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/1000000/100000/manual_time             6897 ms         6897 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit/100000/1000000/manual_time             6628 ms         6627 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit/1000000/100000/manual_time             7180 ms         7180 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit_nulls/100000/1000000/manual_time      14781 ms        14780 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit_nulls/1000000/100000/manual_time      16295 ms        16295 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit_nulls/100000/1000000/manual_time      15865 ms        15865 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit_nulls/1000000/100000/manual_time      17484 ms        17483 ms            1

After:

------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                Time             CPU   Iterations
------------------------------------------------------------------------------------------------------------------------------------------
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/1000000/manual_time             6168 ms         6167 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/1000000/100000/manual_time             6236 ms         6234 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit/100000/1000000/manual_time             6482 ms         6482 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit/1000000/100000/manual_time             6536 ms         6536 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit_nulls/100000/1000000/manual_time      14292 ms        14292 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit_nulls/1000000/100000/manual_time      14394 ms        14393 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit_nulls/100000/1000000/manual_time      15565 ms        15565 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit_nulls/1000000/100000/manual_time      15649 ms        15648 ms            1

@codecov
Copy link

codecov bot commented Oct 26, 2021

Codecov Report

Merging #9523 (addf2c7) into branch-21.12 (ab4bfaa) will decrease coverage by 0.11%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.12    #9523      +/-   ##
================================================
- Coverage         10.79%   10.67%   -0.12%     
================================================
  Files               116      117       +1     
  Lines             18869    19716     +847     
================================================
+ Hits               2036     2104      +68     
- Misses            16833    17612     +779     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/sorting.py 92.90% <0.00%> (-1.21%) ⬇️
python/cudf/cudf/io/csv.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/hdf.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/abc.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/dlpack.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
... and 66 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca40e18...addf2c7. Read the comment docs.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. I decided to pick up this review for @devavret to give him a break and keep myself up to date on AST code. 👍

cpp/src/join/conditional_join.cu Outdated Show resolved Hide resolved
cpp/src/join/conditional_join.cu Outdated Show resolved Hide resolved
cpp/src/join/conditional_join_kernels.cuh Outdated Show resolved Hide resolved
cpp/src/join/conditional_join_kernels.cuh Show resolved Hide resolved
cpp/src/join/conditional_join_kernels.cuh Outdated Show resolved Hide resolved
@vyasr vyasr removed the request for review from devavret October 26, 2021 20:23
@abellina
Copy link
Contributor

@vyasr I ran with this patch locally and it does what we need. Before the patch:

compute_conditional_join_output_size
Begins: 39.1965s
Ends: 39.8928s (+696.253 ms)
grid:  <<<40, 1, 1>>>
block: <<<128, 1, 1>>>

After the patch:

compute_conditional_join_output_size
Begins: 56.9164s
Ends: 56.9942s (+77.828 ms)
grid:  <<<7813, 1, 1>>>
block: <<<128, 1, 1>>>

I can reproduce similar fast times than with our spark-side prototype I had discussed here #9461. Thanks!

Copy link
Contributor

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the perf standpoint I got very good times from this patch. I basically stopped our join with the old patch because it takes 30 minutes to run, whereas with the patch it is ~1.4 minutes.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vyasr LGTM aside from a couple missed replacements. Please fix those, otherwise approved!

cpp/src/join/conditional_join.cu Outdated Show resolved Hide resolved
cpp/src/join/conditional_join.cu Outdated Show resolved Hide resolved
@vyasr
Copy link
Contributor Author

vyasr commented Oct 29, 2021

@vyasr LGTM aside from a couple missed replacements. Please fix those, otherwise approved!

Thank you for finding all my find-and-replace and copy-paste errors!

@vyasr
Copy link
Contributor Author

vyasr commented Oct 29, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit c12b691 into rapidsai:branch-21.12 Oct 29, 2021
@vyasr vyasr deleted the perf/conditional_join_swap_tables branch January 14, 2022 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants