-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor hash join with cuCollections multimap #8934
Merged
rapids-bot
merged 75 commits into
rapidsai:branch-21.12
from
PointKernel:cuco-integration
Nov 2, 2021
Merged
Refactor hash join with cuCollections multimap #8934
rapids-bot
merged 75 commits into
rapidsai:branch-21.12
from
PointKernel:cuco-integration
Nov 2, 2021
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
PointKernel
added
2 - In Progress
Currently a work in progress
libcudf
Affects libcudf (C++/CUDA) code.
CMake
CMake build issue
labels
Aug 3, 2021
devavret
suggested changes
Sep 20, 2021
rerun tests |
devavret
approved these changes
Sep 22, 2021
@PointKernel Could you please re-target this PR to 21.12? Thanks. |
jrhemstad
added
5 - Ready to Merge
Testing and reviews complete, ready to merge
and removed
3 - Ready for Review
Ready for review by team
labels
Nov 1, 2021
@gpucibot merge |
rapids-bot bot
pushed a commit
that referenced
this pull request
Apr 12, 2022
The `concurrent_unordered_multimap` is no longer used in libcudf. It has been replaced by `cuco::static_multimap`. The majority of the refactoring was done in PRs #8934 and #9704. A similar effort is in progress for `concurrent_unordered_map` and `cuco::static_map` in #9666 (and may depend on porting some optimizations from libcudf to cuco -- need to look into this before doing a direct replacement). This partially resolves issue #10401. cc: @PointKernel @vyasr Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Vyas Ramasubramani (https://github.com/vyasr) - Jake Hemstad (https://github.com/jrhemstad) URL: #10642
rapids-bot bot
pushed a commit
that referenced
this pull request
May 2, 2022
When working on #8934, we observed a performance regression when nulls are unequal. One major reason is that the new hash map uses a CG-based double hashing algorithm. This algorithm is dedicated to improving hash collision handling. The existing implementation determines hash map size by the number of rows in the build table regardless of how many rows are valid. In the case of nulls being unequal, the actual map occupancy is, therefore, lower than the default 50% thus resulting in fewer hash collisions. The old scalar linear probing is more efficient in this case due to less CG-related overhead and the probe will mostly end at the first probe slot. To improve this situation, the original idea of this PR was to construct the hash map based on the number of valid rows. There are supposed to be two benefits: 1. Increases map occupancy to benefit more from CG-based double hashing thus improving runtime efficiency 2. Reduces peak memory usage: for 1'000 elements with 75% nulls, the new capacity would be 500 (1000 * 0.25 * 2) as opposed to 2000 (1000 * 2) During this work, however, we noticed the first assumption is improper since it didn't consider the performance degradation along with reduced capacity (see #10248 (comment)). Though this effort will reduce peak memory usage, it seems Python/Spark workflows would never benefit from it since they tend to drop nulls before any join operations. Finally, all changes related to map size reduction are discarded. This PR only adds `_composite_bitmask` as a `detail::hash_join` member which is a preparation step for #9151 Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Vyas Ramasubramani (https://github.com/vyasr) URL: #10248
3 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
5 - Ready to Merge
Testing and reviews complete, ready to merge
CMake
CMake build issue
improvement
Improvement / enhancement to an existing function
libcudf
Affects libcudf (C++/CUDA) code.
non-breaking
Non-breaking change
Performance
Performance related issue
Python
Affects Python cuDF API.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR refactors the existing hash join implementation by using cuCollections
static_multimap
. Related functions now invokecuco::static_multimap
host-bulk APIs thus join-specific kernels are no longer needed. Compared to the current cudf multimap,cuco::static_multimap
applied a list of optimizations to improve overall performance:inner_join
benchmarks are used to evaluate new hash join performance:inner_join_32bit
[0] Tesla V100-SXM2-32GB
inner_join_64bit
[0] Tesla V100-SXM2-32GB
inner_join_32bit_nulls
[0] Tesla V100-SXM2-32GB
inner_join_64bit_nulls
[0] Tesla V100-SXM2-32GB
When nulls are not present, the map has an actual occupancy of 0.5. The new hash join implementation (
Cmp
) outperforms the existing one (Ref
) by 20% (can be achieved by tuning CG size) to 100%. When nulls are present and treated as unequal, however, the actual occupancy is 0.125 andCmp
is always about 20% slower thanRef
. Using the number of valid rows (#9176) to build hash map can solve this performance issue. Note that the above results show the minimum speedups brought by the new hash join since multiplicity and load factor are relatively low in this case.Also, this PR simplifies and improves the current join benchmark by adding multiplicity control. It fixs bugs in join cpp tests and pytests where outputs were not sorted before compare.