Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reapply mixed_semi_join refactoring and bug fixes #16859

Merged

Conversation

mhaseeb123
Copy link
Member

Description

This PR reapplies changes from #16230 and adds bug fixes and performance improvements for mixed_semi_join.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 20, 2024
@mhaseeb123 mhaseeb123 self-assigned this Sep 20, 2024
left_table_keep_mask[outer_row_index] =
hash_table_view.contains(outer_row_index, hash_probe, equality);
// Find all the rows in the left table that are in the hash table.
for (auto outer_row_index = cudf::detail::grid_1d::global_thread_id<block_size>() / cg_size;
Copy link
Member Author

@mhaseeb123 mhaseeb123 Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug fix (part 1): This was a simple if before so only querying nelems/cg_size items instead of all nelems

@mhaseeb123 mhaseeb123 added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 20, 2024
@mhaseeb123
Copy link
Member Author

mhaseeb123 commented Sep 20, 2024

Performance

Benchmark

JOIN_NVBENCH -d 0 --benchmark mixed_left_semi_join --timeout 100

Hardware

GPU: NVIDIA RTX 5880 Ada Generation
SM Version: 890 (PTX Version: 860)
Number of SMs: 110
SM Default Clock Rate: 18446744071874 MHz
Global Memory: 23879 MiB Free / 48632 MiB Total
Global Memory Bus Peak: 960 GB/sec (384-bit DDR @10001MHz)
Max Shared Memory: 100 KiB/SM, 48 KiB/Block
L2 Cache Size: 98304 KiB
Maximum Active Blocks: 24/SM
Maximum Active Threads: 1536/SM, 1024/Block
Available Registers: 65536/SM, 65536/Block
ECC Enabled: No

Comparison Table

Arthimetic mean of all % improvements: +0.79%
Geometic mean of positive % improvements: 6.7%
Geometic mean of negative % improvements: -14%

| Key | Nullable | left_size | right_size | GPU_Time_old | GPU_Time_new | pct_change |
|-----|----------|-----------|------------|--------------|--------------|------------|
| I32 |        0 |      1000 |       1000 |    94.260 us |    94.927 us |     -0.708 |
| I32 |        0 |    100000 |       1000 |   132.938 us |   146.014 us |     -9.836 |
| I32 |        0 |  10000000 |       1000 |     1.502 ms |     1.806 ms |     -20.24 |
| I32 |        0 |    100000 |     100000 |   157.627 us |   141.074 us |     10.501 |
| I32 |        0 |  10000000 |     100000 |     1.479 ms |     1.692 ms |    -14.402 |
| I32 |        0 |  10000000 |   10000000 |     5.867 ms |     4.150 ms |     29.265 |
| I32 |        1 |      1000 |       1000 |   108.833 us |   100.220 us |      7.914 |
| I32 |        1 |    100000 |       1000 |   143.874 us |   144.758 us |     -0.614 |
| I32 |        1 |  10000000 |       1000 |     1.283 ms |     1.416 ms |    -10.366 |
| I32 |        1 |    100000 |     100000 |   159.061 us |   145.933 us |      8.253 |
| I32 |        1 |  10000000 |     100000 |     1.156 ms |     1.418 ms |    -22.664 |
| I32 |        1 |  10000000 |   10000000 |     2.951 ms |     1.920 ms |     34.937 |
| I64 |        0 |      1000 |       1000 |    94.945 us |    87.359 us |       7.99 |
| I64 |        0 |    100000 |       1000 |   134.796 us |   138.466 us |     -2.723 |
| I64 |        0 |  10000000 |       1000 |     1.466 ms |     1.692 ms |    -15.416 |
| I64 |        0 |    100000 |     100000 |   159.470 us |   145.605 us |      8.694 |
| I64 |        0 |  10000000 |     100000 |     1.510 ms |     1.738 ms |    -15.099 |
| I64 |        0 |  10000000 |   10000000 |     6.862 ms |     5.223 ms |     23.885 |
| I64 |        1 |      1000 |       1000 |   106.791 us |    96.834 us |      9.324 |
| I64 |        1 |    100000 |       1000 |   143.220 us |   142.791 us |        0.3 |
| I64 |        1 |  10000000 |       1000 |     1.118 ms |     1.482 ms |    -32.558 |
| I64 |        1 |    100000 |     100000 |   159.526 us |   147.077 us |      7.804 |
| I64 |        1 |  10000000 |     100000 |     1.178 ms |     1.422 ms |    -20.713 |
| I64 |        1 |  10000000 |   10000000 |     2.980 ms |     1.992 ms |     33.154 |

@mhaseeb123 mhaseeb123 changed the title Reapply mixed_semi_join refactoring and bug fixes [WIP] Reapply mixed_semi_join refactoring and bug fixes Sep 20, 2024

// skip rows that are null here.
if ((compare_nulls == null_equality::EQUAL) or (not nullable(build))) {
hash_table.insert(iter, iter + right_num_rows, hash_build, equality_build, stream.value());
row_set.insert_async(iter, iter + right_num_rows, stream.value());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some perf improvement through use of _async APIs.


detail::grid_1d const config(outer_num_rows, DEFAULT_JOIN_BLOCK_SIZE);
auto const shmem_size_per_block = parser.shmem_per_thread * config.num_threads_per_block;
detail::grid_1d const config(outer_num_rows * hash_set_type::cg_size, DEFAULT_JOIN_BLOCK_SIZE);
Copy link
Member Author

@mhaseeb123 mhaseeb123 Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug fix (part 2): We need to launch outer_num_rows * hash_set_type::cg_size threads now. Even without this the loop in part 1 will take care of everything.

@mhaseeb123 mhaseeb123 changed the base branch from branch-24.10 to branch-24.12 September 20, 2024 23:38
@mhaseeb123 mhaseeb123 changed the base branch from branch-24.12 to branch-24.10 September 20, 2024 23:39
@mhaseeb123
Copy link
Member Author

Performance results from Spark (credits: @zpuller).

spark2a

Regression alerts
-----------------
--------------------------------------------------------------------
Name = query85
Means = 2845.4, 2253.0
Time diff = 592.4000000000001
Speedup = 1.2629383044829117
T-Test (test statistic, p value, df) = 2.43033039217987, 0.041181012901009915, 8.0
T-Test Confidence Interval = 30.304881537668848, 1154.4951184623314
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query90
Means = 1421.4, 863.4
Time diff = 558.0000000000001
Speedup = 1.6462821403752608
T-Test (test statistic, p value, df) = 2.6788276864789964, 0.02797678796082813, 8.0
T-Test Confidence Interval = 77.6591725781339, 1038.3408274218664
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query93
Means = 10904.4, 11764.0
Time diff = -859.6000000000004
Speedup = 0.9269296157769465
T-Test (test statistic, p value, df) = -2.9660190047403745, 0.017980126109303492, 8.0
T-Test Confidence Interval = -1527.9170780272916, -191.282921972709
ALERT: significant change has been detected (p-value < 0.05)
ALERT: regression in performance has been observed

spark-h

====================
Results for query16
====================
Regression alerts
-----------------

Speedup results
-----------------
query16: Previous (6218.6 ms) vs Current (6250.6 ms) Diff -32 E2E 0.99x
====================
Results for query94
====================
Regression alerts
-----------------

Speedup results
-----------------
query94: Previous (5255.2 ms) vs Current (5044.4 ms) Diff 210 E2E 1.04x

@mhaseeb123 mhaseeb123 changed the title [WIP] Reapply mixed_semi_join refactoring and bug fixes Reapply mixed_semi_join refactoring and bug fixes Sep 23, 2024
@mhaseeb123 mhaseeb123 changed the base branch from branch-24.10 to branch-24.12 September 23, 2024 21:40
@mhaseeb123 mhaseeb123 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Sep 24, 2024
@mhaseeb123 mhaseeb123 marked this pull request as ready for review September 24, 2024 18:39
@mhaseeb123 mhaseeb123 requested a review from a team as a code owner September 24, 2024 18:39
Co-authored-by: Yunsong Wang <yunsongw@nvidia.com>
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great 🚀

cpp/tests/join/mixed_join_tests.cu Show resolved Hide resolved
Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an enlightening read. Very clean code, too. Thanks, @mhaseeb123.

LGTM.

@mhaseeb123
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit b1e1c9c into rapidsai:branch-24.12 Sep 26, 2024
100 checks passed
@mhaseeb123 mhaseeb123 deleted the make-mixed-semi-join-great-again branch September 26, 2024 18:00
copy-pr-bot bot pushed a commit that referenced this pull request Sep 28, 2024
This PR reapplies changes from #16230 and adds bug fixes and performance improvements for mixed_semi_join.

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - MithunR (https://github.com/mythrocks)
  - Nghia Truong (https://github.com/ttnghia)

URL: #16859
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants