Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance for cudf::strings::count_characters for long strings #12779

Merged
merged 22 commits into from
Feb 27, 2023

Conversation

davidwendt
Copy link
Contributor

Description

Adds more efficient counting algorithm specifically for columns with long strings--greater than 64 bytes on average.
The internal detail method will be used to help improve performance in other strings functions.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 15, 2023
@davidwendt davidwendt self-assigned this Feb 15, 2023
@github-actions github-actions bot added the CMake CMake build issue label Feb 15, 2023
@davidwendt
Copy link
Contributor Author

Benchmark results show most improvement for string widths greater 64 bytes on average.

|   rows   | width|   Ref Time |    Cmp Time |          Diff |   %Diff |
|----------|------|------------|-------------|---------------|---------|
|     4096 |   32 |  37.670 us |   37.532 us |     -0.138 us |  -0.37% |
|    32768 |   32 |  38.356 us |   38.225 us |     -0.131 us |  -0.34% |
|   262144 |   32 |  49.001 us |   48.659 us |     -0.342 us |  -0.70% |
|  2097152 |   32 | 332.821 us |  333.535 us |      0.714 us |   0.21% |
| 16777216 |   32 |   2.418 ms |    2.434 ms |     15.780 us |   0.65% |
|     4096 |   64 |  38.041 us |   38.462 us |      0.421 us |   1.11% |
|    32768 |   64 |  39.032 us |   39.507 us |      0.475 us |   1.22% |
|   262144 |   64 |  56.326 us |   56.501 us |      0.175 us |   0.31% |
|  2097152 |   64 | 381.756 us |  380.597 us |     -1.159 us |  -0.30% |
| 16777216 |   64 |   2.808 ms |    2.789 ms |    -19.355 us |  -0.69% |
|     4096 |  128 |  41.175 us |   41.297 us |      0.122 us |   0.30% |
|    32768 |  128 |  45.130 us |   45.286 us |      0.156 us |   0.35% |
|   262144 |  128 |  77.200 us |   77.078 us |     -0.122 us |  -0.16% |
|  2097152 |  128 | 497.075 us |  492.847 us |     -4.228 us |  -0.85% |
|     4096 |  256 |  49.684 us |   37.710 us |    -11.974 us | -24.10% |
|    32768 |  256 |  53.379 us |   49.197 us |     -4.182 us |  -7.83% |
|   262144 |  256 | 152.252 us |  139.544 us |    -12.709 us |  -8.35% |
|  2097152 |  256 | 963.613 us |    1.090 ms |    126.674 us |  13.15% |
|     4096 |  512 |  69.779 us |   41.161 us |    -28.618 us | -41.01% |
|    32768 |  512 |  75.668 us |   55.073 us |    -20.595 us | -27.22% |
|   262144 |  512 | 405.729 us |  166.213 us |   -239.516 us | -59.03% |
|  2097152 |  512 |   2.551 ms |    1.297 ms |  -1254.632 us | -49.18% |
|     4096 | 1024 | 106.060 us |   42.641 us |    -63.418 us | -59.79% |
|    32768 | 1024 | 127.396 us |   68.891 us |    -58.505 us | -45.92% |
|   262144 | 1024 |   1.223 ms |  237.241 us |   -986.090 us | -80.61% |
|     4096 | 2048 | 172.035 us |   46.194 us |   -125.841 us | -73.15% |
|    32768 | 2048 | 228.790 us |   94.118 us |   -134.671 us | -58.86% |
|   262144 | 2048 |   6.446 ms |  422.672 us |  -6023.192 us | -93.44% |
|     4096 | 4096 | 309.562 us |   52.891 us |   -256.671 us | -82.91% |
|    32768 | 4096 | 425.600 us |  138.327 us |   -287.273 us | -67.50% |
|   262144 | 4096 |  14.901 ms |  797.709 us | -14103.759 us | -94.65% |

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Feb 17, 2023
@davidwendt davidwendt marked this pull request as ready for review February 17, 2023 00:04
@davidwendt davidwendt requested a review from a team as a code owner February 17, 2023 00:04
@ttnghia
Copy link
Contributor

ttnghia commented Feb 17, 2023

Will the performance with strings shorter than 64 bytes be impacted?

@davidwendt
Copy link
Contributor Author

Will the performance with strings shorter than 64 bytes be impacted?

No. The optimized code did not perform as well as the original code under 64 bytes so the solution implemented here is to include both code paths.

@davidwendt davidwendt marked this pull request as draft February 19, 2023 00:39
@davidwendt davidwendt marked this pull request as ready for review February 22, 2023 00:10
cpp/src/strings/attributes.cu Outdated Show resolved Hide resolved
cpp/src/strings/attributes.cu Show resolved Hide resolved
auto strings_view = cudf::strings_column_view(strings);

auto results = cudf::strings::count_characters(strings_view);
std::vector<int32_t> h_expected(h_strings.size(), 64);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expected that these tests should all be using cudf::size_type for the counts, but I also see that count_characters is defined to return an INT32 in its docstring. That seems undesirable, right? Shouldn't we be using cudf::size_type instead of hardcoding int32_t for the function return value and test comparisons?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is no SIZE_TYPE column type, I think it is more correct to use the appropriate data type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If changing the definition of size type would affect the correctness or consistency of this code, we should use size type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree but this seems out of scope for this PR.
There are at least 6 APIs that do this today.
Perhaps you can create a separate issue if you want to reopen the discussion.

We document what data-type is used for the output of the column of these APIs.
I'm not comfortable saying the column type is whatever size-type equates to at the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed a bit offline. The latest documented status is #3958 (comment) but from conversation since that was posted, @GregoryKimball seemed interested in moving towards option 3. I won't hold this PR up on that account.

@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 2025783 into rapidsai:branch-23.04 Feb 27, 2023
@davidwendt davidwendt deleted the count-string-lengths branch February 27, 2023 18:46
rapids-bot bot pushed a commit that referenced this pull request Mar 15, 2023
Adds section to developer guide about `cudf::size_type` and adds links to it from other relevant parts of the document.
The fundamental nature of this type seems important enough to mention in the developer guide since it is the basis for how much of the code is designed and implemented.
Also updates some doxygen for public APIs that are return `size_type` column values but had cited `INT32` specifically.

Reference: #12779 (comment)

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Lawrence Mitchell (https://github.com/wence-)
  - Yunsong Wang (https://github.com/PointKernel)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #12904
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants