Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Columnar to Columnar transfers are very slow #4393

Closed
revans2 opened this issue Dec 20, 2021 · 5 comments
Closed

[BUG] Columnar to Columnar transfers are very slow #4393

revans2 opened this issue Dec 20, 2021 · 5 comments
Assignees
Labels
performance A performance related task/issue

Comments

@revans2
Copy link
Collaborator

revans2 commented Dec 20, 2021

Describe the bug
When exploring #4164 I tested what the performance impact is of staying on the CPU for parquet parsing. I found that for many of the queries in question (like NDS query 9) the CPU was also spending all of it's time parsing parquet so it was not a win at all, especially if we use less cores compared to the CPU. But while doing that I noticed that we were wasting a lot of time on columnar to columnar data transfers. When profiling I found that we were spending a lot of time in the cudf HostBuffer.Builder code checking if we needed to grow the buffers. This was for buffers where we know up front exactly the maximum amount of memory needed to hold the data. There were also some issues with the plugin code itself that was not using ideal loops because scala does not make that simple.

I was able to speed up the transfers massively by hacking up the Builder to avoid checking at all. This is not good, but gives us a proof of concept. We need to.

  1. Cache in the builder the maximum number of rows/bytes before we have to grow a buffer instead of re-computing it each time. (this might be enough on the CUDF side, but we need to check)
  2. Don't call a JNI function to get the size of the validity buffer we need to allocate. We can do the same computation in pure java and it saves a lot of time.
  3. Update all of the for loops in the columnar to columnar transfer code to be while loops because they are much faster.
  4. Reduce the number of method calls in the CUDF data path as much as possible. (inline things if we can)
  5. This is a bit separate but I noticed that in the columnar to columnar transfer code we compute an approximate batch size at the beginning, but then when we have finished a batch we calculate a better approximation bases off of the data we actually cached. Then we stomp over the top of it for the very next batch. This is not great, although we rarely get more than one batch so it might not matter that much.

We really need some kind of micro benchmarks that we can use to measure how fast this and other operations are that we can run regularly and see if we are regressing. Also once we have these changes made we should do some more profiling to see if there is more that we can do to speed this up.

@revans2 revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify performance A performance related task/issue labels Dec 20, 2021
@sameerz sameerz removed bug Something isn't working ? - Needs Triage Need team to review and classify labels Dec 21, 2021
@sperlingxx
Copy link
Collaborator

I would like to try this optimization work if no one takes.

@revans2
Copy link
Collaborator Author

revans2 commented Jan 10, 2022

Sure go ahead

@sperlingxx
Copy link
Collaborator

#4565

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Jan 31, 2022
…10025)

According to NVIDIA/spark-rapids#4393, current PR takes several measures to speed up the buffer growing during the build of `HostColumnVector`:
1. Introduce `rowCapacity` to cache the maximum number of rows/bytes 
2. Introduce pura Java method `byteSizeOfNullMask` to get the size of the validity buffer
3. Reorganize the code structure to reduce the number of method calls

I have tested this PR with the spark-rapids tests locally.
BTW, shall we clean up the `HostColumnVector.Builder` and replace all the usages of `Builder` with `ColumnBuilder`?

Authors:
  - Alfred Xu (https://github.com/sperlingxx)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: #10025
@revans2
Copy link
Collaborator Author

revans2 commented Mar 10, 2022

Great work @sperlingxx do you have any benchmark results about how much this helped? If not please don't spend any time creating them. I am just curious.

@sperlingxx
Copy link
Collaborator

Great work @sperlingxx do you have any benchmark results about how much this helped? If not please don't spend any time creating them. I am just curious.

I didn't run the benchmark to measure the overall improvement of cuDF side and plugin side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

3 participants