[BUG] Columnar to Columnar transfers are very slow #4393

revans2 · 2021-12-20T15:09:03Z

Describe the bug
When exploring #4164 I tested what the performance impact is of staying on the CPU for parquet parsing. I found that for many of the queries in question (like NDS query 9) the CPU was also spending all of it's time parsing parquet so it was not a win at all, especially if we use less cores compared to the CPU. But while doing that I noticed that we were wasting a lot of time on columnar to columnar data transfers. When profiling I found that we were spending a lot of time in the cudf HostBuffer.Builder code checking if we needed to grow the buffers. This was for buffers where we know up front exactly the maximum amount of memory needed to hold the data. There were also some issues with the plugin code itself that was not using ideal loops because scala does not make that simple.

I was able to speed up the transfers massively by hacking up the Builder to avoid checking at all. This is not good, but gives us a proof of concept. We need to.

Cache in the builder the maximum number of rows/bytes before we have to grow a buffer instead of re-computing it each time. (this might be enough on the CUDF side, but we need to check)
Don't call a JNI function to get the size of the validity buffer we need to allocate. We can do the same computation in pure java and it saves a lot of time.
Update all of the for loops in the columnar to columnar transfer code to be while loops because they are much faster.
Reduce the number of method calls in the CUDF data path as much as possible. (inline things if we can)
This is a bit separate but I noticed that in the columnar to columnar transfer code we compute an approximate batch size at the beginning, but then when we have finished a batch we calculate a better approximation bases off of the data we actually cached. Then we stomp over the top of it for the very next batch. This is not great, although we rarely get more than one batch so it might not matter that much.

We really need some kind of micro benchmarks that we can use to measure how fast this and other operations are that we can run regularly and see if we are regressing. Also once we have these changes made we should do some more profiling to see if there is more that we can do to speed this up.

The text was updated successfully, but these errors were encountered:

sperlingxx · 2022-01-10T10:23:11Z

I would like to try this optimization work if no one takes.

revans2 · 2022-01-10T16:27:17Z

Sure go ahead

sperlingxx · 2022-01-19T10:13:35Z

#4565

…10025) According to NVIDIA/spark-rapids#4393, current PR takes several measures to speed up the buffer growing during the build of `HostColumnVector`: 1. Introduce `rowCapacity` to cache the maximum number of rows/bytes 2. Introduce pura Java method `byteSizeOfNullMask` to get the size of the validity buffer 3. Reorganize the code structure to reduce the number of method calls I have tested this PR with the spark-rapids tests locally. BTW, shall we clean up the `HostColumnVector.Builder` and replace all the usages of `Builder` with `ColumnBuilder`? Authors: - Alfred Xu (https://github.com/sperlingxx) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) URL: #10025

revans2 · 2022-03-10T20:49:33Z

Great work @sperlingxx do you have any benchmark results about how much this helped? If not please don't spend any time creating them. I am just curious.

sperlingxx · 2022-03-14T03:17:47Z

Great work @sperlingxx do you have any benchmark results about how much this helped? If not please don't spend any time creating them. I am just curious.

I didn't run the benchmark to measure the overall improvement of cuDF side and plugin side.

revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify performance A performance related task/issue labels Dec 20, 2021

sameerz removed bug Something isn't working ? - Needs Triage Need team to review and classify labels Dec 21, 2021

sperlingxx self-assigned this Jan 11, 2022

sperlingxx mentioned this issue Jan 12, 2022

JNI: Rewrite growBuffersAndRows to accelerate the HostColumnBuilder rapidsai/cudf#10025

Merged

sperlingxx mentioned this issue Feb 14, 2022

Improve columnarCopy for HostColumnarToGpu #4770

Merged

sperlingxx mentioned this issue Feb 21, 2022

[FEA] Improve copying decimal data from CPU columnar data #4784

Closed

sperlingxx closed this as completed Mar 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Columnar to Columnar transfers are very slow #4393

[BUG] Columnar to Columnar transfers are very slow #4393

revans2 commented Dec 20, 2021

sperlingxx commented Jan 10, 2022

revans2 commented Jan 10, 2022

sperlingxx commented Jan 19, 2022

revans2 commented Mar 10, 2022

sperlingxx commented Mar 14, 2022

[BUG] Columnar to Columnar transfers are very slow #4393

[BUG] Columnar to Columnar transfers are very slow #4393

Comments

revans2 commented Dec 20, 2021

sperlingxx commented Jan 10, 2022

revans2 commented Jan 10, 2022

sperlingxx commented Jan 19, 2022

revans2 commented Mar 10, 2022

sperlingxx commented Mar 14, 2022