-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Columnar to Columnar transfers are very slow #4393
Comments
I would like to try this optimization work if no one takes. |
Sure go ahead |
…10025) According to NVIDIA/spark-rapids#4393, current PR takes several measures to speed up the buffer growing during the build of `HostColumnVector`: 1. Introduce `rowCapacity` to cache the maximum number of rows/bytes 2. Introduce pura Java method `byteSizeOfNullMask` to get the size of the validity buffer 3. Reorganize the code structure to reduce the number of method calls I have tested this PR with the spark-rapids tests locally. BTW, shall we clean up the `HostColumnVector.Builder` and replace all the usages of `Builder` with `ColumnBuilder`? Authors: - Alfred Xu (https://github.com/sperlingxx) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) URL: #10025
Great work @sperlingxx do you have any benchmark results about how much this helped? If not please don't spend any time creating them. I am just curious. |
I didn't run the benchmark to measure the overall improvement of cuDF side and plugin side. |
Describe the bug
When exploring #4164 I tested what the performance impact is of staying on the CPU for parquet parsing. I found that for many of the queries in question (like NDS query 9) the CPU was also spending all of it's time parsing parquet so it was not a win at all, especially if we use less cores compared to the CPU. But while doing that I noticed that we were wasting a lot of time on columnar to columnar data transfers. When profiling I found that we were spending a lot of time in the cudf
HostBuffer.Builder
code checking if we needed to grow the buffers. This was for buffers where we know up front exactly the maximum amount of memory needed to hold the data. There were also some issues with the plugin code itself that was not using ideal loops because scala does not make that simple.I was able to speed up the transfers massively by hacking up the Builder to avoid checking at all. This is not good, but gives us a proof of concept. We need to.
We really need some kind of micro benchmarks that we can use to measure how fast this and other operations are that we can run regularly and see if we are regressing. Also once we have these changes made we should do some more profiling to see if there is more that we can do to speed this up.
The text was updated successfully, but these errors were encountered: