Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch memcpy the last offsets for output buffers of str and list cols in PQ reader #16905

Conversation

mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented Sep 25, 2024

Description

This PR adds the capability to batch memcpy the last offsets for the output buffers of string and list columns in PQ reader. This reduces the overhead from several cudaMemcpyAsync calls when reading wide strings and/or list columns tables. This optimization was found as well as ORC changes were contributed by @vuule. See this comment for performance improvement data and discussion.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mhaseeb123 mhaseeb123 self-assigned this Sep 25, 2024
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 25, 2024
@mhaseeb123 mhaseeb123 changed the base branch from branch-24.10 to branch-24.12 September 25, 2024 03:25
@mhaseeb123 mhaseeb123 added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change cuIO cuIO issue labels Sep 25, 2024
@mhaseeb123 mhaseeb123 changed the title Batch memcpy the last offsets for output buffers of str and list cols in PQ reader 🚧 Batch memcpy the last offsets for output buffers of str and list cols in PQ reader Sep 25, 2024
@vuule vuule self-requested a review September 25, 2024 18:34
@github-actions github-actions bot added the CMake CMake build issue label Sep 26, 2024
@mhaseeb123
Copy link
Member Author

mhaseeb123 commented Sep 26, 2024

Performance Improvements

Summary

The time to write final offsets for STRING and LIST columns via 4 byte H2D memcpys in reader::impl::decode_page_data significantly reduced. The effect is dramatic for wide tables with many string and list cols.

Benchmark Setup

Benchmark name: PARQUET_READER_NVBENCH -b parquet_read_wide_tables
Table properties: data_size = 1GB across num_cols = 1024 (all STRING type), cardinality=0, run_len=16
GPU: NVIDIA RTX 5880 Ada Generation

Profiles (Before = top, After = bottom)

Before: 1024 x cudaMemcpyAsync = 2.393ms
After: 9us for 2 x cudaMemcpyAsync (the two small red bars just before the highlighted region) + 214us for cub::DeviceMemcpy::Batched()

Screenshot 2024-09-25 at 6 30 34 PM

CC @GregoryKimball

@mhaseeb123 mhaseeb123 changed the title 🚧 Batch memcpy the last offsets for output buffers of str and list cols in PQ reader Batch memcpy the last offsets for output buffers of str and list cols in PQ reader Sep 26, 2024
@mhaseeb123 mhaseeb123 marked this pull request as ready for review September 26, 2024 01:02
@mhaseeb123 mhaseeb123 requested a review from a team as a code owner September 26, 2024 01:02
@mhaseeb123 mhaseeb123 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Sep 26, 2024
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool stuff!
Bunch of small suggestions, mostly to polish the new functions

cpp/include/cudf/io/detail/batched_memcpy.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/detail/batched_memcpy.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/detail/batched_memcpy.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/parquet_gpu.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/parquet_gpu.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/parquet_gpu.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_data.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_data.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_data.cu Outdated Show resolved Hide resolved
cpp/include/cudf/io/detail/batched_memcpy.hpp Outdated Show resolved Hide resolved
cpp/src/io/orc/stripe_enc.cu Outdated Show resolved Hide resolved
Copy link

copy-pr-bot bot commented Sep 30, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mhaseeb123 mhaseeb123 requested a review from a team as a code owner October 1, 2024 01:46
@mhaseeb123 mhaseeb123 removed the request for review from a team October 1, 2024 01:51
@mhaseeb123
Copy link
Member Author

/ok to test

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice!

cpp/include/cudf/detail/utilities/batched_memcpy.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_data.cu Outdated Show resolved Hide resolved
@mhaseeb123
Copy link
Member Author

/ok to test

@mhaseeb123
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 7ae5360 into rapidsai:branch-24.12 Oct 3, 2024
99 checks passed
@mhaseeb123 mhaseeb123 deleted the fea-batch-memcpy-list-str-output-buff-offsets branch October 3, 2024 03:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants