Batch memcpy the last offsets for output buffers of str and list cols in PQ reader #16905

mhaseeb123 · 2024-09-25T03:25:02Z

Description

This PR adds the capability to batch memcpy the last offsets for the output buffers of string and list columns in PQ reader. This reduces the overhead from several cudaMemcpyAsync calls when reading wide strings and/or list columns tables. This optimization was found as well as ORC changes were contributed by @vuule. See this comment for performance improvement data and discussion.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…f-offsets

mhaseeb123 · 2024-09-26T01:01:30Z

Performance Improvements

Summary

The time to write final offsets for STRING and LIST columns via 4 byte H2D memcpys in reader::impl::decode_page_data significantly reduced. The effect is dramatic for wide tables with many string and list cols.

Benchmark Setup

Benchmark name: PARQUET_READER_NVBENCH -b parquet_read_wide_tables
Table properties: data_size = 1GB across num_cols = 1024 (all STRING type), cardinality=0, run_len=16
GPU: NVIDIA RTX 5880 Ada Generation

Profiles (Before = top, After = bottom)

Before: 1024 x cudaMemcpyAsync = 2.393ms
After: 9us for 2 x cudaMemcpyAsync (the two small red bars just before the highlighted region) + 214us for cub::DeviceMemcpy::Batched()

CC @GregoryKimball

cpp/include/cudf/io/detail/batched_memcpy.hpp

vuule

cool stuff!
Bunch of small suggestions, mostly to polish the new functions

cpp/include/cudf/io/detail/batched_memcpy.hpp

cpp/src/io/parquet/parquet_gpu.hpp

cpp/src/io/parquet/page_data.cu

cpp/include/cudf/io/detail/batched_memcpy.hpp

cpp/src/io/orc/stripe_enc.cu

…://github.com/mhaseeb123/cudf into fea-batch-memcpy-list-str-output-buff-offsets

copy-pr-bot · 2024-09-30T21:43:10Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cpp/include/cudf/detail/utilities/batched_memcpy.hpp

cpp/src/io/orc/stripe_enc.cu

Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com>

…f-offsets

mhaseeb123 · 2024-10-02T17:22:16Z

/ok to test

vyasr

This is nice!

cpp/include/cudf/detail/utilities/batched_memcpy.hpp

cpp/src/io/parquet/page_data.cu

Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>

…f-offsets

mhaseeb123 · 2024-10-03T00:59:34Z

/ok to test

mhaseeb123 · 2024-10-03T00:59:48Z

/merge

Add capability to batch memcpy the last offsets to str and list out_bufs

74ee6ae

mhaseeb123 self-assigned this Sep 25, 2024

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 25, 2024

mhaseeb123 changed the base branch from branch-24.10 to branch-24.12 September 25, 2024 03:25

mhaseeb123 added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change cuIO cuIO issue labels Sep 25, 2024

mhaseeb123 changed the title ~~Batch memcpy the last offsets for output buffers of str and list cols in PQ reader~~ 🚧 Batch memcpy the last offsets for output buffers of str and list cols in PQ reader Sep 25, 2024

vuule self-requested a review September 25, 2024 18:34

mhaseeb123 added 5 commits September 25, 2024 18:51

Move WriteFinalOffsetsBatched out of the for loop

cab885d

Generalize the API and ORC changes by @vuule

b15e3d3

Use make_zeroed_device_uvector_async instead

50dcd71

Merge branch 'branch-24.12' into fea-batch-memcpy-list-str-output-buf…

bd44ca0

…f-offsets

Add gtest for batched_memcpy

800b271

github-actions bot added the CMake CMake build issue label Sep 26, 2024

mhaseeb123 changed the title ~~🚧 Batch memcpy the last offsets for output buffers of str and list cols in PQ reader~~ Batch memcpy the last offsets for output buffers of str and list cols in PQ reader Sep 26, 2024

mhaseeb123 marked this pull request as ready for review September 26, 2024 01:02

mhaseeb123 requested a review from a team as a code owner September 26, 2024 01:02

mhaseeb123 requested a review from vyasr September 26, 2024 01:02

mhaseeb123 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Sep 26, 2024

mhaseeb123 commented Sep 26, 2024

View reviewed changes

cpp/include/cudf/io/detail/batched_memcpy.hpp Outdated Show resolved Hide resolved

cpp/include/cudf/io/detail/batched_memcpy.hpp Outdated Show resolved Hide resolved

mhaseeb123 added 3 commits September 25, 2024 18:20

Update cpp/include/cudf/io/detail/batched_memcpy.hpp

31a755b

Update cpp/include/cudf/io/detail/batched_memcpy.hpp

b29329b

Comments update

4efb989

vuule requested changes Sep 26, 2024

View reviewed changes

mhaseeb123 added 2 commits September 27, 2024 01:03

Address reviewer comments

cc2829f

Style fix

78d68a8

vuule reviewed Sep 30, 2024

View reviewed changes

cpp/src/io/orc/stripe_enc.cu Outdated Show resolved Hide resolved

mhaseeb123 added 2 commits September 30, 2024 21:41

Minor updates

2372fbb

Merge branch 'fea-batch-memcpy-list-str-output-buff-offsets' of https…

6100c94

…://github.com/mhaseeb123/cudf into fea-batch-memcpy-list-str-output-buff-offsets

mhaseeb123 requested a review from vuule September 30, 2024 23:50

vuule approved these changes Oct 1, 2024

View reviewed changes

mhaseeb123 added 3 commits October 1, 2024 00:53

Minor comment update

4ea0930

Minor comment update

3eea6e2

Style fix and add to CI.

6d078c2

mhaseeb123 requested a review from a team as a code owner October 1, 2024 01:46

Revert erroneous commit

1cc4e1f

mhaseeb123 removed the request for review from a team October 1, 2024 01:51

ttnghia reviewed Oct 2, 2024

View reviewed changes

cpp/include/cudf/detail/utilities/batched_memcpy.hpp Outdated Show resolved Hide resolved

ttnghia reviewed Oct 2, 2024

View reviewed changes

cpp/include/cudf/detail/utilities/batched_memcpy.hpp Outdated Show resolved Hide resolved

ttnghia reviewed Oct 2, 2024

View reviewed changes

cpp/src/io/orc/stripe_enc.cu Outdated Show resolved Hide resolved

mhaseeb123 and others added 5 commits October 2, 2024 09:57

Update cpp/include/cudf/detail/utilities/batched_memcpy.hpp

042cfc0

Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com>

Apply suggestions from review

eee6f6d

Minor updates from review

828e0ac

Minor

ecc4252

Merge branch 'branch-24.12' into fea-batch-memcpy-list-str-output-buf…

4bd83db

…f-offsets

mhaseeb123 requested a review from ttnghia October 2, 2024 17:21

vyasr approved these changes Oct 2, 2024

View reviewed changes

cpp/include/cudf/detail/utilities/batched_memcpy.hpp Outdated Show resolved Hide resolved

cpp/src/io/parquet/page_data.cu Outdated Show resolved Hide resolved

mhaseeb123 and others added 3 commits October 2, 2024 17:48

Update cpp/src/io/parquet/page_data.cu

871854b

Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>

Comments update.

3e30777

Merge branch 'branch-24.12' into fea-batch-memcpy-list-str-output-buf…

16540a1

…f-offsets

rapids-bot bot merged commit 7ae5360 into rapidsai:branch-24.12 Oct 3, 2024
99 checks passed

mhaseeb123 deleted the fea-batch-memcpy-list-str-output-buff-offsets branch October 3, 2024 03:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch memcpy the last offsets for output buffers of str and list cols in PQ reader #16905

Batch memcpy the last offsets for output buffers of str and list cols in PQ reader #16905

mhaseeb123 commented Sep 25, 2024 •

edited

Loading

mhaseeb123 commented Sep 26, 2024 •

edited

Loading

vuule left a comment

copy-pr-bot bot commented Sep 30, 2024

mhaseeb123 commented Oct 2, 2024

vyasr left a comment

mhaseeb123 commented Oct 3, 2024

mhaseeb123 commented Oct 3, 2024

Batch memcpy the last offsets for output buffers of str and list cols in PQ reader #16905

Batch memcpy the last offsets for output buffers of str and list cols in PQ reader #16905

Conversation

mhaseeb123 commented Sep 25, 2024 • edited Loading

Description

Checklist

mhaseeb123 commented Sep 26, 2024 • edited Loading

Performance Improvements

Summary

Benchmark Setup

Profiles (Before = top, After = bottom)

vuule left a comment

Choose a reason for hiding this comment

copy-pr-bot bot commented Sep 30, 2024

mhaseeb123 commented Oct 2, 2024

vyasr left a comment

Choose a reason for hiding this comment

mhaseeb123 commented Oct 3, 2024

mhaseeb123 commented Oct 3, 2024

mhaseeb123 commented Sep 25, 2024 •

edited

Loading

mhaseeb123 commented Sep 26, 2024 •

edited

Loading