Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a chunked_pack API #13260

Merged
merged 45 commits into from
May 18, 2023
Merged

Conversation

abellina
Copy link
Contributor

@abellina abellina commented May 1, 2023

Description

This PR introduces a "chunked pack" API built on top of contiguous_split. This API is used when we want to copy a cuDF table_view over the wire or to the host in a contiguous layout (aka contiguous_split), but without user provided memory. As a result this API does not allocate any buffers for GPU data and instead it uses a provided user buffer to perform the contiguous_split in chunks.

Luckily, contiguous_split already had a subdivision of work that we are now calling "batches". Each batch is up to 1MB of data from the source table. As such, one can think of this function as copying as many batches as will fit in a user buffer (or "chunking the batches"). The API follows other chunked interfaces in cuDF with a has_next and a next, with the difference that in this case next takes a device_span, and the user can provide any device_span as long as the size is the same as the size specified during construction.

When thrust and scratch space is required on the GPU, this PR makes use of the memory resource passed, even to the point of using the second argument of exec_policy which is the memory resource. I found this while testing that exec_policy defaults to the per-device resource, and in this case I really wanted to pass a pooled memory resource, outside of our normal async memory resource to set aside this memory ahead of time.

Most of the changes are about moving things around. You'll see 3 structs with the name "packed" (e.g. packed_split_indices_and_src_buf_info). These are here to group together state that contiguous_split needs to work, but now because of chunked_pack we need to keep around as well for the subsequent calls to next. These structs are also packed in memory, which is an optimization contiguous_split had already done to reduce the number of d2h/h2d copies. This PR did need to add a state object (contiguous_split_state) that now contiguous_split leverages. It also makes use of the metadata_builder which we added in a prior PR.

This PR does not include the JNI changes needed for this to work on the java side, I'll post that separately. I figured this was too big already (and if people have suggestions on "chunking" this PR up, I am happy to do that).

@nvdbaranec spent a great deal of time documenting contiguous_split for me and he suggested a path to get this done that I just followed (thank you!!)

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Co-authored-by: Dave Baranec <dbaranec@nvidia.com>

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
@abellina abellina requested a review from a team as a code owner May 1, 2023 21:40
@abellina abellina requested review from vyasr and davidwendt May 1, 2023 21:40
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label May 1, 2023
@abellina abellina changed the title Chunked pack implementation Implement a chunked_pack API May 1, 2023
@abellina abellina added feature request New feature or request Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels May 1, 2023
@abellina abellina mentioned this pull request May 1, 2023
4 tasks
@abellina
Copy link
Contributor Author

abellina commented May 1, 2023

I don't see major differences when running CONTIGUOUS_SPLIT_BENCH benchmark. Here are both new and old results:

Benchmark results for this PR, with the `chunked_pack` ones at the end
----------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                          Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------------------------------------------
ContiguousSplit/6Gb512ColsNoValidity/6442450944/512/256/0/iterations:8/manual_time              57.8 ms         57.9 ms            8 bytes_per_second=207.691G/s
ContiguousSplit/6Gb512ColsValidity/6442450944/512/256/1/iterations:8/manual_time                57.3 ms         57.3 ms            8 bytes_per_second=216.064G/s
ContiguousSplit/6Gb10ColsNoValidity/6442450944/10/256/0/iterations:8/manual_time                27.9 ms         28.0 ms            8 bytes_per_second=429.477G/s
ContiguousSplit/6Gb10ColsValidity/6442450944/10/256/1/iterations:8/manual_time                  29.2 ms         29.2 ms            8 bytes_per_second=423.943G/s
ContiguousSplit/4Gb512ColsNoValidity/4294967296/512/256/0/iterations:8/manual_time              44.5 ms         44.5 ms            8 bytes_per_second=179.893G/s
ContiguousSplit/4Gb512ColsValidity/4294967296/512/256/1/iterations:8/manual_time                47.4 ms         47.5 ms            8 bytes_per_second=173.998G/s
ContiguousSplit/4Gb10ColsNoValidity/4294967296/10/256/0/iterations:8/manual_time                19.5 ms         19.5 ms            8 bytes_per_second=410.685G/s
ContiguousSplit/4Gb10ColsValidity/4294967296/10/256/1/iterations:8/manual_time                  19.9 ms         19.9 ms            8 bytes_per_second=414.661G/s
ContiguousSplit/4Gb4ColsNoSplits/1073741824/4/0/1/iterations:8/manual_time                      4.40 ms         4.42 ms            8 bytes_per_second=468.835G/s
ContiguousSplit/4Gb4ColsValidityNoSplits/1073741824/4/0/1/iterations:8/manual_time              4.38 ms         4.40 ms            8 bytes_per_second=470.987G/s
ContiguousSplit/1Gb512ColsNoValidity/1073741824/512/256/0/iterations:8/manual_time              32.6 ms         32.6 ms            8 bytes_per_second=61.4436G/s
ContiguousSplit/1Gb512ColsValidity/1073741824/512/256/1/iterations:8/manual_time                35.8 ms         35.8 ms            8 bytes_per_second=57.6267G/s
ContiguousSplit/1Gb10ColsNoValidity/1073741824/10/256/0/iterations:8/manual_time                5.70 ms         5.72 ms            8 bytes_per_second=351.129G/s
ContiguousSplit/1Gb10ColsValidity/1073741824/10/256/1/iterations:8/manual_time                  5.90 ms         5.92 ms            8 bytes_per_second=349.677G/s
ContiguousSplit/1Gb1ColNoSplits/1073741824/1/0/1/iterations:8/manual_time                       4.40 ms         4.42 ms            8 bytes_per_second=469.279G/s
ContiguousSplit/1Gb1ColValidityNoSplits/1073741824/1/0/1/iterations:8/manual_time               4.39 ms         4.41 ms            8 bytes_per_second=470.14G/s
ContiguousSplitStrings/4Gb512ColsNoValidity/4294967296/512/256/0/iterations:8/manual_time        158 ms          158 ms            8 bytes_per_second=50.6568G/s
ContiguousSplitStrings/4Gb512ColsValidity/4294967296/512/256/1/iterations:8/manual_time          146 ms          146 ms            8 bytes_per_second=55.5446G/s
ContiguousSplitStrings/4Gb10ColsNoValidity/4294967296/10/256/0/iterations:8/manual_time         30.6 ms         31.6 ms            8 bytes_per_second=274.595G/s
ContiguousSplitStrings/4Gb10ColsValidity/4294967296/10/256/1/iterations:8/manual_time           21.0 ms         21.0 ms            8 bytes_per_second=406.154G/s
ContiguousSplitStrings/4Gb4ColsNoSplits/1073741824/4/0/0/iterations:8/manual_time               6.71 ms         7.36 ms            8 bytes_per_second=335.519G/s
ContiguousSplitStrings/4Gb4ColsValidityNoSplits/1073741824/4/0/1/iterations:8/manual_time       4.35 ms         4.37 ms            8 bytes_per_second=524.718G/s
ContiguousSplitStrings/1Gb512ColsNoValidity/1073741824/512/256/0/iterations:8/manual_time        133 ms          133 ms            8 bytes_per_second=15.0416G/s
ContiguousSplitStrings/1Gb512ColsValidity/1073741824/512/256/1/iterations:8/manual_time          134 ms          134 ms            8 bytes_per_second=15.1684G/s
ContiguousSplitStrings/1Gb10ColsNoValidity/1073741824/10/256/0/iterations:8/manual_time         9.99 ms         10.3 ms            8 bytes_per_second=210.255G/s
ContiguousSplitStrings/1Gb10ColsValidity/1073741824/10/256/1/iterations:8/manual_time           7.71 ms         7.74 ms            8 bytes_per_second=276.398G/s
ContiguousSplitStrings/1Gb1ColNoSplits/1073741824/1/0/0/iterations:8/manual_time                6.67 ms         9.23 ms            8 bytes_per_second=449.574G/s
ContiguousSplitStrings/1Gb1ColValidityNoSplits/1073741824/1/0/1/iterations:8/manual_time        4.35 ms         4.37 ms            8 bytes_per_second=696.746G/s
ChunkedPack/6Gb512ColsNoValidity/6442450944/512/0/0/iterations:8/manual_time                    23.9 ms         23.9 ms            8 bytes_per_second=501.71G/s
ChunkedPack/6Gb512ColsValidity/6442450944/512/0/1/iterations:8/manual_time                      24.7 ms         24.7 ms            8 bytes_per_second=500.825G/s
ChunkedPack/6Gb10ColsNoValidity/6442450944/10/0/0/iterations:8/manual_time                      23.8 ms         23.9 ms            8 bytes_per_second=503.552G/s
ChunkedPack/6Gb10ColsValidity/6442450944/10/0/1/iterations:8/manual_time                        26.7 ms         26.8 ms            8 bytes_per_second=463.174G/s
ChunkedPack/4Gb512ColsNoValidity/4294967296/512/0/0/iterations:8/manual_time                    16.0 ms         16.0 ms            8 bytes_per_second=500.369G/s
ChunkedPack/4Gb512ColsValidity/4294967296/512/0/1/iterations:8/manual_time                      17.9 ms         17.9 ms            8 bytes_per_second=461.314G/s
ChunkedPack/4Gb10ColsNoValidity/4294967296/10/0/0/iterations:8/manual_time                      17.7 ms         17.7 ms            8 bytes_per_second=451.413G/s
ChunkedPack/4Gb10ColsValidity/4294967296/10/0/1/iterations:8/manual_time                        16.4 ms         16.5 ms            8 bytes_per_second=501.532G/s
ChunkedPack/4Gb4ColsValidity/1073741824/4/0/1/iterations:8/manual_time                          4.22 ms         4.24 ms            8 bytes_per_second=489.245G/s
ChunkedPack/1Gb512ColsNoValidity/1073741824/512/0/0/iterations:8/manual_time                    4.86 ms         4.90 ms            8 bytes_per_second=411.202G/s
ChunkedPack/1Gb512ColsValidity/1073741824/512/0/1/iterations:8/manual_time                      4.66 ms         4.68 ms            8 bytes_per_second=442.465G/s
ChunkedPack/1Gb10ColsNoValidity/1073741824/10/0/0/iterations:8/manual_time                      4.18 ms         4.20 ms            8 bytes_per_second=478.992G/s
ChunkedPack/1Gb10ColsValidity/1073741824/10/0/1/iterations:8/manual_time                        4.25 ms         4.27 ms            8 bytes_per_second=485.494G/s
ChunkedPack/1Gb1ColValidity/1073741824/1/0/1/iterations:8/manual_time                           4.22 ms         4.24 ms            8 bytes_per_second=488.715G/s
ChunkedPackStrings/4Gb512ColsNoValidity/4294967296/512/0/0/iterations:8/manual_time             24.9 ms         25.0 ms            8 bytes_per_second=321.246G/s
ChunkedPackStrings/4Gb512ColsValidity/4294967296/512/0/1/iterations:8/manual_time               16.4 ms         16.5 ms            8 bytes_per_second=494.505G/s
ChunkedPackStrings/4Gb10ColsNoValidity/4294967296/10/0/0/iterations:8/manual_time               26.0 ms         27.1 ms            8 bytes_per_second=322.617G/s
ChunkedPackStrings/4Gb10ColsValidity/4294967296/10/0/1/iterations:8/manual_time                 16.8 ms         16.9 ms            8 bytes_per_second=506.475G/s
ChunkedPackStrings/4Gb4ColsValidity/1073741824/4/0/1/iterations:8/manual_time                   4.42 ms         4.44 ms            8 bytes_per_second=516.444G/s
ChunkedPackStrings/1Gb512ColsNoValidity/1073741824/512/0/0/iterations:8/manual_time             6.72 ms         6.74 ms            8 bytes_per_second=298.103G/s
ChunkedPackStrings/1Gb512ColsValidity/1073741824/512/0/1/iterations:8/manual_time               4.75 ms         4.77 ms            8 bytes_per_second=427.759G/s
ChunkedPackStrings/1Gb10ColsNoValidity/1073741824/10/0/0/iterations:8/manual_time               6.15 ms         6.43 ms            8 bytes_per_second=341.407G/s
ChunkedPackStrings/1Gb10ColsValidity/1073741824/10/0/1/iterations:8/manual_time                 4.31 ms         4.35 ms            8 bytes_per_second=494.412G/s
ChunkedPackStrings/1Gb1ColValidity/1073741824/1/0/1/iterations:8/manual_time                    4.15 ms         4.18 ms            8 bytes_per_second=730.982G/s
Benchmark results before this PR
----------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                          Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------------------------------------------
ContiguousSplit/6Gb512ColsNoValidity/6442450944/512/256/0/iterations:8/manual_time              57.4 ms         57.4 ms            8 bytes_per_second=209.192G/s
ContiguousSplit/6Gb512ColsValidity/6442450944/512/256/1/iterations:8/manual_time                59.3 ms         59.4 ms            8 bytes_per_second=208.633G/s
ContiguousSplit/6Gb10ColsNoValidity/6442450944/10/256/0/iterations:8/manual_time                29.7 ms         29.7 ms            8 bytes_per_second=404.623G/s
ContiguousSplit/6Gb10ColsValidity/6442450944/10/256/1/iterations:8/manual_time                  29.1 ms         29.1 ms            8 bytes_per_second=425.948G/s
ContiguousSplit/4Gb512ColsNoValidity/4294967296/512/256/0/iterations:8/manual_time              47.5 ms         47.6 ms            8 bytes_per_second=168.256G/s
ContiguousSplit/4Gb512ColsValidity/4294967296/512/256/1/iterations:8/manual_time                50.2 ms         50.3 ms            8 bytes_per_second=164.325G/s
ContiguousSplit/4Gb10ColsNoValidity/4294967296/10/256/0/iterations:8/manual_time                21.2 ms         21.3 ms            8 bytes_per_second=376.781G/s
ContiguousSplit/4Gb10ColsValidity/4294967296/10/256/1/iterations:8/manual_time                  21.8 ms         21.9 ms            8 bytes_per_second=378.616G/s
ContiguousSplit/4Gb4ColsNoSplits/1073741824/4/0/1/iterations:8/manual_time                      4.73 ms         4.81 ms            8 bytes_per_second=436.355G/s
ContiguousSplit/4Gb4ColsValidityNoSplits/1073741824/4/0/1/iterations:8/manual_time              4.69 ms         4.75 ms            8 bytes_per_second=440.121G/s
ContiguousSplit/1Gb512ColsNoValidity/1073741824/512/256/0/iterations:8/manual_time              32.7 ms         32.8 ms            8 bytes_per_second=61.1839G/s
ContiguousSplit/1Gb512ColsValidity/1073741824/512/256/1/iterations:8/manual_time                35.1 ms         35.2 ms            8 bytes_per_second=58.6784G/s
ContiguousSplit/1Gb10ColsNoValidity/1073741824/10/256/0/iterations:8/manual_time                5.65 ms         5.68 ms            8 bytes_per_second=353.678G/s
ContiguousSplit/1Gb10ColsValidity/1073741824/10/256/1/iterations:8/manual_time                  5.93 ms         5.96 ms            8 bytes_per_second=347.818G/s
ContiguousSplit/1Gb1ColNoSplits/1073741824/1/0/1/iterations:8/manual_time                       4.39 ms         4.42 ms            8 bytes_per_second=469.332G/s
ContiguousSplit/1Gb1ColValidityNoSplits/1073741824/1/0/1/iterations:8/manual_time               4.39 ms         4.41 ms            8 bytes_per_second=469.621G/s
ContiguousSplitStrings/4Gb512ColsNoValidity/4294967296/512/256/0/iterations:8/manual_time        153 ms          153 ms            8 bytes_per_second=52.3405G/s
ContiguousSplitStrings/4Gb512ColsValidity/4294967296/512/256/1/iterations:8/manual_time          145 ms          145 ms            8 bytes_per_second=56.2481G/s
ContiguousSplitStrings/4Gb10ColsNoValidity/4294967296/10/256/0/iterations:8/manual_time         33.2 ms         34.3 ms            8 bytes_per_second=252.873G/s
ContiguousSplitStrings/4Gb10ColsValidity/4294967296/10/256/1/iterations:8/manual_time           22.8 ms         22.9 ms            8 bytes_per_second=373.402G/s
ContiguousSplitStrings/4Gb4ColsNoSplits/1073741824/4/0/0/iterations:8/manual_time               6.66 ms         7.32 ms            8 bytes_per_second=337.598G/s
ContiguousSplitStrings/4Gb4ColsValidityNoSplits/1073741824/4/0/1/iterations:8/manual_time       4.35 ms         4.38 ms            8 bytes_per_second=523.833G/s
ContiguousSplitStrings/1Gb512ColsNoValidity/1073741824/512/256/0/iterations:8/manual_time        138 ms          138 ms            8 bytes_per_second=14.4633G/s
ContiguousSplitStrings/1Gb512ColsValidity/1073741824/512/256/1/iterations:8/manual_time          131 ms          131 ms            8 bytes_per_second=15.5614G/s
ContiguousSplitStrings/1Gb10ColsNoValidity/1073741824/10/256/0/iterations:8/manual_time         10.1 ms         10.4 ms            8 bytes_per_second=207.945G/s
ContiguousSplitStrings/1Gb10ColsValidity/1073741824/10/256/1/iterations:8/manual_time           8.31 ms         8.36 ms            8 bytes_per_second=256.333G/s
ContiguousSplitStrings/1Gb1ColNoSplits/1073741824/1/0/0/iterations:8/manual_time                6.96 ms         9.71 ms            8 bytes_per_second=430.994G/s
ContiguousSplitStrings/1Gb1ColValidityNoSplits/1073741824/1/0/1/iterations:8/manual_time        4.40 ms         4.43 ms            8 bytes_per_second=688.462G/s

@abellina abellina mentioned this pull request May 2, 2023
3 tasks
Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First wave of comments.

cpp/include/cudf/contiguous_split.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/contiguous_split.hpp Outdated Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
@abellina
Copy link
Contributor Author

abellina commented May 5, 2023

Thanks @nvdbaranec @ttnghia, will address this afternoon :)

@abellina
Copy link
Contributor Author

abellina commented May 9, 2023

@ttnghia @nvdbaranec @davidwendt let me know if you have further comments. Thanks!

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass of review, mostly high-level comments and minor editorial fixes. Will take another pass tomorrow to fully wrap my head around the algorithm.

[[nodiscard]] bool has_next() const;

/**
* @brief Packs the next chunk into `user_buffer`. This should be call as long as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @brief Packs the next chunk into `user_buffer`. This should be call as long as
* @brief Packs the next chunk into `user_buffer`. This should be called as long as

* The intent of this operation is to be used in a streamed fashion at times of GPU
* out-of-memory, where we want to minimize the number of small cudaMemcpy calls and
* tracking of all the metadata associated with cudf tables. Because of the memory constraints,
* all thrust and scratch memory allocations are using the passed-in memory resource exclusively,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with documentation, I think this behavior is unexpected given the general approach that libcudf takes. I understand why we might need to support this, but I would prefer it be communicated very clearly in the API and not rely solely on documentation. Can we modify the struct to have two memory resources, the usual mr and an extra temp_mr or so? They can both default to the per-device resource. That way the caller gets a very clear indication that the memory resource used for temp allocations is also controllable in this particular instance.

If all allocations made by this function are temporary because it's just splitting an existing buffer, then maybe just renaming the mr makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunked_pack never allocates result buffers. So I am going to replace mr in this api with temp_mr to make it clear, as you suggest

rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Destructor that will be implemented as default, required because
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @brief Destructor that will be implemented as default, required because
* @brief Destructor that will be implemented as default. Declared with definition here because

Comment on lines 2022 to 2023
CUDF_EXPECTS(user_buffer_size >= desired_batch_size,
"The output buffer size must be at least 1MB in size");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this error in the constructor?

* return true when there is work to be done, and false otherwise.
*
* contiguous_split_state::contiguous_split() performs a single-pass contiguous_split
* and is only valid iff contiguous_split_state is instantiated with 0 for the user_buffer_size.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only is redundant with "iff".

Suggested change
* and is only valid iff contiguous_split_state is instantiated with 0 for the user_buffer_size.
* and is valid iff contiguous_split_state is instantiated with 0 for the user_buffer_size.

Unless "iff" is a typo, in which case use

Suggested change
* and is only valid iff contiguous_split_state is instantiated with 0 for the user_buffer_size.
* and is only valid if contiguous_split_state is instantiated with 0 for the user_buffer_size.

Comment on lines 1323 to 1324
current_iteration(0),
starting_batch(0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit

Suggested change
current_iteration(0),
starting_batch(0),
current_iteration{0},
starting_batch{0},

Comment on lines +1341 to +1342
CUDF_EXPECTS(current_iteration < num_iterations,
"current_iteration cannot exceed num_iterations");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can an external caller ever trigger this exception, or would this be purely an internal error? If the latter, maybe not something to throw an exception for, but more like an assert?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I guess this could happen if someone tried to advance the iteration on the state object without first checking if it has any chunks left?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, that's what it is protecting against.

return std::make_pair(starting_batch, count_for_current);
}

std::size_t advance_iteration()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add docs for these functions (this one and those below)? Don't need full doxygen, a simple description of inputs/output.

auto keys = cudf::detail::make_counting_transform_iterator(
0, split_key_functor{static_cast<int>(num_src_bufs)});
auto values =
cudf::detail::make_counting_transform_iterator(0, buf_size_functor{d_dst_buf_info});

thrust::reduce_by_key(rmm::exec_policy(stream),
thrust::reduce_by_key(rmm::exec_policy(stream, mr),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made exactly the same comment independently in the header file. Adding a separate temp_mr was my proposal.

std::vector<std::size_t> const h_size_of_buffs_per_iteration;
};

std::unique_ptr<chunk_iteration_state> make_chunk_iteration_state(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternate solution here that would satisfy both requests is to define the function out of band, i.e.

class chunk_iteration_state {
...
static auto create(...);
...
}
static auto chunk_iteration_state::create(...) {...}

WDYT?

@abellina
Copy link
Contributor Author

@vyasr I took care of all the comments with: dae5ceb

Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know some of this code is just moved around so these are just suggestions for improvement.

cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
@abellina
Copy link
Contributor Author

I think I have addressed all issues known to me @vyasr and @davidwendt let me know if you have other things you want me to tackle.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions for improvement and a couple of questions. I didn't appreciate at first how much of the code is simply moved around rather than new, so I won't push for too many further changes here. I think we can merge this and iterate further later if desired.

I personally find some of the organization confusing. For instance, it's not clear when constructors do a lot of work, when they just set up pointers to preexisting allocations, or when they do nearly nothing. It's also not immediately obvious when things are implemented as methods or not, and which pieces of state are getting passed around (there are lots of struct members being accessed directly then handed off to some other algorithm). I believe I've made sense of it after a longer read-through, but I'm sure I'm missing lots of details. Would be good to come back and try to clarify more of the structure.

cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
std::unique_ptr<chunk_iteration_state> chunk_iter_state;

// Two modes are allowed:
// - user provided buffer: as the name implies, the user has provided a buffer that must be at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this context a user is not a libcudf user, but rather libcudf code making use of this struct, right? In particular, contiguous_split vs chunked_pack?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. I will clarify this comment, it was written before we started calling this chunked_pack.. before this it was chunked_contiguous_split which was all kinds of confusing :)

return result;
}

cudf::table_view const input;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be documented using the standard doxygen style for members, see the example in our dev docs. For the members with very long descriptions maybe use a doxygen-style short description and a separate comment for the longer explanation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to chunked_iteration_state and to contiguous_split_state. I didn't add for the other structs in the anonymous namespace for now, but can go back to it if all member should get this treatment.

cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
max(std::size_t{1}, util::round_up_unsafe(bytes, desired_chunk_size) / desired_chunk_size);
// packed block of memory 1: split indices and src_buf_info structs
struct packed_split_indices_and_src_buf_info {
explicit packed_split_indices_and_src_buf_info(cudf::table_view const& input,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this constructor is marked explicit while none of the other structs are?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason. I think this was me not understanding explicit.

What are the rules for explicit in cuDF?

As far as I know, in general, the idea is to make sure the constructor is invoked, well, explicitly, rather than via a conversion. With multiple-argument constructors, I think the only conversion one would "protect" against is that via the initializer list. But I think initializer lists are preferred, so by marking constructors explicit I am probably going against that, forcing the constructor to be invoked directly. Is my understanding correct?

Copy link
Contributor Author

@abellina abellina May 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the explicit here to make them consistent, but I am curious on the above for my own edification.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're understanding seems correct to me: https://en.cppreference.com/w/cpp/language/explicit
Given that this an internal class I don't think there will be much chance of ambiguity between the ctor and an arbitrary init list.

@abellina
Copy link
Contributor Author

Thanks @nvdbaranec @ttnghia @davidwendt @vyasr for all the reviews!

I'll be able to revisit this code base soon with #13332, so I am going to continue to iterate to handle the feedback that was not addressed in this PR.

@abellina
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit c601b83 into rapidsai:branch-23.06 May 18, 2023
rapids-bot bot pushed a commit that referenced this pull request May 18, 2023
This PR is the standalone JNI side of #13260. Therefore it doesn't build in as is, but I am putting this up as draft for the Java reviewers to start taking a look.

This implements a `ChunkedPack` java class mirroring the interface of `cudf::chunked_pack`. This is an iterator-like class that can be used to invoke `cudf::pack` (aka `cudf::contigous_split` without splits) over several iterations against a bounce buffer.

In order to create a `ChunkedPack`, the user calls `makeChunkedPack` from a `Table` instance. During this call the user can also pass an `RmmDeviceMemoryResource` to be used internally by `cudf::chunked_pack` exclusively for scratch/temporary data.

Authors:
  - Alessandro Bellina (https://github.com/abellina)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Nghia Truong (https://github.com/ttnghia)

URL: #13278
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants