Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally return compression statistics from ORC and Parquet writers #13294

Merged
merged 25 commits into from
May 16, 2023

Conversation

vuule
Copy link
Contributor

@vuule vuule commented May 4, 2023

Description

Partially address #13252

Adds an option to ORC and Parquet writers to take a shared pointer to a writer_compression_statistics objects. This object is then updated with the compression statistics from the written file.
In the case of chunked writing, statistics object is updated after each file. If a writing operation fails, statistics are not modified.
One statistics object can be reused in multiple write calls, but it should not be used concurrently in multiple threads.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule added feature request New feature or request cuIO cuIO issue non-breaking Non-breaking change labels May 4, 2023
@vuule vuule self-assigned this May 4, 2023
@github-actions github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels May 4, 2023
@vuule vuule marked this pull request as ready for review May 10, 2023 06:24
@vuule vuule requested review from a team as code owners May 10, 2023 06:24
Copy link
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake changes LGTM

*
* @param comp_stats Pointer to compression statistics to be updated after writing
*/
void set_compression_statistics(std::shared_ptr<writer_compression_statistics> const& comp_stats)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Please remind me what the right call is here for parameters: passing by value vs const reference. The methods above pass by value and std::move(). We're paying for 1 copy anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got tripped up by the shared_ptr, but, yes, we should be able to pass by value + move here as well.

Copy link
Contributor

@ttnghia ttnghia May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that the statistics is generated internally by the writer, isn't it? If so, why do we need this function, and also why do we need to have the shared_ptr of that object in the input options to writer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The statistics object is created by the user (see tests). That's why we have this shared lifetime management. Basically users registers an object for the writer to write the statistics into.
FWIW, this also means that a statistics object is reusable :)

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super. I just have a couple of nitpicks, and some clarifying questions.

cpp/include/cudf/io/types.hpp Show resolved Hide resolved
cpp/src/io/comp/statistics.cu Show resolved Hide resolved
cpp/src/io/comp/statistics.cu Outdated Show resolved Hide resolved
cpp/src/io/comp/statistics.cu Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Show resolved Hide resolved
cpp/tests/io/orc_test.cpp Outdated Show resolved Hide resolved
*/
[[nodiscard]] auto compression_ratio() const noexcept
{
return static_cast<double>(num_compressed_bytes()) / _num_compressed_output_bytes;
Copy link
Contributor

@ttnghia ttnghia May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling for div by zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We let C++ handle it and return a nan. I added a note about this behavior in the docs of compression_ratio.
The alternative would be an optional, but I think nan fits better here.


namespace cudf::io {

writer_compression_statistics collect_compression_statistics(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not shared_ptr anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function creates a writer_compression_statistics object, no need to know how its lifetime might be managed after it's created. If we think of this as a factory, we could return a unique_ptr, but I don't think it's beneficial here.

cpp/src/io/orc/orc_gpu.hpp Outdated Show resolved Hide resolved
* @param single_table_stats Statistics data from the current table
* @param compression_stats Compression statistics from the current table
*/
void update_statistics(size_type num_rows,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add this declaration here? Can we make this a free function?

*
* @param compression_stats Optional compression statistics from the current table
*/
void update_compression_statistics(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question, can we make this a free function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It changes the internal state of the impl object, I think it fits better as a member function.

@vuule vuule requested a review from ttnghia May 15, 2023 22:43
@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label May 16, 2023
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks!

@@ -1136,7 +1136,7 @@ __global__ void __launch_bounds__(256)
device_span<device_span<uint8_t const>> inputs,
device_span<device_span<uint8_t>> outputs,
device_span<compression_result> results,
uint8_t* compressed_bfr,
device_span<uint8_t> compressed_bfr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@vuule
Copy link
Contributor Author

vuule commented May 16, 2023

/merge

@rapids-bot rapids-bot bot merged commit 3e5f00e into rapidsai:branch-23.06 May 16, 2023
@vuule vuule deleted the fea-compression-stats branch May 16, 2023 21:28
rapids-bot bot pushed a commit that referenced this pull request May 18, 2023
This adds JNI to expose compression statistics for ORC and Parquet writer, which was implemented in #13294.

For doing so, `TableWriter` has been changed from an interface into an abstract class. Related derived writer classes are also updated accordingly.

Closes #13252.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: #13376
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants