Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add write_parquet to pylibcudf #17263

Merged
merged 30 commits into from
Nov 22, 2024

Conversation

mroeschke
Copy link
Contributor

Description

Broken off from #17252 since also replacing cudf Python's write_parquet usage would have made the PR fairly large.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package labels Nov 7, 2024
@mroeschke mroeschke self-assigned this Nov 7, 2024
@mroeschke mroeschke requested a review from a team as a code owner November 7, 2024 02:17
@mroeschke mroeschke requested review from wence- and bdice November 7, 2024 02:17
@github-actions github-actions bot added the Python Affects Python cuDF API. label Nov 7, 2024
@pytest.mark.parametrize("max_page_size_bytes", [None, 100])
@pytest.mark.parametrize("max_page_size_rows", [None, 1])
@pytest.mark.parametrize("max_dictionary_size", [None, 100])
def test_write_parquet(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This test already generates 32k tests. Thoughts on if that's OK and if not which parameters are most important to exercise
  2. So far I'm just asserting that assert isinstance(result, plc.io.parquet.BufferArrayFromVector). I'm hoping that cudf Python can assert more specifics about the metadata result (given that I'm testing a lot of params here) but happy to assert something stronger if desired

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like too many. I think we have to trust that the combinations of parameters are implemented correctly by libcudf, so we can test all of these parameters in groups that make "logical" sense together.

We could also just check that round-tripping through set/get works individually for each setter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like some kind of reduction here before approving.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. I reduced the parameters to run ~1000 tests that takes 9ish seconds on a DGX

Comment on lines 542 to 572
cdef class BufferArrayFromVector:
@staticmethod
cdef BufferArrayFromVector from_unique_ptr(
unique_ptr[vector[uint8_t]] in_vec
):
cdef BufferArrayFromVector buf = BufferArrayFromVector()
buf.in_vec = move(in_vec)
buf.length = dereference(buf.in_vec).size()
return buf

def __getbuffer__(self, Py_buffer *buffer, int flags):
cdef Py_ssize_t itemsize = sizeof(uint8_t)

self.shape[0] = self.length
self.strides[0] = 1

buffer.buf = dereference(self.in_vec).data()

buffer.format = NULL # byte
buffer.internal = NULL
buffer.itemsize = itemsize
buffer.len = self.length * itemsize # product(shape) * itemsize
buffer.ndim = 1
buffer.obj = self
buffer.readonly = 0
buffer.shape = self.shape
buffer.strides = self.strides
buffer.suboffsets = NULL

def __releasebuffer__(self, Py_buffer *buffer):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now implemented twice (it is called HostBuffer in contiguous_split.pyx). Can we rationalise the implementations into a utilities submodule?

unique_ptr[vector[uint8_t]] in_vec
):
cdef BufferArrayFromVector buf = BufferArrayFromVector()
buf.in_vec = move(in_vec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: conventionally we just call these things c_obj.

cdef ParquetWriterOptionsBuilder bldr = ParquetWriterOptionsBuilder.__new__(
ParquetWriterOptionsBuilder
)
bldr.builder = parquet_writer_options.builder(sink.c_obj, table.view())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: call builder c_obj?

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One lifetime issue. Please also add type stubs for these new objects.

python/pylibcudf/pylibcudf/io/parquet.pyx Show resolved Hide resolved
python/pylibcudf/pylibcudf/io/parquet.pyx Outdated Show resolved Hide resolved
@wence- wence- mentioned this pull request Nov 15, 2024
3 tasks
@Matt711 Matt711 mentioned this pull request Nov 18, 2024
3 tasks
Now that we use HostBuffer to take ownership of metadata from
write_parquet, we must handle the case of a null pointer as input.
Seems to induce a bug in libcudf for now.
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this now looks good, but I have written part of it...

python/pylibcudf/pylibcudf/io/parquet.pyx Outdated Show resolved Hide resolved
python/pylibcudf/pylibcudf/io/types.pyi Outdated Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions about data types.

python/pylibcudf/pylibcudf/io/parquet.pxd Outdated Show resolved Hide resolved
python/pylibcudf/pylibcudf/io/parquet.pyi Outdated Show resolved Hide resolved
python/pylibcudf/pylibcudf/io/types.pyx Outdated Show resolved Hide resolved
@vyasr vyasr requested a review from bdice November 21, 2024 01:33
python/pylibcudf/pylibcudf/io/parquet.pxd Outdated Show resolved Hide resolved
python/pylibcudf/pylibcudf/io/parquet.pyi Outdated Show resolved Hide resolved
python/pylibcudf/pylibcudf/io/parquet.pyx Outdated Show resolved Hide resolved
python/pylibcudf/pylibcudf/io/parquet.pyx Outdated Show resolved Hide resolved
python/pylibcudf/pylibcudf/io/parquet.pyx Outdated Show resolved Hide resolved
python/pylibcudf/pylibcudf/io/parquet.pyx Outdated Show resolved Hide resolved

cpdef ParquetWriterOptionsBuilder int96_timestamps(self, bool enabled):
"""
Sets whether int96 timestamps are written or not.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit unclear. If false, this still writes the data, just as a different type. Look at the C++ code to see what this should say.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the Builder method docstring says this too (@brief Sets whether int96 timestamps are written or not.), but was able to lift a description from the an associated enable_int96_timestamps method


cpdef ParquetWriterOptionsBuilder write_v2_headers(self, bool enabled):
"""
Set to true if V2 page headers are to be written.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same kind of ambiguity here, should we clarify that false writes v1 headers?

Parameters
----------
req : bool
True = use int96 physical type. False = use int64 physical type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the kind of info I wanted above about int96 timestamps.

@pytest.mark.parametrize("max_page_size_bytes", [None, 100])
@pytest.mark.parametrize("max_page_size_rows", [None, 1])
@pytest.mark.parametrize("max_dictionary_size", [None, 100])
def test_write_parquet(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like some kind of reduction here before approving.

mroeschke and others added 10 commits November 21, 2024 12:06
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my last round of feedback. Are we still targeting 24.12 with this?

@vyasr vyasr changed the base branch from branch-24.12 to branch-25.02 November 22, 2024 18:21
@vyasr
Copy link
Contributor

vyasr commented Nov 22, 2024

Bumped to 25.02

@mroeschke
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 53e4525 into rapidsai:branch-25.02 Nov 22, 2024
112 checks passed
@mroeschke mroeschke deleted the plc/io/parquet_writer_only branch November 22, 2024 19:01
rapids-bot bot pushed a commit that referenced this pull request Dec 6, 2024
Follow up of #17263, this PR adds the parquet reader options classes to pylibcudf and plumbs the changes through cudf python.

Authors:
  - Matthew Murray (https://github.com/Matt711)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)
  - Yunsong Wang (https://github.com/PointKernel)
  - Nghia Truong (https://github.com/ttnghia)
  - MithunR (https://github.com/mythrocks)

URL: #17464
rapids-bot bot pushed a commit that referenced this pull request Dec 9, 2024
Removes unused IO utilities from cuDF Python. Depends on #17163 #16042 #17252 #17263

Authors:
  - Matthew Murray (https://github.com/Matt711)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #17374
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants