Add write_parquet to pylibcudf #17263

mroeschke · 2024-11-07T02:17:23Z

Description

Broken off from #17252 since also replacing cudf Python's write_parquet usage would have made the PR fairly large.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

mroeschke · 2024-11-07T02:25:07Z

python/pylibcudf/pylibcudf/tests/io/test_parquet.py

+@pytest.mark.parametrize("max_page_size_bytes", [None, 100])
+@pytest.mark.parametrize("max_page_size_rows", [None, 1])
+@pytest.mark.parametrize("max_dictionary_size", [None, 100])
+def test_write_parquet(


This test already generates 32k tests. Thoughts on if that's OK and if not which parameters are most important to exercise

So far I'm just asserting that assert isinstance(result, plc.io.parquet.BufferArrayFromVector). I'm hoping that cudf Python can assert more specifics about the metadata result (given that I'm testing a lot of params here) but happy to assert something stronger if desired

That seems like too many. I think we have to trust that the combinations of parameters are implemented correctly by libcudf, so we can test all of these parameters in groups that make "logical" sense together.

We could also just check that round-tripping through set/get works individually for each setter.

I would like some kind of reduction here before approving.

Sure thing. I reduced the parameters to run ~1000 tests that takes 9ish seconds on a DGX

…uet_writer_only

wence- · 2024-11-11T18:38:44Z

python/pylibcudf/pylibcudf/io/parquet.pyx

+cdef class BufferArrayFromVector:
+    @staticmethod
+    cdef BufferArrayFromVector from_unique_ptr(
+        unique_ptr[vector[uint8_t]] in_vec
+    ):
+        cdef BufferArrayFromVector buf = BufferArrayFromVector()
+        buf.in_vec = move(in_vec)
+        buf.length = dereference(buf.in_vec).size()
+        return buf
+
+    def __getbuffer__(self, Py_buffer *buffer, int flags):
+        cdef Py_ssize_t itemsize = sizeof(uint8_t)
+
+        self.shape[0] = self.length
+        self.strides[0] = 1
+
+        buffer.buf = dereference(self.in_vec).data()
+
+        buffer.format = NULL  # byte
+        buffer.internal = NULL
+        buffer.itemsize = itemsize
+        buffer.len = self.length * itemsize   # product(shape) * itemsize
+        buffer.ndim = 1
+        buffer.obj = self
+        buffer.readonly = 0
+        buffer.shape = self.shape
+        buffer.strides = self.strides
+        buffer.suboffsets = NULL
+
+    def __releasebuffer__(self, Py_buffer *buffer):
+        pass


This is now implemented twice (it is called HostBuffer in contiguous_split.pyx). Can we rationalise the implementations into a utilities submodule?

wence- · 2024-11-11T18:40:00Z

python/pylibcudf/pylibcudf/io/parquet.pyx

+        unique_ptr[vector[uint8_t]] in_vec
+    ):
+        cdef BufferArrayFromVector buf = BufferArrayFromVector()
+        buf.in_vec = move(in_vec)


nit: conventionally we just call these things c_obj.

wence- · 2024-11-11T18:40:26Z

python/pylibcudf/pylibcudf/io/parquet.pyx

+        cdef ParquetWriterOptionsBuilder bldr = ParquetWriterOptionsBuilder.__new__(
+            ParquetWriterOptionsBuilder
+        )
+        bldr.builder = parquet_writer_options.builder(sink.c_obj, table.view())


nit: call builder c_obj?

…uet_writer_only

wence-

One lifetime issue. Please also add type stubs for these new objects.

python/pylibcudf/pylibcudf/io/parquet.pyx

…uet_writer_only

python/pylibcudf/pylibcudf/io/parquet.pyx

Now that we use HostBuffer to take ownership of metadata from write_parquet, we must handle the case of a null pointer as input.

Seems to induce a bug in libcudf for now.

wence-

I think this now looks good, but I have written part of it...

python/pylibcudf/pylibcudf/io/parquet.pyx

python/pylibcudf/pylibcudf/io/types.pyi

bdice

Some questions about data types.

python/pylibcudf/pylibcudf/io/parquet.pxd

python/pylibcudf/pylibcudf/io/parquet.pyi

python/pylibcudf/pylibcudf/io/types.pyx

python/pylibcudf/pylibcudf/io/parquet.pxd

python/pylibcudf/pylibcudf/io/parquet.pyi

python/pylibcudf/pylibcudf/io/parquet.pyx

bdice · 2024-11-21T18:40:14Z

python/pylibcudf/pylibcudf/io/parquet.pyx

+
+    cpdef ParquetWriterOptionsBuilder int96_timestamps(self, bool enabled):
+        """
+        Sets whether int96 timestamps are written or not.


This is a bit unclear. If false, this still writes the data, just as a different type. Look at the C++ code to see what this should say.

FWIW the Builder method docstring says this too (@brief Sets whether int96 timestamps are written or not.), but was able to lift a description from the an associated enable_int96_timestamps method

bdice · 2024-11-21T18:40:55Z

python/pylibcudf/pylibcudf/io/parquet.pyx

+
+    cpdef ParquetWriterOptionsBuilder write_v2_headers(self, bool enabled):
+        """
+        Set to true if V2 page headers are to be written.


Same kind of ambiguity here, should we clarify that false writes v1 headers?

bdice · 2024-11-21T18:43:26Z

python/pylibcudf/pylibcudf/io/types.pyx

+        Parameters
+        ----------
+        req : bool
+            True = use int96 physical type. False = use int64 physical type.


This is the kind of info I wanted above about int96 timestamps.

bdice · 2024-11-21T18:44:47Z

python/pylibcudf/pylibcudf/tests/io/test_parquet.py

+@pytest.mark.parametrize("max_page_size_bytes", [None, 100])
+@pytest.mark.parametrize("max_page_size_rows", [None, 1])
+@pytest.mark.parametrize("max_dictionary_size", [None, 100])
+def test_write_parquet(


I would like some kind of reduction here before approving.

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

…uet_writer_only

bdice

Thanks for addressing my last round of feedback. Are we still targeting 24.12 with this?

vyasr · 2024-11-22T18:22:07Z

Bumped to 25.02

mroeschke · 2024-11-22T19:01:32Z

/merge

Follow up of #17263, this PR adds the parquet reader options classes to pylibcudf and plumbs the changes through cudf python. Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) - MithunR (https://github.com/mythrocks) URL: #17464

Removes unused IO utilities from cuDF Python. Depends on #17163 #16042 #17252 #17263 Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Bradley Dice (https://github.com/bdice) URL: #17374

mroeschke added 2 commits November 6, 2024 16:19

Add writer, supporting objects, and tests; compilation passes

ff0d51f

Add fix test, add python method for construction

bb2c258

mroeschke added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package labels Nov 7, 2024

mroeschke self-assigned this Nov 7, 2024

mroeschke requested a review from a team as a code owner November 7, 2024 02:17

mroeschke requested review from wence- and bdice November 7, 2024 02:17

github-actions bot added the Python Affects Python cuDF API. label Nov 7, 2024

mroeschke commented Nov 7, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/branch-24.12' into plc/io/parq…

625e254

…uet_writer_only

wence- requested changes Nov 11, 2024

View reviewed changes

mroeschke added 3 commits November 14, 2024 13:48

Merge remote-tracking branch 'upstream/branch-24.12' into plc/io/parq…

4b8402d

…uet_writer_only

Use HostBuffer, rename to c_obj

097decb

fix up tests

3133ee1

wence- requested changes Nov 15, 2024

View reviewed changes

python/pylibcudf/pylibcudf/io/parquet.pyx Show resolved Hide resolved

python/pylibcudf/pylibcudf/io/parquet.pyx Outdated Show resolved Hide resolved

wence- mentioned this pull request Nov 15, 2024

Migrate CSV writer to pylibcudf #17163

Merged

3 tasks

mroeschke added 6 commits November 15, 2024 11:08

Merge remote-tracking branch 'upstream/branch-24.12' into plc/io/parq…

35984c9

…uet_writer_only

keep table and sink references alive

14c4501

Return memoryview

46cbb46

Adjust test too

46db84e

Add back contiguous split changes

efe24d4

Merge remote-tracking branch 'upstream/branch-24.12' into plc/io/parq…

0d0b5ba

…uet_writer_only

wence- reviewed Nov 18, 2024

View reviewed changes

python/pylibcudf/pylibcudf/io/parquet.pyx Outdated Show resolved Hide resolved

Matt711 mentioned this pull request Nov 18, 2024

Migrate ORC Writer to pylibcudf #17310

Merged

3 tasks

wence- added 4 commits November 19, 2024 14:48

Allow construction of HostBuffer from nullptr

a0fdcfa

Now that we use HostBuffer to take ownership of metadata from write_parquet, we must handle the case of a null pointer as input.

Parquet writing does not support gzip compression

4d802f2

Use valid values for row_group/max_page_size_bytes

91e847e

Skip zero-sized table and non-None partition info

f2a905e

Seems to induce a bug in libcudf for now.

Merge branch 'branch-24.12' into plc/io/parquet_writer_only

da6b730

wence- approved these changes Nov 19, 2024

View reviewed changes

Matt711 approved these changes Nov 19, 2024

View reviewed changes

python/pylibcudf/pylibcudf/io/parquet.pyx Outdated Show resolved Hide resolved

python/pylibcudf/pylibcudf/io/types.pyi Outdated Show resolved Hide resolved

bdice requested changes Nov 20, 2024

View reviewed changes

python/pylibcudf/pylibcudf/io/parquet.pxd Outdated Show resolved Hide resolved

python/pylibcudf/pylibcudf/io/parquet.pyi Outdated Show resolved Hide resolved

python/pylibcudf/pylibcudf/io/types.pyx Outdated Show resolved Hide resolved

Address reviews

81c9839

Matt711 mentioned this pull request Nov 20, 2024

Remove unused IO utilities from cudf python #17374

Merged

3 tasks

merge conflict

c057bf7

github-actions bot assigned Matt711 Nov 20, 2024

vyasr requested a review from bdice November 21, 2024 01:33

bdice requested changes Nov 21, 2024

View reviewed changes

mroeschke and others added 10 commits November 21, 2024 12:06

Update python/pylibcudf/pylibcudf/io/parquet.pxd

85a5505

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

Update python/pylibcudf/pylibcudf/io/parquet.pyi

2f65032

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

Update python/pylibcudf/pylibcudf/io/parquet.pyx

44a7bba

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

Update python/pylibcudf/pylibcudf/io/parquet.pyx

1564fae

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

Update python/pylibcudf/pylibcudf/io/parquet.pyx

91ed038

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

Update python/pylibcudf/pylibcudf/io/parquet.pyx

2faa151

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

Update python/pylibcudf/pylibcudf/io/parquet.pyx

ed096f3

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

Update python/pylibcudf/pylibcudf/io/parquet.pyx

66f555f

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

address docstring review, reduce test parameterization

0df14b1

Merge remote-tracking branch 'upstream/branch-24.12' into plc/io/parq…

741c95c

…uet_writer_only

bdice approved these changes Nov 22, 2024

View reviewed changes

vyasr changed the base branch from branch-24.12 to branch-25.02 November 22, 2024 18:21

rapids-bot bot merged commit 53e4525 into rapidsai:branch-25.02 Nov 22, 2024
112 checks passed

mroeschke deleted the plc/io/parquet_writer_only branch November 22, 2024 19:01

Matt711 mentioned this pull request Nov 27, 2024

Add Parquet Reader options classes to pylibcudf #17464

Merged

3 tasks

Matt711 mentioned this pull request Dec 10, 2024

[FEA] Expose IO options structs in pylibcudf #17565

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add write_parquet to pylibcudf #17263

Add write_parquet to pylibcudf #17263

mroeschke commented Nov 7, 2024

mroeschke Nov 7, 2024

wence- Nov 11, 2024

bdice Nov 21, 2024

mroeschke Nov 21, 2024

wence- Nov 11, 2024

wence- Nov 11, 2024

wence- Nov 11, 2024

wence- left a comment

wence- left a comment

bdice left a comment

bdice Nov 21, 2024

mroeschke Nov 21, 2024

bdice Nov 21, 2024

bdice Nov 21, 2024

bdice Nov 21, 2024

bdice left a comment

vyasr commented Nov 22, 2024

mroeschke commented Nov 22, 2024

Add write_parquet to pylibcudf #17263

Add write_parquet to pylibcudf #17263

Conversation

mroeschke commented Nov 7, 2024

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

vyasr commented Nov 22, 2024

mroeschke commented Nov 22, 2024