-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add write_parquet to pylibcudf #17263
Add write_parquet to pylibcudf #17263
Conversation
@pytest.mark.parametrize("max_page_size_bytes", [None, 100]) | ||
@pytest.mark.parametrize("max_page_size_rows", [None, 1]) | ||
@pytest.mark.parametrize("max_dictionary_size", [None, 100]) | ||
def test_write_parquet( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This test already generates 32k tests. Thoughts on if that's OK and if not which parameters are most important to exercise
- So far I'm just asserting that
assert isinstance(result, plc.io.parquet.BufferArrayFromVector)
. I'm hoping that cudf Python can assert more specifics about the metadata result (given that I'm testing a lot of params here) but happy to assert something stronger if desired
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like too many. I think we have to trust that the combinations of parameters are implemented correctly by libcudf, so we can test all of these parameters in groups that make "logical" sense together.
We could also just check that round-tripping through set/get works individually for each setter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like some kind of reduction here before approving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing. I reduced the parameters to run ~1000 tests that takes 9ish seconds on a DGX
cdef class BufferArrayFromVector: | ||
@staticmethod | ||
cdef BufferArrayFromVector from_unique_ptr( | ||
unique_ptr[vector[uint8_t]] in_vec | ||
): | ||
cdef BufferArrayFromVector buf = BufferArrayFromVector() | ||
buf.in_vec = move(in_vec) | ||
buf.length = dereference(buf.in_vec).size() | ||
return buf | ||
|
||
def __getbuffer__(self, Py_buffer *buffer, int flags): | ||
cdef Py_ssize_t itemsize = sizeof(uint8_t) | ||
|
||
self.shape[0] = self.length | ||
self.strides[0] = 1 | ||
|
||
buffer.buf = dereference(self.in_vec).data() | ||
|
||
buffer.format = NULL # byte | ||
buffer.internal = NULL | ||
buffer.itemsize = itemsize | ||
buffer.len = self.length * itemsize # product(shape) * itemsize | ||
buffer.ndim = 1 | ||
buffer.obj = self | ||
buffer.readonly = 0 | ||
buffer.shape = self.shape | ||
buffer.strides = self.strides | ||
buffer.suboffsets = NULL | ||
|
||
def __releasebuffer__(self, Py_buffer *buffer): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now implemented twice (it is called HostBuffer
in contiguous_split.pyx
). Can we rationalise the implementations into a utilities submodule?
unique_ptr[vector[uint8_t]] in_vec | ||
): | ||
cdef BufferArrayFromVector buf = BufferArrayFromVector() | ||
buf.in_vec = move(in_vec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: conventionally we just call these things c_obj
.
cdef ParquetWriterOptionsBuilder bldr = ParquetWriterOptionsBuilder.__new__( | ||
ParquetWriterOptionsBuilder | ||
) | ||
bldr.builder = parquet_writer_options.builder(sink.c_obj, table.view()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: call builder
c_obj
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One lifetime issue. Please also add type stubs for these new objects.
Now that we use HostBuffer to take ownership of metadata from write_parquet, we must handle the case of a null pointer as input.
Seems to induce a bug in libcudf for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this now looks good, but I have written part of it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions about data types.
|
||
cpdef ParquetWriterOptionsBuilder int96_timestamps(self, bool enabled): | ||
""" | ||
Sets whether int96 timestamps are written or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit unclear. If false, this still writes the data, just as a different type. Look at the C++ code to see what this should say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW the Builder method docstring says this too (@brief Sets whether int96 timestamps are written or not.
), but was able to lift a description from the an associated enable_int96_timestamps
method
|
||
cpdef ParquetWriterOptionsBuilder write_v2_headers(self, bool enabled): | ||
""" | ||
Set to true if V2 page headers are to be written. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same kind of ambiguity here, should we clarify that false writes v1 headers?
Parameters | ||
---------- | ||
req : bool | ||
True = use int96 physical type. False = use int64 physical type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the kind of info I wanted above about int96 timestamps.
@pytest.mark.parametrize("max_page_size_bytes", [None, 100]) | ||
@pytest.mark.parametrize("max_page_size_rows", [None, 1]) | ||
@pytest.mark.parametrize("max_dictionary_size", [None, 100]) | ||
def test_write_parquet( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like some kind of reduction here before approving.
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my last round of feedback. Are we still targeting 24.12 with this?
Bumped to 25.02 |
/merge |
Follow up of #17263, this PR adds the parquet reader options classes to pylibcudf and plumbs the changes through cudf python. Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) - MithunR (https://github.com/mythrocks) URL: #17464
Removes unused IO utilities from cuDF Python. Depends on #17163 #16042 #17252 #17263 Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Bradley Dice (https://github.com/bdice) URL: #17374
Description
Broken off from #17252 since also replacing cudf Python's
write_parquet
usage would have made the PR fairly large.Checklist