Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-44066: [Python] Add Python wrapper for JsonExtensionType #44070

Merged
merged 20 commits into from
Oct 22, 2024

Conversation

rok
Copy link
Member

@rok rok commented Sep 11, 2024

Rationale for this change

We added canonical JsonExtensionType and we should make it usable from Python.

What changes are included in this PR?

Python wrapper for JsonExtensionType and JsonArray are added on Python side as well as JsonArray on c++ side.

Are these changes tested?

Python tests for the extension type and array are included.

Are there any user-facing changes?

This adds a json canonical extension type to pyarrow.

Copy link

⚠️ GitHub issue #44066 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

Added some inline comments. One other comment:

  • Can you add the new classes to test_extension_type_constructor_errors in test_misc.py

Are you planning to work on enabling the parquet integration in a later PR?

python/pyarrow/array.pxi Show resolved Hide resolved
python/pyarrow/lib.pxd Show resolved Hide resolved
cpp/src/arrow/extension/json.h Outdated Show resolved Hide resolved
python/pyarrow/types.pxi Outdated Show resolved Hide resolved
python/pyarrow/tests/test_extension_type.py Outdated Show resolved Hide resolved
python/pyarrow/tests/test_extension_type.py Show resolved Hide resolved
python/pyarrow/tests/test_extension_type.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes awaiting change review Awaiting change review labels Sep 12, 2024
@rok
Copy link
Member Author

rok commented Sep 12, 2024

Thanks for working on this!

Thanks for the review!

  • Can you add the new classes to test_extension_type_constructor_errors in test_misc.py

Added.

Are you planning to work on enabling the parquet integration in a later PR?

Do we need to do something Python-side? C++ side was covered by #13901. I've added a basic parquet test here and it works for me locally.

Question: any idea why pytest is saying pa.json is not callable in CI? Locally this works as expected.

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Sep 12, 2024
@jorisvandenbossche
Copy link
Member

Do we need to do something Python-side? C++ side was covered by #13901. I've added a basic parquet test here and it works for me locally.

I would be surprised if it works locally out of the box because IIRC the option was disabled by default in C++ for now? (so what we would need on the python side is expose that new arrow_extensions_enabled options so that the user can enable it)
Although I suppose that you wrote the file with pyarrow, and at that point we give priority to the stored schema, I suppose, even if the option is not enabled.

Question: any idea why pytest is saying pa.json is not callable in CI? Locally this works as expected.

There might somehow we some conflict with the stdlib module?

@jorisvandenbossche
Copy link
Member

Question: any idea why pytest is saying pa.json is not callable in CI? Locally this works as expected.

There might somehow we some conflict with the stdlib module?

Actually not that, but we have a pyarrow.json module already. So we will have to call this function differently anyway. For others with a conflict with added a trailing underscore, so could do that here as well.

@pytest.mark.parametrize("storage_type", (
pa.utf8(), pa.large_utf8(), pa.string(), pa.large_string()))
@pytest.mark.parquet
def test_parquet_json(tmpdir, storage_type):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move those tests to parquet subdir? (e.g. python/pyarrow/tests/parquet/test_data_types.py)
I know we already have parquet related tests in this file, but those are for custom extension type support, while this will be a built-in extension type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 12, 2024
@rok rok force-pushed the python_json_extension_type_wrapper branch from 4393536 to cb6820b Compare September 12, 2024 14:01
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Sep 12, 2024
@rok
Copy link
Member Author

rok commented Sep 12, 2024

Do we need to do something Python-side? C++ side was covered by #13901. I've added a basic parquet test here and it works for me locally.

I would be surprised if it works locally out of the box because IIRC the option was disabled by default in C++ for now? (so what we would need on the python side is expose that new arrow_extensions_enabled options so that the user can enable it) Although I suppose that you wrote the file with pyarrow, and at that point we give priority to the stored schema, I suppose, even if the option is not enabled.

I see. I assumed arrow_extensions_enabled was on by default. Do we want to expose a global setter? E.g.: pyarrow.parquet.set_arrow_extensions_enabled? Or would it be preferred to add a parameter for pq.write_table?

Question: any idea why pytest is saying pa.json is not callable in CI? Locally this works as expected.

There might somehow we some conflict with the stdlib module?
[..]
Actually not that, but we have a pyarrow.json module already. So we will have to call this function differently anyway. For > others with a conflict with added a trailing underscore, so could do that here as well.

It seems this was the case, changed extension type name to pa.json_.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 12, 2024
@rok rok force-pushed the python_json_extension_type_wrapper branch from 8ddc682 to 514985a Compare October 8, 2024 11:13
@rok
Copy link
Member Author

rok commented Oct 8, 2024

Another rebase on master.

@rok rok force-pushed the python_json_extension_type_wrapper branch from 514985a to bf25e3a Compare October 8, 2024 17:37
@rok
Copy link
Member Author

rok commented Oct 14, 2024

@pitrou could you do a quick pass here in case anything stands out please?

Create an extension array

>>> arr = [None, '{ "id":30, "values":["a", "b"] }']
>>> storage = pa.array(arr, pa.large_utf8())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: it would be nice if one could write json_array = pa.array(arr, json_type).
Perhaps open a feature request?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a nice feature. #44406

python/pyarrow/types.pxi Show resolved Hide resolved

for storage_type in (pa.int32(), pa.large_binary(), pa.float32()):
with pytest.raises(
pa.ArrowInvalid,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pity this doesn't raise TypeError.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could catch and raise it but it's probably not a good idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. TypeError would have to be raised at the C++ level instead. Anyway, this is out of scope for this PR.

python/pyarrow/tests/test_extension_type.py Outdated Show resolved Hide resolved
@rok
Copy link
Member Author

rok commented Oct 14, 2024

@pitrou do you think this merits a merge or should we wait for another review?

@pitrou
Copy link
Member

pitrou commented Oct 15, 2024

What's the plan for the Parquet arrow_extensions_enabled option?

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 22, 2024
@rok
Copy link
Member Author

rok commented Oct 22, 2024

@pitrou

What's the plan for the Parquet arrow_extensions_enabled option?

Perhaps we should open another issue for it? Current implementation seems to roundtrip to parquet ok.
I'd propose something like this:

diff --git a/python/pyarrow/_parquet.pxd b/python/pyarrow/_parquet.pxd
index d6aebd8284..32e2618ecf 100644
--- a/python/pyarrow/_parquet.pxd
+++ b/python/pyarrow/_parquet.pxd
@@ -405,6 +405,7 @@ cdef extern from "parquet/api/reader.h" namespace "parquet" nogil:
         CCacheOptions cache_options() const
         void set_coerce_int96_timestamp_unit(TimeUnit unit)
         TimeUnit coerce_int96_timestamp_unit() const
+        void set_arrow_extensions_enabled(c_bool enabled)
 
     ArrowReaderProperties default_arrow_reader_properties()
 
diff --git a/python/pyarrow/_parquet.pyx b/python/pyarrow/_parquet.pyx
index 254bfe3b09..6ae1726c71 100644
--- a/python/pyarrow/_parquet.pyx
+++ b/python/pyarrow/_parquet.pyx
@@ -1441,7 +1441,8 @@ cdef class ParquetReader(_Weakrefable):
              FileDecryptionProperties decryption_properties=None,
              thrift_string_size_limit=None,
              thrift_container_size_limit=None,
-             page_checksum_verification=False):
+             page_checksum_verification=False,
+             arrow_extensions_enabled=False):
         """
         Open a parquet file for reading.
 
@@ -1458,6 +1459,7 @@ cdef class ParquetReader(_Weakrefable):
         thrift_string_size_limit : int, optional
         thrift_container_size_limit : int, optional
         page_checksum_verification : bool, default False
+        arrow_extensions_enabled: bool, default False
         """
         cdef:
             shared_ptr[CFileMetaData] c_metadata
@@ -1522,6 +1524,9 @@ cdef class ParquetReader(_Weakrefable):
         if read_dictionary is not None:
             self._set_read_dictionary(read_dictionary, &arrow_props)
 
+        if arrow_extensions_enabled:
+            arrow_props.set_arrow_extensions_enabled(<c_bool>True)
+
         with nogil:
             check_status(builder.memory_pool(self.pool)
                          .properties(arrow_props)

@pitrou
Copy link
Member

pitrou commented Oct 22, 2024

@rok, yes, we should open a new issue for it

@pitrou pitrou merged commit bcb4653 into apache:main Oct 22, 2024
15 checks passed
@pitrou pitrou removed the awaiting change review Awaiting change review label Oct 22, 2024
@rok
Copy link
Member Author

rok commented Oct 22, 2024

Opened an issue for the arrow_extensions_enabled #44500 problem.

Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit bcb4653.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants