Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: PyArrow HDFS support #7997

Merged
merged 2 commits into from
Jul 26, 2023
Merged

Conversation

LuigiCerone
Copy link
Contributor

@LuigiCerone LuigiCerone commented Jul 5, 2023

Closes #7974.

@github-actions github-actions bot added the python label Jul 5, 2023
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up @LuigiCerone. Looking good. Could you also update the documentation?

python/pyiceberg/io/__init__.py Outdated Show resolved Hide resolved
@Fokko Fokko added this to the PyIceberg 0.4.1 milestone Jul 6, 2023
@Fokko Fokko changed the title Add pyarrow hdfs support Python: Add pyarrow hdfs support Jul 10, 2023
@LuigiCerone
Copy link
Contributor Author

LuigiCerone commented Jul 25, 2023

I tested this locally in a Docker environment, the metadata JSON file is one generated by the docker-spark-iceberg quickstart.

HDFS setup created with this repo:

root@579a35817003:/# hdfs dfs -ls /user/luigi
Found 1 items
-rw-r--r--   3 root supergroup       5153 2023-07-25 09:58 /user/luigi/00000-3ee52bcd-ba94-4a7e-a0c0-60ce14d5397b.metadata.json

In another container (after env vars setup according to pyarrow docs):

root@341c4b757a72:/opt/spark/work-dir# pip install "git+https://github.com/LuigiCerone/iceberg.git@feat/hdfs_support#subdirectory=python&egg=pyiceberg[pyarrow]"
root@341c4b757a72:/opt/spark/work-dir# export HADOOP_HOME=/usr/local/hadoop
root@341c4b757a72:/opt/spark/work-dir# export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob`
root@341c4b757a72:/opt/spark/work-dir# python
Python 3.8.13 (default, Mar 28 2022, 11:38:47)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyiceberg.table import StaticTable
>>> table = StaticTable.from_metadata("hdfs:///user/luigi/00000-3ee52bcd-ba94-4a7e-a0c0-60ce14d5397b.metadata.json", properties={"hdfs.host": "namenode", "hdfs.port": 9000, "hdfs.user": "luigi"})
>>> table.metadata
TableMetadataV1(location='s3://warehouse/nyc/taxis', table_uuid=UUID('a4f97eac-3793-4f17-99ad-7079b6ce408a'), last_updated_ms=1690277912736, last_column_id=19, schemas=[Schema(NestedField(field_id=1, name='VendorID', field_type=LongType(), required=False), NestedField(field_id=2, name='tpep_pickup_datetime', field_type=TimestamptzType(), required=False), NestedField(field_id=3, name='tpep_dropoff_datetime', field_type=TimestamptzType(), required=False), NestedField(field_id=4, name='passenger_count', field_type=DoubleType(), required=False), NestedField(field_id=5, name='trip_distance', field_type=DoubleType(), required=False), NestedField(field_id=6, name='RatecodeID', field_type=DoubleType(), required=False), NestedField(field_id=7, name='store_and_fwd_flag', field_type=StringType(), required=False), NestedField(field_id=8, name='PULocationID', field_type=LongType(), required=False), NestedField(field_id=9, name='DOLocationID', field_type=LongType(), required=False), NestedField(field_id=10, name='payment_type', field_type=LongType(), required=False), NestedField(field_id=11, name='fare_amount', field_type=DoubleType(), required=False), NestedField(field_id=12, name='extra', field_type=DoubleType(), required=False), NestedField(field_id=13, name='mta_tax', field_type=DoubleType(), required=False), NestedField(field_id=14, name='tip_amount', field_type=DoubleType(), required=False), NestedField(field_id=15, name='tolls_amount', field_type=DoubleType(), required=False), NestedField(field_id=16, name='improvement_surcharge', field_type=DoubleType(), required=False), NestedField(field_id=17, name='total_amount', field_type=DoubleType(), required=False), NestedField(field_id=18, name='congestion_surcharge', field_type=DoubleType(), required=False), NestedField(field_id=19, name='airport_fee', field_type=DoubleType(), required=False), schema_id=0, identifier_field_ids=[])], current_schema_id=0, partition_specs=[PartitionSpec(PartitionField(source_id=2, field_id=1000, transform=DayTransform(), name='tpep_pickup_datetime_day'), spec_id=0)], default_spec_id=0, last_partition_id=1000, properties={'owner': 'root'}, current_snapshot_id=None, snapshots=[], snapshot_log=[], metadata_log=[], sort_orders=[SortOrder(order_id=0)], default_sort_order_id=0, refs={}, format_version=1, schema_=Schema(NestedField(field_id=1, name='VendorID', field_type=LongType(), required=False), NestedField(field_id=2, name='tpep_pickup_datetime', field_type=TimestamptzType(), required=False), NestedField(field_id=3, name='tpep_dropoff_datetime', field_type=TimestamptzType(), required=False), NestedField(field_id=4, name='passenger_count', field_type=DoubleType(), required=False), NestedField(field_id=5, name='trip_distance', field_type=DoubleType(), required=False), NestedField(field_id=6, name='RatecodeID', field_type=DoubleType(), required=False), NestedField(field_id=7, name='store_and_fwd_flag', field_type=StringType(), required=False), NestedField(field_id=8, name='PULocationID', field_type=LongType(), required=False), NestedField(field_id=9, name='DOLocationID', field_type=LongType(), required=False), NestedField(field_id=10, name='payment_type', field_type=LongType(), required=False), NestedField(field_id=11, name='fare_amount', field_type=DoubleType(), required=False), NestedField(field_id=12, name='extra', field_type=DoubleType(), required=False), NestedField(field_id=13, name='mta_tax', field_type=DoubleType(), required=False), NestedField(field_id=14, name='tip_amount', field_type=DoubleType(), required=False), NestedField(field_id=15, name='tolls_amount', field_type=DoubleType(), required=False), NestedField(field_id=16, name='improvement_surcharge', field_type=DoubleType(), required=False), NestedField(field_id=17, name='total_amount', field_type=DoubleType(), required=False), NestedField(field_id=18, name='congestion_surcharge', field_type=DoubleType(), required=False), NestedField(field_id=19, name='airport_fee', field_type=DoubleType(), required=False), schema_id=0, identifier_field_ids=[]), partition_spec=[{'name': 'tpep_pickup_datetime_day', 'transform': 'day', 'source-id': 2, 'field-id': 1000}])
>>> table.metadata_location
'hdfs:///user/luigi/00000-3ee52bcd-ba94-4a7e-a0c0-60ce14d5397b.metadata.json'

@LuigiCerone LuigiCerone marked this pull request as ready for review July 25, 2023 20:03
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and testing using HDFS, appreciate it @LuigiCerone. Let's get this in!

@Fokko Fokko merged commit d28f899 into apache:master Jul 26, 2023
@atifiu
Copy link

atifiu commented Aug 11, 2023

@LuigiCerone I am trying to use Pyiceberg with hdfs and @Fokko has suggested me that you have got this working. I was trying to install using below command
pip install "git+https://github.com/LuigiCerone/iceberg.git@feat/hdfs_support#subdirectory=python&egg=pyiceberg[pyarrow]"

But it is throwing error. My Python version is 3.6 is it not supported ? or could you please suggest me how to fix this.

Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... error
  ERROR: Command errored out with exit status 1:
   command: /bin/python3.6 /home/matif/.local/lib/python3.6/site-packages/pip/_vendor/pep517/in_process/_in_process.py prepare_metadata_for_build_wheel /tmp/tmprow0kpai
       cwd: /tmp/pip-install-4ss5kzl4/pyiceberg_8eb97131d88a454394d3c3ef3b770300/python
  Complete output (26 lines):
  Traceback (most recent call last):
    File "/home/matif/.local/lib/python3.6/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
      main()
    File "/home/matif/.local/lib/python3.6/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "/home/matif/.local/lib/python3.6/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 164, in prepare_metadata_for_build_wheel
      return hook(metadata_directory, config_settings)
    File "/tmp/pip-build-env-oudfpfs2/overlay/lib/python3.6/site-packages/poetry/core/masonry/api.py", line 43, in prepare_metadata_for_build_wheel
      poetry = Factory().create_poetry(Path(".").resolve(), with_dev=False)
    File "/tmp/pip-build-env-oudfpfs2/overlay/lib/python3.6/site-packages/poetry/core/factory.py", line 34, in create_poetry
      local_config = PyProjectTOML(path=poetry_file).poetry_config
    File "/tmp/pip-build-env-oudfpfs2/overlay/lib/python3.6/site-packages/poetry/core/pyproject/toml.py", line 54, in poetry_config
      self._poetry_config = self.data.get("tool", {}).get("poetry")
    File "/tmp/pip-build-env-oudfpfs2/overlay/lib/python3.6/site-packages/poetry/core/_vendor/tomlkit/container.py", line 541, in get
      return self[key]
    File "/tmp/pip-build-env-oudfpfs2/overlay/lib/python3.6/site-packages/poetry/core/_vendor/tomlkit/container.py", line 582, in __getitem__
      return OutOfOrderTableProxy(self, idx)
    File "/tmp/pip-build-env-oudfpfs2/overlay/lib/python3.6/site-packages/poetry/core/_vendor/tomlkit/container.py", line 711, in __init__
      self._internal_container.append(k, v)
    File "/tmp/pip-build-env-oudfpfs2/overlay/lib/python3.6/site-packages/poetry/core/_vendor/tomlkit/container.py", line 170, in append
      current.append(k, v)
    File "/tmp/pip-build-env-oudfpfs2/overlay/lib/python3.6/site-packages/poetry/core/_vendor/tomlkit/items.py", line 921, in append
      self._value.append(key, _item)
    File "/tmp/pip-build-env-oudfpfs2/overlay/lib/python3.6/site-packages/poetry/core/_vendor/tomlkit/container.py", line 176, in append
      raise KeyAlreadyPresent(key)
  tomlkit.exceptions.KeyAlreadyPresent: Key "extras" already exists.
  ----------------------------------------
WARNING: Discarding git+https://github.com/LuigiCerone/iceberg.git@feat/hdfs_support#subdirectory=python&egg=pyiceberg[pyarrow]. Command errored out with exit status 1: /bin/python3.6 /home/matif/.local/lib/python3.6/site-packages/pip/_vendor/pep517/in_process/_in_process.py prepare_metadata_for_build_wheel /tmp/tmprow0kpai Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement pyiceberg (unavailable) (from versions: 0.0.1, 0.0.2)
ERROR: No matching distribution found for pyiceberg (unavailable)

@Fokko
Copy link
Contributor

Fokko commented Aug 11, 2023

Hey @atifiu thanks for jumping in here. PyIceberg is Python 3.8 onwards, and please keep in mind that this feature isn't released yet (it will be part of PyIceberg 0.5.0). Of course you can install it locally from the repository.

@Fokko Fokko changed the title Python: Add pyarrow hdfs support Python: PyArrow HDFS support Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Python: Add HDFS support for PyArrow
3 participants