-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: PyArrow HDFS support #7997
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this up @LuigiCerone. Looking good. Could you also update the documentation?
27071a5
to
0e04b71
Compare
0e04b71
to
f831845
Compare
I tested this locally in a Docker environment, the metadata JSON file is one generated by the docker-spark-iceberg quickstart. HDFS setup created with this repo: root@579a35817003:/# hdfs dfs -ls /user/luigi
Found 1 items
-rw-r--r-- 3 root supergroup 5153 2023-07-25 09:58 /user/luigi/00000-3ee52bcd-ba94-4a7e-a0c0-60ce14d5397b.metadata.json In another container (after env vars setup according to pyarrow docs): root@341c4b757a72:/opt/spark/work-dir# pip install "git+https://github.com/LuigiCerone/iceberg.git@feat/hdfs_support#subdirectory=python&egg=pyiceberg[pyarrow]"
root@341c4b757a72:/opt/spark/work-dir# export HADOOP_HOME=/usr/local/hadoop
root@341c4b757a72:/opt/spark/work-dir# export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob`
root@341c4b757a72:/opt/spark/work-dir# python
Python 3.8.13 (default, Mar 28 2022, 11:38:47)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyiceberg.table import StaticTable
>>> table = StaticTable.from_metadata("hdfs:///user/luigi/00000-3ee52bcd-ba94-4a7e-a0c0-60ce14d5397b.metadata.json", properties={"hdfs.host": "namenode", "hdfs.port": 9000, "hdfs.user": "luigi"})
>>> table.metadata
TableMetadataV1(location='s3://warehouse/nyc/taxis', table_uuid=UUID('a4f97eac-3793-4f17-99ad-7079b6ce408a'), last_updated_ms=1690277912736, last_column_id=19, schemas=[Schema(NestedField(field_id=1, name='VendorID', field_type=LongType(), required=False), NestedField(field_id=2, name='tpep_pickup_datetime', field_type=TimestamptzType(), required=False), NestedField(field_id=3, name='tpep_dropoff_datetime', field_type=TimestamptzType(), required=False), NestedField(field_id=4, name='passenger_count', field_type=DoubleType(), required=False), NestedField(field_id=5, name='trip_distance', field_type=DoubleType(), required=False), NestedField(field_id=6, name='RatecodeID', field_type=DoubleType(), required=False), NestedField(field_id=7, name='store_and_fwd_flag', field_type=StringType(), required=False), NestedField(field_id=8, name='PULocationID', field_type=LongType(), required=False), NestedField(field_id=9, name='DOLocationID', field_type=LongType(), required=False), NestedField(field_id=10, name='payment_type', field_type=LongType(), required=False), NestedField(field_id=11, name='fare_amount', field_type=DoubleType(), required=False), NestedField(field_id=12, name='extra', field_type=DoubleType(), required=False), NestedField(field_id=13, name='mta_tax', field_type=DoubleType(), required=False), NestedField(field_id=14, name='tip_amount', field_type=DoubleType(), required=False), NestedField(field_id=15, name='tolls_amount', field_type=DoubleType(), required=False), NestedField(field_id=16, name='improvement_surcharge', field_type=DoubleType(), required=False), NestedField(field_id=17, name='total_amount', field_type=DoubleType(), required=False), NestedField(field_id=18, name='congestion_surcharge', field_type=DoubleType(), required=False), NestedField(field_id=19, name='airport_fee', field_type=DoubleType(), required=False), schema_id=0, identifier_field_ids=[])], current_schema_id=0, partition_specs=[PartitionSpec(PartitionField(source_id=2, field_id=1000, transform=DayTransform(), name='tpep_pickup_datetime_day'), spec_id=0)], default_spec_id=0, last_partition_id=1000, properties={'owner': 'root'}, current_snapshot_id=None, snapshots=[], snapshot_log=[], metadata_log=[], sort_orders=[SortOrder(order_id=0)], default_sort_order_id=0, refs={}, format_version=1, schema_=Schema(NestedField(field_id=1, name='VendorID', field_type=LongType(), required=False), NestedField(field_id=2, name='tpep_pickup_datetime', field_type=TimestamptzType(), required=False), NestedField(field_id=3, name='tpep_dropoff_datetime', field_type=TimestamptzType(), required=False), NestedField(field_id=4, name='passenger_count', field_type=DoubleType(), required=False), NestedField(field_id=5, name='trip_distance', field_type=DoubleType(), required=False), NestedField(field_id=6, name='RatecodeID', field_type=DoubleType(), required=False), NestedField(field_id=7, name='store_and_fwd_flag', field_type=StringType(), required=False), NestedField(field_id=8, name='PULocationID', field_type=LongType(), required=False), NestedField(field_id=9, name='DOLocationID', field_type=LongType(), required=False), NestedField(field_id=10, name='payment_type', field_type=LongType(), required=False), NestedField(field_id=11, name='fare_amount', field_type=DoubleType(), required=False), NestedField(field_id=12, name='extra', field_type=DoubleType(), required=False), NestedField(field_id=13, name='mta_tax', field_type=DoubleType(), required=False), NestedField(field_id=14, name='tip_amount', field_type=DoubleType(), required=False), NestedField(field_id=15, name='tolls_amount', field_type=DoubleType(), required=False), NestedField(field_id=16, name='improvement_surcharge', field_type=DoubleType(), required=False), NestedField(field_id=17, name='total_amount', field_type=DoubleType(), required=False), NestedField(field_id=18, name='congestion_surcharge', field_type=DoubleType(), required=False), NestedField(field_id=19, name='airport_fee', field_type=DoubleType(), required=False), schema_id=0, identifier_field_ids=[]), partition_spec=[{'name': 'tpep_pickup_datetime_day', 'transform': 'day', 'source-id': 2, 'field-id': 1000}])
>>> table.metadata_location
'hdfs:///user/luigi/00000-3ee52bcd-ba94-4a7e-a0c0-60ce14d5397b.metadata.json' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR and testing using HDFS, appreciate it @LuigiCerone. Let's get this in!
@LuigiCerone I am trying to use Pyiceberg with hdfs and @Fokko has suggested me that you have got this working. I was trying to install using below command But it is throwing error. My Python version is 3.6 is it not supported ? or could you please suggest me how to fix this.
|
Hey @atifiu thanks for jumping in here. PyIceberg is Python 3.8 onwards, and please keep in mind that this feature isn't released yet (it will be part of PyIceberg 0.5.0). Of course you can install it locally from the repository. |
Closes #7974.