Use iceberg-rust for PyIceberg Bucket Transform #514

sungwy · 2024-07-31T19:53:34Z

sungwy
Jul 31, 2024
Collaborator

This is something I've been mulling about for a while and I thought this would be the right forum to discuss this topic as a follow up to a similar topic: #513

As soon as we released 0.7.0 which supports writes into tables with TimeTransform partitions, our prospective users started asking for support for Bucket Transform partitions.

Iceberg has a rather custom set of logic for Bucket partitions. I took a look into the Java code and I think it looks somewhat like:

treat different types differently
special handling for None
use mmh3 to create the hash and bucket

Unfortunately there is no existing pyarrow compute function that does this, so I'd like to propose that we write the function in rust that takes an Arrow Array and the bucket number as the input, that returns an Arrow Array with the buckets corresponding to the input Arrow Array. When iceberg-rust becomes more mature, I believe that the same function can be reused for transforms within this repository, and in the interim we could support writes into Bucket partitioned tables on PyIceberg by exposing this function as a Python binding that we import into PyIceberg.

I'd love to hear how folks feel about this idea!

sungwy · 2024-07-31T19:59:17Z

sungwy
Jul 31, 2024
Collaborator Author

There's already similar work done in Daft that creates Daft objects, but uses the same logic under the hood with arrow2 crate.

I think we will be able to save ourselves a lot of work by reusing some of their already implemented logic with arrow crate instead, and accrediting them for their implementation.

Credit to @jaychia for sharing this information

2 replies

Fokko Aug 1, 2024
Collaborator

Thanks for raising this @sungwy 🙌

This is exciting, and would also simplify the logic on the PyIceberg side. Details on the bucketing transform can be found in the spec: https://iceberg.apache.org/spec/#bucket-transform-details

What kind of API would envision between Rust and Python? Passing a reference to an Arrow buffer, where rust returns a reference to a new buffer (because the type will change for the bucket operation, we cannot do it in place).

sungwy Aug 1, 2024
Collaborator Author

Yes @Fokko that's what I was thinking too.

We'd want to preserve the original buffer regardless because we will need to use it to write the files in some use cases. I'd also imagine the output buffer to be mostly efficient as well, since we'd be able to fit most use cases in UInt16Array.

A future optimization could use different bit size array of where n is determined based on the number of buckets we choose to use.

I'll cross post this discussion in the mail list to make sure we are discussing this design with the larger group.

liurenjie1024 · 2024-08-04T03:43:21Z

liurenjie1024
Aug 4, 2024
Collaborator

For those who are interested, we typicall have such top project level dev design in dev list, and this is the thread about this discussion: https://lists.apache.org/thread/33c0nkc3k6646lvro1lv22pvhwlp50ss

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use iceberg-rust for PyIceberg Bucket Transform #514

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Use iceberg-rust for PyIceberg Bucket Transform #514

sungwy Jul 31, 2024 Collaborator

Replies: 2 comments · 2 replies

sungwy Jul 31, 2024 Collaborator Author

Fokko Aug 1, 2024 Collaborator

sungwy Aug 1, 2024 Collaborator Author

liurenjie1024 Aug 4, 2024 Collaborator

sungwy
Jul 31, 2024
Collaborator

Replies: 2 comments 2 replies

sungwy
Jul 31, 2024
Collaborator Author

Fokko Aug 1, 2024
Collaborator

sungwy Aug 1, 2024
Collaborator Author

liurenjie1024
Aug 4, 2024
Collaborator