Replies: 2 comments 2 replies
-
There's already similar work done in Daft that creates Daft objects, but uses the same logic under the hood with arrow2 crate. I think we will be able to save ourselves a lot of work by reusing some of their already implemented logic with arrow crate instead, and accrediting them for their implementation. Credit to @jaychia for sharing this information |
Beta Was this translation helpful? Give feedback.
-
For those who are interested, we typicall have such top project level dev design in dev list, and this is the thread about this discussion: https://lists.apache.org/thread/33c0nkc3k6646lvro1lv22pvhwlp50ss |
Beta Was this translation helpful? Give feedback.
-
This is something I've been mulling about for a while and I thought this would be the right forum to discuss this topic as a follow up to a similar topic: #513
As soon as we released 0.7.0 which supports writes into tables with TimeTransform partitions, our prospective users started asking for support for Bucket Transform partitions.
Iceberg has a rather custom set of logic for Bucket partitions. I took a look into the Java code and I think it looks somewhat like:
Unfortunately there is no existing pyarrow compute function that does this, so I'd like to propose that we write the function in rust that takes an Arrow Array and the bucket number as the input, that returns an Arrow Array with the buckets corresponding to the input Arrow Array. When iceberg-rust becomes more mature, I believe that the same function can be reused for transforms within this repository, and in the interim we could support writes into Bucket partitioned tables on PyIceberg by exposing this function as a Python binding that we import into PyIceberg.
I'd love to hear how folks feel about this idea!
Beta Was this translation helpful? Give feedback.
All reactions