feat: support `arrow_struct_to_iceberg_struct` #731

ZENOTME · 2024-11-28T11:43:55Z

This PR introduces the function to convert arrow struct to iceberg struct. This function is needed when we add fanout partition writer: In this writer, we need to compute the partition value using record batch and convert them into struct value finally and set into data file.

ZENOTME · 2024-11-28T12:47:02Z

cc @liurenjie1024 @Fokko @Xuanwo

liurenjie1024 · 2024-12-04T02:45:37Z

Hi, @ZENOTME Thanks for this pr! I'm thinking that instead of array transformation, should we consider transforming arrow record batch to/from array of iceberg datum? It maybe also worthy to have a visitor pattern.

ZENOTME · 2024-12-04T07:44:32Z

Hi, @ZENOTME Thanks for this pr! I'm thinking that instead of array transformation, should we consider transforming arrow record batch to/from array of iceberg datum?

For now, this function is mainly used in partition writer. And we store partition value as Struct so we need to transform it to literal now.

It maybe also worthy to have a visitor pattern.

Good point. The benefit of visitor patterns is to make it more convenient to convert between different types, e.g. datum or literal. I try to use visitor patterns for this PR so that we can add more type conversions in the future. But I'm not sure whether it's a good design. Feel free to let me know if there are some other API designs.

liurenjie1024

Thanks @ZENOTME for this pr, I have some suggestions about it.

liurenjie1024 · 2024-12-06T03:25:53Z

crates/iceberg/src/arrow/value.rs

+        &self,
+        array: &NullArray,
+        arrow_type: &DataType,
+        iceberg_type: &PrimitiveType,


I'm hesitating to add an iceberg_type here.

I also don't think we need an arrow_type, arrow array has data type associated with it: https://docs.rs/arrow/latest/arrow/array/trait.Array.html#tymethod.data_type

liurenjie1024 · 2024-12-06T03:29:41Z

crates/iceberg/src/arrow/value.rs

+use crate::spec::{Literal, PrimitiveType, Struct, StructType, Type};
+use crate::{Error, ErrorKind, Result};
+
+trait ArrowArrayVistor {


We need have clear documentation about what kind of visitor it is. Is it an post order visitor?

In Iceberg, everything is post-order, except assigning field-IDs. I think in this case post-order makes sense, since you would first convert the fields of the struct, and then move up to nested-structs, lists, maps, structs, all the way up to the schema struct.

liurenjie1024 · 2024-12-06T03:37:34Z

crates/iceberg/src/arrow/value.rs

+        array: &StructArray,
+        columns: Vec<Vec<Self::T>>,
+    ) -> Result<Vec<Self::T>>;
+    fn r#struct(


The implementation of this method is somehow weird to me, since it's actually implementing the visiting scheduler logic, a typical post order visitor definition should look like this:

fn r#sturct(&self, array: &StructArray, children: Vec<Vec<Self::>>) -> Result<Vec<Self::T>>;

And we should move this logic into a standalone function like this:

fn visit(array: &Array, visitor: V) -> Result<V::T> { ... }

Fokko · 2024-12-06T08:38:10Z

crates/iceberg/src/arrow/value.rs

+
+        let mut columns = Vec::with_capacity(array.columns().len());
+
+        for ((array, arrow_type), iceberg_field) in array


Looping back @liurenjie1024 question here:

We need have clear documentation about what kind of visitor it is.

First of all, thanks for converting this into a visitor, I think that's the right approach here. One question regarding the lookup. It looks like we're expecting the Arrow data to be in the same position of the Iceberg schema. Instead, what would be more Iceberg-esque, is performing the lookup by field-ID.

For example:

schema = Schema( NestedField(1, "city", StringType(), required=False), NestedField(2, "lat", DoubleType(), required=False), NestedField(3, "long", DoubleType(), required=False), ) table { 1: city: optional string 2: lat: optional double 3: long: optional double } table = catalog.create_table("default.locations", schema) with table.update_schema() as update: update.move_after("lat", "long") table { 1: city: optional string 3: lat: optional double 2: long: optional double }

Now the fields are switched. Data that's already written needs to be projected, and new data can be read as is. Both of the files will be read by this visitor without raising an error, but for the old files, the data the fields will be switched.

This way, we could also do smart things, like fill in nullable fields:

let struct_array = StructArray::new_null(); let iceberg_struct_type = StructType::new(vec![Arc::new(NestedField::optional( 0, "bool_field", Type::Primitive(PrimitiveType::Boolean), ))]); let result = arrow_struct_to_literal(&struct_array, iceberg_struct_type).unwrap(); assert_eq!(result, vec![None; 3]);

And in a step after that, #737. You can find the Python equivalent here.

Use field-ID require user store the file id metadata for date type in struct type. I'm not sure whether is this always set up. How about providing two visit function? One would assume the schema of arrow will be same as iceberg schema, it will return error if convert fail. And the other one will look up the column using the field id.

ZENOTME force-pushed the arrow_value_convert branch 2 times, most recently from 1ffc802 to 0830aae Compare November 28, 2024 12:00

ZENOTME added 2 commits December 4, 2024 15:40

add arrow_struct_to_iceberg_struct

52912c0

use visitor pattern

cc09bff

ZENOTME force-pushed the arrow_value_convert branch from 0830aae to cc09bff Compare December 4, 2024 07:40

liurenjie1024 reviewed Dec 6, 2024

View reviewed changes

Fokko changed the title ~~feat: support arrow_struct_to_iceberg_struct~~ feat: support arrow_struct_to_iceberg_struct Dec 6, 2024

Fokko reviewed Dec 6, 2024

View reviewed changes

ZENOTME added 4 commits December 11, 2024 12:21

refine visitor design

ecce70a

add todo comment

85050d9

fix typos

8c46815

use field nullable instead of array nullable

1d19c78

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support `arrow_struct_to_iceberg_struct` #731

feat: support `arrow_struct_to_iceberg_struct` #731

ZENOTME commented Nov 28, 2024

ZENOTME commented Nov 28, 2024

liurenjie1024 commented Dec 4, 2024

ZENOTME commented Dec 4, 2024

liurenjie1024 left a comment

liurenjie1024 Dec 6, 2024

liurenjie1024 Dec 6, 2024

liurenjie1024 Dec 6, 2024

Fokko Dec 6, 2024 •

edited

Loading

liurenjie1024 Dec 6, 2024

Fokko Dec 6, 2024

ZENOTME Dec 10, 2024


		let mut columns = Vec::with_capacity(array.columns().len());

		for ((array, arrow_type), iceberg_field) in array

feat: support arrow_struct_to_iceberg_struct #731

Are you sure you want to change the base?

feat: support arrow_struct_to_iceberg_struct #731

Conversation

ZENOTME commented Nov 28, 2024

ZENOTME commented Nov 28, 2024

liurenjie1024 commented Dec 4, 2024

ZENOTME commented Dec 4, 2024

liurenjie1024 left a comment

Choose a reason for hiding this comment

liurenjie1024 Dec 6, 2024

Choose a reason for hiding this comment

liurenjie1024 Dec 6, 2024

Choose a reason for hiding this comment

liurenjie1024 Dec 6, 2024

Choose a reason for hiding this comment

Fokko Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

liurenjie1024 Dec 6, 2024

Choose a reason for hiding this comment

Fokko Dec 6, 2024

Choose a reason for hiding this comment

ZENOTME Dec 10, 2024

Choose a reason for hiding this comment

feat: support `arrow_struct_to_iceberg_struct` #731

feat: support `arrow_struct_to_iceberg_struct` #731

Fokko Dec 6, 2024 •

edited

Loading