feat: Introduce conversion between iceberg schema and avro schema #40

liurenjie1024 · 2023-08-25T03:29:49Z

Introduce conversion between avro schema and iceberg schema.

Currently limitation, due to apache_avro's limit, we can' store attributes in array schema and map schema, this means we will lose element-id for list type, and key-id, value-id for map type. We will fix it later.

Close #37.

liurenjie1024 · 2023-08-25T14:12:42Z

cc @Xuanwo @JanKaul @nastra @Fokko PTAL

ZENOTME

Thanks! LGTM

crates/iceberg/src/lib.rs

crates/iceberg/src/avro/schema.rs

JanKaul

LGTM, thank you. I have one non-technical comment regarding idiomatic rust programming.

JanKaul · 2023-08-29T09:21:37Z

crates/iceberg/src/avro/schema.rs

+pub trait AvroSchemaVisitor {
+    type T;
+
+    fn record(&mut self, record: &RecordSchema, fields: Vec<Self::T>) -> Result<Self::T>;
+
+    fn union(&mut self, union: &UnionSchema, options: Vec<Self::T>) -> Result<Self::T>;
+
+    fn array(&mut self, array: &AvroSchema, item: Self::T) -> Result<Self::T>;
+    fn map(&mut self, map: &AvroSchema, value: Self::T) -> Result<Self::T>;
+
+    fn primitive(&mut self, schema: &AvroSchema) -> Result<Self::T>;
+}
+
+/// Visit avro schema in post order visitor.
+pub fn visit<V: AvroSchemaVisitor>(schema: &AvroSchema, visitor: &mut V) -> Result<V::T> {


My personal opinion: I find a recursive function with pattern matching more idiomatic rust than using the visitor pattern for this. I feel like this comes from Java where you don't have enums.

Or is there anything here that couldn't be achieved with pattern matching?

Yeah, this is inspired by java's implementation which visits avro schema in post order. I'm not quite sure about what you mean? The arguments of different pattern is different. Could you write an example to make it more clear?

I wonder what the benefit of the visitor pattern is compared to using pattern matching as in the following code:

/// Converts avro schema to iceberg schema. pub(crate) fn avro_schema_to_schema(avro_schema: &AvroSchema) -> Result<Schema> { if let AvroSchema::Record(_) = avro_schema { let mut next_field_id = 0; let typ = avro_type_to_type(avro_schema, &mut next_field_id)?; if let Some(Type::Struct(s)) = typ { Schema::builder() .with_fields(s.fields().iter().cloned()) .build() } else { Err(Error::new( ErrorKind::Unexpected, format!("Expected to convert avro record schema to struct type, but {typ}"), )) } } else { Err(Error::new( ErrorKind::DataInvalid, "Can't convert non record avro schema to iceberg schema: {avro_schema}", )) } } fn avro_type_to_type(avro: &AvroSchema, next_field_id: &mut i32) -> Result<Option<Type>> { let typ = match avro { AvroSchema::Decimal(decimal) => { Type::decimal(decimal.precision as u32, decimal.scale as u32)? } AvroSchema::Date => Type::Primitive(PrimitiveType::Date), AvroSchema::TimeMicros => Type::Primitive(PrimitiveType::Time), AvroSchema::TimestampMicros => Type::Primitive(PrimitiveType::Timestamp), AvroSchema::Uuid => Type::Primitive(PrimitiveType::Uuid), AvroSchema::Boolean => Type::Primitive(PrimitiveType::Boolean), AvroSchema::Int => Type::Primitive(PrimitiveType::Int), AvroSchema::Long => Type::Primitive(PrimitiveType::Long), AvroSchema::Float => Type::Primitive(PrimitiveType::Float), AvroSchema::Double => Type::Primitive(PrimitiveType::Double), AvroSchema::String | AvroSchema::Enum(_) => Type::Primitive(PrimitiveType::String), AvroSchema::Fixed(fixed) => Type::Primitive(PrimitiveType::Fixed(fixed.size as u64)), AvroSchema::Bytes => Type::Primitive(PrimitiveType::Binary), AvroSchema::Null => return Ok(None), AvroSchema::Record(record) => { let field_types = record .fields .iter() .map(|x| avro_type_to_type(&x.schema, next_field_id)) .collect::<Result<Vec<Option<Type>>>>()?; let mut fields = Vec::with_capacity(field_types.len()); for (avro_field, typ) in record.fields.iter().zip_eq(field_types) { let field_id = avro_field .custom_attributes .get(FILED_ID_PROP) .and_then(Value::as_i64) .ok_or_else(|| { Error::new( ErrorKind::DataInvalid, format!("Can't convert field, missing field id: {avro_field:?}"), ) })?; let optional = is_avro_optional(&avro_field.schema); let mut field = if optional { NestedField::optional(field_id as i32, &avro_field.name, typ.unwrap()) } else { NestedField::required(field_id as i32, &avro_field.name, typ.unwrap()) }; if let Some(doc) = &avro_field.doc { field = field.with_doc(doc); } fields.push(field.into()); } Ok(Some(Type::Struct(StructType::new(fields)))) } AvroSchema::Array(array) => { let item = avro_type_to_type(array, next_field_id)?; if let AvroSchema::Array(item_schema) = array.as_ref() { *next_field_id += 1; let element_field = NestedField::list_element( *next_field_id, item.unwrap(), !is_avro_optional(item_schema), ) .into(); Ok(Some(Type::List(ListType { element_field }))) } else { Err(Error::new( ErrorKind::Unexpected, "Expected avro array schema, but {array}", )) } } }; Ok(Some(typ)) }

For me this is much clearer what happens.

Oh, I see. I use visitor since it's a design pattern which decouples business logic from visiting logic, and both java and Python implementation uses this pattern. But it maybe a little too heavy for this simple case. CC @Fokko @Xuanwo Any comments?

IMO the benefit of the vistor is to reuse it for coverting or collect info in schema (They represent different business logic). If it only used in avro schema to schema, maybe recursive function is enough.

The problem is whether we will need to process avro schema in other places. If not, I think using recursive + pattern matching is good enough.

We can also use the visitor pattern here. I'm just generally trying to be cautious when porting OO code to rust not to introduce complexity that is not necessary in Rust.

I feel like this comes from Java where you don't have enums.

You do: https://docs.oracle.com/javase/tutorial/java/javaOO/enum.html

I think the missing piece in Java is that you don't have pattern matching. I had the same experience when I started with PyIceberg that it might be too much, but now I'm very happy that I did use them because they add much-needed structure to the code. I think without the visitor, two kinds of errors are very likely:

Visiting the structure in the right order for all the types (Jan's example is missing the Map)

Make sure that you cover all the types. For example, in the pattern matching that goes over the primitives, it is easy to forget a primitive there. That's why we also have a visitor in Python that visits all of the primitives.

Converting the schema might be a bit simple for the visitor, but later on, when you start pruning schema (or for Avro, having a read schema that differs from the write schema), it also gives this concept of divide and conquer where you can easily isolate the different cases.

Let's go with the visitor pattern. I just want to make a couple of points for completeness sake.

From wikipedia:

"Programming languages with sum types and pattern matching obviate many of the benefits of the visitor pattern, as the visitor class is able to both easily branch on the type of the object and generate a compiler error if a new object type is defined which the visitor does not yet handle."

The benefits of the visitor pattern don't apply to our case:

Visitor allows same algorithm over different types

We are always visiting the same type, even with a read schema and a write schema.

Visitor allows dynamic double dispatch

Without any default implementations on the visitor trait, all trait methods have to be implemented for a new type which is equal to pattern matching. There would be a benefit if you have many methods with default implementations because then you could save implementation effort for new types.

Converting the schema might be a bit simple for the visitor, but later on, when you start pruning schema (or for Avro, having a read schema that differs from the write schema), it also gives this concept of divide and conquer where you can easily isolate the different cases.

This is also my concern. I prefer to keep visitor pattern to see if we have more features to implement on avro schema.

Fokko · 2023-08-30T12:00:48Z

crates/iceberg/src/avro/schema.rs

+    Ok(AvroSchema::Decimal(DecimalSchema {
+        precision,
+        scale,
+        inner: Box::new(avro_fixed_schema(DECIMAL_LENGTH)?),


The fixed length is based on the precision of the decimal:

Based on the precision the decimal needs a set number of bytes to store the information: https://github.com/apache/iceberg/blob/master/python/pyiceberg/utils/decimal.py#L111-L127

➜ Desktop python3 Python 3.11.4 (main, Jul 25 2023, 17:36:13) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import math >>> MAX_PRECISION = tuple(math.floor(math.log10(math.fabs(math.pow(2, 8 * pos - 1) - 1))) for pos in range(24)) >>> REQUIRED_LENGTH = tuple(next(pos for pos in range(24) if p <= MAX_PRECISION[pos]) for p in range(40)) >>> MAX_PRECISION (-1, 2, 4, 6, 9, 11, 14, 16, 18, 21, 23, 26, 28, 31, 33, 35, 38, 40, 43, 45, 47, 50, 52, 55) >>> MAX_PRECISION (-1, 2, 4, 6, 9, 11, 14, 16, 18, 21, 23, 26, 28, 31, 33, 35, 38, 40, 43, 45, 47, 50, 52, 55)

Fokko · 2023-08-30T12:05:31Z

crates/iceberg/src/spec/datatypes.rs

+    /// Creates  decimal type.
+    #[inline(always)]
+    pub fn decimal(precision: u32, scale: u32) -> Result<Self> {
+        ensure_data_valid!(precision <= MAX_DECIMAL_PRECISION, "Decimals with precision larger than {MAX_DECIMAL_PRECISION} are not supported: {precision}",);


Should we also check > 0?

Fokko · 2023-08-30T12:19:42Z

crates/iceberg/src/avro/schema.rs

+pub trait AvroSchemaVisitor {
+    type T;
+
+    fn record(&mut self, record: &RecordSchema, fields: Vec<Self::T>) -> Result<Self::T>;
+
+    fn union(&mut self, union: &UnionSchema, options: Vec<Self::T>) -> Result<Self::T>;
+
+    fn array(&mut self, array: &AvroSchema, item: Self::T) -> Result<Self::T>;
+    fn map(&mut self, map: &AvroSchema, value: Self::T) -> Result<Self::T>;
+
+    fn primitive(&mut self, schema: &AvroSchema) -> Result<Self::T>;
+}
+
+/// Visit avro schema in post order visitor.
+pub fn visit<V: AvroSchemaVisitor>(schema: &AvroSchema, visitor: &mut V) -> Result<V::T> {


I feel like this comes from Java where you don't have enums.

You do: https://docs.oracle.com/javase/tutorial/java/javaOO/enum.html

I think the missing piece in Java is that you don't have pattern matching. I had the same experience when I started with PyIceberg that it might be too much, but now I'm very happy that I did use them because they add much-needed structure to the code. I think without the visitor, two kinds of errors are very likely:

Visiting the structure in the right order for all the types (Jan's example is missing the Map)

Make sure that you cover all the types. For example, in the pattern matching that goes over the primitives, it is easy to forget a primitive there. That's why we also have a visitor in Python that visits all of the primitives.

Converting the schema might be a bit simple for the visitor, but later on, when you start pruning schema (or for Avro, having a read schema that differs from the write schema), it also gives this concept of divide and conquer where you can easily isolate the different cases.

Fokko

Deriving the correct byte length for reading the decimal should be fixed, apart from that this looks great @liurenjie1024 👍🏻

Fokko

Thanks @liurenjie1024 for working on this 👍🏻

liurenjie1024 requested review from JanKaul, Xuanwo, Fokko and nastra August 25, 2023 03:29

ZENOTME approved these changes Aug 28, 2023

View reviewed changes

Xuanwo reviewed Aug 28, 2023

View reviewed changes

crates/iceberg/src/lib.rs Outdated Show resolved Hide resolved

crates/iceberg/src/lib.rs Show resolved Hide resolved

crates/iceberg/src/avro/schema.rs Outdated Show resolved Hide resolved

crates/iceberg/src/avro/schema.rs Outdated Show resolved Hide resolved

liurenjie1024 force-pushed the renjie/issue-37 branch from 3430bf8 to a47c96e Compare August 29, 2023 09:36

JanKaul reviewed Aug 29, 2023

View reviewed changes

Fokko reviewed Aug 30, 2023

View reviewed changes

liurenjie1024 added 5 commits August 31, 2023 10:03

feat: Introduce conversion between iceberg schema and avro schema

a5ec801

Fix typo

1607112

Fix comments

26fcfe5

Fix clippy

3af3119

Calculate decimal bytes

4b41231

liurenjie1024 force-pushed the renjie/issue-37 branch from a47c96e to 4b41231 Compare August 31, 2023 03:47

Fokko approved these changes Aug 31, 2023

View reviewed changes

Fokko merged commit 10cc7c9 into apache:main Aug 31, 2023

liurenjie1024 deleted the renjie/issue-37 branch August 31, 2023 11:38

xxchan mentioned this pull request Jul 21, 2024

fix: use avro fixed to represent decimal #472

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Introduce conversion between iceberg schema and avro schema #40

feat: Introduce conversion between iceberg schema and avro schema #40

liurenjie1024 commented Aug 25, 2023

liurenjie1024 commented Aug 25, 2023

ZENOTME left a comment

JanKaul left a comment

JanKaul Aug 29, 2023

liurenjie1024 Aug 29, 2023

JanKaul Aug 29, 2023 •

edited

Loading

liurenjie1024 Aug 29, 2023

ZENOTME Aug 29, 2023

liurenjie1024 Aug 30, 2023

JanKaul Aug 30, 2023

Fokko Aug 30, 2023

JanKaul Aug 30, 2023 •

edited

Loading

liurenjie1024 Aug 31, 2023

Fokko Aug 30, 2023

liurenjie1024 Aug 31, 2023

Fokko Aug 30, 2023

liurenjie1024 Aug 31, 2023

Fokko Aug 30, 2023

Fokko left a comment

Fokko left a comment

feat: Introduce conversion between iceberg schema and avro schema #40

feat: Introduce conversion between iceberg schema and avro schema #40

Conversation

liurenjie1024 commented Aug 25, 2023

liurenjie1024 commented Aug 25, 2023

ZENOTME left a comment

Choose a reason for hiding this comment

JanKaul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JanKaul Aug 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JanKaul Aug 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

JanKaul Aug 29, 2023 •

edited

Loading

JanKaul Aug 30, 2023 •

edited

Loading