Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Introduce conversion between iceberg schema and avro schema #40

Merged
merged 5 commits into from
Aug 31, 2023

Conversation

liurenjie1024
Copy link
Contributor

Introduce conversion between avro schema and iceberg schema.

Currently limitation, due to apache_avro's limit, we can' store attributes in array schema and map schema, this means we will lose element-id for list type, and key-id, value-id for map type. We will fix it later.

Close #37.

@liurenjie1024
Copy link
Contributor Author

cc @Xuanwo @JanKaul @nastra @Fokko PTAL

Copy link
Contributor

@ZENOTME ZENOTME left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM

crates/iceberg/src/lib.rs Outdated Show resolved Hide resolved
crates/iceberg/src/lib.rs Show resolved Hide resolved
crates/iceberg/src/avro/schema.rs Outdated Show resolved Hide resolved
crates/iceberg/src/avro/schema.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@JanKaul JanKaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you. I have one non-technical comment regarding idiomatic rust programming.

Comment on lines 265 to 279
pub trait AvroSchemaVisitor {
type T;

fn record(&mut self, record: &RecordSchema, fields: Vec<Self::T>) -> Result<Self::T>;

fn union(&mut self, union: &UnionSchema, options: Vec<Self::T>) -> Result<Self::T>;

fn array(&mut self, array: &AvroSchema, item: Self::T) -> Result<Self::T>;
fn map(&mut self, map: &AvroSchema, value: Self::T) -> Result<Self::T>;

fn primitive(&mut self, schema: &AvroSchema) -> Result<Self::T>;
}

/// Visit avro schema in post order visitor.
pub fn visit<V: AvroSchemaVisitor>(schema: &AvroSchema, visitor: &mut V) -> Result<V::T> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal opinion: I find a recursive function with pattern matching more idiomatic rust than using the visitor pattern for this. I feel like this comes from Java where you don't have enums.

Or is there anything here that couldn't be achieved with pattern matching?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is inspired by java's implementation which visits avro schema in post order. I'm not quite sure about what you mean? The arguments of different pattern is different. Could you write an example to make it more clear?

Copy link
Collaborator

@JanKaul JanKaul Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what the benefit of the visitor pattern is compared to using pattern matching as in the following code:

/// Converts avro schema to iceberg schema.
pub(crate) fn avro_schema_to_schema(avro_schema: &AvroSchema) -> Result<Schema> {
    if let AvroSchema::Record(_) = avro_schema {
        let mut next_field_id = 0;
        let typ = avro_type_to_type(avro_schema, &mut next_field_id)?;
        if let Some(Type::Struct(s)) = typ {
            Schema::builder()
                .with_fields(s.fields().iter().cloned())
                .build()
        } else {
            Err(Error::new(
                ErrorKind::Unexpected,
                format!("Expected to convert avro record schema to struct type, but {typ}"),
            ))
        }
    } else {
        Err(Error::new(
            ErrorKind::DataInvalid,
            "Can't convert non record avro schema to iceberg schema: {avro_schema}",
        ))
    }
}

fn avro_type_to_type(avro: &AvroSchema, next_field_id: &mut i32) -> Result<Option<Type>> {
    let typ = match avro {
        AvroSchema::Decimal(decimal) => {
            Type::decimal(decimal.precision as u32, decimal.scale as u32)?
        }
        AvroSchema::Date => Type::Primitive(PrimitiveType::Date),
        AvroSchema::TimeMicros => Type::Primitive(PrimitiveType::Time),
        AvroSchema::TimestampMicros => Type::Primitive(PrimitiveType::Timestamp),
        AvroSchema::Uuid => Type::Primitive(PrimitiveType::Uuid),
        AvroSchema::Boolean => Type::Primitive(PrimitiveType::Boolean),
        AvroSchema::Int => Type::Primitive(PrimitiveType::Int),
        AvroSchema::Long => Type::Primitive(PrimitiveType::Long),
        AvroSchema::Float => Type::Primitive(PrimitiveType::Float),
        AvroSchema::Double => Type::Primitive(PrimitiveType::Double),
        AvroSchema::String | AvroSchema::Enum(_) => Type::Primitive(PrimitiveType::String),
        AvroSchema::Fixed(fixed) => Type::Primitive(PrimitiveType::Fixed(fixed.size as u64)),
        AvroSchema::Bytes => Type::Primitive(PrimitiveType::Binary),
        AvroSchema::Null => return Ok(None),
        AvroSchema::Record(record) => {
            let field_types = record
                .fields
                .iter()
                .map(|x| avro_type_to_type(&x.schema, next_field_id))
                .collect::<Result<Vec<Option<Type>>>>()?;
            let mut fields = Vec::with_capacity(field_types.len());
            for (avro_field, typ) in record.fields.iter().zip_eq(field_types) {
                let field_id = avro_field
                    .custom_attributes
                    .get(FILED_ID_PROP)
                    .and_then(Value::as_i64)
                    .ok_or_else(|| {
                        Error::new(
                            ErrorKind::DataInvalid,
                            format!("Can't convert field, missing field id: {avro_field:?}"),
                        )
                    })?;

                let optional = is_avro_optional(&avro_field.schema);

                let mut field = if optional {
                    NestedField::optional(field_id as i32, &avro_field.name, typ.unwrap())
                } else {
                    NestedField::required(field_id as i32, &avro_field.name, typ.unwrap())
                };

                if let Some(doc) = &avro_field.doc {
                    field = field.with_doc(doc);
                }

                fields.push(field.into());
            }

            Ok(Some(Type::Struct(StructType::new(fields))))
        }
        AvroSchema::Array(array) => {
            let item = avro_type_to_type(array, next_field_id)?;
            if let AvroSchema::Array(item_schema) = array.as_ref() {
                *next_field_id += 1;
                let element_field = NestedField::list_element(
                    *next_field_id,
                    item.unwrap(),
                    !is_avro_optional(item_schema),
                )
                .into();
                Ok(Some(Type::List(ListType { element_field })))
            } else {
                Err(Error::new(
                    ErrorKind::Unexpected,
                    "Expected avro array schema, but {array}",
                ))
            }
        }
    };
    Ok(Some(typ))
}

For me this is much clearer what happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. I use visitor since it's a design pattern which decouples business logic from visiting logic, and both java and Python implementation uses this pattern. But it maybe a little too heavy for this simple case. CC @Fokko @Xuanwo Any comments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the benefit of the vistor is to reuse it for coverting or collect info in schema (They represent different business logic). If it only used in avro schema to schema, maybe recursive function is enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is whether we will need to process avro schema in other places. If not, I think using recursive + pattern matching is good enough.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also use the visitor pattern here. I'm just generally trying to be cautious when porting OO code to rust not to introduce complexity that is not necessary in Rust.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this comes from Java where you don't have enums.

You do: https://docs.oracle.com/javase/tutorial/java/javaOO/enum.html

I think the missing piece in Java is that you don't have pattern matching. I had the same experience when I started with PyIceberg that it might be too much, but now I'm very happy that I did use them because they add much-needed structure to the code. I think without the visitor, two kinds of errors are very likely:

  • Visiting the structure in the right order for all the types (Jan's example is missing the Map)
  • Make sure that you cover all the types. For example, in the pattern matching that goes over the primitives, it is easy to forget a primitive there. That's why we also have a visitor in Python that visits all of the primitives.

Converting the schema might be a bit simple for the visitor, but later on, when you start pruning schema (or for Avro, having a read schema that differs from the write schema), it also gives this concept of divide and conquer where you can easily isolate the different cases.

Copy link
Collaborator

@JanKaul JanKaul Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go with the visitor pattern. I just want to make a couple of points for completeness sake.

From wikipedia:

"Programming languages with sum types and pattern matching obviate many of the benefits of the visitor pattern, as the visitor class is able to both easily branch on the type of the object and generate a compiler error if a new object type is defined which the visitor does not yet handle."

The benefits of the visitor pattern don't apply to our case:

  1. Visitor allows same algorithm over different types

We are always visiting the same type, even with a read schema and a write schema.

  1. Visitor allows dynamic double dispatch

Without any default implementations on the visitor trait, all trait methods have to be implemented for a new type which is equal to pattern matching. There would be a benefit if you have many methods with default implementations because then you could save implementation effort for new types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting the schema might be a bit simple for the visitor, but later on, when you start pruning schema (or for Avro, having a read schema that differs from the write schema), it also gives this concept of divide and conquer where you can easily isolate the different cases.

This is also my concern. I prefer to keep visitor pattern to see if we have more features to implement on avro schema.

Ok(AvroSchema::Decimal(DecimalSchema {
precision,
scale,
inner: Box::new(avro_fixed_schema(DECIMAL_LENGTH)?),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fixed length is based on the precision of the decimal:
image

Based on the precision the decimal needs a set number of bytes to store the information: https://github.com/apache/iceberg/blob/master/python/pyiceberg/utils/decimal.py#L111-L127

Desktop python3
Python 3.11.4 (main, Jul 25 2023, 17:36:13) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import math
>>> MAX_PRECISION = tuple(math.floor(math.log10(math.fabs(math.pow(2, 8 * pos - 1) - 1))) for pos in range(24))
>>> REQUIRED_LENGTH = tuple(next(pos for pos in range(24) if p <= MAX_PRECISION[pos]) for p in range(40))
>>> MAX_PRECISION
(-1, 2, 4, 6, 9, 11, 14, 16, 18, 21, 23, 26, 28, 31, 33, 35, 38, 40, 43, 45, 47, 50, 52, 55)
>>> MAX_PRECISION
(-1, 2, 4, 6, 9, 11, 14, 16, 18, 21, 23, 26, 28, 31, 33, 35, 38, 40, 43, 45, 47, 50, 52, 55)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

/// Creates decimal type.
#[inline(always)]
pub fn decimal(precision: u32, scale: u32) -> Result<Self> {
ensure_data_valid!(precision <= MAX_DECIMAL_PRECISION, "Decimals with precision larger than {MAX_DECIMAL_PRECISION} are not supported: {precision}",);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check > 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment on lines 265 to 279
pub trait AvroSchemaVisitor {
type T;

fn record(&mut self, record: &RecordSchema, fields: Vec<Self::T>) -> Result<Self::T>;

fn union(&mut self, union: &UnionSchema, options: Vec<Self::T>) -> Result<Self::T>;

fn array(&mut self, array: &AvroSchema, item: Self::T) -> Result<Self::T>;
fn map(&mut self, map: &AvroSchema, value: Self::T) -> Result<Self::T>;

fn primitive(&mut self, schema: &AvroSchema) -> Result<Self::T>;
}

/// Visit avro schema in post order visitor.
pub fn visit<V: AvroSchemaVisitor>(schema: &AvroSchema, visitor: &mut V) -> Result<V::T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this comes from Java where you don't have enums.

You do: https://docs.oracle.com/javase/tutorial/java/javaOO/enum.html

I think the missing piece in Java is that you don't have pattern matching. I had the same experience when I started with PyIceberg that it might be too much, but now I'm very happy that I did use them because they add much-needed structure to the code. I think without the visitor, two kinds of errors are very likely:

  • Visiting the structure in the right order for all the types (Jan's example is missing the Map)
  • Make sure that you cover all the types. For example, in the pattern matching that goes over the primitives, it is easy to forget a primitive there. That's why we also have a visitor in Python that visits all of the primitives.

Converting the schema might be a bit simple for the visitor, but later on, when you start pruning schema (or for Avro, having a read schema that differs from the write schema), it also gives this concept of divide and conquer where you can easily isolate the different cases.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deriving the correct byte length for reading the decimal should be fixed, apart from that this looks great @liurenjie1024 👍🏻

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @liurenjie1024 for working on this 👍🏻

@Fokko Fokko merged commit 10cc7c9 into apache:main Aug 31, 2023
@liurenjie1024 liurenjie1024 deleted the renjie/issue-37 branch August 31, 2023 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

schema: Converting iceberg schema to avro schema.
5 participants