-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Introduce conversion between iceberg schema and avro schema #40
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM
3430bf8
to
a47c96e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you. I have one non-technical comment regarding idiomatic rust programming.
crates/iceberg/src/avro/schema.rs
Outdated
pub trait AvroSchemaVisitor { | ||
type T; | ||
|
||
fn record(&mut self, record: &RecordSchema, fields: Vec<Self::T>) -> Result<Self::T>; | ||
|
||
fn union(&mut self, union: &UnionSchema, options: Vec<Self::T>) -> Result<Self::T>; | ||
|
||
fn array(&mut self, array: &AvroSchema, item: Self::T) -> Result<Self::T>; | ||
fn map(&mut self, map: &AvroSchema, value: Self::T) -> Result<Self::T>; | ||
|
||
fn primitive(&mut self, schema: &AvroSchema) -> Result<Self::T>; | ||
} | ||
|
||
/// Visit avro schema in post order visitor. | ||
pub fn visit<V: AvroSchemaVisitor>(schema: &AvroSchema, visitor: &mut V) -> Result<V::T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal opinion: I find a recursive function with pattern matching more idiomatic rust than using the visitor pattern for this. I feel like this comes from Java where you don't have enum
s.
Or is there anything here that couldn't be achieved with pattern matching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is inspired by java's implementation which visits avro schema in post order. I'm not quite sure about what you mean? The arguments of different pattern is different. Could you write an example to make it more clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what the benefit of the visitor pattern is compared to using pattern matching as in the following code:
/// Converts avro schema to iceberg schema.
pub(crate) fn avro_schema_to_schema(avro_schema: &AvroSchema) -> Result<Schema> {
if let AvroSchema::Record(_) = avro_schema {
let mut next_field_id = 0;
let typ = avro_type_to_type(avro_schema, &mut next_field_id)?;
if let Some(Type::Struct(s)) = typ {
Schema::builder()
.with_fields(s.fields().iter().cloned())
.build()
} else {
Err(Error::new(
ErrorKind::Unexpected,
format!("Expected to convert avro record schema to struct type, but {typ}"),
))
}
} else {
Err(Error::new(
ErrorKind::DataInvalid,
"Can't convert non record avro schema to iceberg schema: {avro_schema}",
))
}
}
fn avro_type_to_type(avro: &AvroSchema, next_field_id: &mut i32) -> Result<Option<Type>> {
let typ = match avro {
AvroSchema::Decimal(decimal) => {
Type::decimal(decimal.precision as u32, decimal.scale as u32)?
}
AvroSchema::Date => Type::Primitive(PrimitiveType::Date),
AvroSchema::TimeMicros => Type::Primitive(PrimitiveType::Time),
AvroSchema::TimestampMicros => Type::Primitive(PrimitiveType::Timestamp),
AvroSchema::Uuid => Type::Primitive(PrimitiveType::Uuid),
AvroSchema::Boolean => Type::Primitive(PrimitiveType::Boolean),
AvroSchema::Int => Type::Primitive(PrimitiveType::Int),
AvroSchema::Long => Type::Primitive(PrimitiveType::Long),
AvroSchema::Float => Type::Primitive(PrimitiveType::Float),
AvroSchema::Double => Type::Primitive(PrimitiveType::Double),
AvroSchema::String | AvroSchema::Enum(_) => Type::Primitive(PrimitiveType::String),
AvroSchema::Fixed(fixed) => Type::Primitive(PrimitiveType::Fixed(fixed.size as u64)),
AvroSchema::Bytes => Type::Primitive(PrimitiveType::Binary),
AvroSchema::Null => return Ok(None),
AvroSchema::Record(record) => {
let field_types = record
.fields
.iter()
.map(|x| avro_type_to_type(&x.schema, next_field_id))
.collect::<Result<Vec<Option<Type>>>>()?;
let mut fields = Vec::with_capacity(field_types.len());
for (avro_field, typ) in record.fields.iter().zip_eq(field_types) {
let field_id = avro_field
.custom_attributes
.get(FILED_ID_PROP)
.and_then(Value::as_i64)
.ok_or_else(|| {
Error::new(
ErrorKind::DataInvalid,
format!("Can't convert field, missing field id: {avro_field:?}"),
)
})?;
let optional = is_avro_optional(&avro_field.schema);
let mut field = if optional {
NestedField::optional(field_id as i32, &avro_field.name, typ.unwrap())
} else {
NestedField::required(field_id as i32, &avro_field.name, typ.unwrap())
};
if let Some(doc) = &avro_field.doc {
field = field.with_doc(doc);
}
fields.push(field.into());
}
Ok(Some(Type::Struct(StructType::new(fields))))
}
AvroSchema::Array(array) => {
let item = avro_type_to_type(array, next_field_id)?;
if let AvroSchema::Array(item_schema) = array.as_ref() {
*next_field_id += 1;
let element_field = NestedField::list_element(
*next_field_id,
item.unwrap(),
!is_avro_optional(item_schema),
)
.into();
Ok(Some(Type::List(ListType { element_field })))
} else {
Err(Error::new(
ErrorKind::Unexpected,
"Expected avro array schema, but {array}",
))
}
}
};
Ok(Some(typ))
}
For me this is much clearer what happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the benefit of the vistor is to reuse it for coverting or collect info in schema (They represent different business logic). If it only used in avro schema to schema
, maybe recursive function is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is whether we will need to process avro schema in other places. If not, I think using recursive + pattern matching is good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also use the visitor pattern here. I'm just generally trying to be cautious when porting OO code to rust not to introduce complexity that is not necessary in Rust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this comes from Java where you don't have enums.
You do: https://docs.oracle.com/javase/tutorial/java/javaOO/enum.html
I think the missing piece in Java is that you don't have pattern matching. I had the same experience when I started with PyIceberg that it might be too much, but now I'm very happy that I did use them because they add much-needed structure to the code. I think without the visitor, two kinds of errors are very likely:
- Visiting the structure in the right order for all the types (Jan's example is missing the
Map
) - Make sure that you cover all the types. For example, in the pattern matching that goes over the primitives, it is easy to forget a primitive there. That's why we also have a visitor in Python that visits all of the primitives.
Converting the schema might be a bit simple for the visitor, but later on, when you start pruning schema (or for Avro, having a read schema that differs from the write schema), it also gives this concept of divide and conquer where you can easily isolate the different cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go with the visitor pattern. I just want to make a couple of points for completeness sake.
From wikipedia:
"Programming languages with sum types and pattern matching obviate many of the benefits of the visitor pattern, as the visitor class is able to both easily branch on the type of the object and generate a compiler error if a new object type is defined which the visitor does not yet handle."
The benefits of the visitor pattern don't apply to our case:
- Visitor allows same algorithm over different types
We are always visiting the same type, even with a read schema and a write schema.
- Visitor allows dynamic double dispatch
Without any default implementations on the visitor trait, all trait methods have to be implemented for a new type which is equal to pattern matching. There would be a benefit if you have many methods with default implementations because then you could save implementation effort for new types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting the schema might be a bit simple for the visitor, but later on, when you start pruning schema (or for Avro, having a read schema that differs from the write schema), it also gives this concept of divide and conquer where you can easily isolate the different cases.
This is also my concern. I prefer to keep visitor pattern to see if we have more features to implement on avro schema.
crates/iceberg/src/avro/schema.rs
Outdated
Ok(AvroSchema::Decimal(DecimalSchema { | ||
precision, | ||
scale, | ||
inner: Box::new(avro_fixed_schema(DECIMAL_LENGTH)?), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fixed length is based on the precision of the decimal:
Based on the precision the decimal needs a set number of bytes to store the information: https://github.com/apache/iceberg/blob/master/python/pyiceberg/utils/decimal.py#L111-L127
➜ Desktop python3
Python 3.11.4 (main, Jul 25 2023, 17:36:13) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import math
>>> MAX_PRECISION = tuple(math.floor(math.log10(math.fabs(math.pow(2, 8 * pos - 1) - 1))) for pos in range(24))
>>> REQUIRED_LENGTH = tuple(next(pos for pos in range(24) if p <= MAX_PRECISION[pos]) for p in range(40))
>>> MAX_PRECISION
(-1, 2, 4, 6, 9, 11, 14, 16, 18, 21, 23, 26, 28, 31, 33, 35, 38, 40, 43, 45, 47, 50, 52, 55)
>>> MAX_PRECISION
(-1, 2, 4, 6, 9, 11, 14, 16, 18, 21, 23, 26, 28, 31, 33, 35, 38, 40, 43, 45, 47, 50, 52, 55)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
crates/iceberg/src/spec/datatypes.rs
Outdated
/// Creates decimal type. | ||
#[inline(always)] | ||
pub fn decimal(precision: u32, scale: u32) -> Result<Self> { | ||
ensure_data_valid!(precision <= MAX_DECIMAL_PRECISION, "Decimals with precision larger than {MAX_DECIMAL_PRECISION} are not supported: {precision}",); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also check > 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
crates/iceberg/src/avro/schema.rs
Outdated
pub trait AvroSchemaVisitor { | ||
type T; | ||
|
||
fn record(&mut self, record: &RecordSchema, fields: Vec<Self::T>) -> Result<Self::T>; | ||
|
||
fn union(&mut self, union: &UnionSchema, options: Vec<Self::T>) -> Result<Self::T>; | ||
|
||
fn array(&mut self, array: &AvroSchema, item: Self::T) -> Result<Self::T>; | ||
fn map(&mut self, map: &AvroSchema, value: Self::T) -> Result<Self::T>; | ||
|
||
fn primitive(&mut self, schema: &AvroSchema) -> Result<Self::T>; | ||
} | ||
|
||
/// Visit avro schema in post order visitor. | ||
pub fn visit<V: AvroSchemaVisitor>(schema: &AvroSchema, visitor: &mut V) -> Result<V::T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this comes from Java where you don't have enums.
You do: https://docs.oracle.com/javase/tutorial/java/javaOO/enum.html
I think the missing piece in Java is that you don't have pattern matching. I had the same experience when I started with PyIceberg that it might be too much, but now I'm very happy that I did use them because they add much-needed structure to the code. I think without the visitor, two kinds of errors are very likely:
- Visiting the structure in the right order for all the types (Jan's example is missing the
Map
) - Make sure that you cover all the types. For example, in the pattern matching that goes over the primitives, it is easy to forget a primitive there. That's why we also have a visitor in Python that visits all of the primitives.
Converting the schema might be a bit simple for the visitor, but later on, when you start pruning schema (or for Avro, having a read schema that differs from the write schema), it also gives this concept of divide and conquer where you can easily isolate the different cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deriving the correct byte length for reading the decimal should be fixed, apart from that this looks great @liurenjie1024 👍🏻
a47c96e
to
4b41231
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @liurenjie1024 for working on this 👍🏻
Introduce conversion between avro schema and iceberg schema.
Currently limitation, due to
apache_avro
's limit, we can' store attributes in array schema and map schema, this means we will loseelement-id
for list type, andkey-id
,value-id
for map type. We will fix it later.Close #37.