-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support ser/deser of value #82
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ZENOTME Left some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, left some small comments.
crates/iceberg/src/spec/values.rs
Outdated
} | ||
|
||
impl Iterator for StructIntoIter { | ||
type Item = (i32, Option<Literal>, String); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the item should be Option<Literal>
. According to this discussion, we will remove struct types in Struct
, and we can't return field name then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be (i32,Option<Literal>)
?🤔 Seems we still need to store field id in Struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove struct types in Struct
, we also have no filed_id
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, thanks!
crates/iceberg/src/avro/schema.rs
Outdated
) | ||
})?; | ||
match logical_type { | ||
"uuid" => Type::Primitive(PrimitiveType::Uuid), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"uuid" => Type::Primitive(PrimitiveType::Uuid), | |
UUID_LOGICAL_TYPE => Type::Primitive(PrimitiveType::Uuid), |
crates/iceberg/src/spec/values.rs
Outdated
} | ||
|
||
impl Iterator for StructIntoIter { | ||
type Item = (i32, Option<Literal>, String); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove struct types in Struct
, we also have no filed_id
.
} | ||
|
||
#[derive(Clone)] | ||
struct Record { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Record
should contains a StructType
:
struct Record<'a> {
r#type: &'a StructType,
values: Vec<Option<RawLiteralEnum>>
}
This way we can avoid copying field names every time. But we can leave it as an optimization.
crates/iceberg/src/spec/values.rs
Outdated
let mut key = None; | ||
let mut value = None; | ||
required.into_iter().for_each(|(k, v)| { | ||
if k == "key" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if k == "key" { | |
if k == MAP_KEY_FIELD_NAME { |
crates/iceberg/src/spec/values.rs
Outdated
required.into_iter().for_each(|(k, v)| { | ||
if k == "key" { | ||
key = Some(v); | ||
} else if k == "value" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else if k == "value" { | |
} else if k == MAP_VALUE_FIELD_NAME { |
crates/iceberg/src/spec/values.rs
Outdated
.ok_or_else(|| invalid_err("list"))?; | ||
let value = v.try_into(value_ty)?; | ||
if map_ty.value_field.required && value.is_none() { | ||
return Err(invalid_err("list")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make the error here more clear.
crates/iceberg/src/spec/values.rs
Outdated
// - binary | ||
// - fixed | ||
// - decimal | ||
// - uuid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can still test them? They are encoded in bytes when converting from literal to raw literal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you means we can test serialize?
}, | ||
// # TODO | ||
// rust avro don't support deserialize any bytes representation now. | ||
RawLiteralEnum::Bytes(_) => Err(invalid_err_with_reason( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't quite get your points, I think avro has byte type: https://docs.rs/apache-avro/0.16.0/apache_avro/types/enum.Value.html#variant.Bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/avro/blob/2b1955947ab446ad437f152ec2f3310ea399a015/lang/rust/avro/src/de.rs#L279 But it can't support to deserialize bytes.
And it also can't support to serialize fixed type now. https://issues.apache.org/jira/browse/AVRO-3892?filter=-2
I will try to send a PR to fix them later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see. I'll create an issue to track this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Others LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM, only some suggestions.
const UUID_BYTES: usize = 16; | ||
const UUID_LOGICAL_TYPE: &str = "uuid"; | ||
// # TODO | ||
// This const may better to maintain in avro-rs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can create an issue for avro-rs and comment the issue link here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have added a #86 to track them
} | ||
|
||
// # TODO | ||
// rust avro don't support deserialize any bytes representation now: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about create an issue and comment the link here?
2. modify representation of Deciaml to i128
2. fix to pass test
My bad, have fixed the typos now.🥵 |
@Fokko Hi, we can run the check again |
BTreeMap::from([( | ||
LOGICAL_TYPE.to_string(), | ||
Value::String(logical_type.to_string()), | ||
)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity. Is BTreeMap
the default in Rust? Trees tend to have many pointers and, therefore have faster lookups in exchange for a larger memory footprint (compared to a HashMap
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's not default. For here it's just because avro::FixedSchema use BTreeMap. And indeed HashMap can save more memory. But I find that seems avro prefer to use BTreeMap. (I'm not sure that maybe they need the sorted order when iterate it the attributes in serialization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sometimes comes up if you want the value to be hashable. BTreeMap
is hashable while HashMap
is not hashable. I think it has something to do with requiring an order to be hashable.
})?; | ||
match logical_type { | ||
UUID_LOGICAL_TYPE => Type::Primitive(PrimitiveType::Uuid), | ||
ty => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For another PR, logical_type
could also be a decimal: https://github.com/apache/iceberg-python/blob/main/pyiceberg/utils/schema_conversion.py#L571-L579
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! We should add more checks here I think. I will do it later.
PrimitiveLiteral::Decimal(_) => Type::Primitive(PrimitiveType::Decimal { | ||
precision: MAX_DECIMAL_PRECISION, | ||
scale: 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we ignoring the scale?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are making PrimitiveLiteral::Decimal as (i128,i64)
to store the value in the first one and the scale in the later one? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will this be used? I think inferring type from literal is not feasible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think inferring type from literal is not feasible?
Sounds reasonable. In most of case we can get the type from the according schema. cc @JanKaul
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I thought it would be useful to directly get the type from the value. But in case of the decimal you would need to store the Decimal with the scale like @ZENOTME suggested. It might make more sense to delete this method entirely and use the schema to get the types.
I'll move this forward since it has been pending for a while (Sorry for that!). I have one comment about ignoring the scale, which looks incorrect to me. Thanks @ZENOTME for working on this, and @liurenjie1024 and @Xuanwo for the review 🙌 |
This PR:
It make more easy to process Value::Decimal and avoid type consisent problem if we use Decimal.