Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support ser/deser of value #82

Merged
merged 9 commits into from
Nov 13, 2023
Merged

feat: support ser/deser of value #82

merged 9 commits into from
Nov 13, 2023

Conversation

ZENOTME
Copy link
Contributor

@ZENOTME ZENOTME commented Oct 18, 2023

This PR:

  1. modify to store i128 for Value::Decimal
    It make more easy to process Value::Decimal and avoid type consisent problem if we use Decimal.
  2. support support ser/deser of value
  3. add related test and fix some wrong in schema

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Oct 18, 2023

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ZENOTME Left some comments.

crates/iceberg/src/avro/schema.rs Outdated Show resolved Hide resolved
crates/iceberg/src/avro/schema.rs Outdated Show resolved Hide resolved
crates/iceberg/src/avro/schema.rs Outdated Show resolved Hide resolved
crates/iceberg/src/avro/schema.rs Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Show resolved Hide resolved
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, left some small comments.

crates/iceberg/src/avro/schema.rs Outdated Show resolved Hide resolved
crates/iceberg/src/avro/schema.rs Outdated Show resolved Hide resolved
crates/iceberg/src/avro/schema.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
}

impl Iterator for StructIntoIter {
type Item = (i32, Option<Literal>, String);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the item should be Option<Literal>. According to this discussion, we will remove struct types in Struct, and we can't return field name then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be (i32,Option<Literal>)?🤔 Seems we still need to store field id in Struct

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove struct types in Struct, we also have no filed_id.

crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Show resolved Hide resolved
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, thanks!

)
})?;
match logical_type {
"uuid" => Type::Primitive(PrimitiveType::Uuid),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"uuid" => Type::Primitive(PrimitiveType::Uuid),
UUID_LOGICAL_TYPE => Type::Primitive(PrimitiveType::Uuid),

}

impl Iterator for StructIntoIter {
type Item = (i32, Option<Literal>, String);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove struct types in Struct, we also have no filed_id.

}

#[derive(Clone)]
struct Record {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Record should contains a StructType:

struct Record<'a> {
  r#type: &'a StructType,
  values: Vec<Option<RawLiteralEnum>>
}

This way we can avoid copying field names every time. But we can leave it as an optimization.

let mut key = None;
let mut value = None;
required.into_iter().for_each(|(k, v)| {
if k == "key" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if k == "key" {
if k == MAP_KEY_FIELD_NAME {

required.into_iter().for_each(|(k, v)| {
if k == "key" {
key = Some(v);
} else if k == "value" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else if k == "value" {
} else if k == MAP_VALUE_FIELD_NAME {

.ok_or_else(|| invalid_err("list"))?;
let value = v.try_into(value_ty)?;
if map_ty.value_field.required && value.is_none() {
return Err(invalid_err("list"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make the error here more clear.

crates/iceberg/src/spec/values.rs Show resolved Hide resolved
Comment on lines 2179 to 2182
// - binary
// - fixed
// - decimal
// - uuid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can still test them? They are encoded in bytes when converting from literal to raw literal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you means we can test serialize?

},
// # TODO
// rust avro don't support deserialize any bytes representation now.
RawLiteralEnum::Bytes(_) => Err(invalid_err_with_reason(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't quite get your points, I think avro has byte type: https://docs.rs/apache-avro/0.16.0/apache_avro/types/enum.Value.html#variant.Bytes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/avro/blob/2b1955947ab446ad437f152ec2f3310ea399a015/lang/rust/avro/src/de.rs#L279 But it can't support to deserialize bytes.

And it also can't support to serialize fixed type now. https://issues.apache.org/jira/browse/AVRO-3892?filter=-2

I will try to send a PR to fix them later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see. I'll create an issue to track this.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Others LGTM

@liurenjie1024
Copy link
Contributor

cc @Fokko @Xuanwo @JanKaul PTAL

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, only some suggestions.

const UUID_BYTES: usize = 16;
const UUID_LOGICAL_TYPE: &str = "uuid";
// # TODO
// This const may better to maintain in avro-rs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can create an issue for avro-rs and comment the issue link here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have added a #86 to track them

}

// # TODO
// rust avro don't support deserialize any bytes representation now:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about create an issue and comment the link here?

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Nov 1, 2023

My bad, have fixed the typos now.🥵

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Nov 4, 2023

@Fokko Hi, we can run the check again

@liurenjie1024
Copy link
Contributor

cc @Xuanwo @Fokko Any other comments?

Comment on lines +244 to +247
BTreeMap::from([(
LOGICAL_TYPE.to_string(),
Value::String(logical_type.to_string()),
)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity. Is BTreeMap the default in Rust? Trees tend to have many pointers and, therefore have faster lookups in exchange for a larger memory footprint (compared to a HashMap).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not default. For here it's just because avro::FixedSchema use BTreeMap. And indeed HashMap can save more memory. But I find that seems avro prefer to use BTreeMap. (I'm not sure that maybe they need the sorted order when iterate it the attributes in serialization?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sometimes comes up if you want the value to be hashable. BTreeMap is hashable while HashMap is not hashable. I think it has something to do with requiring an order to be hashable.

})?;
match logical_type {
UUID_LOGICAL_TYPE => Type::Primitive(PrimitiveType::Uuid),
ty => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! We should add more checks here I think. I will do it later.

Comment on lines +928 to +930
PrimitiveLiteral::Decimal(_) => Type::Primitive(PrimitiveType::Decimal {
precision: MAX_DECIMAL_PRECISION,
scale: 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we ignoring the scale?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are making PrimitiveLiteral::Decimal as (i128,i64) to store the value in the first one and the scale in the later one? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will this be used? I think inferring type from literal is not feasible?

Copy link
Contributor Author

@ZENOTME ZENOTME Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think inferring type from literal is not feasible?

Sounds reasonable. In most of case we can get the type from the according schema. cc @JanKaul

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I thought it would be useful to directly get the type from the value. But in case of the decimal you would need to store the Decimal with the scale like @ZENOTME suggested. It might make more sense to delete this method entirely and use the schema to get the types.

@Fokko
Copy link
Contributor

Fokko commented Nov 13, 2023

I'll move this forward since it has been pending for a while (Sorry for that!). I have one comment about ignoring the scale, which looks incorrect to me. Thanks @ZENOTME for working on this, and @liurenjie1024 and @Xuanwo for the review 🙌

@Fokko Fokko merged commit 28d7006 into apache:main Nov 13, 2023
6 checks passed
@ZENOTME ZENOTME deleted the value_ser branch November 13, 2023 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants