Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: suport read/write Manifest #79

Merged
merged 7 commits into from
Dec 13, 2023
Merged

feat: suport read/write Manifest #79

merged 7 commits into from
Dec 13, 2023

Conversation

ZENOTME
Copy link
Contributor

@ZENOTME ZENOTME commented Oct 13, 2023

This PR prepare to support read/write Manifest.

related issue: #36

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Oct 13, 2023

For now, it still not been completed.

It only complete the basic design and I want to make sure whether the design well first.

If it looks well, I will complete it and add the test later.

cc @JanKaul @Fokko @Xuanwo @liurenjie1024

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZENOTME Thanks for this! I would suggest to move the part for ser/de to another pr and resolve them first, otherwise it would be too large to review.

I'm ok with the overall design.

crates/iceberg/src/spec/partition.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/manifest.rs Show resolved Hide resolved
crates/iceberg/src/spec/manifest.rs Outdated Show resolved Hide resolved
@ZENOTME
Copy link
Contributor Author

ZENOTME commented Oct 18, 2023

I will split the RawLiteral out into another PR.

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Oct 18, 2023

I have send a new PR #82 for value part. Let's complete it first and then continue on working this PR later.

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Nov 30, 2023

Sorry for being late. The #82 has closed. Let's move again. I have fixed the problem mentioned above and recommit.

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Nov 30, 2023

Seem the binary file will make typos fail. Maybe we should fix #70 first. Other parts is ready to review.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks great! We are close to manipulating metadata in rust! Left some small comments to improve.

crates/iceberg/src/spec/manifest_list.rs Outdated Show resolved Hide resolved
@@ -126,17 +126,20 @@ pub enum Transform {
impl Transform {
/// Get the return type of transform given the input type.
/// Returns `None` if it can't be transformed.
pub fn result_type(&self, input_type: &Type) -> Option<Type> {
pub fn result_type(&self, input_type: &Type) -> Result<Type> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to change this signature? I think an Option is more appropriate here. I think we can have a method try_infer_return_type to convert Option to Result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The None case is the case which is not support in iceberg.🤔 Why not treat it as a error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They have different meaning. When return Option, it acts more like a check to say if this transform can be applied, and whether it's an error is up to user. But their difference seems negligible. cc @Xuanwo @JanKaul @Fokko What do you think?

Copy link
Contributor

@Fokko Fokko Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a difference, let me explain! In Iceberg it is possible to have your custom transform, for example if you want to do something very specific, or you want to influence the distribution when it comes to partitioning. It can be that the transform is not available in the client, and then we cast it to an UnknownTransform: https://github.com/apache/iceberg-python/blob/8c8abb5c4c258e32941110a9ce0938e1328290b3/pyiceberg/transforms.py#L688

What this essentially means is that we don't know how it is transformed, and we cannot apply speedups. For example, bucket[16](34) = 3 then we know in which bucket the data is that we're looking for. But if there is some custom partitioning then we just have to look through all the buckets.

To this specific case of the type, in PyIceberg we return a StringType(), that's borrowed from Java where you just cast it to a string. I don't think that's used anywhere since we don't call this on the UnknownTransform. The Type cannot be None, so I think having Result is the right thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

crates/iceberg/src/spec/manifest.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/manifest.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/manifest.rs Outdated Show resolved Hide resolved
@liurenjie1024
Copy link
Contributor

I think we need to add configurations in typos.toml to skip the check:

https://github.com/crate-ci/typos/blob/master/docs/reference.md

@Xuanwo
Copy link
Member

Xuanwo commented Dec 13, 2023

I think we need to add configurations in typos.toml to skip the check:

https://github.com/crate-ci/typos/blob/master/docs/reference.md

Yep, we should skip check on testdata.

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Dec 13, 2023

I think we need to add configurations in typos.toml to skip the check:
https://github.com/crate-ci/typos/blob/master/docs/reference.md

Yep, we should skip check on testdata.

I have added .typo.toml to skip.

crates/iceberg/src/spec/partition.rs Show resolved Hide resolved
crates/iceberg/src/spec/manifest_list.rs Show resolved Hide resolved
crates/iceberg/src/spec/manifest_list.rs Show resolved Hide resolved
crates/iceberg/src/spec/manifest.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/manifest.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/manifest.rs Outdated Show resolved Hide resolved
.typos.toml Outdated Show resolved Hide resolved
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the effort!

sequence_number: 1,
min_sequence_number: 1,
added_snapshot_id: 377075049360453639,
sequence_number: UNASSIGNED_SEQUENCE_NUMBER,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the manifest list the snapshot should be assigned, maybe we can add a note here that it should be fixed once we get the snapshots generation in there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry. I'm not sure what you mean.🤔 In here the snapshot id is assigned. What's the snapshots generation?

Comment on lines +1634 to +1635
assert!(entry.sequence_number.is_none());
assert!(entry.file_sequence_number.is_none());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

These should default to 0. You could implement this using the initial-value of Iceberg:

image

Suggested change
assert!(entry.sequence_number.is_none());
assert!(entry.file_sequence_number.is_none());
assert!(entry.sequence_number == Some(0));
assert!(entry.file_sequence_number == Some(0));

Comment on lines +1395 to +1397
assert!(entry.snapshot_id == Some(0));
assert!(entry.sequence_number == Some(1));
assert!(entry.file_sequence_number == Some(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is some interesting test-data 🤔 The status is added, so this means a new file which typically doesn't set the snapshot-id, sequence-number, and file-sequence-number since they should be inherited. This way we don't have to rewrite the manifest after a failed conflict (because a new snapshot-id and sequence-number is assigned).

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for putting so much effort into this @ZENOTME 🙌 Thanks for the review @JanKaul and @liurenjie1024. I think there are still some loose ends with the snapshot-id and sequence-number inheritance, but we can tackle that separately.

@Fokko Fokko merged commit e0e2b1b into apache:main Dec 13, 2023
6 checks passed
@ZENOTME ZENOTME deleted the manifest branch December 14, 2023 03:26
shaeqahmed pushed a commit to matanolabs/iceberg-rust that referenced this pull request Dec 9, 2024
* support read/write Manifest

* refine in memory struct

* fix seq num assign

* ignore typos testdata

* refine

* add license for typos.toml

* remove distinct_counts

---------

Co-authored-by: ZENOTME <st810918843@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants