-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: suport read/write Manifest #79
Conversation
For now, it still not been completed. It only complete the basic design and I want to make sure whether the design well first. If it looks well, I will complete it and add the test later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZENOTME Thanks for this! I would suggest to move the part for ser/de to another pr and resolve them first, otherwise it would be too large to review.
I'm ok with the overall design.
I will split the RawLiteral out into another PR. |
I have send a new PR #82 for value part. Let's complete it first and then continue on working this PR later. |
Sorry for being late. The #82 has closed. Let's move again. I have fixed the problem mentioned above and recommit. |
Seem the binary file will make typos fail. Maybe we should fix #70 first. Other parts is ready to review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks great! We are close to manipulating metadata in rust! Left some small comments to improve.
@@ -126,17 +126,20 @@ pub enum Transform { | |||
impl Transform { | |||
/// Get the return type of transform given the input type. | |||
/// Returns `None` if it can't be transformed. | |||
pub fn result_type(&self, input_type: &Type) -> Option<Type> { | |||
pub fn result_type(&self, input_type: &Type) -> Result<Type> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to change this signature? I think an Option
is more appropriate here. I think we can have a method try_infer_return_type
to convert Option
to Result
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The None case is the case which is not support in iceberg.🤔 Why not treat it as a error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a difference, let me explain! In Iceberg it is possible to have your custom transform, for example if you want to do something very specific, or you want to influence the distribution when it comes to partitioning. It can be that the transform is not available in the client, and then we cast it to an UnknownTransform
: https://github.com/apache/iceberg-python/blob/8c8abb5c4c258e32941110a9ce0938e1328290b3/pyiceberg/transforms.py#L688
What this essentially means is that we don't know how it is transformed, and we cannot apply speedups. For example, bucket[16](34) = 3
then we know in which bucket the data is that we're looking for. But if there is some custom partitioning then we just have to look through all the buckets.
To this specific case of the type, in PyIceberg we return a StringType()
, that's borrowed from Java where you just cast it to a string. I don't think that's used anywhere since we don't call this on the UnknownTransform
. The Type cannot be None, so I think having Result
is the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I think we need to add configurations in https://github.com/crate-ci/typos/blob/master/docs/reference.md |
Yep, we should skip check on |
I have added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the effort!
sequence_number: 1, | ||
min_sequence_number: 1, | ||
added_snapshot_id: 377075049360453639, | ||
sequence_number: UNASSIGNED_SEQUENCE_NUMBER, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the manifest list the snapshot should be assigned, maybe we can add a note here that it should be fixed once we get the snapshots generation in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry. I'm not sure what you mean.🤔 In here the snapshot id
is assigned. What's the snapshots generation?
assert!(entry.sequence_number.is_none()); | ||
assert!(entry.file_sequence_number.is_none()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert!(entry.snapshot_id == Some(0)); | ||
assert!(entry.sequence_number == Some(1)); | ||
assert!(entry.file_sequence_number == Some(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is some interesting test-data 🤔 The status is added
, so this means a new file which typically doesn't set the snapshot-id, sequence-number, and file-sequence-number since they should be inherited. This way we don't have to rewrite the manifest after a failed conflict (because a new snapshot-id and sequence-number is assigned).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for putting so much effort into this @ZENOTME 🙌 Thanks for the review @JanKaul and @liurenjie1024. I think there are still some loose ends with the snapshot-id and sequence-number inheritance, but we can tackle that separately.
* support read/write Manifest * refine in memory struct * fix seq num assign * ignore typos testdata * refine * add license for typos.toml * remove distinct_counts --------- Co-authored-by: ZENOTME <st810918843@gmail.com>
This PR prepare to support read/write Manifest.
related issue: #36