feat: support read Manifest List #56

ZENOTME · 2023-09-02T17:51:08Z

related issue: #36

For the unifed veiw, it follow https://iceberg.apache.org/spec/#writer-requirements basically.

For the field in V1 is optional but require in V2 and spec didn't state the default value , we make it optional in unifed view.
But for this kind of field with indication of default value, e.g. content field, we make it require in unifed view.

For write, it may need a writer interface like icelake so we need to implement it after #49 finish.

crates/iceberg/src/spec/snapshot.rs

crates/iceberg/src/spec/manifest_list.rs

ZENOTME · 2023-09-03T03:46:57Z

cc @liurenjie1024 @Xuanwo @JanKaul @nastra @Fokko PTAL

liurenjie1024

Thanks for the effort! Left some comments.

crates/iceberg/src/spec/manifest_list.rs

liurenjie1024 · 2023-09-04T02:09:38Z

crates/iceberg/src/spec/manifest_list.rs

+/// manifest.
+#[derive(Debug, Clone)]
+pub struct ManifestList {
+    version: FormatVersion,


Do we really need this?

When we serialize the ManifestList, we need to know serialize it to what format version.🤔

It should be controlled by table format version, storing another variable for manifest list is dangerous.

liurenjie1024 · 2023-09-04T02:12:13Z

crates/iceberg/src/spec/manifest_list.rs

+    /// A list of field summaries for each partition field in the spec. Each
+    /// field in the list corresponds to a field in the manifest file’s
+    /// partition spec.
+    partitions: Option<Vec<FieldSummary>>,


Remove the Option? I think empty vec is good enough for None?

I think this will cause a case which will change manifest list implicitly

read partitions: Some(vec![])

write partitions back: None
Same things happend in key_metadata

Will it cause wierd for user?

Good point. But since it's optional, we don't need to write None in serialized format? We just need to skip serialization.

liurenjie1024 · 2023-09-04T02:12:22Z

crates/iceberg/src/spec/manifest_list.rs

+    /// field: 519
+    ///
+    /// Implementation-specific key metadata for encryption
+    key_metadata: Option<Vec<u8>>,


liurenjie1024 · 2023-09-04T02:20:06Z

crates/iceberg/src/spec/manifest_list.rs

+    /// Parse manifest list from bytes.
+    ///
+    /// QUESTION: Will we have more than one manifest list in a single file?
+    pub fn parse(


It would be better to follow our current approach:

ManifestListV1 and ManifestListV2

ManifestListEnum

ManifestList <-> ManifsetEnum

You can take schema as an example

I'm concern does use ManifestEnum will have extra performance cost. The generate code will try to serialize ManifestListV2 first and then ManifestListV1. If we pass a version number, this cost can be avoid.🤔

According to iceberg spec, manifest list will be avro file. The read/write process would be similar to read/write manifest file, e.g. we need to define avro schema first. So it's not blindly try v1/v2.

liurenjie1024 · 2023-09-04T02:21:10Z

crates/iceberg/src/spec/manifest_list.rs

+    use super::ManifestListEntry;
+
+    #[derive(Debug, Serialize, Deserialize, PartialEq, Eq)]
+    pub(super) struct ManifestListEntryV1 {


We should have similar approach for MainfestList

ZENOTME · 2023-09-07T06:34:31Z

And I find some place is inconsistent with spec.

https://iceberg.apache.org/spec/#manifests:~:text=504-,added_files_count,-int
In partice, this field in avro is added_data_files_count same thing exist in: existing_files_count, deleted_files_count

Optional fields, array elements, and map values must be wrapped in an Avro union with null. This is the only union type allowed in Iceberg data files.

manifest_list:
   partitions: `list<508: field_summary>`

Actually this field_summary field is not a optional value.

JanKaul

Thank you for working on this. I have one comment but otherwise looks good to me.

JanKaul · 2023-09-08T06:27:45Z

crates/iceberg/src/spec/manifest_list.rs

+#[derive(Debug, PartialEq, Clone)]
+pub enum ManifestContentType {
+    /// The manifest content is data.
+    Data = 0,
+    /// The manifest content is deletes.
+    Deletes = 1,
+}


I think you have to distinguish between position deletes = 1 and equality deletes = 2. If we use the serde_repr crate we could directly serialize/deserialize it as follows:

/// The type of files tracked by the manifest, either data or delete files; Data(0) for all v1 manifests #[derive(Debug, Serialize_repr, Deserialize_repr, PartialEq, Eq, Clone)] #[repr(u8)] pub enum ManifestContentType { /// Data. Data = 0, /// Deletes at position. PositionDeletes = 1, /// Delete by equality. EqualityDeletes = 2, }

I think you have to distinguish between position deletes = 1 and equality deletes = 2. If we use the serde_repr crate we could directly serialize/deserialize it as follows:

Thanks for the reminder! I have a question:

In spec of data file, it distinguish position deletes and equality delete.
In spec of manifest_list, it only distinguish data and delete.

So is that just a inconsistent in spec?

Java's implementation also only has Data or Delete in manifest content type.

I think you have to distinguish between position deletes = 1 and equality deletes = 2. If we use the serde_repr crate we could directly serialize/deserialize it as follows:

Thanks for the reminder! I have a question:

In spec of data file, it distinguish position deletes and equality delete. In spec of manifest_list, it only distinguish data and delete.

So is that just a inconsistent in spec?

Sorry, my bad. I thought it was the same struct as in datafile. But it looks like they are different. Maybe the information is not required in ManifestList. Forget my comment then.

liurenjie1024 · 2023-09-08T11:29:12Z

And I find some place is inconsistent with spec.

https://iceberg.apache.org/spec/#manifests:~:text=504-,added_files_count,-int In partice, this field in avro is added_data_files_count same thing exist in: existing_files_count, deleted_files_count

Optional fields, array elements, and map values must be wrapped in an Avro union with null. This is the only union type allowed in Iceberg data files.
manifest_list:
   partitions: `list<508: field_summary>`
Actually this field_summary field is not a optional value.

How about submitting fix to iceberg-docs?

liurenjie1024

We are almost there, thanks!

liurenjie1024 · 2023-09-08T11:30:55Z

crates/iceberg/src/avro/mod.rs

@@ -18,3 +18,4 @@
 //! Avro related codes.
 #[allow(dead_code)]
 mod schema;
+pub use schema::*;


Suggested change

pub use schema::*;

pub(crate) use schema::*;

Avro schema is not intended for external users.

liurenjie1024 · 2023-09-08T11:31:48Z

crates/iceberg/src/avro/schema.rs

@@ -215,7 +215,7 @@ impl SchemaVisitor for SchemaToAvroSchema {
 }

 /// Converting iceberg schema to avro schema.
-pub(crate) fn schema_to_avro_schema(name: impl ToString, schema: &Schema) -> Result<AvroSchema> {
+pub fn schema_to_avro_schema(name: impl ToString, schema: &Schema) -> Result<AvroSchema> {


liurenjie1024 · 2023-09-08T11:31:59Z

crates/iceberg/src/avro/schema.rs

@@ -454,7 +454,7 @@ impl AvroSchemaVisitor for AvroSchemaToSchema {
 }

 /// Converts avro schema to iceberg schema.
-pub(crate) fn avro_schema_to_schema(avro_schema: &AvroSchema) -> Result<Schema> {
+pub fn avro_schema_to_schema(avro_schema: &AvroSchema) -> Result<Schema> {


liurenjie1024 · 2023-09-08T11:40:34Z

crates/iceberg/src/spec/manifest_list.rs

+#[derive(Debug, PartialEq, Clone)]
+pub enum ManifestContentType {
+    /// The manifest content is data.
+    Data = 0,
+    /// The manifest content is deletes.
+    Deletes = 1,
+}


Java's implementation also only has Data or Delete in manifest content type.

liurenjie1024 · 2023-09-08T11:47:26Z

crates/iceberg/src/spec/manifest_list.rs

+
+    #[test]
+    fn test_parse_manifest_list_v1() {
+        let path = format!(


This ut only checks deserialization. We should also check deserialization.

liurenjie1024 · 2023-09-11T02:24:28Z

crates/iceberg/src/spec/manifest_list.rs

            ))
        })
    };
-    const MANIFEST_LENGTH: Lazy<NestedFieldRef> = {
+    pub static MANIFEST_LENGTH: Lazy<NestedFieldRef> = {


Why we remove const and use pub static instead? I think const is better here.

We can't evaluate(Lazy) the const in runtime. Seems we only can use static in this case.🤔

I would suggest to change modifier to pub(crate)

liurenjie1024

LGTM. Thanks for your effort!

* fix clippy

ZENOTME · 2023-09-22T02:19:08Z

Any other comments? cc @Fokko

crates/iceberg/src/avro/mod.rs

crates/iceberg/src/spec/manifest_list.rs

Xuanwo · 2023-09-22T03:37:04Z

crates/iceberg/testdata/simple_manifest_list_v1.avro

By default, ASF releases do not allow binary files. Should we generate these files or exclude them from the release?

cc @Fokko for comments as you are likely to be our first release manager.

You can add binary files to the repository, but we should not release them, so if you can exclude them from the artifact, then we're good. In general, we try to avoid adding binary files. In PyIceberg we use FastAvro to generate them.

I think it is worth the effort to generate the files. This way we can also generate the whole structure later on (manifest-list, manifest, datafile). Since Iceberg requires absolute paths, it is not easy to generate these files on forehand (looking at the Windows CI 😁)

I think it is worth the effort to generate the files. This way we can also generate the whole structure later on (manifest-list, manifest, datafile). Since Iceberg requires absolute paths, it is not easy to generate these files on forehand (looking at the Windows CI 😁)

+1, We can investigate how to do it in following PR.

Tracked by #70

liurenjie1024 · 2023-09-30T01:59:37Z

CC @Xuanwo @JanKaul Any comments?

JanKaul · 2023-10-02T10:41:15Z

LGTM, thank you all for your efforts.

Fokko · 2023-10-02T16:14:14Z

Thanks @ZENOTME for working on this! And @liurenjie102, @Xuanwo and @JanKaul for the review 🙌

ZENOTME force-pushed the manifest_list branch from 22af280 to 3cd5dfe Compare September 3, 2023 03:37

ZENOTME commented Sep 3, 2023

View reviewed changes

crates/iceberg/src/spec/snapshot.rs Show resolved Hide resolved

ZENOTME commented Sep 3, 2023

View reviewed changes

crates/iceberg/src/spec/manifest_list.rs Outdated Show resolved Hide resolved

ZENOTME marked this pull request as ready for review September 3, 2023 03:45

liurenjie1024 reviewed Sep 4, 2023

View reviewed changes

JanKaul reviewed Sep 8, 2023

View reviewed changes

liurenjie1024 reviewed Sep 8, 2023

View reviewed changes

liurenjie1024 reviewed Sep 11, 2023

View reviewed changes

ZENOTME force-pushed the manifest_list branch from 76582cf to c5036eb Compare September 11, 2023 10:58

liurenjie1024 approved these changes Sep 12, 2023

View reviewed changes

ZENOTME added 4 commits September 22, 2023 10:15

support read manifest-list

b8a206c

support to read with schema

5beda7d

* add test for serialize

8b217c5

* fix clippy

fix typos

a801d0a

ZENOTME force-pushed the manifest_list branch from f78e19c to a801d0a Compare September 22, 2023 02:17

Xuanwo reviewed Sep 22, 2023

View reviewed changes

rename _schema to _const_fields

b68196c

Fokko approved these changes Sep 27, 2023

View reviewed changes

liurenjie1024 mentioned this pull request Sep 27, 2023

test: Replace binary avro file by generating it on the fly. #70

Closed

Fokko merged commit 2585a2f into apache:main Oct 2, 2023
6 checks passed

ZENOTME deleted the manifest_list branch October 7, 2023 16:40

feat: support read Manifest List #56

feat: support read Manifest List #56

Conversation

ZENOTME commented Sep 2, 2023 • edited Loading

ZENOTME commented Sep 3, 2023

liurenjie1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZENOTME commented Sep 7, 2023

JanKaul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JanKaul Sep 11, 2023 • edited Loading

Choose a reason for hiding this comment

liurenjie1024 commented Sep 8, 2023

liurenjie1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 left a comment

Choose a reason for hiding this comment

ZENOTME commented Sep 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 commented Sep 30, 2023

JanKaul commented Oct 2, 2023

Fokko commented Oct 2, 2023

ZENOTME commented Sep 2, 2023 •

edited

Loading

JanKaul Sep 11, 2023 •

edited

Loading

ZENOTME commented Sep 22, 2023 •

edited

Loading