Parse schema from table metadata #30

samansmink · 2023-11-27T12:35:35Z

This PR changes the way we scan Iceberg tables. Before we would simply scan them as a list of parquet files, using the schema we found in the first parquet file scanned.

Now the iceberg metadata is actually scanned to parse the schema and pass that to the parquet_scan in duckdb using the schema parameter: duckdb/duckdb#9123

This effectively adds support for schema evolution and resolves issues where the parquet schema is different from the iceberg schema.

To test it, the spark-based test data generation script was updated to include schema evolution. Note that it is not yet run from CI, but that will follow soon in conjunction with the REST catalog support and the testing infrastructure added in #27

…d to the 'read_parquet' function as schema for the result

Fokko

Looking good @samansmink

src/common/schema.cpp

Fokko · 2023-11-27T12:40:37Z

src/common/schema.cpp

+		auto digits = StringUtil::Split(raw_digits, ',');
+		D_ASSERT(digits.size() == 2);
+
+		auto width = std::stoi(digits[0]);


Checked if it strips whitespace and it looks like it does.

So this is good as is right?

src/common/iceberg.cpp

src/include/iceberg_metadata.hpp

Parse schema from table metadata

Tishj and others added 11 commits September 27, 2023 14:29

add parsing code for the json schema, this should be parsed and passe…

f33d7e2

…d to the 'read_parquet' function as schema for the result

wip: using iceberg schema instead of parquet metadata schema

19b8113

update submodule

a329ebf

wipwipwip

ddee75d

restore test

d831280

update duckdb

c40184b

Now reading correct schema from iceberg metadata

adda3f2

merge with main: add direct version scan

31b0ea1

cleanup

48f010f

Merge branch 'main' into parse_json_schema

42059c4

update vcpkg, bump duckdb to main

481f533

samansmink requested a review from Tishj November 27, 2023 12:35

Fokko approved these changes Nov 27, 2023

View reviewed changes

samansmink added 4 commits November 28, 2023 12:15

Merge branch 'main' into parse_json_schema

74eaeb0

bump vcpkg in REST workflow

242fc5b

fix pr comments

63f20db

no need to install and load: iceberg is linked statically

db09227

samansmink removed the request for review from Tishj November 28, 2023 13:35

Tishj reviewed Nov 29, 2023

View reviewed changes

src/common/iceberg.cpp Outdated Show resolved Hide resolved

Tishj reviewed Nov 29, 2023

View reviewed changes

src/include/iceberg_metadata.hpp Outdated Show resolved Hide resolved

samansmink added 2 commits November 29, 2023 12:51

make string params const ref

553a0a7

pass by ref to make ownership stay at caller

6d4dbe8

samansmink merged commit e16988b into duckdb:main Nov 30, 2023
7 checks passed

samansmink deleted the parse_json_schema branch November 30, 2023 10:22

samansmink mentioned this pull request Dec 4, 2023

Table schema evolution support #33

Closed

samansmink mentioned this pull request Feb 22, 2024

0.10.0 Regression - Cannot parse metadata #39

Closed

mike-luabase pushed a commit to definite-app/duckdb_iceberg that referenced this pull request Oct 27, 2024

Merge pull request duckdb#30 from samansmink/parse_json_schema

45d489d

Parse schema from table metadata

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse schema from table metadata #30

Parse schema from table metadata #30

samansmink commented Nov 27, 2023

Fokko left a comment

Fokko Nov 27, 2023

samansmink Nov 28, 2023

Parse schema from table metadata #30

Parse schema from table metadata #30

Conversation

samansmink commented Nov 27, 2023

Fokko left a comment

Choose a reason for hiding this comment

Fokko Nov 27, 2023

Choose a reason for hiding this comment

samansmink Nov 28, 2023

Choose a reason for hiding this comment