Add support for `add_files` and `add_files_from_table` procedures in Iceberg #22751

ebyhr · 2024-07-22T08:08:16Z

Description

Spark supports adding files from tables or locations with add_files procedure. This procedure is helpful for migrating specific Hive partitions, or importing files on the storage. It would be nice to support the same procedure in Trino Iceberg connector.

This PR adds add_files and add_files_from_table procedures.

In add_files procedure, `recursive_directory argument is optional:

ALTER TABLE testdb.testdb EXECUTE add_files(
    location => 's3://my-bucket/a/path',
    format => 'ORC',
    [recursive_directory => 'true'])

In add_files_from_table procedure, partition_filter and recursive_directory arguments are optional:

ALTER TABLE testdb.testdb EXECUTE add_files_from_table(
    schema_name => 'testdb',
    table_name => 'hive_customer_orders',
    [partition_filter => map(ARRAY['region'], ARRAY['AMERICA'])], 
    [recursive_directory => 'true'])

The add_files procedure is disabled by default with iceberg.add-files-procedure.enabled config property because OSS Trino doesn't support location based access control.

Fixes #11744

Release notes

# Iceberg
* Add support for `add_files` and `add_files_from_table` procedures. ({issue}`11744`)

...no-iceberg/src/test/java/io/trino/plugin/iceberg/procedure/TestIcebergAddFilesProcedure.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/AddFilesProcedure.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/HiveMigrations.java

alexjo2144 · 2024-07-23T19:19:14Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/AddFilesProcedure.java

+            checkProcedureArgument(
+                    table.schemas().size() == sourceTable.getDataColumns().size(),
+                    "Data column count mismatch: %d vs %d", table.schemas().size(), sourceTable.getDataColumns().size());
+            for (Column sourceColumn : sourceTable.getDataColumns()) {
+                Types.NestedField targetColumn = schema.caseInsensitiveFindField(sourceColumn.getName());
+                if (targetColumn == null) {
+                    throw new TrinoException(COLUMN_NOT_FOUND, "Column '%s' does not exist".formatted(sourceColumn.getName()));
+                }
+                ColumnIdentity columnIdentity = createColumnIdentity(targetColumn);
+                org.apache.iceberg.types.Type sourceColumnType = toIcebergType(typeManager.getType(getTypeSignature(sourceColumn.getType(), DEFAULT_PRECISION)), columnIdentity);
+                if (!targetColumn.type().equals(sourceColumnType)) {
+                    throw new TrinoException(TYPE_MISMATCH, "Expected target '%s' type, but got source '%s' type".formatted(targetColumn.type(), sourceColumnType));
+                }
+            }


These checks may be a little strict, if Iceberg supports coercion from the source column to the target type it should be okay as is.

Having additional columns in the source than the target also doesn't hurt, we'll just ignore them.

It is harder to mess up a call to the procedure this way, but I guess it's a question of how we expect people to use this. With perfectly matching schemas, or ones that have been tweaked a little.

I think we should start with strict checks and tweak them based on user feedback.

The common use case looks adding files from Hive partitions after the initial migration. For instance, migrating the entire table on July 24, adding files from July 25 partition next day, .... until the end of migration.

Does Spark's version of add_files enforce strict checks, or is it permissive?

I'm also in favor of defaulting to strict mode, but giving the user the ability disable it with a non-strict boolean flag.

SemionPar

LGTM!

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/HiveMigrations.java

docs/src/main/sphinx/connector/iceberg.md

...no-iceberg/src/test/java/io/trino/plugin/iceberg/procedure/TestIcebergAddFilesProcedure.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/AddFilesProcedure.java

alonahmias · 2024-07-29T21:53:55Z

What about sorted by? Is there a way to also allow sorted by columns in this way?

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/HiveMigrations.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergConfig.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/AddFilesProcedure.java

docs/src/main/sphinx/connector/iceberg.md

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/AddFilesProcedure.java

...no-iceberg/src/test/java/io/trino/plugin/iceberg/procedure/TestIcebergAddFilesProcedure.java

docs/src/main/sphinx/connector/iceberg.md

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/AddFilesProcedure.java

docs/src/main/sphinx/connector/iceberg.md

raunaqmorarka · 2024-08-30T07:15:38Z

@ebyhr please update description to match updated syntax

...no-iceberg/src/test/java/io/trino/plugin/iceberg/procedure/TestIcebergAddFilesProcedure.java

docs/src/main/sphinx/connector/iceberg.md

martint · 2024-09-24T16:02:00Z

Let's split the function for adding files from a path and from another table into separate ones. Having overloads where most of the arguments are different is error prone and confusing for users.

ebyhr · 2024-09-25T04:10:17Z

@martint Separated into add_files_from_location and add_files_from_table procedures. Let me know if you want to rename them.

docs/src/main/sphinx/connector/iceberg.md

...no-iceberg/src/test/java/io/trino/plugin/iceberg/procedure/TestIcebergAddFilesProcedure.java

martint · 2024-09-30T17:57:40Z

A couple of questions/comments:

The _from_location suffix is necessary. Since we're talking about files, it's natural to think they are in object storage, at a given location. The _from_table suffix is necessary, though 1) to avoid the overload 2) since loading files from a table is a special case.
Do we expect to be able to add files from other sources (e.g., Delta Lake tables) at some point?

Add add_files_from_table and add_files procedures in Iceberg connector. The add_files procedure is disabled by deafult because location based access conrol is not supported in Trino.

ebyhr · 2024-09-30T22:24:49Z

@martint Removed _from_location suffix. There is no concrete plan to support other table formats, but it's a potential enhancement. I expect changing add_files_from_table behavior internally instead of adding a new procedure so that users can avoid rewriting procedure name based on source table types.

cla-bot bot added the cla-signed label Jul 22, 2024

github-actions bot added docs iceberg Iceberg connector labels Jul 22, 2024

ebyhr added the syntax-needs-review label Jul 22, 2024

ebyhr requested review from raunaqmorarka, alexjo2144, SemionPar and mayankvadariya July 23, 2024 10:59

mayankvadariya approved these changes Jul 23, 2024

View reviewed changes

...no-iceberg/src/test/java/io/trino/plugin/iceberg/procedure/TestIcebergAddFilesProcedure.java Outdated Show resolved Hide resolved

alexjo2144 reviewed Jul 23, 2024

View reviewed changes

ebyhr force-pushed the ebi/iceberg-add-files branch from 627e63d to 46e9950 Compare July 24, 2024 06:19

ebyhr requested a review from findinpath July 24, 2024 13:50

SemionPar approved these changes Jul 25, 2024

View reviewed changes

ebyhr force-pushed the ebi/iceberg-add-files branch from 46e9950 to fe148d7 Compare July 25, 2024 23:45

findinpath requested a review from pajaks August 13, 2024 09:27

ebyhr force-pushed the ebi/iceberg-add-files branch from fe148d7 to ec6b98b Compare August 14, 2024 11:45

raunaqmorarka reviewed Aug 15, 2024

View reviewed changes

ebyhr force-pushed the ebi/iceberg-add-files branch from ec6b98b to 4b1f362 Compare August 16, 2024 02:18

pajaks reviewed Aug 16, 2024

View reviewed changes

ebyhr force-pushed the ebi/iceberg-add-files branch from 4b1f362 to 0462ad7 Compare August 30, 2024 01:16

raunaqmorarka approved these changes Aug 30, 2024

View reviewed changes

docs/src/main/sphinx/connector/iceberg.md Show resolved Hide resolved

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/AddFilesProcedure.java Outdated Show resolved Hide resolved

findinpath reviewed Aug 30, 2024

View reviewed changes

docs/src/main/sphinx/connector/iceberg.md Outdated Show resolved Hide resolved

findinpath reviewed Aug 30, 2024

View reviewed changes

...no-iceberg/src/test/java/io/trino/plugin/iceberg/procedure/TestIcebergAddFilesProcedure.java Outdated Show resolved Hide resolved

pajaks reviewed Sep 6, 2024

View reviewed changes

...no-iceberg/src/test/java/io/trino/plugin/iceberg/procedure/TestIcebergAddFilesProcedure.java Outdated Show resolved Hide resolved

docs/src/main/sphinx/connector/iceberg.md Outdated Show resolved Hide resolved

ebyhr force-pushed the ebi/iceberg-add-files branch from 0462ad7 to 7760d49 Compare September 22, 2024 03:22

ebyhr added 2 commits September 25, 2024 08:40

Move package of TestIcebergMigrateProcedure

eeaa7a7

Extract helper method from MigrateProcedure

6374c4f

Pass file length in MigrationUtils

96a9703

ebyhr force-pushed the ebi/iceberg-add-files branch from 7760d49 to b56c022 Compare September 25, 2024 04:07

ebyhr changed the title ~~Add support for add_files procedure in Iceberg~~ Add support for add_files_from_location and add_files_from_table procedures in Iceberg Sep 25, 2024

pajaks reviewed Sep 26, 2024

View reviewed changes

ebyhr force-pushed the ebi/iceberg-add-files branch from b56c022 to f5fda78 Compare September 26, 2024 11:49

pajaks approved these changes Sep 27, 2024

View reviewed changes

findinpath reviewed Sep 30, 2024

View reviewed changes

docs/src/main/sphinx/connector/iceberg.md Show resolved Hide resolved

findinpath reviewed Sep 30, 2024

View reviewed changes

...no-iceberg/src/test/java/io/trino/plugin/iceberg/procedure/TestIcebergAddFilesProcedure.java Outdated Show resolved Hide resolved

ebyhr force-pushed the ebi/iceberg-add-files branch from f5fda78 to 9e7efc6 Compare September 30, 2024 21:38

ebyhr changed the title ~~Add support for add_files_from_location and add_files_from_table procedures in Iceberg~~ Add support for add_files and add_files_from_table procedures in Iceberg Sep 30, 2024

ebyhr force-pushed the ebi/iceberg-add-files branch from 9e7efc6 to ef8dd05 Compare September 30, 2024 21:41

Add procedures to add files in Iceberg

c6cc6d1

Add add_files_from_table and add_files procedures in Iceberg connector. The add_files procedure is disabled by deafult because location based access conrol is not supported in Trino.

ebyhr force-pushed the ebi/iceberg-add-files branch from ef8dd05 to c6cc6d1 Compare September 30, 2024 21:49

ebyhr requested a review from martint October 2, 2024 02:44

martint removed the syntax-needs-review label Oct 3, 2024

ebyhr merged commit 25b3c46 into trinodb:master Oct 4, 2024
47 checks passed

ebyhr deleted the ebi/iceberg-add-files branch October 4, 2024 09:10

github-actions bot added this to the 461 milestone Oct 4, 2024

ebyhr mentioned this pull request Oct 4, 2024

Add non-strict option to add_files_from_table procedure #23677

Open

This was referenced Oct 4, 2024

Add Trino 461 release notes #23669

Merged

Improve docs for add file procedures in Iceberg #23717

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for `add_files` and `add_files_from_table` procedures in Iceberg #22751

Add support for `add_files` and `add_files_from_table` procedures in Iceberg #22751

ebyhr commented Jul 22, 2024 •

edited

Loading

alexjo2144 Jul 23, 2024

ebyhr Jul 24, 2024

cwsteinbach Sep 25, 2024

ebyhr Oct 4, 2024

SemionPar left a comment

alonahmias commented Jul 29, 2024 •

edited

Loading

raunaqmorarka commented Aug 30, 2024

martint commented Sep 24, 2024

ebyhr commented Sep 25, 2024

martint commented Sep 30, 2024

ebyhr commented Sep 30, 2024

Add support for add_files and add_files_from_table procedures in Iceberg #22751

Add support for add_files and add_files_from_table procedures in Iceberg #22751

Conversation

ebyhr commented Jul 22, 2024 • edited Loading

Description

Release notes

alexjo2144 Jul 23, 2024

Choose a reason for hiding this comment

ebyhr Jul 24, 2024

Choose a reason for hiding this comment

cwsteinbach Sep 25, 2024

Choose a reason for hiding this comment

ebyhr Oct 4, 2024

Choose a reason for hiding this comment

SemionPar left a comment

Choose a reason for hiding this comment

alonahmias commented Jul 29, 2024 • edited Loading

raunaqmorarka commented Aug 30, 2024

martint commented Sep 24, 2024

ebyhr commented Sep 25, 2024

martint commented Sep 30, 2024

ebyhr commented Sep 30, 2024

Add support for `add_files` and `add_files_from_table` procedures in Iceberg #22751

Add support for `add_files` and `add_files_from_table` procedures in Iceberg #22751

ebyhr commented Jul 22, 2024 •

edited

Loading

alonahmias commented Jul 29, 2024 •

edited

Loading