Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for add_files and add_files_from_table procedures in Iceberg #22751

Merged
merged 4 commits into from
Oct 4, 2024

Conversation

ebyhr
Copy link
Member

@ebyhr ebyhr commented Jul 22, 2024

Description

Spark supports adding files from tables or locations with add_files procedure. This procedure is helpful for migrating specific Hive partitions, or importing files on the storage. It would be nice to support the same procedure in Trino Iceberg connector.

This PR adds add_files and add_files_from_table procedures.

In add_files procedure, `recursive_directory argument is optional:

ALTER TABLE testdb.testdb EXECUTE add_files(
    location => 's3://my-bucket/a/path',
    format => 'ORC',
    [recursive_directory => 'true'])

In add_files_from_table procedure, partition_filter and recursive_directory arguments are optional:

ALTER TABLE testdb.testdb EXECUTE add_files_from_table(
    schema_name => 'testdb',
    table_name => 'hive_customer_orders',
    [partition_filter => map(ARRAY['region'], ARRAY['AMERICA'])], 
    [recursive_directory => 'true'])

The add_files procedure is disabled by default with iceberg.add-files-procedure.enabled config property because OSS Trino doesn't support location based access control.

Fixes #11744

Release notes

# Iceberg
* Add support for `add_files` and `add_files_from_table` procedures. ({issue}`11744`)

Comment on lines 257 to 269
checkProcedureArgument(
table.schemas().size() == sourceTable.getDataColumns().size(),
"Data column count mismatch: %d vs %d", table.schemas().size(), sourceTable.getDataColumns().size());
for (Column sourceColumn : sourceTable.getDataColumns()) {
Types.NestedField targetColumn = schema.caseInsensitiveFindField(sourceColumn.getName());
if (targetColumn == null) {
throw new TrinoException(COLUMN_NOT_FOUND, "Column '%s' does not exist".formatted(sourceColumn.getName()));
}
ColumnIdentity columnIdentity = createColumnIdentity(targetColumn);
org.apache.iceberg.types.Type sourceColumnType = toIcebergType(typeManager.getType(getTypeSignature(sourceColumn.getType(), DEFAULT_PRECISION)), columnIdentity);
if (!targetColumn.type().equals(sourceColumnType)) {
throw new TrinoException(TYPE_MISMATCH, "Expected target '%s' type, but got source '%s' type".formatted(targetColumn.type(), sourceColumnType));
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These checks may be a little strict, if Iceberg supports coercion from the source column to the target type it should be okay as is.

Having additional columns in the source than the target also doesn't hurt, we'll just ignore them.

It is harder to mess up a call to the procedure this way, but I guess it's a question of how we expect people to use this. With perfectly matching schemas, or ones that have been tweaked a little.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should start with strict checks and tweak them based on user feedback.

The common use case looks adding files from Hive partitions after the initial migration. For instance, migrating the entire table on July 24, adding files from July 25 partition next day, .... until the end of migration.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Spark's version of add_files enforce strict checks, or is it permissive?

I'm also in favor of defaulting to strict mode, but giving the user the ability disable it with a non-strict boolean flag.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@SemionPar SemionPar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@alonahmias
Copy link

alonahmias commented Jul 29, 2024

What about sorted by? Is there a way to also allow sorted by columns in this way?

@raunaqmorarka
Copy link
Member

@ebyhr please update description to match updated syntax

@martint
Copy link
Member

martint commented Sep 24, 2024

Let's split the function for adding files from a path and from another table into separate ones. Having overloads where most of the arguments are different is error prone and confusing for users.

@ebyhr ebyhr changed the title Add support for add_files procedure in Iceberg Add support for add_files_from_location and add_files_from_table procedures in Iceberg Sep 25, 2024
@ebyhr
Copy link
Member Author

ebyhr commented Sep 25, 2024

@martint Separated into add_files_from_location and add_files_from_table procedures. Let me know if you want to rename them.

docs/src/main/sphinx/connector/iceberg.md Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/iceberg.md Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/iceberg.md Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/iceberg.md Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/iceberg.md Outdated Show resolved Hide resolved
@martint
Copy link
Member

martint commented Sep 30, 2024

A couple of questions/comments:

  • The _from_location suffix is necessary. Since we're talking about files, it's natural to think they are in object storage, at a given location. The _from_table suffix is necessary, though 1) to avoid the overload 2) since loading files from a table is a special case.
  • Do we expect to be able to add files from other sources (e.g., Delta Lake tables) at some point?

@ebyhr ebyhr changed the title Add support for add_files_from_location and add_files_from_table procedures in Iceberg Add support for add_files and add_files_from_table procedures in Iceberg Sep 30, 2024
Add add_files_from_table and add_files procedures
in Iceberg connector. The add_files procedure is
disabled by deafult because location based access conrol
is not supported in Trino.
@ebyhr
Copy link
Member Author

ebyhr commented Sep 30, 2024

@martint Removed _from_location suffix. There is no concrete plan to support other table formats, but it's a potential enhancement. I expect changing add_files_from_table behavior internally instead of adding a new procedure so that users can avoid rewriting procedure name based on source table types.

@ebyhr ebyhr requested a review from martint October 2, 2024 02:44
@ebyhr ebyhr merged commit 25b3c46 into trinodb:master Oct 4, 2024
47 checks passed
@ebyhr ebyhr deleted the ebi/iceberg-add-files branch October 4, 2024 09:10
@github-actions github-actions bot added this to the 461 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

Add add_files procedure in Iceberg connector
10 participants