Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Iceberg OPTIMIZE with WHERE casting timestamp_tz column to a date #12918

Merged
merged 7 commits into from
Jun 22, 2022

Conversation

findepi
Copy link
Member

@findepi findepi commented Jun 21, 2022

Follow-up to #12795

Among other things, that PR added support for

-- regardless of session zone
ALTER TABLR iceberg_table EXECUTE optimize WHERE c_timestamp_tz > with_timezone(a_date_constant, 'UTC');

-- only when session zone is UTC
ALTER TABLR iceberg_table EXECUTE optimize WHERE c_timestamp_tz > a_date_constant

This PR adds support for

-- regardless of session zone
ALTER TABLR iceberg_table EXECUTE optimize WHERE CAST(c_timestamp_tz AS date) > a_date_constant

Further enhances #7905
Fixes #12362

Comment on lines +98 to +99
* {@link TimestampWithTimeZoneType}. Such unwrap would not be monotonic. Within Iceberg, we know
* that {@link TimestampWithTimeZoneType} is always in UTC zone (point in time, with no time zone information),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future: I wonder if we can communicate a constraint on a type from connector to engine so standard optimizer can handle that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could do that as part of io.trino.spi.connector.ColumnMetadata.

Or, we could use a dedicate type: #2273
A type may be better because this limitation -- storing point in time only, without time zone -- isn't really specific to Iceberg.

@findepi findepi force-pushed the findepi/iceberg-subsume-with-cast branch from 9be3f52 to 412e967 Compare June 21, 2022 14:38
Map<ColumnHandle, Domain> enforcedDomains = enforcedDomainsBuilder.buildOrThrow();
checkArgument(
enforcedDomains.size() + unenforcedDomains.size() == predicateDomains.size(),
"Enforced tuple domain cannot be determined. Connector returned an unenforced TupleDomain %s that contains columns not in predicate %s.",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -437,9 +437,8 @@ public static TupleDomain<ColumnHandle> computeEnforced(TupleDomain<ColumnHandle
if (unenforcedDomains.containsKey(predicateColumnHandle)) {
Domain unenforcedDomain = unenforcedDomains.get(predicateColumnHandle);
checkArgument(
predicateDomain.equals(unenforcedDomain),
"Enforced tuple domain cannot be determined. The connector is expected to enforce the respective domain entirely on none, some, or all of the column. " +
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi findepi force-pushed the findepi/iceberg-subsume-with-cast branch from 412e967 to 1ad18f7 Compare June 21, 2022 14:52
findepi added 7 commits June 21, 2022 22:53
When `IcebergMetadata.applyFilter` is invoked with `Constraint` that has
no useful summary (`TupleDomain`) and carries only expression or
functional predicate, we can short-circuit the method execution.
Also improve wording.
`unwrapTimestampToDateCast` assumes the target type is `DATE` (and is
invoked only when it is), so passing `targetType` is redundant.
The check dates back to the time when `TupleDomain` was the only
information passed to the connector in the
`ConnectorMetadata.applyFilter`. Its purpose was to ensure the connector
does not erroneously return some new column constraints (`Domain`
objects) in `ConstraintApplicationResult.remainingFilter`.

With expression-based pushdown, the check is no longer valid. A
connector may be able to translate `Constraint.expression` (or part
thereof) into a `Domain` / `TupleDomain` and then enforce it, or return
such simplified representation as a remaining `TupleDomain`
(`ConstraintApplicationResult.remainingFilter`).
@findepi findepi force-pushed the findepi/iceberg-subsume-with-cast branch from 1ad18f7 to d6f2da1 Compare June 21, 2022 20:55
@findepi
Copy link
Member Author

findepi commented Jun 21, 2022

(just rebased to resolve a conflict with #12911)

@erichwang
Copy link
Contributor

erichwang commented Jun 22, 2022

@findepi Just a quick question:

-- regardless of session zone
ALTER TABLR iceberg_table EXECUTE optimize WHERE CAST(c_timestamp_tz AS date) > a_date_constant

This wouldn't work for Iceberg time partitions since c_timestamp_tz could be in any time zone right? I thought we had discussed making it look like:

-- regardless of session zone
ALTER TABLR iceberg_table EXECUTE optimize WHERE CAST(c_timestamp_tz AT TIME ZONE 'UTC' AS date) > a_date_constant

To force the same timezone?

Side note: that being said, it looks like if we just do with_timezone(a_date_constant, 'UTC') in our query, that might already get us what we need, making this change more of a nice to have.

@findepi
Copy link
Member Author

findepi commented Jun 22, 2022

Just a quick question:

-- regardless of session zone
ALTER TABLR iceberg_table EXECUTE optimize WHERE CAST(c_timestamp_tz AS date) > a_date_constant

This wouldn't work for Iceberg time partitions since c_timestamp_tz could be in any time zone right?

in Iceberg, timestamp with time zone values are all in UTC. Iceberg stores point in time, but not the time zone.

Side note: that being said, it looks like if we just do with_timezone(a_date_constant, 'UTC') in our query, that might already get us what we need, making this change more of a nice to have.

Yep, this is already supported, but may be less intuitive to users.
CAST(c_timestamp_tz AS date) > a_date_constant looks simpler than
c_timestamp_tz > with_timezone(a_date_constant, 'UTC')

fortunately, it's or, not xor. People can choose whichever they prefer.

@findepi findepi merged commit 49be4c2 into trinodb:master Jun 22, 2022
@findepi findepi mentioned this pull request Jun 22, 2022
@github-actions github-actions bot added this to the 387 milestone Jun 22, 2022
@findepi findepi deleted the findepi/iceberg-subsume-with-cast branch June 22, 2022 11:23
@colebow
Copy link
Member

colebow commented Jun 22, 2022

I think we should document this so users know how to leverage this...

@erichwang
Copy link
Contributor

Yes, definitely prefer CAST(c_timestamp_tz AS date) > a_date_constant if possible. That's how I normally think about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Ability to OPTIMIZE a single time-based partition in Iceberg
4 participants