Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use session TimeZone for timestamp_tz column #8680

Closed
yanqinghe opened this issue Jul 27, 2021 · 5 comments
Closed

Use session TimeZone for timestamp_tz column #8680

yanqinghe opened this issue Jul 27, 2021 · 5 comments

Comments

@yanqinghe
Copy link

yanqinghe commented Jul 27, 2021

when query timestamp_tz column data from iceberg table, I found that Trino uses the UTC time zone to process these columns, which increases the cost of using queries.
For example:

select ts for table;

specify time zone

select ts from ts AT TIME ZONE 'Asia/Shanghai' from table;

Can we use session time zone to process these fields with time zone so that users do not need to make additional time zone conversion when querying?

There are places in the processing logic of Parquet/ORC file where the time zone can be modified. I don't know whether this is reasonable.

I hope to get better suggestions. Welcome to discuss

@findepi
Copy link
Member

findepi commented Jul 28, 2021

Iceberg's timestamptz is a point in time (without zone).

Mapping to session zone would have representability consequences as in #5488.

@yanqinghe
Copy link
Author

Although iceberg's timestamptz format can only represent one time point physically, iceberg connector stipulates that this time point is UTC time zone, that is, it includes the concept of time zone.

We think that there is no problem in processing time with time zone according to the local time zone. After all, if necessary, the time zone can be converted to UTC again.

In the Hive connector, we found that when processing the int96 format in parquet, the timezone in session is used when processing the column with the field type of timestampwithtimezonetype, but when processing the Int64 format in parquet, the UTC time zone is used by default for the column with time zone.

The above implementation can be seen in io.trino.parquet.reader.primitivecolumnreader.

Therefore, we think that we can use session timezone when processing iceberg's timestamptz format.

@findepi
Copy link
Member

findepi commented Aug 4, 2021

Although iceberg's timestamptz format can only represent one time point physically, iceberg connector stipulates that this time point is UTC time zone, that is, it includes the concept of time zone.

yes, because `timestamp with time zone is the only Trino type that can be used to represent a point in time.
we don't have an instant type, for example (#2273)

We think that there is no problem in processing time with time zone according to the local time zone.

depending on the local time zone and actual data, their indeed might be no problem, or there may be

In the Hive connector, we found that when processing the int96 format in parquet, the timezone in session is used when processing the column with the field type of timestampwithtimezonetype,

that looks like a bug in Hive connector, because some values can be misrepresented, or may hit the #5781 problem

@Poncirus
Copy link

Poncirus commented Aug 9, 2021

Although all problems could be resolved by converting to UTC again, problems related to DST are often ambiguous and inexplicit. That is the reason why you would like to use UTC as default time zone. Is my understanding correct?

Like for #5488, if user convert time zone to 'America/Los_Angeles', they will still get the same result.

SELECT INCREMENT,
       t AT TIME ZONE 'America/Los_Angeles'
FROM
  (SELECT INCREMENT,
          from_unixtime(base + INCREMENT * 3600, 'UTC') AS t
   FROM (
         VALUES(1572766200)) t(base)
   CROSS JOIN UNNEST(sequence(0, 3)) u(INCREMENT))

This conversion must be obvious, and should not be done by connector.

@findepi
Copy link
Member

findepi commented Sep 17, 2021

Note that:

trino> SELECT base, increment, from_unixtime(base + increment * 3600, 'America/Los_Angeles')
    -> FROM (VALUES to_unixtime(TIMESTAMP '2021-11-07 00:30:00 America/Los_Angeles')) t(base)
    -> CROSS JOIN UNNEST(sequence(0, 3)) u(increment);
    base     | increment |                    _col2
-------------+-----------+---------------------------------------------
 1.6362702E9 |         0 | 2021-11-07 00:30:00.000 America/Los_Angeles
 1.6362702E9 |         1 | 2021-11-07 01:30:00.000 America/Los_Angeles
 1.6362702E9 |         2 | 2021-11-07 01:30:00.000 America/Los_Angeles
 1.6362702E9 |         3 | 2021-11-07 02:30:00.000 America/Los_Angeles

two different points in time converted to timestamp with time zone in PT zone are then getting returned to the user with identical representation.

Also, the fact that engine discerns them (even though they look the same) is a bug tracked in #5781

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants