Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: include precision information in PrecisionTimestamp and PrecisionTimestampTZ literals #659

Merged

Conversation

Blizzara
Copy link
Contributor

@Blizzara Blizzara commented Jul 5, 2024

BREAKING CHANGE: changes the message type for Literal PrecisionTimestamp and PrecisionTimestampTZ

The PrecisionTimestamp and PrecisionTimestampTZ literals were introduced in #594, and there was some discussion about their contents in #594 (comment). In their current form, they only contain a i64 value, which I believe is meant to be a precision-dependent number of fractional seconds since epoch. However, the literals don't contain the precision itself, so it's impossible to interpret a PrecisionTimestamp or PrecisionTimestampTZ literal without other information. This is in contrast to e.g. varchar, whose literal does specify the length, or decimal, whose literal specifies scale and precision. @westonpace curious for your thoughts since you were part of that original discussion - am I missing something or is this a bug?

This PR changes the Literal types for PrecisionTimestamp and PrecisionTimestampTZ to contain a PrecisionTimestamp message instead of a i64. That message then contains the i64 value as well as the i32 type. This is a breaking change, but given in their current form these literals aren't usable (unless I'm missing something), would that be okay?

Currently I used the same message for both PrecisionTimestamp and PrecisionTimestampTZ, but I can also make a copy for PTTZ if that'd be better for forwards-compatibility.

…nTimestampTZ literals

BREAKING CHANGE: changes the message type for Literal PrecisionTimestamp and PrecisionTimestampTZ
@richtia
Copy link
Member

richtia commented Jul 5, 2024

The epoch i64 value itself should be enough to indicate the precision. Any additional digits after seconds will be for fractional seconds (and used to determine precision).

      // If the precision is 6 or less then this is the microseconds since the UNIX epoch
      // If the precision is more than 6 then this is the nanoseconds since the UNIX epoch

Does that make the comment from Weston more clear?

@Blizzara
Copy link
Contributor Author

Blizzara commented Jul 5, 2024

The epoch i64 value itself should be enough to indicate the precision. Any additional digits after seconds will be for fractional seconds (and used to determine precision).

      // If the precision is 6 or less then this is the microseconds since the UNIX epoch
      // If the precision is more than 6 then this is the nanoseconds since the UNIX epoch

Does that make the comment from Weston more clear?

Not really, could you ELI5 how you get the precision from the i64? 😅 ie how do you know which part of it indicates seconds?

@richtia
Copy link
Member

richtia commented Jul 5, 2024

Not really, could you ELI5 how you get the precision from the i64? 😅 ie how do you know which part of it indicates seconds?

For example

1720200424 -> Friday, July 5, 2024 5:27:04 PM
1720200424123 -> Friday, July 5, 2024 5:27:04.123 PM (3 - millisecond precision)
1720200424123456 -> Friday, July 5, 2024 5:27:04.123456 PM (6 - microsecond precision)
1720200424123456789 -> Friday, July 5, 2024 5:27:04.123456789 PM (9 - nanosecond precision)

We stuck with either microsecond or nanosecond because that seems to be what most engines are using.

westonpace
westonpace previously approved these changes Jul 5, 2024
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think I agree that there is no easy way to determine the precision from the literal alone. You could infer the precision by looking at other surrounding context (if I have ts > 123 I can look at the datatype of the ts expression) but that is a headache we don't enforce on other types (e.g. we don't enforce users infer the precision of an integer literal like SQL does). I would rather not introduce that.

So I agree a literal needs both precision AND value.

Comment on lines 862 to 864
int32 precision = 1;
// Time passed since 1970-01-01 00:00:00.000000 in UTC for PrecisionTimestampTZ and unspecified timezone for PrecisionTimestamp
int64 value = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int32 precision = 1;
// Time passed since 1970-01-01 00:00:00.000000 in UTC for PrecisionTimestampTZ and unspecified timezone for PrecisionTimestamp
int64 value = 2;
uint32 precision = 1;
// Time passed since 1970-01-01 00:00:00.000000 in UTC for PrecisionTimestampTZ and unspecified timezone for PrecisionTimestamp
uint64 value = 2;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I fixed the value to be uint64.

The precision is int32 in the type so I think it's better to keep it as such?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Keeping precision as int32 seems fine.

@richtia
Copy link
Member

richtia commented Jul 5, 2024

Makes sense. This should make it easier for users to understand. I imagine a few others would have eventually had confusion here as well

richtia
richtia previously approved these changes Jul 5, 2024
@Blizzara Blizzara dismissed stale reviews from richtia and westonpace via 49f1aca July 8, 2024 09:16
@Blizzara
Copy link
Contributor Author

Blizzara commented Jul 8, 2024

Thoughts on whether there should be one message (PrecisionTimestamp) shared across both, or two messages (PrecisionTimestamp and PrecisionTimestampTZ)? The types have separate messages, so maybe using separate messages here as well would be better to keep things aligned? But I dunno if it matters.

@westonpace
Copy link
Member

Thoughts on whether there should be one message (PrecisionTimestamp) shared across both, or two messages (PrecisionTimestamp and PrecisionTimestampTZ)?

I think one message is fine. E.g. we use i64 for int64 and time (and the legacy timestamp). We use bytes for binary, fixed_binary, and uuid. The reader / writer can always distinguish easily between timestamp and timestamp_tz using the message code (34/35)

westonpace
westonpace previously approved these changes Jul 8, 2024
@westonpace
Copy link
Member

This is a breaking change so I think we'll need 2 SMC approvals. @EpsilonPrime @vbarua @jacques-n

Blizzara added a commit to Blizzara/datafusion that referenced this pull request Jul 11, 2024
@@ -164,13 +164,15 @@ message Type {
}

message PrecisionTimestamp {
// Sub-second precision, 0 means the value given is in seconds, 3 is milliseconds, 6 microsecods, 9 is nanoseconds
// Defaults to 6
int32 precision = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to these changes, but how does Defaults to 6 work here given that default value if precision is not set will be 0, and we can't distinguish between

  • 0 was set to indicate seconds.
  • 0 was not set intentionally

Should we remove this entirely?
@westonpace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I did mean to remove that, as I don't think it works, like you say - 666aa60. Hope that's better!

(The commit did remove your approvals though @vbarua @westonpace )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I agree that a "default" can't exist here. The default of 6 is really "if you see a Timestamp or TimestampTz type then you can treat it as a PrecisionTimestamp or PrecisionTimestampTz type where the precision = 6.

vbarua
vbarua previously approved these changes Jul 11, 2024
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very much in favour of making this more explicit.

@Blizzara Blizzara dismissed stale reviews from vbarua and westonpace via 666aa60 July 11, 2024 17:29
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, one very minor change needed.

message PrecisionTimestamp {
// Sub-second precision, 0 means the value given is in seconds, 3 is milliseconds, 6 microseconds, 9 is nanoseconds
int32 precision = 1;
// Time passed since 1970-01-01 00:00:00.000000 in UTC for PrecisionTimestampTZ and unspecified timezone for PrecisionTimestamp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Time passed since 1970-01-01 00:00:00.000000 in UTC for PrecisionTimestampTZ and unspecified timezone for PrecisionTimestamp
// Time passed since 1970-01-01 00:00:00.000000 in UTC

Sorry, one small clarification. Timestamps are always in UTC, regardless of whether or not they are Timestamp or TimestampTZ.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm is that right? I think the docs for Timestamp and TimestampTZ disagree, also that’s just the difference between the TZ version and the non-TZ version. The TZ version doesn’t carry any timezone with it, but it counts time since utc epoch so it specifies a definite point in time - while the non-TZ timestamp doesn’t, and is only useful relative to another no -TZ timestamp.

At least that’s the way I understand them, and it aligns with Spark and DataFusion timestamp types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while the non-TZ timestamp doesn’t, and is only useful relative to another no -TZ timestamp.

My understanding is that you can still do certain things to a non-TZ timestamp:

  • Display the wall-clock time represented by the timestamp (e.g. 10PM on Thursday, July 11th, 2024)
  • Assign a timezone to it to convert it to a -TZ timestamp

In order for a consumer to do either of those things there needs to be some meaning behind the literal beyond "just some random number that has appropriate relative distance from other values in this column"

In pretty much all cases I'm aware of the non-TZ timstamps are still stored as UTC instants where the wall clock (in a UTC time zone) would display the desired value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pretty much all cases I'm aware of the non-TZ timstamps are still stored as UTC instants where the wall clock (in a UTC time zone) would display the desired value.

I checked Spark, DataFusion, Parquet and Postgres docs, and as far as I understand them they all say the non-TZ timestamp bears no notion of timezone at all (ie. it's time passed since some epoch, not UTC epoch), while the yes-TZ timestamp is specifically time passed since UTC epoch.

My understanding is that you can still do certain things to a non-TZ timestamp:

  • Display the wall-clock time represented by the timestamp (e.g. 10PM on Thursday, July 11th, 2024)
  • Assign a timezone to it to convert it to a -TZ timestamp

Yes, the non-TZ timestamp does indicate wall clock time - but it will represent the same wall clock time of e.g. 10PM on Thursday, July 11th, 2024 no matter in which timezone you look at it. And you can convert it to a -TZ timestamp by providing information about which timezone's wall clock time it corresponds to.

In order for a consumer to do either of those things there needs to be some meaning behind the literal beyond "just some random number that has appropriate relative distance from other values in this column"

Well yes, it's not just a random number, but time passed since 1.1.1970.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well yes, it's not just a random number, but time passed since 1.1.1970.

Ok. So if a producer encodes the literal "10PM on Thursday, July 11th, 2024" as 1720735200 (seconds since 1.1.1970 in UTC) and my consumer encodes the literal "10PM on Thursday, July 11th, 2024" as 1720760400 (seconds since 1.1.1970 in PT) then I think we are going to have a problem.

Often people say "seconds since the UNIX epoch" but the UNIX epoch is defined as 1.1.1970 UTC and not 1.1.1970 in some arbitrary time zone.

the non-TZ timestamp bears no notion of timezone at all

Correct. We are supposed to think of these values semantically as a wall clock time. However, we still need to agree on how we encode wall clock times into a proto literal. If we represent wall clock times as ISO-8601 strings (with no TZ part) then it would be fine. There would be no need to mention time zones. However, if we choose to encode wall clock times as a u64 then we must mention how that works. The way that works is by encoding "the wall clock time X" as "seconds since the epoch (UTC) where a wall clock in a UTC time zone would display X"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't necessitate specifying UTC. You can as well say "seconds since the epoch (in arbitrary timezone TZ) where a wall clock in the same timezone TZ would display X". That's what I assume is the point of the no-TZ timestamp :)

Very well then. You are a consumer. I have sent you a plan with a PrecisionTimestamp literal. The proto is:

{
  precision: 0,
  value: 1720794104
}

Can you please format this as a string of the form MONTH/DAY/YEAR HOURS:MINUTES:SECONDS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very well then. You are a consumer. I have sent you a plan with a PrecisionTimestamp literal. The proto is:

{
  precision: 0,
  value: 1720794104
}

Can you please format this as a string of the form MONTH/DAY/YEAR HOURS:MINUTES:SECONDS?

I'm not sure I understand where you're getting with this, but I believe that'd be July 12, 2024 14:21:44 (well, with a slightly different format string).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that answer is wrong. The correct answer is July 12, 2024 7:21:44 AM.

I believe I followed the encoding rules correctly. 1720794104 is the number of seconds that have passed since 1.1.1970 PT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think I see my mistake. I'm thinking we are interpreting this as

"The time shown on a wall clock in UTC after X seconds have passed since Y" (in which case Y's timezone matters)

You are perhaps interpreting it as

"The time shown on a wall clock in timezone Z after X seconds have passed since Y" (in which case Y's time zone doesn't matter as long as Z matches it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right in how I interpret it 👍

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I see you are mirroring the site docs. Let's tackle this in a future PR.

@westonpace westonpace merged commit f9e5f9c into substrait-io:main Jul 11, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants