-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: include precision information in PrecisionTimestamp and PrecisionTimestampTZ literals #659
fix: include precision information in PrecisionTimestamp and PrecisionTimestampTZ literals #659
Conversation
…nTimestampTZ literals BREAKING CHANGE: changes the message type for Literal PrecisionTimestamp and PrecisionTimestampTZ
The epoch i64 value itself should be enough to indicate the precision. Any additional digits after seconds will be for fractional seconds (and used to determine precision).
Does that make the comment from Weston more clear? |
Not really, could you ELI5 how you get the precision from the i64? 😅 ie how do you know which part of it indicates seconds? |
For example
We stuck with either microsecond or nanosecond because that seems to be what most engines are using. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think I agree that there is no easy way to determine the precision from the literal alone. You could infer the precision by looking at other surrounding context (if I have ts > 123 I can look at the datatype of the ts
expression) but that is a headache we don't enforce on other types (e.g. we don't enforce users infer the precision of an integer literal like SQL does). I would rather not introduce that.
So I agree a literal needs both precision AND value.
proto/substrait/algebra.proto
Outdated
int32 precision = 1; | ||
// Time passed since 1970-01-01 00:00:00.000000 in UTC for PrecisionTimestampTZ and unspecified timezone for PrecisionTimestamp | ||
int64 value = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int32 precision = 1; | |
// Time passed since 1970-01-01 00:00:00.000000 in UTC for PrecisionTimestampTZ and unspecified timezone for PrecisionTimestamp | |
int64 value = 2; | |
uint32 precision = 1; | |
// Time passed since 1970-01-01 00:00:00.000000 in UTC for PrecisionTimestampTZ and unspecified timezone for PrecisionTimestamp | |
uint64 value = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I fixed the value to be uint64.
The precision is int32 in the type so I think it's better to keep it as such?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Keeping precision as int32 seems fine.
Makes sense. This should make it easier for users to understand. I imagine a few others would have eventually had confusion here as well |
Thoughts on whether there should be one message (PrecisionTimestamp) shared across both, or two messages (PrecisionTimestamp and PrecisionTimestampTZ)? The types have separate messages, so maybe using separate messages here as well would be better to keep things aligned? But I dunno if it matters. |
I think one message is fine. E.g. we use |
This is a breaking change so I think we'll need 2 SMC approvals. @EpsilonPrime @vbarua @jacques-n |
to align with spark https://spark.apache.org/docs/3.4.2/api/java/org/apache/spark/sql/types/TimestampType.html until we have proper PrecisionTimestamp substrait-io/substrait#659
@@ -164,13 +164,15 @@ message Type { | |||
} | |||
|
|||
message PrecisionTimestamp { | |||
// Sub-second precision, 0 means the value given is in seconds, 3 is milliseconds, 6 microsecods, 9 is nanoseconds | |||
// Defaults to 6 | |||
int32 precision = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to these changes, but how does Defaults to 6
work here given that default value if precision is not set will be 0, and we can't distinguish between
- 0 was set to indicate seconds.
- 0 was not set intentionally
Should we remove this entirely?
@westonpace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I did mean to remove that, as I don't think it works, like you say - 666aa60. Hope that's better!
(The commit did remove your approvals though @vbarua @westonpace )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I agree that a "default" can't exist here. The default of 6 is really "if you see a Timestamp
or TimestampTz
type then you can treat it as a PrecisionTimestamp
or PrecisionTimestampTz
type where the precision = 6
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very much in favour of making this more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, one very minor change needed.
message PrecisionTimestamp { | ||
// Sub-second precision, 0 means the value given is in seconds, 3 is milliseconds, 6 microseconds, 9 is nanoseconds | ||
int32 precision = 1; | ||
// Time passed since 1970-01-01 00:00:00.000000 in UTC for PrecisionTimestampTZ and unspecified timezone for PrecisionTimestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Time passed since 1970-01-01 00:00:00.000000 in UTC for PrecisionTimestampTZ and unspecified timezone for PrecisionTimestamp | |
// Time passed since 1970-01-01 00:00:00.000000 in UTC |
Sorry, one small clarification. Timestamps are always in UTC, regardless of whether or not they are Timestamp
or TimestampTZ
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm is that right? I think the docs for Timestamp and TimestampTZ disagree, also that’s just the difference between the TZ version and the non-TZ version. The TZ version doesn’t carry any timezone with it, but it counts time since utc epoch so it specifies a definite point in time - while the non-TZ timestamp doesn’t, and is only useful relative to another no -TZ timestamp.
At least that’s the way I understand them, and it aligns with Spark and DataFusion timestamp types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while the non-TZ timestamp doesn’t, and is only useful relative to another no -TZ timestamp.
My understanding is that you can still do certain things to a non-TZ timestamp:
- Display the wall-clock time represented by the timestamp (e.g. 10PM on Thursday, July 11th, 2024)
- Assign a timezone to it to convert it to a -TZ timestamp
In order for a consumer to do either of those things there needs to be some meaning behind the literal beyond "just some random number that has appropriate relative distance from other values in this column"
In pretty much all cases I'm aware of the non-TZ timstamps are still stored as UTC instants where the wall clock (in a UTC time zone) would display the desired value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In pretty much all cases I'm aware of the non-TZ timstamps are still stored as UTC instants where the wall clock (in a UTC time zone) would display the desired value.
I checked Spark, DataFusion, Parquet and Postgres docs, and as far as I understand them they all say the non-TZ timestamp bears no notion of timezone at all (ie. it's time passed since some epoch, not UTC epoch), while the yes-TZ timestamp is specifically time passed since UTC epoch.
My understanding is that you can still do certain things to a non-TZ timestamp:
- Display the wall-clock time represented by the timestamp (e.g. 10PM on Thursday, July 11th, 2024)
- Assign a timezone to it to convert it to a -TZ timestamp
Yes, the non-TZ timestamp does indicate wall clock time - but it will represent the same wall clock time of e.g. 10PM on Thursday, July 11th, 2024 no matter in which timezone you look at it. And you can convert it to a -TZ timestamp by providing information about which timezone's wall clock time it corresponds to.
In order for a consumer to do either of those things there needs to be some meaning behind the literal beyond "just some random number that has appropriate relative distance from other values in this column"
Well yes, it's not just a random number, but time passed since 1.1.1970.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well yes, it's not just a random number, but time passed since 1.1.1970.
Ok. So if a producer encodes the literal "10PM on Thursday, July 11th, 2024" as 1720735200 (seconds since 1.1.1970 in UTC) and my consumer encodes the literal "10PM on Thursday, July 11th, 2024" as 1720760400 (seconds since 1.1.1970 in PT) then I think we are going to have a problem.
Often people say "seconds since the UNIX epoch" but the UNIX epoch is defined as 1.1.1970 UTC and not 1.1.1970 in some arbitrary time zone.
the non-TZ timestamp bears no notion of timezone at all
Correct. We are supposed to think of these values semantically as a wall clock time. However, we still need to agree on how we encode wall clock times into a proto literal. If we represent wall clock times as ISO-8601 strings (with no TZ part) then it would be fine. There would be no need to mention time zones. However, if we choose to encode wall clock times as a u64 then we must mention how that works. The way that works is by encoding "the wall clock time X" as "seconds since the epoch (UTC) where a wall clock in a UTC time zone would display X"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't necessitate specifying UTC. You can as well say "seconds since the epoch (in arbitrary timezone TZ) where a wall clock in the same timezone TZ would display X". That's what I assume is the point of the no-TZ timestamp :)
Very well then. You are a consumer. I have sent you a plan with a PrecisionTimestamp
literal. The proto is:
{
precision: 0,
value: 1720794104
}
Can you please format this as a string of the form MONTH/DAY/YEAR HOURS:MINUTES:SECONDS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very well then. You are a consumer. I have sent you a plan with a
PrecisionTimestamp
literal. The proto is:{ precision: 0, value: 1720794104 }
Can you please format this as a string of the form
MONTH/DAY/YEAR HOURS:MINUTES:SECONDS
?
I'm not sure I understand where you're getting with this, but I believe that'd be July 12, 2024 14:21:44
(well, with a slightly different format string).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid that answer is wrong. The correct answer is July 12, 2024 7:21:44 AM
.
I believe I followed the encoding rules correctly. 1720794104 is the number of seconds that have passed since 1.1.1970 PT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think I see my mistake. I'm thinking we are interpreting this as
"The time shown on a wall clock in UTC after X seconds have passed since Y" (in which case Y's timezone matters)
You are perhaps interpreting it as
"The time shown on a wall clock in timezone Z after X seconds have passed since Y" (in which case Y's time zone doesn't matter as long as Z matches it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right in how I interpret it 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I see you are mirroring the site docs. Let's tackle this in a future PR.
to align with spark https://spark.apache.org/docs/3.4.2/api/java/org/apache/spark/sql/types/TimestampType.html until we have proper PrecisionTimestamp substrait-io/substrait#659
to align with spark https://spark.apache.org/docs/3.4.2/api/java/org/apache/spark/sql/types/TimestampType.html until we have proper PrecisionTimestamp substrait-io/substrait#659
BREAKING CHANGE: changes the message type for Literal PrecisionTimestamp and PrecisionTimestampTZ
The PrecisionTimestamp and PrecisionTimestampTZ literals were introduced in #594, and there was some discussion about their contents in #594 (comment). In their current form, they only contain a i64 value, which I believe is meant to be a precision-dependent number of fractional seconds since epoch. However, the literals don't contain the precision itself, so it's impossible to interpret a PrecisionTimestamp or PrecisionTimestampTZ literal without other information. This is in contrast to e.g. varchar, whose literal does specify the length, or decimal, whose literal specifies scale and precision. @westonpace curious for your thoughts since you were part of that original discussion - am I missing something or is this a bug?
This PR changes the Literal types for PrecisionTimestamp and PrecisionTimestampTZ to contain a PrecisionTimestamp message instead of a i64. That message then contains the i64 value as well as the i32 type. This is a breaking change, but given in their current form these literals aren't usable (unless I'm missing something), would that be okay?
Currently I used the same message for both PrecisionTimestamp and PrecisionTimestampTZ, but I can also make a copy for PTTZ if that'd be better for forwards-compatibility.