-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: include precision parameter in timestamp types #594
Conversation
ACTION NEEDED Substrait follows the Conventional Commits The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
Keeping this as a draft and only making the changes for |
c641f87
to
6dd360a
Compare
extensions/functions_datetime.yaml
Outdated
@@ -32,7 +32,8 @@ scalar_functions: | |||
* SECOND Return the second (0-59). | |||
* MILLISECOND Return number of milliseconds since the last full second. | |||
* MICROSECOND Return number of microseconds since the last full millisecond. | |||
* SUBSECOND Return number of microseconds since the last full second of the given timestamp. | |||
* NANOSECOND Return number of nanoseconds since the last full microsecond. | |||
* SUBSECOND Return number of nanoseconds since the last full second of the given timestamp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we changing the behavior of SUBSECOND here? Would it be safer to add a new option and deprecate SUBSECOND?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, deprecating might be a better. Should we have a process to be followed for deprecating an option value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the standard definition uses millisecond for SUBSECOND then we might need to leave this version permanently.
proto/substrait/algebra.proto
Outdated
@@ -796,7 +796,8 @@ message Expression { | |||
string string = 12; | |||
bytes binary = 13; | |||
// Timestamp in units of microseconds since the UNIX epoch. | |||
int64 timestamp = 14; | |||
// Deprecated in favor of `Timestamp timestamp` | |||
// int64 timestamp = 14 [deprecated = true]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically the number would become reserved to prevent reuse. But usually one goes through a deprecation period to allow for interoperability. That means we need to have both around for a while. We could say that consumers always should use matched sets but I don't think that's always possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means we need to have both around for a while. We could say that consumers always should use matched sets but I don't think that's always possible.
For migration purposes we need a period of time where both of these timestamps types coexist. To migrate this properly we need to be able to do something like:
- Update consumers to handle the new timestamp type.
- Update producers to generate plans with the new timestamp type and remove the old type.
- Remove code for handling the old timestamp type in consumers.
If we remove this entirely we have to do 1 and 2 simultaneously, which doesn't work super well if they're entirely different systems 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. So we'll need a different name for the new timestamp. Maybe something like ParametrizedTimestamp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since parameterized_types.proto already uses Parameterized
as a prefix in all the names, it'll be less confusing to use something like PrecisionTimestamp
instead
proto/substrait/algebra.proto
Outdated
@@ -845,6 +848,11 @@ message Expression { | |||
int32 scale = 3; | |||
} | |||
|
|||
message Timestamp { | |||
// the maximum precision is 9. | |||
int32 precision = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also have a value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly sure why value would be necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a literal (the type is named ParameterizedPrecisionTimestamp
?. In other words, if my SQL filter is event_time < '1999-01-08 04:05:06'
and event_time
is a "precision timestamp" column then I need to make a Substrait plan that looks like...
less_than(input_field(0), ???)
where ???
is my timestamp literal. So we do need a value (and not a precision) here.
We could do something like:
// If the precision is 6 or less then this is the microseconds since the UNIX epoch
// If the precision is more than 6 then this is the nanoseconds since the UNIX epoch
uint64 precision_timestamp = 34;
proto/substrait/algebra.proto
Outdated
@@ -845,6 +848,11 @@ message Expression { | |||
int32 scale = 3; | |||
} | |||
|
|||
message Timestamp { | |||
// the maximum precision is 9. | |||
int32 precision = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default value is likely 6 for backwards compatibility (although this is a completely different message type so I'm not sure that matters).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the type, there's no need, but it looked like this was replacing the value version of int timestamp.
site/docs/types/type_classes.md
Outdated
| MAP<K, V> | An unordered list of type K keys with type V values. Keys may be repeated. While the key type could be nullable, keys may not be null. | `repeated KeyValue` (in turn two `Literal`s), all key types matching K and all value types matching V | ||
| LIST<T> | A list of values of type T. The list can be between [0..2,147,483,647] values in length. | `repeated Literal`, all types matching T | ||
| MAP<K, V> | An unordered list of type K keys with type V values. Keys may be repeated. While the key type could be nullable, keys may not be null. | `repeated KeyValue` (in turn two `Literal`s), all key types matching K and all value types matching V | ||
| TIMESTAMP<P> | A naive timestamp within [1000-01-01 00:00:00.000000000..9999-12-31 23:59:59.999999999], with precision (P, number of digits) <= 9. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python. | `int64` nanoseconds since 1970-01-01 00:00:00.000000000 (in an unspecified timezone) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should have both the old and new types listed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the proto it looks like we are deprecating the old type? Are you saying we should list the old type and mark it deprecated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not work. If you store int64 nanoseconds since the epoch then the maximum value you can get is 2262-04-12 01:47:16.854775807 +0200
(slightly larger if you store unsigned but still not large enough). However, the first sentence claims within [1000-01-01 00:00:00.000000000..9999-12-31 23:59:59.999999999]
.
So either the literal representation needs to be larger than 8 bytes or the allowable range needs to depend on the precision.
Note that making literals larger than 8 bytes is tempting, but it could lead to system incompatibility. For example, you could create a Substrait filter my_date < year_4000
but if you tried to evaluate that on the consumer it might try and convert that literal to its internal representation and return an error (e.g. this is what would happen if you tried to use pandas, datafusion, or pyarrow to evaluate that).
So I think we have to state that the allowable range depends on the precision.
7b97dce
to
990aa9b
Compare
proto/substrait/algebra.proto
Outdated
@@ -845,6 +848,11 @@ message Expression { | |||
int32 scale = 3; | |||
} | |||
|
|||
message Timestamp { | |||
// the maximum precision is 9. | |||
int32 precision = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a literal (the type is named ParameterizedPrecisionTimestamp
?. In other words, if my SQL filter is event_time < '1999-01-08 04:05:06'
and event_time
is a "precision timestamp" column then I need to make a Substrait plan that looks like...
less_than(input_field(0), ???)
where ???
is my timestamp literal. So we do need a value (and not a precision) here.
We could do something like:
// If the precision is 6 or less then this is the microseconds since the UNIX epoch
// If the precision is more than 6 then this is the nanoseconds since the UNIX epoch
uint64 precision_timestamp = 34;
site/docs/types/type_classes.md
Outdated
| MAP<K, V> | An unordered list of type K keys with type V values. Keys may be repeated. While the key type could be nullable, keys may not be null. | `repeated KeyValue` (in turn two `Literal`s), all key types matching K and all value types matching V | ||
| LIST<T> | A list of values of type T. The list can be between [0..2,147,483,647] values in length. | `repeated Literal`, all types matching T | ||
| MAP<K, V> | An unordered list of type K keys with type V values. Keys may be repeated. While the key type could be nullable, keys may not be null. | `repeated KeyValue` (in turn two `Literal`s), all key types matching K and all value types matching V | ||
| TIMESTAMP<P> | A naive timestamp within [1000-01-01 00:00:00.000000000..9999-12-31 23:59:59.999999999], with precision (P, number of digits) <= 9. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python. | `int64` nanoseconds since 1970-01-01 00:00:00.000000000 (in an unspecified timezone) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the proto it looks like we are deprecating the old type? Are you saying we should list the old type and mark it deprecated?
site/docs/types/type_classes.md
Outdated
| MAP<K, V> | An unordered list of type K keys with type V values. Keys may be repeated. While the key type could be nullable, keys may not be null. | `repeated KeyValue` (in turn two `Literal`s), all key types matching K and all value types matching V | ||
| LIST<T> | A list of values of type T. The list can be between [0..2,147,483,647] values in length. | `repeated Literal`, all types matching T | ||
| MAP<K, V> | An unordered list of type K keys with type V values. Keys may be repeated. While the key type could be nullable, keys may not be null. | `repeated KeyValue` (in turn two `Literal`s), all key types matching K and all value types matching V | ||
| TIMESTAMP<P> | A naive timestamp within [1000-01-01 00:00:00.000000000..9999-12-31 23:59:59.999999999], with precision (P, number of digits) <= 9. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python. | `int64` nanoseconds since 1970-01-01 00:00:00.000000000 (in an unspecified timezone) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not work. If you store int64 nanoseconds since the epoch then the maximum value you can get is 2262-04-12 01:47:16.854775807 +0200
(slightly larger if you store unsigned but still not large enough). However, the first sentence claims within [1000-01-01 00:00:00.000000000..9999-12-31 23:59:59.999999999]
.
So either the literal representation needs to be larger than 8 bytes or the allowable range needs to depend on the precision.
Note that making literals larger than 8 bytes is tempting, but it could lead to system incompatibility. For example, you could create a Substrait filter my_date < year_4000
but if you tried to evaluate that on the consumer it might try and convert that literal to its internal representation and return an error (e.g. this is what would happen if you tried to use pandas, datafusion, or pyarrow to evaluate that).
So I think we have to state that the allowable range depends on the precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again! Some suggestions and cleanup.
proto/substrait/type.proto
Outdated
@@ -159,6 +163,18 @@ message Type { | |||
Nullability nullability = 4; | |||
} | |||
|
|||
message PrecisionTimestamp { | |||
optional int32 precision = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that we've used optional
much elsewhere. I seem to recall it being largely a no-op as far as protobuf is concerned. However, I don't see much harm in it. Maybe document the default (6 I assume) again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional is from the version 2 of the protobuf format. These days most of the outside world is using version 3 (as is this file) where optional is assumed.
site/docs/types/type_classes.md
Outdated
| PRECISIONTIMESTAMP<P> | A timestamp with fractional second precision (P, number of digits) <= 9. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python. | `uint64` nanoseconds since 1970-01-01 00:00:00.000000000 (in an unspecified timezone) | ||
| PRECISIONTIMESTAMPTZ<P> | A timezone-aware timestamp, with fractional second precision (P, number of digits) <= 9. Similar to aware datetime in Python. | `int64` microseconds since 1970-01-01 00:00:00.000000 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might go without saying but should we mention that P >= 0
as well?
site/docs/types/type_classes.md
Outdated
| NSTRUCT<N:T1,...,N:Tn> | **Pseudo-type**: A struct that maps unique names to value types. Each name is a UTF-8-encoded string. Each value can have a distinct type. Note that NSTRUCT is actually a pseudo-type, because Substrait's core type system is based entirely on ordinal positions, not named fields. Nonetheless, when working with systems outside Substrait, names are important. | n/a | ||
| LIST<T> | A list of values of type T. The list can be between [0..2,147,483,647] values in length. | `repeated Literal`, all types matching T | ||
| MAP<K, V> | An unordered list of type K keys with type V values. Keys may be repeated. While the key type could be nullable, keys may not be null. | `repeated KeyValue` (in turn two `Literal`s), all key types matching K and all value types matching V | ||
| PRECISIONTIMESTAMP<P> | A timestamp with fractional second precision (P, number of digits) <= 9. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python. | `uint64` nanoseconds since 1970-01-01 00:00:00.000000000 (in an unspecified timezone) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like PRECISIONTIMESTAMP
has a literal representation of uint64 nanoseconds
and PRECISIONTIMESTAMPTZ
has a literal representation of uint64 microseconds
.
However, if I'm reading the proto correctly, they both have a literal representation of "either uint64 microseconds or nanoseconds since ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of this, do we need add signature names for the new types as well?
substrait/site/docs/extensions/index.md
Lines 73 to 74 in d9b9672
| timestamp | ts | | |
| timestamp_tz | tstz | |
Thanks for the catch! Added new signature names |
16c5cce
to
7c5da20
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small typo
proto/substrait/type.proto
Outdated
@@ -159,6 +163,18 @@ message Type { | |||
Nullability nullability = 4; | |||
} | |||
|
|||
message PrecisionTimestamp { | |||
optional int32 precision = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional is from the version 2 of the protobuf format. These days most of the outside world is using version 3 (as is this file) where optional is assumed.
Co-authored-by: Weston Pace <weston.pace@gmail.com>
4d349d3
// If the precision is 6 or less then this is the microseconds since the UNIX epoch | ||
// If the precision is more than 6 then this is the nanoseconds since the UNIX epoch | ||
uint64 precision_timestamp = 34; | ||
uint64 precision_timestamp_tz = 35; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these have been custom types that'd include the precision? Like Decimal/VarChar below. Or how is the consumer mean to infer the precision? The uint64
presumably means the timestamp value itself
…nTimestampTZ literals (#659) BREAKING CHANGE: changes the message type for Literal PrecisionTimestamp and PrecisionTimestampTZ The PrecisionTimestamp and PrecisionTimestampTZ literals were introduced in #594, and there was some discussion about their contents in #594 (comment). In their current form, they only contain a i64 value, which I believe is meant to be a precision-dependent number of fractional seconds since epoch. However, the literals don't contain the precision itself, so it's impossible to interpret a PrecisionTimestamp or PrecisionTimestampTZ literal without other information. This is in contrast to e.g. varchar, whose literal does specify the length, or decimal, whose literal specifies scale and precision. @westonpace curious for your thoughts since you were part of that original discussion - am I missing something or is this a bug? This PR changes the Literal types for PrecisionTimestamp and PrecisionTimestampTZ to contain a PrecisionTimestamp message instead of a i64. That message then contains the i64 value as well as the i32 type. This is a breaking change, but given in their current form these literals aren't usable (unless I'm missing something), would that be okay? Currently I used the same message for both PrecisionTimestamp and PrecisionTimestampTZ, but I can also make a copy for PTTZ if that'd be better for forwards-compatibility.
This PR introduces a new
PrecisionTimestamp
type that accepts an optional precision parameter to specify fractional second precision.Closes #592