-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for an interval type covering both months and (sub)seconds #664
Comments
Should we make the subsecond precision configurable? Or we could just use nanoseconds (there is not a "range problem" since this only needs to store up to 1 day's worth). If we choose microseconds then we still aren't compatible with Arrow / DataFusion. We also need to define the range of the data type. The year/month interval has a range of -10,000 to 10,000 years so I think we could just reuse that (should be doable with a u32 month). |
I've looked into this previously. Posting a summary of what I found in terms of engine behaviours below. One of the big challenges I've found around this was the semantics of various arithmetic operations with interval types. Having a combined interval type more closely matches some engines, but then the actual permitted opertaions with those intervals and the return types can vary wildly. Intervals In the WildSQL SpecThe SQL 98 spec defines two categories of intervals: Year-month intervals are only comparable to other year-month intervals. Operations involving values of type interval (i.e. +) require that the interval types be mutually comparable. This means that per the spec you cannot add year-month intervals to a day-time intervals as they are not comparable. MySQLMySQL does not have an interval type, however specific operations allow for declarations of interval literals. For example DATE_ADD: SELECT DATE_ADD('2018-05-01', INTERVAL 1 DAY);
> 2018-05-02 sqlitesqlite does not have an interval type, however specific functions can specify interval-like modifiers: SELECT datetime('now', '+5 years', '-90 seconds'); TrinoTrino has two interval types: Which cannot be combined: SELECT INTERVAL '1' MONTH + INTERVAL '1' HOUR
> Cannot apply operator: interval year to month + interval day to second SnowflakeSnowflake does not have an interval column type but does allow constant intervals. These intervals can combine year-month and day-time values. Note though that in SELECT TO_DATE ('2019-02-28') + INTERVAL '1 day, 1 year';
> 2020-03-01 SELECT TO_DATE ('2019-02-28') + INTERVAL '1 year, 1 day';
> 2020-02-29 return different values, in this case due to leap year differences. PostgresPostgres has a single interval type which allows for combinations of year-month and day-time intervals. SELECT INTERVAL '1 MONTH' + INTERVAL '2 DAY' + INTERVAL '1 SECOND'
> {"months":1,"days":2,"seconds":1} This does not technically conform to the spec. DuckDBDuckDB has a single interval type which allows for combinations of year-month and day-time intervals. SELECT INTERVAL 1 YEAR + INTERVAL 1 DAY;
> 1 year 1 day This does not technically conform to the spec. ArrowArrow has 3 interval types: Which effectively maps to year-month intervals, day-time intervals and then a combined interval type. |
It looks like Postgres, DuckDb, and Datafusion all have this combined interval type. Are there semantic differences between the three of those? If not, then I think this might be worth pursuing. |
I was mostly comparing day-time and year-month operation differences to those allowed on a combined interval type, and not so much differences between engines on combined interval types. One difference I did find though was between Postgres and Datafusion, when adding intervals to dates. In Postgres: SELECT pg_typeof(DATE '2001-08-22' + INTERVAL '4' MONTH);
> timestamp without time zone In Datafusion (via datafusion-cli) SELECT arrow_typeof(DATE '2001-08-22' + INTERVAL '4' MONTH);
> Date32 |
That is subtle 😖 At the moment I'm still slightly in favor of adding the new type class but that might just be my Arrow bias speaking. |
The current two intervals can be combined to provide a more complete datetime interval but the omission of subsecond time is a notable omission. If the purpose is for a complete datetime interval the lack of years seems like it might be an issue (although one could combine the existing interval with year to get the complete picture). Ideally the intervals that are defined cover what time/date changes that are needed. This is unlikely to be a complete datetime interval as the precision of one year and one nanosecond is nonsensical especially given the impreciseness of a year. A month to nano seems too wide of a definition to me especially when the option to combine with a calendar interval is still available. It's probably too late given the landscape of implementations out there but an interval for time including the missing precision seems like the appropriate missing abstraction. |
That is a difference in the behavior of the The four systems I looked at (Spark, Arrow/DF, Postgres, DuckDB) all store these intervals as a combination of three numbers (months, days, some version of seconds). Actually parquet as well. There are differences in how the second field is treated (a precision of 6 decimal in spark, nanosecs i64 in Arrow/DF, milliseconds i32 in Parquet, microseconds i64 in DuckDB. There's also difference in comparability: some note that the intervals are not comparable, some normalize the values (using e.g. 1 month = 30 days rule) to allow comparison. Other than the comparisons, I'd say the type itself is quite well-defined. Functions using it might differ in behavior, but that's true for every other type as well. I started a PR to add the type in #665 |
When using substrait as a first-class entity, perhaps. But when using Substrait to represent existing systems, it doesn't work, since functions need a single return type. I guess one could return a struct combining the two intervals, but that would be quite complicated for both consumers and producers and isn't really a pattern that'd be used anywhere else.
FWIW, the existing IntervalDayToSecond includes subsecond (microsecond) time.
A year can always* be described as 12 months. Thus the month-day-nano interval can cover a range of +- 178 956 970 years which should be quite enough. It doesn't necessarily allow recreating the exact command a user typed (INTERVAL 1 YEAR 3 MONTHS) would be identical to (INTERVAL 15 MONTHS), but I guess that's not a necessary goal anyways.
While one can argue that using years and nanoseconds in the same interval is probably not very useful, the combined interval type also allows to say eg. "1 MONTH 1 DAY" which you cannot do with either of the existing interval types. Like you note, it is also just in practice the type that's being used out there for many of the common systems, so lacking it makes it impossible/very hard to support intervals correctly in those systems. (* with maybe the exception of times when calendar was switched hundreds of years ago) |
This was my assessment as well. There are systems that combine year-month intervals and day-time intervals into a single type (Postgres, DuckDB) so it makes sense to model these kinds of intervals as first-class entities.
You are correct. The thing that I've found though is that the combined interval type is extremely hard to reason about in terms of behaviour, which you've also noted:
In this case adding the type is easy, but the follow-up work of defining the "correct" behaviour will be challenging because different systems have different opinions about what correct means. The combined interval type is also awkward because while you can always go from a year-month or day-time interval to a combined interval type in a lossless manner (I think), going the other way is not so. Let's say that you're going from a system with a combined interval type SELECT DATE '2019-02-28' + INTERVAL '1 YEAR 1 DAY' to one that does not. Which of the following two is the equivalent query: SELECT DATE '2019-02-28' + INTERVAL '1 YEAR' + INTERVAL '1 DAY'
SELECT DATE '2019-02-28' + INTERVAL '1 DAY' + INTERVAL '1 YEAR' Fun fact, the order matter here. See the Snowflake docs for details and/or this datafusion issue. This is a long-winded way of saying that I think a combined interval type makes sense by virtue of it existing in the wild, but I am unhappy with it existing in general, it makes me sad to try and reason about the many possible behaviours and we have our work cut out for us defining semantics for it. Bonus Cursed ThingFractional intervals in Postgres SELECT INTERVAL '1' SECOND / 3;
> {"milliseconds":333.333} |
While that's true when designing a new system (which I think is what you meant), it sadly doesn't mean that existing things using combined interval could be represented through the year-month and day-time intervals. (Ie. there's no way to represent Spark's
That's fair. FWIW, I think I can solve our need using user-defined types, so I'm not blocked on this.
Works in Spark too :)
Interestingly, Spark's
DF refuses the division, but I guess it wouldn't be hard to implement it. |
I thought wrong 😅 |
NOTE: I've written several responses before posting this. As I thought more and more about it, I'd like to propose we approach this type differently given how different the behavior is between different systems: My recollection is the SQL spec actually spent time against this and realized that trying to figure out what I agree that systems have added other types and thus Substrait needs to support it somehow (propagate the mistake). I think the problem is that the semantics of those types are very poorly defined when a calculation is done. I've seen systems that will immediately take the interval and boil it down using standard conversions (e.g. 30 days = 1 month) and system that maintain full resolution.
This seems like a net positive that aligns with our changes to timestamp, etc.
I started marking up the attached PR but felt like it was better to talk about overall direction here. Note, I'm not demanding the changes above (and I know I'm late to the party). Just trying to think through how we can be supportive of different systems in a system-agnostic way that is also clearly specified. |
Another proposal...
It's a slight adjustment to what is proposed. Two main differences:
|
The more I think about this, the more I'm inclined to go with my second proposal for the reasons above. Makes precision clear for different engines and makes it clear that this is fundamentally a compound version of the other two.
@westonpace and @vbarua , you good with that suggestion? |
Would a microsecond precision system reject a plan with a nanosecond precision literal (even if the nanoseconds value was 0)? |
If we assume a consumer is trying to accurately execute a plan, it might be happy to do a cast of intervalcompound<9> to intervalcompound<6> or intervaldaysecond<9> to intervaldaysecond<6> when the nanoseconds are all zero. Just like a system might be happy to cast a precision_timestamp<3> literal to precision_timestamp<0> literal if there are no seconds. I'm not clear what makes precision_timestamp different from the interval types on this front. |
I'm onboard for encoding precision into intervals in some manner.
I think it should. If the consumer does not support the requested precision it should blow up to signal this to the producer. The producer can do the casts on its side if it's okay with truncation. Even if the value is all 0, operations on that literal that use values read in from data could behave differently at different precisions. It also messes with function lookup within the plan, because if the producer is applying operations on top of the literal it has (hopefully) sent functions with the correct signatures for the precision it expects. |
@westonpace and @EpsilonPrime , are either of you opposed to my proposal above? |
No further concerns. Adding precision to the existing and adding a compound interval both seem like improvements to me. How do you intend to add precision? By modifying the existing type or introducing a new type (as we did for timestamp)? |
I'm on board with the latest suggestion. |
I'm inclined to modify existing interval type. I don't think there is much use yet. @Blizzara , can you update the PR as discussed? Update interval daysecond to support 0-9 precision (and rename the second field/docs to something like fractional_seconds) Add new IntervalCompound type that includes all the fields from the other two types and is parameterized with 0-9 precision |
I’m currently on holidays, if someone wants to take over the PR in the meantime that’s fine by me, otherwise I’ll be happy to update it once I’m back! The proposal overall sounds good to me! |
Substrait currently defines two interval types, IntervalYearToMonth which contains a number of years and months, and IntervalDayToSecond.
Neither of these types is capable of expressing full intervals that combine years, months, days, minutes, seconds, etc. (For YearToMonth that's probably clear; for DayToSecond one could in theory use large numbers for days but given the number of days in a month/year is not unique, there is a semantic difference between an interval of "40 DAYS 1 MINUTE" and "1 MONTH 10 DAYS 1 MINUTE"
Internally, PostgreSQL stores interval values as months, days, and seconds. The months and days values are integers while the seconds field can have fractions
Would you be open to adding an third interval type that combines three numbers: months (i32), days (i32), (micro)seconds (i32 for micro or decimal for seconds)?
The text was updated successfully, but these errors were encountered: