-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support timestamp and interval arithmetic #5764
Support timestamp and interval arithmetic #5764
Conversation
add tests macro will replace matches inside evaluate ready for review
The former result is produced, you can reproduce it with this test. Existing |
This example has a timezone of |
Now I understand what you mean. Postgre gives the result as in the former. If no objection, I will fix it that way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, thank you so much @berkaysynnada
I think this is a significant improvement to DataFusion -- while longer term I would prefer to see the interval arithmetic logic moved into arrow-rs, starting with an implementation in the DataFusion repo has worked well in the past and I think will work well here too.
Can you please respond to @tustvold 's comments? I think they are good questions, but then I think we could merge this PR and file a follow on tickets
- Move the arithmetic code into binary.rs (following the existing models, as a step towards getting them upstream in arrow).
- File a ticket about not handling timezones properly
@@ -142,14 +138,30 @@ impl PhysicalExpr for DateTimeIntervalExpr { | |||
return Err(DataFusionError::Internal(msg.to_string())); | |||
} | |||
}; | |||
// RHS is first checked. If it is a Scalar, there are 2 options: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Longer term I think it would be good to move the date_time arithmetic into https://github.com/apache/arrow-datafusion/tree/main/datafusion/physical-expr/src/expressions/binary as these really are binary operations
That would also set us up so when the kernels are added to arrow-rs (aka part of apache/arrow-rs#3958) it would be easier to migrate.
I like how this PR followed the existing pattern in DateTimeIntervalExpr
even if that may not be our ideal end state
I am working on @tustvold 's comments, and when I finalize them I will commit. Thanks for the support of |
@tustvold I tried to fix the issues you mention, can you please take a quick look? |
.ok(); | ||
parsed_tz | ||
let parsed_tz: Tz = FromStr::from_str(tz).map_err(|_| { | ||
DataFusionError::Execution("cannot parse given timezone".to_string()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if the error contained the problematic timezone. Something like
let parsed_tz: Tz = FromStr::from_str(tz).map_err(|e| {
DataFusionError::Execution(format!("cannot parse '{tz}' as timezone: {e}".to_string())
@@ -348,63 +340,6 @@ pub fn evaluate_temporal_arrays( | |||
Ok(ColumnarValue::Array(ret)) | |||
} | |||
|
|||
#[inline] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
@@ -261,6 +261,110 @@ SELECT INTERVAL '8' MONTH + '2000-01-01T00:00:00'::timestamp; | |||
---- | |||
2000-09-01T00:00:00 | |||
|
|||
# Interval columns are created with timestamp subtraction in subquery since they are not supported yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with #5792 we can now write better tests here -- specifically we can create interval constants.
@@ -44,6 +44,7 @@ unicode_expressions = ["unicode-segmentation"] | |||
[dependencies] | |||
ahash = { version = "0.8", default-features = false, features = ["runtime-rng"] } | |||
arrow = { workspace = true } | |||
arrow-array = { version = "34.0.0", default-features = false, features = ["chrono-tz"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of datafusion now uses arrow 36, but this uses arrow 34
arrow-array = { version = "34.0.0", default-features = false, features = ["chrono-tz"] } | |
arrow-array = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #5794
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @berkaysynnada -- given we are working on intervals in general and this PR pushes things along substantially I am going to merge it and we can clean things up with follow on PRs.
Thanks again
@alamb Thanks for the support. I add these issues to my to-do's and will open the PRs as I progress. |
Thanks @berkaysynnada -- can you be specific about which items you have added to the todo list? |
I meant #5803, which you have completed, and removing the arithmetic code to binary.rs, but I can spare time for the issues that you see as relevant in #5753 and #3958 |
Thank you @berkaysynnada 🙇 . I think this issue:
This is the most valuable part in my opinion as it pays down tech debt and sets us up for a easier migration / porting of the code upstream to arrow-rs -- and since you probably still have all the timestamp / kernel context in your head, you are probably likely to do it more quickly than someone who needs to get up to speed |
I am working on symmetric hash join with temporal type inputs, and hence I need to modify |
If possible, I would recommend a separate PR (that your sym hash join builds on) that moves the code -- this should speed up reviews as each will be smaller and more focused |
Which issue does this PR close?
Closes #5704
Closes #194
Rationale for this change
We can handle such queries now:
SELECT val, ts1 - ts2 AS ts_diff FROM table_a
SELECT val, interval1 - interval2 AS interval_diff FROM table_a
SELECT val, ts1 - interval1 AS ts_interval_diff FROM table_a
SELECT val, interval1 + ts1 AS interval_ts_sub FROM table_a
What changes are included in this PR?
Some match expressions in planner.rs, binary.rs, and datetime.rs are extended. Coerced types and allowable operations are shown in the table.
I try to use existing scalar value functions as much as possible to not duplicate. However, in arrow.rs, subtraction and addition functions are for numeric types, hence I need to add some functions to call with
binary
function.In datetime.rs, the
evaluate
function was written to accept only "Array + Scalar" or "Scalar + Scalar" values to evaluate. It is extended to accept "Array + Array", and 4 different variations of that case (Timestamp op Timestamp, Interval op Interval, Timestamp op Interval, Interval op Timestamp) are implemented. "Array + Scalar" evaluations are done withunary
function in arrow.rs, and I follow the similar pattern withtry_binary_op
function.try_binary_op
function is a modified version ofbinary
function in arrow-rs. The only difference is that it returnsResult
and creates the buffer withtry_from_trusted_len_iter
. Otherwise, we had to unwrap the op function sent to binary.Are these changes tested?
Yes, there are tests for each match in timestamp.slt. However, tables with intervals cannot be created like
INTERVAL '1 second'
, since some work is needed in arrow-rs for casting. Timestamp difference case with timezone is also left in timestamp.rs because of a similar reason.Are there any user-facing changes?