Support timestamp and interval arithmetic #5764

berkaysynnada · 2023-03-28T14:58:42Z

Which issue does this PR close?

Closes #5704
Closes #194

Rationale for this change

We can handle such queries now:
SELECT val, ts1 - ts2 AS ts_diff FROM table_a
SELECT val, interval1 - interval2 AS interval_diff FROM table_a
SELECT val, ts1 - interval1 AS ts_interval_diff FROM table_a
SELECT val, interval1 + ts1 AS interval_ts_sub FROM table_a

What changes are included in this PR?

	-	+
timestamp op timestamp (same type)	OK: second and millisecond types give results in daytime(day+millisecond), microsecond and nanosecond types give result in monthdaynano(month+day+nano, but month field is not used)	N/A
timestamp op timestamp (different types)	N/A	N/A
interval op interval (same type)	OK: operations are done field by field, gives the same type	OK: operations are done field by field, gives the same type
interval op interval (different types)	OK: give result in monthdaynano type	OK: give result in monthdaynano type
timestamp op interval	OK: give result in the type of the timestamp	OK: give result in the type of the timestamp
interval op timestamp	N/A	OK: the same of timestamp + interval

Some match expressions in planner.rs, binary.rs, and datetime.rs are extended. Coerced types and allowable operations are shown in the table.

I try to use existing scalar value functions as much as possible to not duplicate. However, in arrow.rs, subtraction and addition functions are for numeric types, hence I need to add some functions to call with binary function.

In datetime.rs, the evaluate function was written to accept only "Array + Scalar" or "Scalar + Scalar" values to evaluate. It is extended to accept "Array + Array", and 4 different variations of that case (Timestamp op Timestamp, Interval op Interval, Timestamp op Interval, Interval op Timestamp) are implemented. "Array + Scalar" evaluations are done with unary function in arrow.rs, and I follow the similar pattern with try_binary_op function. try_binary_op function is a modified version of binary function in arrow-rs. The only difference is that it returns Result and creates the buffer with try_from_trusted_len_iter. Otherwise, we had to unwrap the op function sent to binary.

Are these changes tested?

Yes, there are tests for each match in timestamp.slt. However, tables with intervals cannot be created like INTERVAL '1 second', since some work is needed in arrow-rs for casting. Timestamp difference case with timezone is also left in timestamp.rs because of a similar reason.

Are there any user-facing changes?

add tests macro will replace matches inside evaluate ready for review

berkaysynnada · 2023-03-29T10:11:28Z

If I add 1 month to 2023-03-01T00:00:00 +02 is the output 2023-04-01T00:00:00 +02 or 2023-03-29T00:00:00 +02.

I believe this PR implements the latter as it performs arithmetic with respect to the UTC epoch? Is that the desired behaviour?

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=11112707c7217ecca3ef64ceb984beb6 contains an example showing the difference

#[tokio::test]
async fn interval_ts_add() -> Result<()> {
    let ctx = SessionContext::new();
    let table_a = {
        let schema = Arc::new(Schema::new(vec![
            Field::new("ts1", DataType::Timestamp(TimeUnit::Second, None), false),
            Field::new(
                "interval1",
                DataType::Interval(IntervalUnit::YearMonth),
                false,
            ),
        ]));
        let array1 = PrimitiveArray::<TimestampSecondType>::from_iter_values(vec![
            1_677_628_800i64, // 2023-03-01T00:00:00
        ]);
        let array2 =
            PrimitiveArray::<IntervalYearMonthType>::from_iter_values(vec![1i32]);
        let data = RecordBatch::try_new(
            schema.clone(),
            vec![Arc::new(array1), Arc::new(array2)],
        )?;
        let table = MemTable::try_new(schema, vec![vec![data]])?;
        Arc::new(table)
    };

    ctx.register_table("table_a", table_a)?;
    let sql = "SELECT ts1, ts1 + interval1 from table_a";
    let actual = execute_to_batches(&ctx, sql).await;
    let expected = vec![
        "+---------------------+---------------------------------+",
        "| ts1                 | table_a.ts1 + table_a.interval1 |",
        "+---------------------+---------------------------------+",
        "| 2023-03-01T00:00:00 | 2023-04-01T00:00:00             |",
        "+---------------------+---------------------------------+",
    ];
    assert_batches_eq!(expected, &actual);
    Ok(())
}

The former result is produced, you can reproduce it with this test. Existing do_date_time_math functionality is adopted.

tustvold · 2023-03-29T10:22:54Z

The former result is produced

This example has a timezone of None, not +02:00 as is necessary to demonstrate the potential inconsistency?

berkaysynnada · 2023-03-29T11:58:53Z

The former result is produced

This example has a timezone of None, not +02:00 as is necessary to demonstrate the potential inconsistency?

Now I understand what you mean. Postgre gives the result as in the former. If no objection, I will fix it that way.

alamb

First of all, thank you so much @berkaysynnada

I think this is a significant improvement to DataFusion -- while longer term I would prefer to see the interval arithmetic logic moved into arrow-rs, starting with an implementation in the DataFusion repo has worked well in the past and I think will work well here too.

Can you please respond to @tustvold 's comments? I think they are good questions, but then I think we could merge this PR and file a follow on tickets

Move the arithmetic code into binary.rs (following the existing models, as a step towards getting them upstream in arrow).
File a ticket about not handling timezones properly

cc @waitingkuo @avantgardnerio @andygrove @liukun4515

datafusion/core/tests/sqllogictests/test_files/timestamps.slt

alamb · 2023-03-29T18:41:53Z

datafusion/physical-expr/src/expressions/datetime.rs

@@ -142,14 +138,30 @@ impl PhysicalExpr for DateTimeIntervalExpr {
                return Err(DataFusionError::Internal(msg.to_string()));
            }
        };
+        // RHS is first checked. If it is a Scalar, there are 2 options:


Longer term I think it would be good to move the date_time arithmetic into https://github.com/apache/arrow-datafusion/tree/main/datafusion/physical-expr/src/expressions/binary as these really are binary operations

That would also set us up so when the kernels are added to arrow-rs (aka part of apache/arrow-rs#3958) it would be easier to migrate.

I like how this PR followed the existing pattern in DateTimeIntervalExpr even if that may not be our ideal end state

datafusion/physical-expr/src/expressions/datetime.rs

berkaysynnada · 2023-03-30T06:49:28Z

First of all, thank you so much @berkaysynnada

I think this is a significant improvement to DataFusion -- while longer term I would prefer to see the interval arithmetic logic moved into arrow-rs, starting with an implementation in the DataFusion repo has worked well in the past and I think will work well here too.

Can you please respond to @tustvold 's comments? I think they are good questions, but then I think we could merge this PR and file a follow on tickets

Move the arithmetic code into binary.rs (following the existing models, as a step towards getting them upstream in arrow).

File a ticket about not handling timezones properly

cc @waitingkuo @avantgardnerio @andygrove @liukun4515

I am working on @tustvold 's comments, and when I finalize them I will commit. Thanks for the support of try_binary.

berkaysynnada · 2023-03-30T11:15:39Z

@tustvold I tried to fix the issues you mention, can you please take a quick look?

alamb · 2023-03-30T13:28:24Z

datafusion/common/src/scalar.rs

-            .ok();
-        parsed_tz
+        let parsed_tz: Tz = FromStr::from_str(tz).map_err(|_| {
+            DataFusionError::Execution("cannot parse given timezone".to_string())


It would be nice if the error contained the problematic timezone. Something like

let parsed_tz: Tz = FromStr::from_str(tz).map_err(|e| { DataFusionError::Execution(format!("cannot parse '{tz}' as timezone: {e}".to_string())

alamb · 2023-03-30T13:28:49Z

datafusion/physical-expr/src/expressions/datetime.rs

@@ -348,63 +340,6 @@ pub fn evaluate_temporal_arrays(
    Ok(ColumnarValue::Array(ret))
 }

-#[inline]


alamb · 2023-03-30T14:41:40Z

datafusion/core/tests/sqllogictests/test_files/timestamps.slt

@@ -261,6 +261,110 @@ SELECT INTERVAL '8' MONTH + '2000-01-01T00:00:00'::timestamp;
 ----
 2000-09-01T00:00:00

+# Interval columns are created with timestamp subtraction in subquery since they are not supported yet


I think with #5792 we can now write better tests here -- specifically we can create interval constants.

alamb · 2023-03-30T14:42:33Z

datafusion/physical-expr/Cargo.toml

@@ -44,6 +44,7 @@ unicode_expressions = ["unicode-segmentation"]
 [dependencies]
 ahash = { version = "0.8", default-features = false, features = ["runtime-rng"] }
 arrow = { workspace = true }
+arrow-array = { version = "34.0.0", default-features = false, features = ["chrono-tz"] }


The rest of datafusion now uses arrow 36, but this uses arrow 34

Suggested change

arrow-array = { version = "34.0.0", default-features = false, features = ["chrono-tz"] }

arrow-array = { workspace = true }

alamb

Thank you @berkaysynnada -- given we are working on intervals in general and this PR pushes things along substantially I am going to merge it and we can clean things up with follow on PRs.

Thanks again

berkaysynnada · 2023-03-30T16:23:35Z

@alamb Thanks for the support. I add these issues to my to-do's and will open the PRs as I progress.

alamb · 2023-03-30T18:23:16Z

@alamb Thanks for the support. I add these issues to my to-do's and will open the PRs as I progress.

Thanks @berkaysynnada -- can you be specific about which items you have added to the todo list?

berkaysynnada · 2023-03-31T08:48:24Z

@alamb Thanks for the support. I add these issues to my to-do's and will open the PRs as I progress.

Thanks @berkaysynnada -- can you be specific about which items you have added to the todo list?

I meant #5803, which you have completed, and removing the arithmetic code to binary.rs, but I can spare time for the issues that you see as relevant in #5753 and #3958

alamb · 2023-03-31T14:19:26Z

Thank you @berkaysynnada 🙇 . I think this issue:

removing the arithmetic code to binary.rs

This is the most valuable part in my opinion as it pays down tech debt and sets us up for a easier migration / porting of the code upstream to arrow-rs -- and since you probably still have all the timestamp / kernel context in your head, you are probably likely to do it more quickly than someone who needs to get up to speed

berkaysynnada · 2023-03-31T16:15:04Z

Thank you @berkaysynnada 🙇 . I think this issue:

removing the arithmetic code to binary.rs

This is the most valuable part in my opinion as it pays down tech debt and sets us up for a easier migration / porting of the code upstream to arrow-rs -- and since you probably still have all the timestamp / kernel context in your head, you are probably likely to do it more quickly than someone who needs to get up to speed

I am working on symmetric hash join with temporal type inputs, and hence I need to modify evaluate_array function in datetime.rs, where the evaluations of Array vs. Scalar values are done (newly added match arms here also use some of these arithmetic functions). I plan to insert that removal work to sym hash join PR, if it is not a problem.

alamb · 2023-03-31T16:32:35Z

I plan to insert that removal work to sym hash join PR, if it is not a problem.

If possible, I would recommend a separate PR (that your sym hash join builds on) that moves the code -- this should speed up reviews as each will be smaller and more focused

berkaysynnada and others added 30 commits March 7, 2023 10:37

first implementation and tests of timestamp subtraction

1869363

improvement after review

2f01278

postgre interval format option

806b4d3

random tests extended

708d717

corrections after review

c5bacbe

operator check

011933f

flag is removed

e475f58

clippy fix

423fb65

toml conflict

1291758

Merge branch 'main' into feature/time-interval-support

055ed81

minor changes

d7f3696

deterministic matches

8d5c8e3

simplifications (clippy error)

31577d9

test format changed

c274aef

minor test fix

968a682

Merge branch 'main' into feature/time-interval-support

49506ed

Update scalar.rs

ed63779

Refactoring and simplifications

68ea647

Make ScalarValue support interval comparison

ed04466

naming tests

3bf8fd6

macro renaming

0f8a7a7

renaming macro

cf892fe

Merge branch 'apache:main' into feature/timestamp-interval-arith-query

6b5484e

ok till arrow kernel ops

a078dbb

Merge branch 'main' into feature/timestamp-interval-arith-query

1c8fd69

Merge branch 'apache:main' into feature/timestamp-interval-arith-query

f27bdb7

Merge branch 'apache:main' into feature/timestamp-interval-arith-query

49727e1

macro will replace matches inside evaluate

bbfd9b1

add tests macro will replace matches inside evaluate ready for review

Code refactor

e14a16f

retract changes in scalar and datetime

9f82bbb

alamb changed the title ~~timestamp interval arithmetic query~~ Support timestamp and interval arithmetic Mar 29, 2023

alamb approved these changes Mar 29, 2023

View reviewed changes

berkaysynnada and others added 3 commits March 30, 2023 13:53

replace try_binary and as_datetime, and keep timezone for ts+interval op

ef1c194

Merge branch 'main' into feature/timestamp-interval-arith-query

f1e78f2

fix after merge

21e1df8

delete unused functions

b20eb77

alamb reviewed Mar 30, 2023

View reviewed changes

alamb mentioned this pull request Mar 30, 2023

Support INTERVAL SQL Type #5792

Merged

alamb reviewed Mar 30, 2023

View reviewed changes

alamb approved these changes Mar 30, 2023

View reviewed changes

alamb merged commit 04ef5c1 into apache:main Mar 30, 2023

alamb mentioned this pull request Mar 30, 2023

Minor: use workspace arrow-array rather than hard coded 34 #5794

Merged

alamb mentioned this pull request Mar 30, 2023

Minor: clean up timestamp arithmetic tests #5803

Merged

berkaysynnada deleted the feature/timestamp-interval-arith-query branch March 31, 2023 08:33

This was referenced Apr 1, 2023

Subtracting Timestamp from Timestamp should produce a Duration (not Timestamp) apache/arrow-rs#3964

Closed

Removal of arithmetic operations for temporal values to binary.rs #5846

Merged

berkaysynnada mentioned this pull request Apr 12, 2023

Temporal datatype support for interval arithmetic #5971

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support timestamp and interval arithmetic #5764

Support timestamp and interval arithmetic #5764

berkaysynnada commented Mar 28, 2023 •

edited by alamb

Loading

berkaysynnada commented Mar 29, 2023

tustvold commented Mar 29, 2023

berkaysynnada commented Mar 29, 2023

alamb left a comment

alamb Mar 29, 2023

berkaysynnada commented Mar 30, 2023

berkaysynnada commented Mar 30, 2023

alamb Mar 30, 2023

alamb Mar 30, 2023

alamb Mar 30, 2023

alamb Mar 30, 2023

alamb Mar 30, 2023

alamb left a comment

berkaysynnada commented Mar 30, 2023 •

edited

Loading

alamb commented Mar 30, 2023

berkaysynnada commented Mar 31, 2023

alamb commented Mar 31, 2023

berkaysynnada commented Mar 31, 2023

alamb commented Mar 31, 2023

	arrow-array = { version = "34.0.0", default-features = false, features = ["chrono-tz"] }
	arrow-array = { workspace = true }

Support timestamp and interval arithmetic #5764

Support timestamp and interval arithmetic #5764

Conversation

berkaysynnada commented Mar 28, 2023 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

berkaysynnada commented Mar 29, 2023

tustvold commented Mar 29, 2023

berkaysynnada commented Mar 29, 2023

alamb left a comment

Choose a reason for hiding this comment

alamb Mar 29, 2023

Choose a reason for hiding this comment

berkaysynnada commented Mar 30, 2023

berkaysynnada commented Mar 30, 2023

alamb Mar 30, 2023

Choose a reason for hiding this comment

alamb Mar 30, 2023

Choose a reason for hiding this comment

alamb Mar 30, 2023

Choose a reason for hiding this comment

alamb Mar 30, 2023

Choose a reason for hiding this comment

alamb Mar 30, 2023

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

berkaysynnada commented Mar 30, 2023 • edited Loading

alamb commented Mar 30, 2023

berkaysynnada commented Mar 31, 2023

alamb commented Mar 31, 2023

berkaysynnada commented Mar 31, 2023

alamb commented Mar 31, 2023

berkaysynnada commented Mar 28, 2023 •

edited by alamb

Loading

berkaysynnada commented Mar 30, 2023 •

edited

Loading