-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Timestamp with timezone not considered join on
#8150
Conversation
Thank you @acking-you - can you please add a test for this behavior -- perhaps in https://github.com/apache/arrow-datafusion/blob/main/datafusion/sqllogictest/test_files/joins.slt ? |
👌 I'm done testing. |
4ba7a8b
to
37564aa
Compare
@@ -2468,6 +2489,14 @@ SELECT * FROM test_timestamps_table as t1 JOIN (SELECT * FROM test_timestamps_ta | |||
2018-11-13T17:11:10.011375885 2018-11-13T17:11:10.011375 2018-11-13T17:11:10.011 2018-11-13T17:11:10 Row 0 2018-11-13T17:11:10.011375885 2018-11-13T17:11:10.011375 2018-11-13T17:11:10.011 2018-11-13T17:11:10 Row 0 | |||
2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10 Row 3 2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10 Row 3 | |||
|
|||
# test timestamp with timezone join on nanos datatype | |||
query PPPPTPPPPT rowsort | |||
SELECT * FROM test_timestamps_tz_table as t1 JOIN (SELECT * FROM test_timestamps_tz_table ) as t2 ON t1.nanos = t2.nanos; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SELECT * FROM test_timestamps_tz_table as t1 JOIN (SELECT * FROM test_timestamps_tz_table ) as t2 ON t1.nanos = t2.nanos; | |
SELECT * FROM test_timestamps_tz_table AS t1 JOIN test_timestamps_tz_table AS t2 ON t1.nanos = t2.nanos; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original test on timestamp used as
, do we need to change it to AS
?
@@ -2476,6 +2505,14 @@ SELECT * FROM test_timestamps_table as t1 JOIN (SELECT * FROM test_timestamps_ta | |||
2018-11-13T17:11:10.011375885 2018-11-13T17:11:10.011375 2018-11-13T17:11:10.011 2018-11-13T17:11:10 Row 0 2018-11-13T17:11:10.011375885 2018-11-13T17:11:10.011375 2018-11-13T17:11:10.011 2018-11-13T17:11:10 Row 0 | |||
2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10 Row 3 2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10 Row 3 | |||
|
|||
# test timestamp with timezone join on micros datatype | |||
query PPPPTPPPPT rowsort | |||
SELECT * FROM test_timestamps_tz_table as t1 JOIN (SELECT * FROM test_timestamps_tz_table ) as t2 ON t1.micros = t2.micros |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SELECT * FROM test_timestamps_tz_table as t1 JOIN (SELECT * FROM test_timestamps_tz_table ) as t2 ON t1.micros = t2.micros | |
SELECT * FROM test_timestamps_tz_table AS t1 JOIN test_timestamps_tz_table AS t2 ON t1.micros = t2.micros |
@@ -2484,6 +2521,46 @@ SELECT * FROM test_timestamps_table as t1 JOIN (SELECT * FROM test_timestamps_ta | |||
2018-11-13T17:11:10.011375885 2018-11-13T17:11:10.011375 2018-11-13T17:11:10.011 2018-11-13T17:11:10 Row 0 2018-11-13T17:11:10.011375885 2018-11-13T17:11:10.011375 2018-11-13T17:11:10.011 2018-11-13T17:11:10 Row 0 | |||
2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10 Row 3 2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10 Row 3 | |||
|
|||
# test timestamp with timezone join on millis datatype | |||
query PPPPTPPPPT rowsort | |||
SELECT * FROM test_timestamps_tz_table as t1 JOIN (SELECT * FROM test_timestamps_tz_table ) as t2 ON t1.millis = t2.millis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SELECT * FROM test_timestamps_tz_table as t1 JOIN (SELECT * FROM test_timestamps_tz_table ) as t2 ON t1.millis = t2.millis | |
SELECT * FROM test_timestamps_tz_table AS t1 JOIN test_timestamps_tz_table AS t2 ON t1.millis = t2.millis |
join on
join on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR can be affected by #8193
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @acking-you -- thank you very much 🙏
@@ -901,7 +901,7 @@ pub fn can_hash(data_type: &DataType) -> bool { | |||
DataType::UInt64 => true, | |||
DataType::Float32 => true, | |||
DataType::Float64 => true, | |||
DataType::Timestamp(time_unit, None) => match time_unit { | |||
DataType::Timestamp(time_unit, _) => match time_unit { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like there is no reason to check the timeunit either (as all branches are true)
However that is not something introduced in this PR
----TableScan: test_timestamps_tz_table projection=[nanos, micros, millis, secs, names] | ||
physical_plan | ||
CoalesceBatchesExec: target_batch_size=2 | ||
--HashJoinExec: mode=Partitioned, join_type=Inner, on=[(millis@2, millis@2)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified that this test covers the code, and without the changes in this PR and it results in NestedLoopsJoin
👍
+ NestedLoopJoinExec: join_type=Inner, filter=millis@0 = millis@1
+ --RepartitionExec: partitioning=RoundRobinBatch(2), input_partitions=1
+ ----MemoryExec: partitions=1, partition_sizes=[1]
+ --MemoryExec: partitions=1, partition_sizes=[1]
Which issue does this PR close?
part of #8147
Rationale for this change
The join filters all use the ExtractEquijoinPredicate optimizer to make the eq expression the key for the join on, and then the presence of
join on
determines whether to useHashJoinExec
orNestedJoinExec
. In this process,can _hash
function determines whether an expression type can be hashed, but currently this function does not support timestamps with timezone.What changes are included in this PR?
Modified the
can_hash
function to support timestamps with timezone.Are these changes tested?
Are there any user-facing changes?
No