Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Timestamp with timezone not considered join on #8150

Merged
merged 2 commits into from
Nov 17, 2023

Conversation

acking-you
Copy link
Contributor

Which issue does this PR close?

part of #8147

Rationale for this change

The join filters all use the ExtractEquijoinPredicate optimizer to make the eq expression the key for the join on, and then the presence of join on determines whether to use HashJoinExec or NestedJoinExec. In this process, can _hash function determines whether an expression type can be hashed, but currently this function does not support timestamps with timezone.

What changes are included in this PR?

Modified the can_hash function to support timestamps with timezone.

Are these changes tested?

Are there any user-facing changes?

No

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Nov 13, 2023
@alamb
Copy link
Contributor

alamb commented Nov 13, 2023

Thank you @acking-you - can you please add a test for this behavior -- perhaps in https://github.com/apache/arrow-datafusion/blob/main/datafusion/sqllogictest/test_files/joins.slt ?

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Nov 14, 2023
@acking-you
Copy link
Contributor Author

Thank you @acking-you - can you please add a test for this behavior -- perhaps in https://github.com/apache/arrow-datafusion/blob/main/datafusion/sqllogictest/test_files/joins.slt ?

👌 I'm done testing.

@@ -2468,6 +2489,14 @@ SELECT * FROM test_timestamps_table as t1 JOIN (SELECT * FROM test_timestamps_ta
2018-11-13T17:11:10.011375885 2018-11-13T17:11:10.011375 2018-11-13T17:11:10.011 2018-11-13T17:11:10 Row 0 2018-11-13T17:11:10.011375885 2018-11-13T17:11:10.011375 2018-11-13T17:11:10.011 2018-11-13T17:11:10 Row 0
2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10 Row 3 2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10 Row 3

# test timestamp with timezone join on nanos datatype
query PPPPTPPPPT rowsort
SELECT * FROM test_timestamps_tz_table as t1 JOIN (SELECT * FROM test_timestamps_tz_table ) as t2 ON t1.nanos = t2.nanos;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SELECT * FROM test_timestamps_tz_table as t1 JOIN (SELECT * FROM test_timestamps_tz_table ) as t2 ON t1.nanos = t2.nanos;
SELECT * FROM test_timestamps_tz_table AS t1 JOIN test_timestamps_tz_table AS t2 ON t1.nanos = t2.nanos;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original test on timestamp used as, do we need to change it to AS?

@@ -2476,6 +2505,14 @@ SELECT * FROM test_timestamps_table as t1 JOIN (SELECT * FROM test_timestamps_ta
2018-11-13T17:11:10.011375885 2018-11-13T17:11:10.011375 2018-11-13T17:11:10.011 2018-11-13T17:11:10 Row 0 2018-11-13T17:11:10.011375885 2018-11-13T17:11:10.011375 2018-11-13T17:11:10.011 2018-11-13T17:11:10 Row 0
2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10 Row 3 2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10 Row 3

# test timestamp with timezone join on micros datatype
query PPPPTPPPPT rowsort
SELECT * FROM test_timestamps_tz_table as t1 JOIN (SELECT * FROM test_timestamps_tz_table ) as t2 ON t1.micros = t2.micros
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SELECT * FROM test_timestamps_tz_table as t1 JOIN (SELECT * FROM test_timestamps_tz_table ) as t2 ON t1.micros = t2.micros
SELECT * FROM test_timestamps_tz_table AS t1 JOIN test_timestamps_tz_table AS t2 ON t1.micros = t2.micros

@@ -2484,6 +2521,46 @@ SELECT * FROM test_timestamps_table as t1 JOIN (SELECT * FROM test_timestamps_ta
2018-11-13T17:11:10.011375885 2018-11-13T17:11:10.011375 2018-11-13T17:11:10.011 2018-11-13T17:11:10 Row 0 2018-11-13T17:11:10.011375885 2018-11-13T17:11:10.011375 2018-11-13T17:11:10.011 2018-11-13T17:11:10 Row 0
2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10 Row 3 2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10.432 2021-01-01T05:11:10 Row 3

# test timestamp with timezone join on millis datatype
query PPPPTPPPPT rowsort
SELECT * FROM test_timestamps_tz_table as t1 JOIN (SELECT * FROM test_timestamps_tz_table ) as t2 ON t1.millis = t2.millis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SELECT * FROM test_timestamps_tz_table as t1 JOIN (SELECT * FROM test_timestamps_tz_table ) as t2 ON t1.millis = t2.millis
SELECT * FROM test_timestamps_tz_table AS t1 JOIN test_timestamps_tz_table AS t2 ON t1.millis = t2.millis

@alamb alamb changed the title fix: Timestamp with timezone not considerd join on fix: Timestamp with timezone not considered join on Nov 14, 2023
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR can be affected by #8193

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @acking-you -- thank you very much 🙏

@@ -901,7 +901,7 @@ pub fn can_hash(data_type: &DataType) -> bool {
DataType::UInt64 => true,
DataType::Float32 => true,
DataType::Float64 => true,
DataType::Timestamp(time_unit, None) => match time_unit {
DataType::Timestamp(time_unit, _) => match time_unit {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like there is no reason to check the timeunit either (as all branches are true)

However that is not something introduced in this PR

----TableScan: test_timestamps_tz_table projection=[nanos, micros, millis, secs, names]
physical_plan
CoalesceBatchesExec: target_batch_size=2
--HashJoinExec: mode=Partitioned, join_type=Inner, on=[(millis@2, millis@2)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that this test covers the code, and without the changes in this PR and it results in NestedLoopsJoin 👍

+   NestedLoopJoinExec: join_type=Inner, filter=millis@0 = millis@1
+   --RepartitionExec: partitioning=RoundRobinBatch(2), input_partitions=1
+   ----MemoryExec: partitions=1, partition_sizes=[1]
+   --MemoryExec: partitions=1, partition_sizes=[1]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants