Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better join order resolution logic #797

Merged
merged 3 commits into from
Aug 5, 2021
Merged

Better join order resolution logic #797

merged 3 commits into from
Aug 5, 2021

Conversation

seddonm1
Copy link
Contributor

Which issue does this PR close?

Reopens #777.

Rationale for this change

The previous resolution strategy ignored the qualifications on the columns if they were present so it required unique column names for resolution. This one is smarter and allows out-of-order qualified fields in the join condition.

What changes are included in this PR?

This change makes better use of the information provided to the join logical plan builder to better resolve column relations. More work could be done to deal with other 'error' conditions (like referring to the same 'left' relation twice) but I wanted to get feedback first.

There are still outstanding issues unrelated that i am working through.

Are there any user-facing changes?

More queries will behave like the SQL standard.

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Jul 31, 2021
@seddonm1 seddonm1 marked this pull request as draft July 31, 2021 02:37
@seddonm1 seddonm1 marked this pull request as draft July 31, 2021 02:37
@alamb
Copy link
Contributor

alamb commented Aug 1, 2021

Thanks @seddonm1 -- I'll plan to check this one out more carefully tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good @seddonm1 -- thank you. I had a few comments, but this looks like it is heading in the right direction to me. 👍

(true, _, _, true) => (Ok(l), Ok(r)),
(_, true, true, _) => (Ok(r), Ok(l)),
(_, _, _, _) => (
Err(DataFusionError::Plan(format!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't this error also happen if the column was found in both the left and right input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was leaving this until an initial review to understand if this was completely in the wrong direction. I can add further conditions and appropriate messages.

(Some(lr), Some(rr)) => {
let l_is_left =
self.plan.all_schemas().iter().any(|schema| {
schema.fields().iter().any(|field| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be cool if we could avoid the repetition here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I will see what we can do. I was trying to avoid heavy work for non-qualified tables but will revisit.

let mut ctx = create_join_context_qualified()?;
let equivalent_sql = [
"SELECT t1.a, t2.b FROM t1 INNER JOIN t2 ON t1.a = t2.a ORDER BY t1.a",
"SELECT t1.a, t2.b FROM t1 INNER JOIN t2 ON t2.a = t1.a ORDER BY t1.a",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about something like

        "SELECT t1.a, t2.b FROM t1 INNER JOIN t2 ON a = t1.a ORDER BY t1.a",

? I would expect it to given an error about ambiguous relation name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I will add more tests.

let l_is_left =
self.plan.all_schemas().iter().any(|schema| {
schema.fields().iter().any(|field| {
field.qualifier().unwrap() == &lr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we ensuring that field.qualifier() is not None (aka how do we know that unwrap() won't panic?

Maybe this would be safer if expressed something like

field.qualifier().map(|q| q == &lr).unwrap_or(false)

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Thanks. Let me go through this more carefully.

@seddonm1
Copy link
Contributor Author

seddonm1 commented Aug 3, 2021

Thank you for the review @alamb.

I have refactored to reduce repetition (by reusing some existing methods) but still have the ambiguous column problem outstanding.

I would like to know your thoughts on merging this one first and then I will raise another issue and make a PR to address the ambiguity (which is an issue on master already).

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is looking good @seddonm1 👍 Addressing ambiguity as a follow on PR sounds like a great idea to me

@seddonm1
Copy link
Contributor Author

seddonm1 commented Aug 4, 2021

@alamb I have done a minor cleaning pass (removing unnecessary clone) and standardising the match. I think this is good to go.

FYI I am finding quite a few bugs in the current Datafusion JOIN behaviour that I will try to work through.

@seddonm1 seddonm1 marked this pull request as ready for review August 4, 2021 21:35
Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @seddonm1 !

@alamb alamb merged commit db214bc into apache:master Aug 5, 2021
@alamb
Copy link
Contributor

alamb commented Aug 5, 2021

Thanks again @seddonm1

@houqp houqp added sql SQL Planner bug Something isn't working labels Aug 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working datafusion Changes in the datafusion crate sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants