-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better join order resolution logic #797
Conversation
Thanks @seddonm1 -- I'll plan to check this one out more carefully tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good @seddonm1 -- thank you. I had a few comments, but this looks like it is heading in the right direction to me. 👍
(true, _, _, true) => (Ok(l), Ok(r)), | ||
(_, true, true, _) => (Ok(r), Ok(l)), | ||
(_, _, _, _) => ( | ||
Err(DataFusionError::Plan(format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't this error also happen if the column was found in both the left and right input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was leaving this until an initial review to understand if this was completely in the wrong direction. I can add further conditions and appropriate messages.
(Some(lr), Some(rr)) => { | ||
let l_is_left = | ||
self.plan.all_schemas().iter().any(|schema| { | ||
schema.fields().iter().any(|field| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be cool if we could avoid the repetition here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I will see what we can do. I was trying to avoid heavy work for non-qualified tables but will revisit.
let mut ctx = create_join_context_qualified()?; | ||
let equivalent_sql = [ | ||
"SELECT t1.a, t2.b FROM t1 INNER JOIN t2 ON t1.a = t2.a ORDER BY t1.a", | ||
"SELECT t1.a, t2.b FROM t1 INNER JOIN t2 ON t2.a = t1.a ORDER BY t1.a", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about something like
"SELECT t1.a, t2.b FROM t1 INNER JOIN t2 ON a = t1.a ORDER BY t1.a",
? I would expect it to given an error about ambiguous relation name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks I will add more tests.
let l_is_left = | ||
self.plan.all_schemas().iter().any(|schema| { | ||
schema.fields().iter().any(|field| { | ||
field.qualifier().unwrap() == &lr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we ensuring that field.qualifier()
is not None
(aka how do we know that unwrap()
won't panic?
Maybe this would be safer if expressed something like
field.qualifier().map(|q| q == &lr).unwrap_or(false)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Thanks. Let me go through this more carefully.
Thank you for the review @alamb. I have refactored to reduce repetition (by reusing some existing methods) but still have the ambiguous column problem outstanding. I would like to know your thoughts on merging this one first and then I will raise another issue and make a PR to address the ambiguity (which is an issue on master already). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is looking good @seddonm1 👍 Addressing ambiguity as a follow on PR sounds like a great idea to me
@alamb I have done a minor cleaning pass (removing unnecessary clone) and standardising the FYI I am finding quite a few bugs in the current Datafusion JOIN behaviour that I will try to work through. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @seddonm1 !
Thanks again @seddonm1 |
Which issue does this PR close?
Reopens #777.
Rationale for this change
The previous resolution strategy ignored the qualifications on the columns if they were present so it required unique column names for resolution. This one is smarter and allows out-of-order qualified fields in the join condition.
What changes are included in this PR?
This change makes better use of the information provided to the join logical plan builder to better resolve column relations. More work could be done to deal with other 'error' conditions (like referring to the same 'left' relation twice) but I wanted to get feedback first.
There are still outstanding issues unrelated that i am working through.
Are there any user-facing changes?
More queries will behave like the SQL standard.