-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance simplifier by adding Canonicalize #8780
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're thinking in the right direction, but there'll need to be more work done to achieve the intended functionality. I've left some pointers to hopefully help you adjust the code as needed, but feel free to ask if you need further clarification or help
right: right.clone(), | ||
}; | ||
let mut switch_op: Operator = op.clone(); | ||
if left.try_into_col().is_ok() && right.try_into_col().is_ok() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this check is better off done via a match, e.g.
match (left.as_ref(), right.as_ref()) {
(Expr::Column(a), Expr::Column(b)) => todo!(),
_ => todo!(),
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it!
Thank you very much
if let Some(swap_op) = op.swap() { | ||
switch_op = swap_op; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're on the right track here with swapping the Operator, but one very important note is that for this functionality, we should only swap certain operators that it makes sense to.
For example, we cannot swap A - B
to B - A
, as that changes the expression's meaning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
Here swap
would return None if operator makes no sense, e.g.: -
, see: https://github.com/apache/arrow-datafusion/blob/cc4289484c33e478242bf5d2b59f695fdb427ab9/datafusion/expr/src/operator.rs#L172C5-L202C6
I think in this case we will not change the order of the Expr, and and would remain the same order as before.
// Case 2, <literal> <op> <col> | ||
else if left.try_into_col().is_err() && right.try_into_col().is_ok() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One advantage of using match
to destructure the left
/right
Expr
s is that you would avoid cases like this, because a pitfall here is that just because an Expr
is not a Expr::Column
variant, doesn't mean it is a Expr::Literal
variant, so failing to cast left
to a Column
could mean its another variant than Expr::Literal
Changed the code according to the review. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking better! 👍
I think the logic is good, but will see if CI passes to ensure no regression in behaviour.
I've left some more comments, mainly around the structure of the code and how to make it more ergonomic in a Rust way (in my opinion, at least)
(Expr::Column(_a), Expr::Column(_b)) => { | ||
let left_name = left.canonical_name(); | ||
let right_name = right.canonical_name(); | ||
if left_name < right_name { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be the other way around, if you want to be consistent with the docs
/// <col1> <op> <col2> is rewritten so that the name of col1 sorts higher than col2 (b > a would be canonicalized to a < b)
/// Canonicalize any BinaryExprs that are not in canonical form | ||
/// <literal> <op> <col> is rewritten to <col> <op> <literal> (remember to switch the operator) | ||
/// <col> <op> <literal> is canonical | ||
/// <col1> <op> <col2> is rewritten so that the name of col1 sorts higher than col2 (b > a would be canonicalized to a < b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might need to upgrade the docs here, for example remove the remember to switch the operator
line (or change it to clarify that operator would be switched)
op: op.clone(), | ||
right: right.clone(), | ||
}; | ||
match (left.as_ref(), right.as_ref()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could even push the op.swap()
into this match, to remove the nested if blocks, e.g.
match (left.as_ref(), right.as_ref(), op.swap()) {
(Expr::Column(_), Expr::Column(_), Some(swapped_op)) => {
todo!(); // swap based on name, directly using the swapped_op
}
(Expr::Literal(_), Expr::Column(_), Some(swapped_op)) => {
todo!(); // swap to have column on left
}
_ => todo!(),
}
You could then even add the name check of left vs right as an if condition to the branch
let mut new_expr = BinaryExpr { | ||
left: left.clone(), | ||
op: op.clone(), | ||
right: right.clone(), | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is a cleaner way to do this, without having to clone here (since this will always clone for Expr;:BinaryExpr
even if we don't end up swapping)
Could try an approach without relying on mutability of new_expr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use return
in if
statement to avoid new_expr
in the new changes :)
Thank you for your detailed review @Jefffrey ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let expected = col("c2").gt(col("c1")); | ||
assert_eq!(simplify(expr), expected); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also please add some additional slightly more complex tests that also use equality and constants (an important use case) such as the ones listed on #8724 (comment) ?
A=1 AND 1 = A AND A = 3 --> A = 1 AND A = 3
A=B AND A > 5 AND B=A --> A=B AND A > 5
(A=1 AND (B> 3 OR 3 < B)) --> (A = 1 AND B > 3)
Also, it would be great to add some negative tests as well like
A < 5 AND A >= 5 --> same (I realize this expression can never be true, but I don't think the simplifier knows how to handle that yet
A > B AND A > B --> same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, in the test case, for A < 5 AND A >= 5 --> same
, it will make no change :)
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs
Outdated
Show resolved
Hide resolved
if let Expr::BinaryExpr(BinaryExpr { left, op, right }) = expr { | ||
match (left.as_ref(), right.as_ref(), op.swap()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do something like this to reduce the indent of this function if you want (not required, I just figured I would point it out given you say you are looking to improve your rust skills
if let Expr::BinaryExpr(BinaryExpr { left, op, right }) = expr { | |
match (left.as_ref(), right.as_ref(), op.swap()) { | |
let Expr::BinaryExpr(BinaryExpr { left, op, right }) = expr else { | |
return Ok(expr) | |
}; | |
match (left.as_ref(), right.as_ref(), op.swap()) { |
(Expr::Column(_), Expr::Column(_), Some(swapped_op)) => { | ||
if right.canonical_name() > left.canonical_name() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Column implements PartialOrd
you don't have to use canonical_name
here (which allocates a string)
(Expr::Column(_), Expr::Column(_), Some(swapped_op)) => { | |
if right.canonical_name() > left.canonical_name() { | |
(Expr::Column(left_col), Expr::Column(right_col), Some(swapped_op)) => { | |
if left_col > right_col { |
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Thank you for the review @alamb ! |
I kicked off the CI checks! |
There's a |
I figured it as a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try take a look at the failing test and see if I can help explain/debug 👍
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs
Outdated
Show resolved
Hide resolved
So there's quite a few failing tests, I chose the first I saw: Pretty simple join test. Running it fails:
So lets change the test to explain the output and see what the plan looks like: #[tokio::test]
async fn join() -> Result<()> {
let left = test_table().await?.select_columns(&["c1", "c2"])?;
let right = test_table_with_name("c2")
.await?
.select_columns(&["c1", "c3"])?;
let join = left.join(right, JoinType::Inner, &["c1"], &["c1"], None)?;
join.explain(true, false)?.show().await?; // HERE
Ok(())
} Gives this output (truncated only to the interesting parts):
What is interesting here, is how Indeed, after some more sleuthing to find the cause (enabling Specifically line 1061.
We can see here that it relies on the assumption that given a join on condition of I hope this helps in explaining the error @yyy1000 (and the process to debug it) As for how to fix it, I'm not sure. Either we respect the current behaviour of relying on the order of columns in a join on expression to be correct with respect to its children, or we try to fix that to not rely on that. Going with the former might suggest having to ensure that this new canonicalize step won't reorder expressions in a Join on clause. Going with the latter, I'm not so sure, would need some input from those more familiar with the Join logic in Datafusion. Just throwing in my 2 cents |
Woo, much appreciated that @Jefffrey ❤️ |
Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
As I looked into the code today, |
Woops, I found implement the second way would be more easy. |
Retriggered CI |
Ah, I believe the CI fails just because the test cases need to be changed due to |
Well, I tried to fix the CI-error, but it's harder than I thought. :(
indicated that here is the reason, https://github.com/apache/arrow-datafusion/blob/eb81ea299aa7e121bbe244e7e1ab56513d4ef800/datafusion/physical-plan/src/joins/utils.rs#L688-L697 |
🤔 so this PR has turned out to be far more complicated than we imagined What is the current state of this PR? Do we think the code is ok, and we just need to update the tests? Or do we think something more fundamental is wrong with it? Thank you for this work @yyy1000 -- I am sorry it was not an easy first issue as I had thought |
From what I understand, it seems this effort will be blocked as either have to:
|
Yeah, what @Jefffrey said is right. |
No worries -- thank you for your perseverence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, here is what I tried the second method, now the Join
would not call canonicalize
.
}).collect::<Result<Vec<_>>>()? | ||
}, | ||
_ => { | ||
plan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I tried for the second method, I define a new function called canonicalize
in simplifier
.
And the rule SimplifyExpressions
will call this if the plan
is not Join
.
I didn't change the current simplify
function because I think add a param to judge whether the Expr
is from a Join
plan would break a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Could you also add a comment to the code here explaining this? Otherwise the reason is not immediately obvious for the near duplication of code 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
Ah, I see the CI failed still due to some test cases need to be changed. |
Update: following that helps me fix the wrong test cases. :) |
@alamb I think this PR is close to be merged. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good now, just left a minor doc suggestion, as well as a question on the join utils changes 👍
datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs
Outdated
Show resolved
Hide resolved
let on_left_reverse = &on.iter().map(|on| on.1.clone()).collect::<HashSet<_>>(); | ||
let left_missing_reverse = | ||
on_left_reverse.difference(left).collect::<HashSet<_>>(); | ||
let on_right_reverse = &on.iter().map(|on| on.0.clone()).collect::<HashSet<_>>(); | ||
let right_missing_reverse = | ||
on_right_reverse.difference(right).collect::<HashSet<_>>(); | ||
if !left_missing_reverse.is_empty() | !right_missing_reverse.is_empty() { | ||
return plan_err!( | ||
"The left or right side of the join does not have all columns on \"on\": \nMissing on the left: {left_missing:?}\nMissing on the right: {right_missing:?}" | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change still required if the order of the on
clause in Joins should not be reordered, now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it. I should remove this.
I retriggered CI and will try and check it out later today |
Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests passing, nice work 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged this branch up from main to make sure all the CI passes, and if it does I plan to merge it in! |
While reviewing this PR, I had a thought for a slightly more consistent API which I have proposed in #8983 |
Which issue does this PR close?
Closes #8724
Rationale for this change
introduce new optimizer rule and optimizer can be better.
What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?
No