Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance simplifier by adding Canonicalize #8780

Merged
merged 18 commits into from
Jan 24, 2024
Merged

Conversation

yyy1000
Copy link
Contributor

@yyy1000 yyy1000 commented Jan 7, 2024

Which issue does this PR close?

Closes #8724

Rationale for this change

introduce new optimizer rule and optimizer can be better.

What changes are included in this PR?

  1. Reorder all BinaryExpr to or (the name of col1 is larger than col2)
  2. two unit test cases

Are these changes tested?

Yes

Are there any user-facing changes?

No

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're thinking in the right direction, but there'll need to be more work done to achieve the intended functionality. I've left some pointers to hopefully help you adjust the code as needed, but feel free to ask if you need further clarification or help

right: right.clone(),
};
let mut switch_op: Operator = op.clone();
if left.try_into_col().is_ok() && right.try_into_col().is_ok() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check is better off done via a match, e.g.

match (left.as_ref(), right.as_ref()) {
    (Expr::Column(a), Expr::Column(b)) => todo!(),
    _ => todo!(),
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it!
Thank you very much

Comment on lines 258 to 260
if let Some(swap_op) = op.swap() {
switch_op = swap_op;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're on the right track here with swapping the Operator, but one very important note is that for this functionality, we should only swap certain operators that it makes sense to.

For example, we cannot swap A - B to B - A, as that changes the expression's meaning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.
Here swap would return None if operator makes no sense, e.g.: -, see: https://github.com/apache/arrow-datafusion/blob/cc4289484c33e478242bf5d2b59f695fdb427ab9/datafusion/expr/src/operator.rs#L172C5-L202C6

I think in this case we will not change the order of the Expr, and and would remain the same order as before.

Comment on lines 268 to 269
// Case 2, <literal> <op> <col>
else if left.try_into_col().is_err() && right.try_into_col().is_ok() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One advantage of using match to destructure the left/right Exprs is that you would avoid cases like this, because a pitfall here is that just because an Expr is not a Expr::Column variant, doesn't mean it is a Expr::Literal variant, so failing to cast left to a Column could mean its another variant than Expr::Literal

@yyy1000
Copy link
Contributor Author

yyy1000 commented Jan 8, 2024

Changed the code according to the review. :)

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking better! 👍

I think the logic is good, but will see if CI passes to ensure no regression in behaviour.

I've left some more comments, mainly around the structure of the code and how to make it more ergonomic in a Rust way (in my opinion, at least)

(Expr::Column(_a), Expr::Column(_b)) => {
let left_name = left.canonical_name();
let right_name = right.canonical_name();
if left_name < right_name {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be the other way around, if you want to be consistent with the docs

/// <col1> <op> <col2> is rewritten so that the name of col1 sorts higher than col2 (b > a would be canonicalized to a < b)

Comment on lines 230 to 233
/// Canonicalize any BinaryExprs that are not in canonical form
/// <literal> <op> <col> is rewritten to <col> <op> <literal> (remember to switch the operator)
/// <col> <op> <literal> is canonical
/// <col1> <op> <col2> is rewritten so that the name of col1 sorts higher than col2 (b > a would be canonicalized to a < b)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might need to upgrade the docs here, for example remove the remember to switch the operator line (or change it to clarify that operator would be switched)

op: op.clone(),
right: right.clone(),
};
match (left.as_ref(), right.as_ref()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could even push the op.swap() into this match, to remove the nested if blocks, e.g.

match (left.as_ref(), right.as_ref(), op.swap()) {
    (Expr::Column(_), Expr::Column(_), Some(swapped_op)) => {
        todo!(); // swap based on name, directly using the swapped_op
    }
    (Expr::Literal(_), Expr::Column(_), Some(swapped_op)) => {
        todo!(); // swap to have column on left
    }
    _ => todo!(),
}

You could then even add the name check of left vs right as an if condition to the branch

Comment on lines 248 to 252
let mut new_expr = BinaryExpr {
left: left.clone(),
op: op.clone(),
right: right.clone(),
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a cleaner way to do this, without having to clone here (since this will always clone for Expr;:BinaryExpr even if we don't end up swapping)

Could try an approach without relying on mutability of new_expr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use return in if statement to avoid new_expr in the new changes :)

@yyy1000
Copy link
Contributor Author

yyy1000 commented Jan 8, 2024

Thank you for your detailed review @Jefffrey !
I changed the code, and now it looks better :)
I'm new to Rust so the code may not looks good at first, thanks again for your guidance.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @yyy1000 and @Jefffrey -- this is looking like it is coming along nicely! There are definitely some CI failures to look into and I left some additional comments, but I think we are getting close

let expected = col("c2").gt(col("c1"));
assert_eq!(simplify(expr), expected);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also please add some additional slightly more complex tests that also use equality and constants (an important use case) such as the ones listed on #8724 (comment) ?

A=1 AND 1 = A AND A = 3 --> A = 1 AND A = 3
A=B AND A > 5 AND B=A --> A=B AND A > 5
(A=1 AND (B> 3 OR 3 < B)) --> (A = 1 AND B > 3)

Also, it would be great to add some negative tests as well like

A < 5 AND A >= 5 --> same (I realize this expression can never be true, but I don't think the simplifier knows how to handle that yet
A > B AND A > B --> same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, in the test case, for A < 5 AND A >= 5 --> same, it will make no change :)

Comment on lines 245 to 246
if let Expr::BinaryExpr(BinaryExpr { left, op, right }) = expr {
match (left.as_ref(), right.as_ref(), op.swap()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do something like this to reduce the indent of this function if you want (not required, I just figured I would point it out given you say you are looking to improve your rust skills

Suggested change
if let Expr::BinaryExpr(BinaryExpr { left, op, right }) = expr {
match (left.as_ref(), right.as_ref(), op.swap()) {
let Expr::BinaryExpr(BinaryExpr { left, op, right }) = expr else {
return Ok(expr)
};
match (left.as_ref(), right.as_ref(), op.swap()) {

Comment on lines 248 to 249
(Expr::Column(_), Expr::Column(_), Some(swapped_op)) => {
if right.canonical_name() > left.canonical_name() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Column implements PartialOrd you don't have to use canonical_name here (which allocates a string)

Suggested change
(Expr::Column(_), Expr::Column(_), Some(swapped_op)) => {
if right.canonical_name() > left.canonical_name() {
(Expr::Column(left_col), Expr::Column(right_col), Some(swapped_op)) => {
if left_col > right_col {

@yyy1000
Copy link
Contributor Author

yyy1000 commented Jan 9, 2024

Thank you for the review @alamb !
I pushed the change to the PR and I will look into the CI failure if it's there :)

@alamb
Copy link
Contributor

alamb commented Jan 10, 2024

I kicked off the CI checks!

@yyy1000
Copy link
Contributor Author

yyy1000 commented Jan 10, 2024

There's a FieldNotFound error in tests introduced by the PR. 🧐
I will dive into it and find what's wrong with it. :)

@yyy1000
Copy link
Contributor Author

yyy1000 commented Jan 11, 2024

I figured it as a SchemaError :: FieldNotFound error when using Join and it seems a little complicated for me :(
Any guidance would be helpful.

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try take a look at the failing test and see if I can help explain/debug 👍

@Jefffrey
Copy link
Contributor

So there's quite a few failing tests, I chose the first I saw: dataframe::tests::join:

https://github.com/apache/arrow-datafusion/blob/a154884545cfdeb1a6c20872b3882a5624cd1119/datafusion/core/src/dataframe/mod.rs#L1892-L1906

Pretty simple join test.

Running it fails:

Error: SchemaError(FieldNotFound { field: Column { relation: Some(Bare { table: "c2" }), name: "c1" }, valid_fields: [Column { relation: Some(Bare { table: "aggregate_test_100" }), name: "c1" }, Column { relation: Some(Bare { table: "aggregate_test_100" }), name: "c2" }] }, Some(""))

So lets change the test to explain the output and see what the plan looks like:

    #[tokio::test]
    async fn join() -> Result<()> {
        let left = test_table().await?.select_columns(&["c1", "c2"])?;
        let right = test_table_with_name("c2")
            .await?
            .select_columns(&["c1", "c3"])?;
        let join = left.join(right, JoinType::Inner, &["c1"], &["c1"], None)?;
        join.explain(true, false)?.show().await?; // HERE
        Ok(())
    }

Gives this output (truncated only to the interesting parts):

+------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
| plan_type                                                  | plan                                                                                               |
+------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
| initial_logical_plan                                       | Inner Join: aggregate_test_100.c1 = c2.c1                                                          |
|                                                            |   Projection: aggregate_test_100.c1, aggregate_test_100.c2                                         |
|                                                            |     TableScan: aggregate_test_100                                                                  |
|                                                            |   Projection: c2.c1, c2.c3                                                                         |
|                                                            |     TableScan: c2                                                                                  |
~~~
| logical_plan after simplify_expressions                    | Inner Join: c2.c1 = aggregate_test_100.c1                                                          |
|                                                            |   Projection: aggregate_test_100.c1, aggregate_test_100.c2                                         |
|                                                            |     TableScan: aggregate_test_100                                                                  |
|                                                            |   Projection: c2.c1, c2.c3                                                                         |
|                                                            |     TableScan: c2                                                                                  |
~~~
| logical_plan after eliminate_cross_join                    | Inner Join: aggregate_test_100.c1 = c2.c1                                                          |
|                                                            |   Projection: aggregate_test_100.c1, aggregate_test_100.c2                                         |
|                                                            |     TableScan: aggregate_test_100                                                                  |
|                                                            |   Projection: c2.c1, c2.c3                                                                         |
|                                                            |     TableScan: c2                                                                                  |
~~~
| logical_plan after simplify_expressions                    | Inner Join: c2.c1 = aggregate_test_100.c1                                                          |
|                                                            |   Projection: aggregate_test_100.c1, aggregate_test_100.c2                                         |
|                                                            |     TableScan: aggregate_test_100                                                                  |
|                                                            |   Projection: c2.c1, c2.c3                                                                         |
|                                                            |     TableScan: c2                                                                                  |
~~~
| logical_plan after optimize_projections                    | Inner Join: c2.c1 = aggregate_test_100.c1                                                          |
|                                                            |   TableScan: aggregate_test_100 projection=[c1, c2]                                                |
|                                                            |   TableScan: c2 projection=[c1, c3]                                                                |
~~~
| logical_plan after eliminate_cross_join                    | Inner Join: aggregate_test_100.c1 = c2.c1                                                          |
|                                                            |   TableScan: aggregate_test_100 projection=[c1, c2]                                                |
|                                                            |   TableScan: c2 projection=[c1, c3]                                                                |
~~~
| logical_plan after simplify_expressions                    | Inner Join: c2.c1 = aggregate_test_100.c1                                                          |
|                                                            |   TableScan: aggregate_test_100 projection=[c1, c2]                                                |
|                                                            |   TableScan: c2 projection=[c1, c3]                                                                |
~~~
| logical_plan                                               | Inner Join: c2.c1 = aggregate_test_100.c1                                                          |
|                                                            |   TableScan: aggregate_test_100 projection=[c1, c2]                                                |
|                                                            |   TableScan: c2 projection=[c1, c3]                                                                |
| initial_physical_plan                                      | Schema error: No field named c2.c1. Valid fields are aggregate_test_100.c1, aggregate_test_100.c2. |
+------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
  • Removed rules which didn't change output

What is interesting here, is how simplify_expressions and eliminate_cross_join 'fight' over the order of the join, and in the end, simplify_expressions 'wins' with its order as there is no more eliminate_cross_join run after. This seems to suggest the order of the columns in the join expression is important.

Indeed, after some more sleuthing to find the cause (enabling backtrace feature and setting RUST_BACKTRACE=1 then running original test to see where the error originated), it is here:

https://github.com/apache/arrow-datafusion/blob/a154884545cfdeb1a6c20872b3882a5624cd1119/datafusion/core/src/physical_planner.rs#L1055-L1065

Specifically line 1061.

keys is grabbed directly from the on condition of the LogicalPlan Join.

We can see here that it relies on the assumption that given a join on condition of a = b, column a must be from the left child and column b must be from the right child.

I hope this helps in explaining the error @yyy1000 (and the process to debug it)

As for how to fix it, I'm not sure. Either we respect the current behaviour of relying on the order of columns in a join on expression to be correct with respect to its children, or we try to fix that to not rely on that.

Going with the former might suggest having to ensure that this new canonicalize step won't reorder expressions in a Join on clause.

Going with the latter, I'm not so sure, would need some input from those more familiar with the Join logic in Datafusion.

Just throwing in my 2 cents

@yyy1000
Copy link
Contributor Author

yyy1000 commented Jan 11, 2024

Woo, much appreciated that @Jefffrey ❤️
I learned that using explain is a good way to debug, will see the methods to fix it now. :)

Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
@yyy1000
Copy link
Contributor Author

yyy1000 commented Jan 12, 2024

As I looked into the code today,
if we respect the current behaviour, maybe we can change the info field in ExprSimplifier, and see whether info contains 'Join infomation' https://github.com/apache/arrow-datafusion/blob/d9a1d4261f78dfd3ead235d6b850c64e0ae9bec6/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L132-L148
if trying to fix that to not rely on join order, would it be implementable?

@github-actions github-actions bot added the core Core DataFusion crate label Jan 12, 2024
@yyy1000
Copy link
Contributor Author

yyy1000 commented Jan 12, 2024

Woops, I found implement the second way would be more easy.
I pushed my changes to the PR, it would try make column a the right child and column b the right child if it can't find them in left and right child.
Also, two test cases are modified because the name of t2.c1 is larger than t1.c1, and it meets expectation.

@alamb
Copy link
Contributor

alamb commented Jan 12, 2024

Retriggered CI

@yyy1000
Copy link
Contributor Author

yyy1000 commented Jan 12, 2024

Ah, I believe the CI fails just because the test cases need to be changed due to bigger names would be in the front, I could do that then.

@yyy1000
Copy link
Contributor Author

yyy1000 commented Jan 13, 2024

Well, I tried to fix the CI-error, but it's harder than I thought. :(
Now the problem is on, the two columns in Join, would change but some functions still pass left and right infos, didn't knowing that now the two columns in on has been swapped.
The failed test case ,

index out of bounds: the len is 2 but the index is 2

indicated that here is the reason, https://github.com/apache/arrow-datafusion/blob/eb81ea299aa7e121bbe244e7e1ab56513d4ef800/datafusion/physical-plan/src/joins/utils.rs#L688-L697

@alamb
Copy link
Contributor

alamb commented Jan 18, 2024

🤔 so this PR has turned out to be far more complicated than we imagined

What is the current state of this PR? Do we think the code is ok, and we just need to update the tests? Or do we think something more fundamental is wrong with it?

Thank you for this work @yyy1000 -- I am sorry it was not an easy first issue as I had thought

@Jefffrey
Copy link
Contributor

From what I understand, it seems this effort will be blocked as either have to:

  • Fix left/right join assumptions throughout the code base
  • Or enhance the expression simplifier to be able to tell when an Expr is from a join on clause, and if so don't do the reordering

@yyy1000
Copy link
Contributor Author

yyy1000 commented Jan 18, 2024

Yeah, what @Jefffrey said is right.
I tried the first method but failed, now I'm trying the second method. It's kind of difficult for me. 🥲
Sorry for the delay. :(

@alamb
Copy link
Contributor

alamb commented Jan 19, 2024

Yeah, what @Jefffrey said is right. I tried the first method but failed, now I'm trying the second method. It's kind of difficult for me. 🥲 Sorry for the delay. :(

No worries -- thank you for your perseverence

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) and removed core Core DataFusion crate labels Jan 21, 2024
Copy link
Contributor Author

@yyy1000 yyy1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, here is what I tried the second method, now the Join would not call canonicalize.

}).collect::<Result<Vec<_>>>()?
},
_ => {
plan
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I tried for the second method, I define a new function called canonicalize in simplifier.
And the rule SimplifyExpressions will call this if the plan is not Join.
I didn't change the current simplify function because I think add a param to judge whether the Expr is from a Join plan would break a lot.

Copy link
Contributor

@Jefffrey Jefffrey Jan 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Could you also add a comment to the code here explaining this? Otherwise the reason is not immediately obvious for the near duplication of code 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

@yyy1000
Copy link
Contributor Author

yyy1000 commented Jan 22, 2024

Ah, I see the CI failed still due to some test cases need to be changed.
Would following the CI flow https://github.com/apache/arrow-datafusion/blob/903ef94a4742a4ff6933dcf696a287972df3b5e2/.github/workflows/rust.yml#L210-L239 to see the failure in my local environment a good method?

@yyy1000
Copy link
Contributor Author

yyy1000 commented Jan 22, 2024

Update: following that helps me fix the wrong test cases. :)

@yyy1000
Copy link
Contributor Author

yyy1000 commented Jan 22, 2024

@alamb I think this PR is close to be merged. :)
I'd like to see your review and feedback. 😃

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good now, just left a minor doc suggestion, as well as a question on the join utils changes 👍

Comment on lines 296 to 306
let on_left_reverse = &on.iter().map(|on| on.1.clone()).collect::<HashSet<_>>();
let left_missing_reverse =
on_left_reverse.difference(left).collect::<HashSet<_>>();
let on_right_reverse = &on.iter().map(|on| on.0.clone()).collect::<HashSet<_>>();
let right_missing_reverse =
on_right_reverse.difference(right).collect::<HashSet<_>>();
if !left_missing_reverse.is_empty() | !right_missing_reverse.is_empty() {
return plan_err!(
"The left or right side of the join does not have all columns on \"on\": \nMissing on the left: {left_missing:?}\nMissing on the right: {right_missing:?}"
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change still required if the order of the on clause in Joins should not be reordered, now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it. I should remove this.

@alamb
Copy link
Contributor

alamb commented Jan 23, 2024

I retriggered CI and will try and check it out later today

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests passing, nice work 👍

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sticking with this PR @yyy1000 and to @Jefffrey for the assist and review 🙏

@alamb
Copy link
Contributor

alamb commented Jan 24, 2024

I merged this branch up from main to make sure all the CI passes, and if it does I plan to merge it in!

@alamb alamb merged commit bc0ba6a into apache:main Jan 24, 2024
22 checks passed
@alamb alamb mentioned this pull request Jan 24, 2024
@alamb
Copy link
Contributor

alamb commented Jan 24, 2024

While reviewing this PR, I had a thought for a slightly more consistent API which I have proposed in #8983

@yyy1000 yyy1000 deleted the simplifier branch January 24, 2024 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expression simplifier does not simplify A = B AND B = A
3 participants