Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove reundant filters (e.g. c> 5 AND c>5 --> c>5) #436

Merged
merged 15 commits into from
Jun 3, 2021

Conversation

jgoday
Copy link
Contributor

@jgoday jgoday commented May 28, 2021

Which issue does this PR close?

Closes #410.

Rationale for this change

Uses boolean algebra laws to simplify binary boolean expressions.

What changes are included in this PR?

New RemoveDuplicateFilters optimizer.

Are there any user-facing changes?

@jgoday
Copy link
Contributor Author

jgoday commented May 28, 2021

(c > 5) OR ((d < 6) AND (c > 5) -- can't remove

@alamb
I Have updated the unit tests, but
(c > 5) OR ((d < 6) AND (c > 5)) should be simplified to (c > 5), isn't it?

Maybe you mean a different expression ?


fn simplify<'a>(expr: &'a Expr) -> Expr {
match expr {
Expr::BinaryExpr { left, op: _, right } if left == right => simplify(left),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems not correct to me, it seems to "simplify" a + a to a too?

Copy link
Contributor

@Dandandan Dandandan May 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am seeing one other example from the test 1=1 is also a good example, it now is rewritten to 1

Copy link
Contributor Author

@jgoday jgoday May 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I was wrongly assuming that all expressions are boolean so ... my mistake.
I have updated it. I hope I can check your other comments soon. Thanks !

@Dandandan
Copy link
Contributor

Dandandan commented May 29, 2021

Looks like a good start!

I think it's good to think of these rules as simple rewrite rules we can apply to the expression tree.

p OR p => p
p AND p => p

Some weeks ago I implemented some rules in this custom DataFusion optimizer (based on graph rewriting), you can see them here:

https://github.com/Dandandan/datafusion-tokomak/blob/main/src/lib.rs#L44
There are a lot of other patterns we can add if we do term rewriting in this way.

Also, I think we should make this optimizer run on any Expr, not just those in Filter as this would benefit any Expr (in projections, aggregations, joins, etc)

@alamb
Copy link
Contributor

alamb commented May 29, 2021

(c > 5) OR ((d < 6) AND (c > 5)) should be simplified to (c > 5), isn't it?

Yes you are correct -- I guess I was quickly trying to cook up examples similar to @Dandandan that showed searching for an arbitrary instance of the same expression (even if it appeared in some nested structure) was going to incorrectly remove the nested instance in some cases

@jgoday
Copy link
Contributor Author

jgoday commented May 31, 2021

@Dandandan I've just added some more simplification rules (from https://github.com/Dandandan/datafusion-tokomak/blob/main/src/lib.rs#L44, as you mentioned before), What do you think ?

Still have to test and add more LogicalPlan cases (Projection, Window, Aggregation, ...) to optimize function.

/// # Introduction
/// It uses boolean algebra laws to simplify or reduce the number of terms in expressions.
///
/// Filter: #b Gt Int32(2) And #b Gt Int32(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can best use b > 2 AND b > 2 here in the docs.
Hopefully we can implement a better way of displaying expressions soon!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in #347 for anyone following along :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed, but I'm afraid this file docs will have to be fixed/completed later ...

/// Filter: #b Gt Int32(2)
pub struct RemoveDuplicateFilters {}

fn expr_contains<'a>(expr: &'a Expr, needle: &'a Expr) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to use lifetime specifiers everywhere - we can't use just use &Expr in here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake, should be fixed now.

}

fn is_one<'a>(s: &'a Expr) -> bool {
match s {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are really good use cases for the matches! macro https://doc.rust-lang.org/std/macro.matches.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, nice macro (I did not know it).

Sorry if i'm being too clumsy, but how can I use it if I have to guard two different cases (each one with a different variable/type) ?

matches!(s,
   Scalar::Int8(1) |
   Scalar::Int16(1) |
   Scalar::Float32(Some(v)) if *v == 1. |
   Scalar::Float64(Some(v)) if *v == 1. )

Seems that maches! only allows one last guard, and the existing binding must have the same type in all alternatives.

Is there any other way ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A matches! returns a boolean so you can also use || / boolean or to cover the float cases and keep the rest in a single matches!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or it might be better to just use | patterns (without matches!) in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that last option :)

}
}

fn operator_is_boolean(op: &Operator) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operator is probably Copy so no need to add &

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! should be fixed in latest commit

.map(|input| optimize(input))
.collect::<Result<Vec<_>>>()?;

let expr = plan.expressions();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use something like plan.expressions().into_iter().map(|x| simplify(x)).collect::<Vec<_>>() ? and the other patterns could be removed here? (So it also applies for any node containing expressions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be valid ?

fn optimize(plan: &LogicalPlan) -> Result<LogicalPlan> {
    let new_inputs = plan
        .inputs()
        .iter()
        .map(|input| optimize(input))
        .collect::<Result<Vec<_>>>()?;
    let expr = plan
        .expressions()
        .into_iter()
        .map(|x| simplify(&x))
        .collect::<Vec<_>>();
    utils::from_plan(&plan, &expr, &new_inputs)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks right to me!

@@ -25,4 +25,5 @@ pub mod hash_build_probe_order;
pub mod limit_push_down;
pub mod optimizer;
pub mod projection_push_down;
pub mod remove_duplicate_filters;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we should rename it now to something like "simplify expressions".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename it to 'optimizer::simplify_expressions' ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be a more accurate name by now 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, changed to simplify_expressions.

left,
op: Operator::Minus,
right
} if left == right => lit(0),
Copy link
Contributor

@Dandandan Dandandan May 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed those by now from this PR #441 as I think they might be wrong in precense of nulls (null - null is not 0 ?) null * 0 is not 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, already removed.
maybe there should be some specific optimizations for arithmetic operators ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be possible to do the optimization if we know the input is non null based on the schema. But probably something to find out later 👍

@codecov-commenter
Copy link

codecov-commenter commented Jun 1, 2021

Codecov Report

Merging #436 (af2c6ac) into master (db4f098) will increase coverage by 0.98%.
The diff coverage is 87.45%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #436      +/-   ##
==========================================
+ Coverage   74.94%   75.92%   +0.98%     
==========================================
  Files         146      154       +8     
  Lines       24314    26195    +1881     
==========================================
+ Hits        18221    19889    +1668     
- Misses       6093     6306     +213     
Impacted Files Coverage Δ
datafusion/src/execution/context.rs 92.08% <ø> (+0.03%) ⬆️
datafusion/src/optimizer/simplify_expressions.rs 87.45% <87.45%> (ø)
datafusion-cli/src/print_format.rs 81.25% <0.00%> (-9.17%) ⬇️
datafusion/src/physical_plan/mod.rs 78.70% <0.00%> (-4.06%) ⬇️
datafusion/src/physical_plan/window_functions.rs 85.71% <0.00%> (-3.01%) ⬇️
...tafusion/src/physical_plan/datetime_expressions.rs 67.29% <0.00%> (-2.52%) ⬇️
datafusion/src/physical_plan/common.rs 84.21% <0.00%> (-2.00%) ⬇️
datafusion/src/optimizer/hash_build_probe_order.rs 58.53% <0.00%> (-1.97%) ⬇️
datafusion/src/physical_plan/repartition.rs 82.45% <0.00%> (-1.89%) ⬇️
ballista/rust/scheduler/src/planner.rs 66.91% <0.00%> (-0.74%) ⬇️
... and 57 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db4f098...af2c6ac. Read the comment docs.

left,
op: Operator::Or,
right,
} if expr_contains(left, right) => as_binary_expr(left)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems if this matches any oppurtunity to go deeper is lost? Is this pattern still needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is needed for expressions like '((c > 5) AND (d < 6)) AND (c > 5)'.
Nested simplification should be fixed by latest commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah it makes sense, I missed that expr_contains pattern matches on or/and 👍

left,
op: Operator::And,
right,
} if is_false(left) || is_false(right) => lit(false),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also not correct I think if a null is present

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm this is actually correct 👍

left,
op: Operator::And,
right,
} if expr_contains(left, right) => as_binary_expr(left)
Copy link
Contributor

@Dandandan Dandandan Jun 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here - is this still needed and doesn't this remove certain opportunities as the recursive call isn't done?

left,
op: Operator::Divide,
right,
} if left == right => lit(1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't correct if both sides can be null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, should be fixed now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged it, but see this part might still be not 100% correct.

if left and right are a column, the value can still be null (for some evaluations).
Also the division by 0 (e.g. 0/0 or x/x where x is 0) is tricky.
I think there is no easy way to do it safely.

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Dandandan
Copy link
Contributor

It looks good to me @jgoday
Could you fix the clippy and formatting errors?

@jgoday
Copy link
Contributor Author

jgoday commented Jun 2, 2021

It looks good to me @jgoday
Could you fix the clippy and formatting errors?

Should be fixed now 👍

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jgoday -- this looks like a good step forward. 👍

As I mentioned I think this logic could be put into the constant folding pass but we can always do that as a follow on

datafusion/src/optimizer/simplify_expressions.rs Outdated Show resolved Hide resolved
datafusion/src/optimizer/simplify_expressions.rs Outdated Show resolved Hide resolved
/// Filter: b > 2 AND b > 2
/// is optimized to
/// Filter: b > 2
pub struct SimplifyExpressions {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't help but notice the similarity between this pass and ConstantFolding: https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/optimizer/constant_folding.rs#L130

Perhaps as a follow on PR, we could combine the code into a single pass


#[test]
fn test_simplify_divide_by_same() -> Result<()> {
let expr = binary_expr(col("c"), Operator::Divide, col("c"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we care that in some cases (where c has a 0) this simplification will avoid a runtime error when it would have generated one without the optimization pass

I personally think it is ok, but wanted to mention it

datafusion/src/optimizer/simplify_expressions.rs Outdated Show resolved Hide resolved
jgoday and others added 5 commits June 3, 2021 07:34
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@Dandandan Dandandan merged commit 4392eea into apache:master Jun 3, 2021
@Dandandan
Copy link
Contributor

Awesome thanks @jgoday 💯

@alamb
Copy link
Contributor

alamb commented Jun 3, 2021

Thanks again @jgoday 👍

@jgoday
Copy link
Contributor Author

jgoday commented Jun 3, 2021

Thank you @Dandandan and @alamb for all your help!

@houqp houqp added datafusion Changes in the datafusion crate enhancement New feature or request labels Jul 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove reundant filters (e.g. c> 5 AND c>5 --> c>5)
6 participants