-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove reundant filters (e.g. c> 5 AND c>5 --> c>5) #436
Conversation
@alamb Maybe you mean a different expression ? |
|
||
fn simplify<'a>(expr: &'a Expr) -> Expr { | ||
match expr { | ||
Expr::BinaryExpr { left, op: _, right } if left == right => simplify(left), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems not correct to me, it seems to "simplify" a + a
to a
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am seeing one other example from the test 1=1
is also a good example, it now is rewritten to 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I was wrongly assuming that all expressions are boolean so ... my mistake.
I have updated it. I hope I can check your other comments soon. Thanks !
Looks like a good start! I think it's good to think of these rules as simple rewrite rules we can apply to the expression tree.
Some weeks ago I implemented some rules in this custom DataFusion optimizer (based on graph rewriting), you can see them here: https://github.com/Dandandan/datafusion-tokomak/blob/main/src/lib.rs#L44 Also, I think we should make this optimizer run on any |
Yes you are correct -- I guess I was quickly trying to cook up examples similar to @Dandandan that showed searching for an arbitrary instance of the same expression (even if it appeared in some nested structure) was going to incorrectly remove the nested instance in some cases |
@Dandandan I've just added some more simplification rules (from https://github.com/Dandandan/datafusion-tokomak/blob/main/src/lib.rs#L44, as you mentioned before), What do you think ? Still have to test and add more LogicalPlan cases (Projection, Window, Aggregation, ...) to optimize function. |
/// # Introduction | ||
/// It uses boolean algebra laws to simplify or reduce the number of terms in expressions. | ||
/// | ||
/// Filter: #b Gt Int32(2) And #b Gt Int32(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can best use b > 2 AND b > 2
here in the docs.
Hopefully we can implement a better way of displaying expressions soon!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked in #347 for anyone following along :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed, but I'm afraid this file docs will have to be fixed/completed later ...
/// Filter: #b Gt Int32(2) | ||
pub struct RemoveDuplicateFilters {} | ||
|
||
fn expr_contains<'a>(expr: &'a Expr, needle: &'a Expr) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason to use lifetime specifiers everywhere - we can't use just use &Expr
in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake, should be fixed now.
} | ||
|
||
fn is_one<'a>(s: &'a Expr) -> bool { | ||
match s { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are really good use cases for the matches!
macro https://doc.rust-lang.org/std/macro.matches.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, nice macro (I did not know it).
Sorry if i'm being too clumsy, but how can I use it if I have to guard two different cases (each one with a different variable/type) ?
matches!(s,
Scalar::Int8(1) |
Scalar::Int16(1) |
Scalar::Float32(Some(v)) if *v == 1. |
Scalar::Float64(Some(v)) if *v == 1. )
Seems that maches! only allows one last guard, and the existing binding must have the same type in all alternatives.
Is there any other way ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A matches!
returns a boolean so you can also use ||
/ boolean or to cover the float cases and keep the rest in a single matches!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or it might be better to just use |
patterns (without matches!
) in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that last option :)
} | ||
} | ||
|
||
fn operator_is_boolean(op: &Operator) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Operator is probably Copy
so no need to add &
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right! should be fixed in latest commit
.map(|input| optimize(input)) | ||
.collect::<Result<Vec<_>>>()?; | ||
|
||
let expr = plan.expressions(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could use something like plan.expressions().into_iter().map(|x| simplify(x)).collect::<Vec<_>>()
? and the other patterns could be removed here? (So it also applies for any node containing expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be valid ?
fn optimize(plan: &LogicalPlan) -> Result<LogicalPlan> {
let new_inputs = plan
.inputs()
.iter()
.map(|input| optimize(input))
.collect::<Result<Vec<_>>>()?;
let expr = plan
.expressions()
.into_iter()
.map(|x| simplify(&x))
.collect::<Vec<_>>();
utils::from_plan(&plan, &expr, &new_inputs)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks right to me!
datafusion/src/optimizer/mod.rs
Outdated
@@ -25,4 +25,5 @@ pub mod hash_build_probe_order; | |||
pub mod limit_push_down; | |||
pub mod optimizer; | |||
pub mod projection_push_down; | |||
pub mod remove_duplicate_filters; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we should rename it now to something like "simplify expressions".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rename it to 'optimizer::simplify_expressions' ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be a more accurate name by now 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, changed to simplify_expressions.
left, | ||
op: Operator::Minus, | ||
right | ||
} if left == right => lit(0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed those by now from this PR #441 as I think they might be wrong in precense of nulls (null - null
is not 0
?) null * 0
is not 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, already removed.
maybe there should be some specific optimizations for arithmetic operators ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be possible to do the optimization if we know the input is non null based on the schema. But probably something to find out later 👍
Codecov Report
@@ Coverage Diff @@
## master #436 +/- ##
==========================================
+ Coverage 74.94% 75.92% +0.98%
==========================================
Files 146 154 +8
Lines 24314 26195 +1881
==========================================
+ Hits 18221 19889 +1668
- Misses 6093 6306 +213
Continue to review full report at Codecov.
|
left, | ||
op: Operator::Or, | ||
right, | ||
} if expr_contains(left, right) => as_binary_expr(left) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems if this matches any oppurtunity to go deeper is lost? Is this pattern still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is needed for expressions like '((c > 5) AND (d < 6)) AND (c > 5)'.
Nested simplification should be fixed by latest commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah it makes sense, I missed that expr_contains
pattern matches on or/and 👍
left, | ||
op: Operator::And, | ||
right, | ||
} if is_false(left) || is_false(right) => lit(false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also not correct I think if a null
is present
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm this is actually correct 👍
left, | ||
op: Operator::And, | ||
right, | ||
} if expr_contains(left, right) => as_binary_expr(left) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here - is this still needed and doesn't this remove certain opportunities as the recursive call isn't done?
left, | ||
op: Operator::Divide, | ||
right, | ||
} if left == right => lit(1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't correct if both sides can be null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, should be fixed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged it, but see this part might still be not 100% correct.
if left and right are a column, the value can still be null (for some evaluations).
Also the division by 0 (e.g. 0/0 or x/x where x is 0) is tricky.
I think there is no easy way to do it safely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
It looks good to me @jgoday |
Should be fixed now 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jgoday -- this looks like a good step forward. 👍
As I mentioned I think this logic could be put into the constant folding pass but we can always do that as a follow on
/// Filter: b > 2 AND b > 2 | ||
/// is optimized to | ||
/// Filter: b > 2 | ||
pub struct SimplifyExpressions {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't help but notice the similarity between this pass and ConstantFolding
: https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/optimizer/constant_folding.rs#L130
Perhaps as a follow on PR, we could combine the code into a single pass
|
||
#[test] | ||
fn test_simplify_divide_by_same() -> Result<()> { | ||
let expr = binary_expr(col("c"), Operator::Divide, col("c")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we care that in some cases (where c has a 0
) this simplification will avoid a runtime error when it would have generated one without the optimization pass
I personally think it is ok, but wanted to mention it
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Awesome thanks @jgoday 💯 |
Thanks again @jgoday 👍 |
Thank you @Dandandan and @alamb for all your help! |
Which issue does this PR close?
Closes #410.
Rationale for this change
Uses boolean algebra laws to simplify binary boolean expressions.
What changes are included in this PR?
New RemoveDuplicateFilters optimizer.
Are there any user-facing changes?