-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a ScalarUDFImpl::simplfy()
API, move SimplifyInfo
et al to datafusion_expr
#9304
Conversation
3242780
to
bba4940
Compare
} | ||
} | ||
|
||
fn function_simplication_analyze_internal(plan: &LogicalPlan) -> Result<LogicalPlan> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
port from operator to function
, schema is removed since not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is any reason calling simplify needs to be its own analyzer pass? Can't it be called as part of the existing simplification code?
logical_plan after simplify_expressions SAME TEXT AS ABOVE
That would also allow this function to be used by anyone else that used the simplification API directly: https://github.com/apache/arrow-datafusion/blob/8f3d1ef23f93cd4303745eba76c0850b39774d07/datafusion-examples/examples/expr_api.rs#L113
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No special reason. I forgot this analyzer exists. Let me take a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jayzhan211 -- this is looking really nice and close. Thank you so much for working on these items and I again apologize for the delay in review.
I left some comments -- let me know what you think.
} | ||
// Wrap with Expr::Cast() to Int64 | ||
fn simplify(&self, args: &[Expr]) -> Result<Simplified> { | ||
let dfs = DFSchema::new_with_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great example, thank you @jayzhan211
It seems to me that the UDF can't always know what the input schema would be, and thus the schema should be passed in.
Something like this
fn simplify(&self, args: &[Expr], schema: &DFSchema) -> Result<Simplified> {
...
}
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SimpliyContext has DFSchemaRef, so I change the type to DFSchemaRef, I think it is also fine
} | ||
|
||
#[tokio::test] | ||
async fn test_user_defined_functions_cast_to_i64() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this test / example -- it makes seeing how the API would work really clear. 👏
datafusion/expr/src/udf.rs
Outdated
fn simplify(&self, _args: &[Expr]) -> Result<Simplified> { | ||
Ok(Simplified::Original) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to make this take self
(and avoid a copy)? Something like
fn simplify(&self, _args: &[Expr]) -> Result<Simplified> { | |
Ok(Simplified::Original) | |
} | |
fn simplify(self, _args: &[Expr]) -> Result<Simplified> { | |
Ok(Simplified::Original(self)) | |
} | |
```? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems quite hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I +Sized for UDF trait, it can't be made into an object.
} | ||
} | ||
|
||
fn function_simplication_analyze_internal(plan: &LogicalPlan) -> Result<LogicalPlan> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is any reason calling simplify needs to be its own analyzer pass? Can't it be called as part of the existing simplification code?
logical_plan after simplify_expressions SAME TEXT AS ABOVE
That would also allow this function to be used by anyone else that used the simplification API directly: https://github.com/apache/arrow-datafusion/blob/8f3d1ef23f93cd4303745eba76c0850b39774d07/datafusion-examples/examples/expr_api.rs#L113
Thanks @jayzhan211 -- I have some more thoughts on this PR and I hope to share them soon. However it may not be until tomorrow as I have a bunch of other things going on right now so I may run out of time |
I ran out of time (again) today, but I will put this at the top of my list for tomorrow |
datafusion/expr/src/udf.rs
Outdated
// Do the function rewrite. | ||
// 'args': The arguments of the function | ||
// 'schema': The schema of the function | ||
fn simplify(&self, _args: &[Expr], _schema: DFSchemaRef) -> Result<Simplified> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played around with this PR today.
I think the core challenge is that the Simplification code uses SimplifyContext
which purposely doesn't have a reference to DFSchema
. This means the simplifier can't call ScalarUDFImpl::simplify
as written.
I tried several options, and here is my suggestion:
- Move SimplifyContext into datafusion_expr
- Change the signature to
fn simplify(&self, _args: &[Expr], info: &dyn SimplifyContext) -> Result<Simplified>
Then I think calling ScalarUDFImpl::simplify
will end up being straightforward to call
The one challenge I found when I tried to do this locally is that SImplifyInfo currently depends on ExecutionPropss which is in datafusion-physical-expr -- we probably have to move that structure into DataFusion common or datafusion expr as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess what you are suggesting is something like
I view this PR as strategically important (as it is needed to unlock the migration of functions out of the core) I think figuring out how to get |
{ | ||
let schema = self | ||
.info | ||
.schema() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can get the DFSchema from SimplifyContext here.
Thank you @alamb , I don't mind that. |
@@ -1233,6 +1272,19 @@ impl<'a, S: SimplifyInfo> TreeNodeRewriter for Simplifier<'a, S> { | |||
out_expr.rewrite(self)? | |||
} | |||
|
|||
Expr::ScalarFunction(ScalarFunction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
I was thinking we should stay true to the design intent on If we were going to require DFSChema I think we could simply get rid of /// The information necessary to apply algebraic simplification to an
/// [Expr]. See [SimplifyContext] for one concrete implementation.
///
/// This trait exists so that other systems can plug schema
/// information in without having to create `DFSchema` objects. If you
/// have a [`DFSchemaRef`] you can use [`SimplifyContext`]
pub trait SimplifyInfo { |
ScalarUDFImpl::simplfy()
API, move SimplifyInfo
et al to datafusion_expr
LGTM |
I plan to resolve the conflicts with this PR later today and merge this PR unless there are any additional comments. |
// Casting should be done in `simplify`, so we just return the first argument | ||
fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> { | ||
assert_eq!(args.len(), 1); | ||
Ok(args.first().unwrap().clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to add ExprSimplifyResult::Replace(Expr)
to cover this case, and eliminate UDF call when expression is simplified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To put some context to my comment, let's say if we define function f(INT, INT) = $1 + $2
we can eliminate UDF call with Alias($1 + $2, "f(a,b)")
and get UDF free plan, which would be easier to distribute across ballista cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is an excellent idea -- I did so in fea82cb
// Casting should be done in `simplify`, so we just return the first argument | ||
fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> { | ||
assert_eq!(args.len(), 1); | ||
Ok(args.first().unwrap().clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is an excellent idea -- I did so in fea82cb
... `DROP FUNCTION` will look for function name in all available registries (udf, udaf, udwf). `remove` may be necessary if UDaF and UDwF do not get `simplify` method from apache#9304.
Ok(ExprSimplifyResult::Simplified(new_expr)) | ||
} | ||
|
||
// Casting should be done in `simplify`, so we just return the first argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment is a bit outdated
Here we go 🚀 Thank you again @jayzhan211 for all the help getting this one tee'd up and ready to merge. I think this will be a very powerful (but simple) API |
* Add plugable function factory * cover `DROP FUNCTION` as well ... ... partially, as `SessionState` does not expose unregister_udf at the moment. * update documentation * fix doc test * Address PR comments (code organization) * Address PR comments (factory interface) * fix test after rebase * `remove`'s gone from the trait ... ... `DROP FUNCTION` will look for function name in all available registries (udf, udaf, udwf). `remove` may be necessary if UDaF and UDwF do not get `simplify` method from #9304. * Rename FunctionDefinition and export it ... FunctionDefinition already exists, DefinitionStatement makes more sense. * Update datafusion/expr/src/logical_plan/ddl.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/core/src/execution/context/mod.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/core/tests/user_defined/user_defined_scalar_functions.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/expr/src/logical_plan/ddl.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * resolve part of follow up comments * Qualified functions are not supported anymore * update docs and todos * fix clippy * address additional comments * Add sqllogicteset for CREATE/DROP function * Add coverage for DROP FUNCTION IF EXISTS * fix multiline error * revert dialect back to generic in test ... ... as `create function` gets support in latest sqlparser. * fmt --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
}) => match udf.simplify(args, info)? { | ||
ExprSimplifyResult::Original(args) => { | ||
Transformed::yes(Expr::ScalarFunction(ScalarFunction { | ||
func_def: ScalarFunctionDefinition::UDF(udf), | ||
args, | ||
})) | ||
} | ||
ExprSimplifyResult::Simplified(expr) => Transformed::no(expr), | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm a bit late to the party on this one but I keep rereading this and I can't help but think the logic is inverted - Original means unchanged which should result in Transformed::no, shouldn't it? And Simplified mean changed which should result in Transformed::yes?
I did a test where I switched the yes -> no, no -> yes and cargo test passes which makes me wonder if this code is indeed getting tested properly 🤔
Or am I completely missing something (very likely)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are correct that the logic is inverted. I wil make a PR to correct it.
I did a test where I switched the yes -> no, no -> yes and cargo test passes which makes me wonder if this code is indeed getting tested properly 🤔
It is a good question about why no tests fail if Transformed::no
I think it is because the expr simplifer code doens't do anything different if the expression was transformed or not: https://github.com/apache/arrow-datafusion/blob/f3836a53122e86f2b73c25557deaa5b800e488e9/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L188-L189
It just looks at data
:
https://github.com/apache/arrow-datafusion/blob/f3836a53122e86f2b73c25557deaa5b800e488e9/datafusion/common/src/tree_node.rs#L456
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which issue does this PR close?
Closes #9289 .
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?