-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC: Splitting scalar functions outside Datafusion core #7752
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool @2010YOUY01 -- thank you very much 🙏
I suggest the next steps is we get some more feedback (I will post this PR's link around and see if we can get anyone else to comment on it).
Then I agree it would be great to try porting 2-3 exising BuiltInScalarFunctons
and see what it looks like with the new API
It is great to see this work pushed forward.
2. Managing extension points
Another issue is how to manage those pluggable crates, possibly with a
extension/
folder under datafusion root? Recently I saw discussions about pulling parquet reader outside the core, those optional extension points can all be placed insideextension/
folder
I left a suggestion inline
@@ -41,6 +41,7 @@ datafusion-common = { path = "../datafusion/common" } | |||
datafusion-expr = { path = "../datafusion/expr" } | |||
datafusion-optimizer = { path = "../datafusion/optimizer" } | |||
datafusion-sql = { path = "../datafusion/sql" } | |||
datafusion-extension-test-scalar-func = {path = "../extension/scalar-function/test-func"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since most of DataFusion is extensions, another potential might be a path structure like the following?
├── core # (already exists)
├── functions-aggregate-core
├── functions-aggregate-statistics
├── functions-scalar-array
├── functions-scalar-core
├── functions-scalar-crypto
├── functions-scalar-timestamp
├── physical-expr # (already exists)
...
└── physical-plan # (already exists)
@@ -792,6 +794,30 @@ impl SessionContext { | |||
.add_var_provider(variable_type, provider); | |||
} | |||
|
|||
/// Register a function package into this context | |||
pub fn register_scalar_function_package( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to implement a new API, maybe we can have it cover all types of functions (window, aggregate, scalar) so there is a place to add table functions eventually too:
Something like
pub fn register_function_package(&self, func_pkg: Box<dyn FunctionPackage>)
Also I wonder if it should take Box<dyn FunctionPackage>
(which is owned) or Arc<dyn FunctionPackage>
which can be shared. Looking at this code, it seems like there is no reason for an owned pointer and we could get away with Arc
(so function packages can be quickly registered with multiple SessionContext
s)
|
||
// TODO: ReturnTypeFunction -> a ENUM | ||
// most function's return type is either the same as 1st arg or a fixed type | ||
fn return_type(&self) -> ReturnTypeFunction; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current API requires return_type to be passed the specific argument types
Also, #7657 suggests there is additional room for improvement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might mark this API as experimental for now, it can have frequent changes to this API, other unexpected cases like #7657 might show up when porting some trickier functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea --
maybe the implementation plan sequence could be:
- Introduce
ScalarFunctionDef
impl ScalarFunctionDef
forScalarUDF
andBuiltInFunction
- Update the core DataFusion code to only use
ScalarFunctionDef
- Port all BuiltInFunction to
ScalarFunctionDef
- Deprecate
ScalarFunctionDef
- Remove
BuiltinFunctionDef
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great 👍🏼 My plan is
1st PR: impl ScalarFunctionDef for BuiltinScalarFunction
and replace the execution code with the new interface. This step makes sure the new interface can cover all functionalities, and determines how the interface looks like
2nd PR: replace current UDF impl with the new interface
3rd and after: start porting existing functions
use std::fmt; | ||
use std::fmt::Debug; | ||
use std::fmt::Formatter; | ||
use std::sync::Arc; | ||
|
||
pub trait ScalarFunctionDef: Sync + Send + std::fmt::Debug { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like using a trait to define functions. To be honest, I am not sure why DataFusion didn't so that originally.
My only concern is that adding this we would now have three ways to define scalar functions (the ScalarUDF, ScalarFunctionDef
and BuiltInScalarFunction
).
I wonder if this trait is needed, or can we extend ScalarUDF
to account for aliases?
Or perhaps should we be aiming to consolidate all functions to use ScalarFuntionDef
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using a trait to define functions is more clear and easy to use, if extending ScalarUDF
, then we must init a struct for each function in an imperative way, it can get messy if there are a lot of function packages to manage.
And trait ScalarFunctionDef
should be equivalent to ScalarUDF
and BuitInScalarFunction
, it's possible to replace all with ScalarFunctionDef
in the future
(I also updated a more detailed answer for this concern in the original PR rationale part)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using a trait to define functions is more clear and easy to use
Yes, I agree
And trait ScalarFunctionDef should be equivalent to ScalarUDF
Maybe we could do something like
impl SclarFunctionDef for ScalarUDF {
...
}
So all of DataFusion's code was in terms of ScalarFunctionDef
.
And then (eventally) deprecate ScalarUDF as part of helping people migrate to using ScalarFunctionUDF
🤔
} | ||
|
||
pub trait ScalarFunctionPackage { | ||
fn functions(&self) -> Vec<Box<dyn ScalarFunctionDef>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest using Arc
here as I don't see any reason for the to be state / need an owned copy
I think @sadboy, @schulte-lukas and @wolfram-s may also be interested in this idea |
Which issue does this PR close?
POC for #7110
Rationale for this change
This POC PR defined a new interface for
ButinScalarFunction
:trait ScalarFunctionDef
, to make it possible to define scalar functions outside the core as #7110 stated.This PR also demonstrated how to define function packages in their own crate and register them into context in
datafusion-examples/examples/external_function_package.rs
Background
Currently, there are two ways to define a scalar function in Datafusion
enum BuiltinScalarFunction
: it's a huge enum and there are several large functions likereturn_type()
to check every enum variant. This way of implementation is not possible to define functions in other places, so there exist another way to define UDFs.struct ScalarUDF
as an internal representation for UDF's logical plan, it's expressive enough to define all builtin functions (almost all, need some minor modifications like supporting aliases). There is an interfacecreate_udf()
on top ofstruct ScalarUDF
for users to define UDFs.A new interface
trait ScalarFunctionDef
ScalarFunctionDef
trait can be used to define scalar functions, and the trait object can be converted intoScalarUDF
, and reuse UDF infrastructures to register and use the functions.One potential concern is why is this new trait needed, can we just use the existing interface
struct ScalarUDF
?If the function interface is defined as a struct, then a new function should be defined in an imperative way (see
examples/simple_udf.rs
). There might be a lot of functions for function packages to manage, defining them in a declarative way (using trait) is cleaner and more ergonomic.Besides,
trait ScalarFunctionDef
should cover the same functionality asstruct ScalarUDF
(also the same asBuiltinScalarFunction
), now it's converted intoScalarUDF
just because this way can reuse the existing implementation and minimize the patch size, it's possible to replaceScalarUDF
withtrait ScalarFunctionDef
in the future as a cleanup task.I think it's also possible to replace
BuitlinScalarFunction
with the new interface, as another cleanup refactor task. Adding a new function intoBuiltinScalarFunction
requires modifications in multiple places/files, the newScalarFunctionDef
interface only needs modifying one place, which can be simpler.What changes are included in this PR?
1.
ScalarFunctionDef
trait as an alternative way to define scalar functionsStruct ScalarUDF
is currently the internal representation of UDF's logical plan, it should be expressive enough to implement all built-in functions. (with some minor changes like supporting alias)ScalarFunctionDef
trait can be used to define scalar functions, and the trait object can be converted intoScalarUDF
, and reuse UDF infrastructures to register and use the functions.2. Managing extension points
Another issue is how to manage those pluggable crates, possibly with a
extension/
folder under datafusion root?Recently I saw discussions about pulling parquet reader outside the core, those optional extension points can all be placed inside
extension/
folderFor example
arrow-datafusion/extension/scalar-function/hash-functions
is a separate crate defined all hash-related functions.If this approach is possible I can try to pull some functions from datafusion core into a separate crate after
Are these changes tested?
Are there any user-facing changes?