POC: Splitting scalar functions outside Datafusion core #7752

2010YOUY01 · 2023-10-06T06:17:24Z

Which issue does this PR close?

POC for #7110

Rationale for this change

This POC PR defined a new interface for ButinScalarFunction: trait ScalarFunctionDef, to make it possible to define scalar functions outside the core as #7110 stated.
This PR also demonstrated how to define function packages in their own crate and register them into context in datafusion-examples/examples/external_function_package.rs

Background

Currently, there are two ways to define a scalar function in Datafusion

built-in functions are defined as enum BuiltinScalarFunction: it's a huge enum and there are several large functions like return_type() to check every enum variant. This way of implementation is not possible to define functions in other places, so there exist another way to define UDFs.
UDFs: currently it uses struct ScalarUDF as an internal representation for UDF's logical plan, it's expressive enough to define all builtin functions (almost all, need some minor modifications like supporting aliases). There is an interface create_udf() on top of struct ScalarUDF for users to define UDFs.


 ┌───────────────────────┐  ┌──────────────────────────┐  ┌────────────────────────────────┐
 │                       │  │                          │  │                                │
 │                       │  │ create_udf()             │  │*THIS PR                        │
 │                       │  │ (interface to define UDF)│  │ trait ScalarFunctionDef        │
 │                       │  │                          │  │ (an interface to define        │
 │                       │  │                          │  │  scalar functions outside      │
 │                       │  │                          │  │  datafusion core)              │
 │ BuiltinScalarFunction │  │                          │  │                                │
 │                       │  └──────────────────────────┘  └────────────────────────────────┘
 │                       │
 │                       │  ┌──────────────────────────────────────────────────────────────┐
 │                       │  │ struct ScalarUDF                                             │
 │                       │  │ (internal representation for UDF's logical plan)             │
 │                       │  │                                                              │
 └───────────────────────┘  └──────────────────────────────────────────────────────────────┘

A new interface `trait ScalarFunctionDef`

ScalarFunctionDef trait can be used to define scalar functions, and the trait object can be converted into ScalarUDF, and reuse UDF infrastructures to register and use the functions.

pub trait ScalarFunctionDef: Sync + Send + std::fmt::Debug {
    fn name(&self) -> &str;
    fn signature(&self) -> Signature;
    fn return_type(&self) -> ReturnTypeFunction;
    fn execute(&self, args: &[ArrayRef]) -> Result<ArrayRef>;
}

One potential concern is why is this new trait needed, can we just use the existing interface struct ScalarUDF?
If the function interface is defined as a struct, then a new function should be defined in an imperative way (see examples/simple_udf.rs). There might be a lot of functions for function packages to manage, defining them in a declarative way (using trait) is cleaner and more ergonomic.
Besides, trait ScalarFunctionDef should cover the same functionality as struct ScalarUDF (also the same as BuiltinScalarFunction), now it's converted into ScalarUDF just because this way can reuse the existing implementation and minimize the patch size, it's possible to replace ScalarUDF with trait ScalarFunctionDef in the future as a cleanup task.
I think it's also possible to replace BuitlinScalarFunction with the new interface, as another cleanup refactor task. Adding a new function into BuiltinScalarFunction requires modifications in multiple places/files, the new ScalarFunctionDef interface only needs modifying one place, which can be simpler.

What changes are included in this PR?

1. `ScalarFunctionDef` trait as an alternative way to define scalar functions

Struct ScalarUDF is currently the internal representation of UDF's logical plan, it should be expressive enough to implement all built-in functions. (with some minor changes like supporting alias)
ScalarFunctionDef trait can be used to define scalar functions, and the trait object can be converted into ScalarUDF, and reuse UDF infrastructures to register and use the functions.

2. Managing extension points

Another issue is how to manage those pluggable crates, possibly with a extension/ folder under datafusion root?
Recently I saw discussions about pulling parquet reader outside the core, those optional extension points can all be placed inside extension/ folder

.
├── extension
│   ├── data-source
│   │   └── parquet-reader
│   └── scalar-function
│       ├── hash-functions
│       ├── string-functions
│       └── ...

For example arrow-datafusion/extension/scalar-function/hash-functions is a separate crate defined all hash-related functions.

If this approach is possible I can try to pull some functions from datafusion core into a separate crate after

Are these changes tested?

Are there any user-facing changes?

alamb

This is really cool @2010YOUY01 -- thank you very much 🙏

I suggest the next steps is we get some more feedback (I will post this PR's link around and see if we can get anyone else to comment on it).

Then I agree it would be great to try porting 2-3 exising BuiltInScalarFunctons and see what it looks like with the new API

It is great to see this work pushed forward.

2. Managing extension points

Another issue is how to manage those pluggable crates, possibly with a extension/ folder under datafusion root? Recently I saw discussions about pulling parquet reader outside the core, those optional extension points can all be placed inside extension/ folder

I left a suggestion inline

alamb · 2023-10-07T11:57:05Z

datafusion-examples/Cargo.toml

@@ -41,6 +41,7 @@ datafusion-common = { path = "../datafusion/common" }
 datafusion-expr = { path = "../datafusion/expr" }
 datafusion-optimizer = { path = "../datafusion/optimizer" }
 datafusion-sql = { path = "../datafusion/sql" }
+datafusion-extension-test-scalar-func = {path = "../extension/scalar-function/test-func"}


Since most of DataFusion is extensions, another potential might be a path structure like the following?

├── core # (already exists) ├── functions-aggregate-core ├── functions-aggregate-statistics ├── functions-scalar-array ├── functions-scalar-core ├── functions-scalar-crypto ├── functions-scalar-timestamp ├── physical-expr # (already exists) ... └── physical-plan # (already exists)

alamb · 2023-10-07T11:59:50Z

datafusion/core/src/execution/context.rs

@@ -792,6 +794,30 @@ impl SessionContext {
            .add_var_provider(variable_type, provider);
    }

+    /// Register a function package into this context
+    pub fn register_scalar_function_package(


If we are going to implement a new API, maybe we can have it cover all types of functions (window, aggregate, scalar) so there is a place to add table functions eventually too:

Something like

pub fn register_function_package(&self, func_pkg: Box<dyn FunctionPackage>)

Also I wonder if it should take Box<dyn FunctionPackage> (which is owned) or Arc<dyn FunctionPackage> which can be shared. Looking at this code, it seems like there is no reason for an owned pointer and we could get away with Arc (so function packages can be quickly registered with multiple SessionContexts)

alamb · 2023-10-07T12:02:09Z

datafusion/expr/src/udf.rs

+
+    // TODO: ReturnTypeFunction -> a ENUM
+    //     most function's return type is either the same as 1st arg or a fixed type
+    fn return_type(&self) -> ReturnTypeFunction;


I think the current API requires return_type to be passed the specific argument types

Also, #7657 suggests there is additional room for improvement

I think we might mark this API as experimental for now, it can have frequent changes to this API, other unexpected cases like #7657 might show up when porting some trickier functions.

I like that idea --

maybe the implementation plan sequence could be:

Introduce ScalarFunctionDef

impl ScalarFunctionDef for ScalarUDF and BuiltInFunction

Update the core DataFusion code to only use ScalarFunctionDef

Port all BuiltInFunction to ScalarFunctionDef

Deprecate ScalarFunctionDef

Remove BuiltinFunctionDef

Sounds great 👍🏼 My plan is
1st PR: impl ScalarFunctionDef for BuiltinScalarFunction and replace the execution code with the new interface. This step makes sure the new interface can cover all functionalities, and determines how the interface looks like
2nd PR: replace current UDF impl with the new interface
3rd and after: start porting existing functions

alamb · 2023-10-07T12:04:25Z

datafusion/expr/src/udf.rs

 use std::fmt;
 use std::fmt::Debug;
 use std::fmt::Formatter;
 use std::sync::Arc;

+pub trait ScalarFunctionDef: Sync + Send + std::fmt::Debug {


I really like using a trait to define functions. To be honest, I am not sure why DataFusion didn't so that originally.

My only concern is that adding this we would now have three ways to define scalar functions (the ScalarUDF, ScalarFunctionDef and BuiltInScalarFunction).

I wonder if this trait is needed, or can we extend ScalarUDF to account for aliases?

Or perhaps should we be aiming to consolidate all functions to use ScalarFuntionDef 🤔

I think using a trait to define functions is more clear and easy to use, if extending ScalarUDF, then we must init a struct for each function in an imperative way, it can get messy if there are a lot of function packages to manage.
And trait ScalarFunctionDef should be equivalent to ScalarUDF and BuitInScalarFunction, it's possible to replace all with ScalarFunctionDef in the future
(I also updated a more detailed answer for this concern in the original PR rationale part)

I think using a trait to define functions is more clear and easy to use

Yes, I agree

And trait ScalarFunctionDef should be equivalent to ScalarUDF

Maybe we could do something like

impl SclarFunctionDef for ScalarUDF { ... }

So all of DataFusion's code was in terms of ScalarFunctionDef.

And then (eventally) deprecate ScalarUDF as part of helping people migrate to using ScalarFunctionUDF 🤔

alamb · 2023-10-07T12:04:57Z

datafusion/expr/src/udf.rs

+}
+
+pub trait ScalarFunctionPackage {
+    fn functions(&self) -> Vec<Box<dyn ScalarFunctionDef>>;


I suggest using Arc here as I don't see any reason for the to be state / need an owned copy

alamb · 2023-10-11T21:37:33Z

I think @sadboy, @schulte-lukas and @wolfram-s may also be interested in this idea

experimenting external function package interface

c6fad4d

github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate labels Oct 6, 2023

alamb reviewed Oct 7, 2023

View reviewed changes

alamb marked this pull request as draft October 10, 2023 16:17

alamb mentioned this pull request Oct 29, 2023

Keep output as scalar for scalar function if all inputs are scalar #7967

Merged

2010YOUY01 mentioned this pull request Oct 30, 2023

[Epic] A new Scalar Function interface #7977

Closed

2010YOUY01 closed this Oct 30, 2023

This was referenced Nov 3, 2023

[EPIC] Unify Function Interface (remove BuiltInScalarFunction) #8045

Closed

RFC: Demonstrate what a function package might look like -- encoding expressions #8046

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: Splitting scalar functions outside Datafusion core #7752

POC: Splitting scalar functions outside Datafusion core #7752

2010YOUY01 commented Oct 6, 2023 •

edited

Loading

alamb left a comment

alamb Oct 7, 2023

alamb Oct 7, 2023

alamb Oct 7, 2023

2010YOUY01 Oct 7, 2023

alamb Oct 9, 2023

2010YOUY01 Oct 13, 2023 •

edited

Loading

alamb Oct 7, 2023

2010YOUY01 Oct 7, 2023

alamb Oct 9, 2023 •

edited

Loading

alamb Oct 7, 2023

alamb commented Oct 11, 2023

POC: Splitting scalar functions outside Datafusion core #7752

POC: Splitting scalar functions outside Datafusion core #7752

Conversation

2010YOUY01 commented Oct 6, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

Background

A new interface trait ScalarFunctionDef

What changes are included in this PR?

1. ScalarFunctionDef trait as an alternative way to define scalar functions

2. Managing extension points

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

2. Managing extension points

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

2010YOUY01 Oct 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Oct 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Oct 11, 2023

2010YOUY01 commented Oct 6, 2023 •

edited

Loading

A new interface `trait ScalarFunctionDef`

1. `ScalarFunctionDef` trait as an alternative way to define scalar functions

2010YOUY01 Oct 13, 2023 •

edited

Loading

alamb Oct 9, 2023 •

edited

Loading