Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split built in functions into "packages" #7110

Open
alamb opened this issue Jul 27, 2023 · 4 comments
Open

Split built in functions into "packages" #7110

alamb opened this issue Jul 27, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jul 27, 2023

Is your feature request related to a problem or challenge?

Right now, DataFusionhas 104 built in functions: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.BuiltinScalarFunction.html

Screenshot 2023-07-27 at 8 51 45 AM

As we add new features and functions (most recently, date_diff #7097 (comment) or all the arary functions) this number will keep growing

This growth means that the size of the DataFusion library will keep growing even if users do not use those features

Describe the solution you'd like

Given that DataFusion has all the machinery to register user defined functions, and they are mostly handled the same way, I propose we split up the datafusion built into scalar function packages

Perhaps like

  • string_functions (upper, hex, etc)
  • crypto_functions (hash, etc)
  • array_functions (make_array, array_contains, etc)
  • date_time functions (date_trunc, etc)
  • ...

Not only would this give users better control over their binary size it would also ensure that the extensibility APIs of DataFusion were sufficient for all functions (and we could enhance the extension points if this was not the case)

This would replace the existing somewhat haphazard feature flags like crypto_functions:

https://github.com/apache/arrow-datafusion/blob/11b7b5c215012231e5768fc5be3445c0254d0169/datafusion/physical-expr/src/lib.rs#L21-L22

I would imagine these functions would be in their own crates like datafusion_functions_crypto with an entry point like

let ctx = SessionContext::new();
ctx.sql("select hash('foobar')");// would error 

// register all functions in the `datafusion_functions_crypto` package
datafusion_functions_crypto::register(&ctx)
ctx.sql("select hash('foobar')");// would now succeed

Describe alternatives you've considered

We could continue to use feature flags

If this works out, we could do the same thing for aggregate and window functions

Additional context

No response

@alamb alamb added the enhancement New feature or request label Jul 27, 2023
@alamb alamb changed the title Split built in scalar functions into "packages" Split built in functions into "packages" Jul 30, 2023
@2010YOUY01
Copy link
Contributor

Not quite familiar with the binary size issue, but maybe we could let the default config ship with most functions pre-registered:

let ctx = SessionContext::new(); // Internally have most functions registered

It might be more convenient for general users and also make the API backward compatible.

If binary size becomes an issue, now there is another nice option available:

let ctx = SessionContext::new_with_minimal_functions();

@universalmind303
Copy link
Contributor

what about namespaced enums? it would still use feature flags, but it'd make it much more compartmentalized & able to easily split into sub crates.

so something like

// this could easily live in it's own crate.
pub enum StringFunctions {
    // string related functions
}

pub enum BuiltinScalarFunction {
    #[cfg(feature = "string_expressions")]
    StringNamespace(StringFunctions),
    #[cfg(feature = "list_expressions")]
    ListNamespace(ListFunctions),
    #[cfg(feature = "crypto_expressions")]
    CryptoNamespace(CryptoFunctions),
    Abs,
    // ...other non namespaced functions
}

@alamb
Copy link
Contributor Author

alamb commented Aug 5, 2023

what about namespaced enums? it would still use feature flags, but it'd make it much more compartmentalized & able to easily split into sub crates.

That is a neat idea @universalmind303 -- I was even thinking there is no reason to have BuiltInScalarFunction at all in theory -- the same effects can be achieved using the ScalarUDFI think

This would also make things like serialization simpler as well (as there is already a name --> function mapping)

@alamb
Copy link
Contributor Author

alamb commented Sep 13, 2023

I think it would be really nice to implement these packages as crates (possibly with feature flags in datafusion core) rather than just feature flags in datafusion core. cc @tustvold

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants