Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Register scalars with boxed fn impl #9980

Closed
wants to merge 2 commits into from

Conversation

jayzhan211
Copy link
Contributor

Which issue does this PR close?

Closes #9892.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@github-actions github-actions bot added the core Core DataFusion crate label Apr 7, 2024
if let Some(udf) = functions_array::get_array_udf(name) {
return Some(udf);
}

self.state.scalar_functions().get(name).cloned()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fallback to user defiend function. Assume they can register actual function . If they need to register closure instead, call array_functions().lock().unwrap().insert(name.to_string(), udf)

/// Repalce old regsiter_udf
pub fn register_array_udf(name: &str, udf: ScalarFactory) -> Option<ScalarFactory> {
// TODO: Check overwrite?
array_functions().lock().unwrap().insert(name.to_string(), udf)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we switch to static map for storing boxed fn impl, the user can only register with this method

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 marked this pull request as ready for review April 8, 2024 13:35
@jayzhan211 jayzhan211 changed the title Register scalars with boxed fn impl [RFC] Register scalars with boxed fn impl Apr 8, 2024
@jayzhan211
Copy link
Contributor Author

@alamb I would like an early comment before moving on.

@@ -53,6 +53,16 @@ make_udf_function!(ArrayHasAny,
array_has_any_udf // internal function name
);

// TODO: Macro-ify aliases for all functions
pub fn array_has_aliases() -> Vec<String> {
Copy link
Contributor Author

@jayzhan211 jayzhan211 Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get aliases without instantiating the struct

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the downside with this approach

  1. is that now every time this function is called a new Vec of Strings gets made which is likely close to the same cost as instantiating the struct in the first place
  2. This will will be more expensive as if the struct is ever actually called
  3. There are now two Vecs of arrays that need to be kept in sync

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the first 2 points.

There are now two Vecs of arrays that need to be kept in sync

We can replace private aliases.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jayzhan211 -- this looks quite neat

I am quite worried about a mutable global hash table, but otherwise I like where this is heading. We should definitely check with @viirya what his usecase for deferred creation is (aka if we need to change SessionState...)

@@ -53,6 +53,16 @@ make_udf_function!(ArrayHasAny,
array_has_any_udf // internal function name
);

// TODO: Macro-ify aliases for all functions
pub fn array_has_aliases() -> Vec<String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the downside with this approach

  1. is that now every time this function is called a new Vec of Strings gets made which is likely close to the same cost as instantiating the struct in the first place
  2. This will will be more expensive as if the struct is ever actually called
  3. There are now two Vecs of arrays that need to be kept in sync

/// Register a single new UDF, so the user can register their own functions
///
/// Repalce old regsiter_udf
pub fn register_array_udf(name: &str, udf: ScalarFactory) -> Option<ScalarFactory> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like that this is some global mutable state that can be changed by the user. I think this will be very confusing as it both may have unintended side effects as well as not being clear if users should register their functions with the SessionState or this global map

I think a much simpler to reason about interface would be that the list of functions in datafusion/functions is immutable. For example something like

pub fn array_functions() -> &'static <HashMap<String, ScalarFactory> {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both may have unintended side effects as well as not being clear if users should register their functions with the SessionState or this global map

My intention is to replace the old one, so there is only a single map that the user need to care about.

@@ -139,13 +326,16 @@ pub fn register_all(registry: &mut dyn FunctionRegistry) -> Result<()> {
replace::array_replace_all_udf(),
replace::array_replace_udf(),
];

// TODO: Remove. Define in array_functions directly
functions.into_iter().try_for_each(|udf| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could avoid the duplication here by looping over the elements of array_functions and instantiating them all

For example (untested)

array_functions()
  .values()
  .try_for_each(|udf_factory| {
    let udf = (udf_factory)();
    ..
  });

Copy link
Contributor Author

@jayzhan211 jayzhan211 Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are register functions here, so instantiation should be avoided. Only when we call get_udf is the time we instantiate the function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should use Arc instead of Box to avoid clone

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is worth clarifying if the goal is to when we are trying to avoi instantiating ScalarUDF

It seems like it could be either:

  1. When we have a physical plan already (aka as in Comet) and need to look up a function by name (but not iterate through the entire list)
  2. Any time a SessionContext is created (aka make FunctionFactory handle deferred registrations too).

I can see the usecase for 1, I am not sure about for 2 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Do you have suggestions on which usecase is your preferred one?

//
// Replace register_all with our built-in functions
// Replace scalar_functions: HashMap<String, Arc<ScalarUDF>> in SessionState
pub fn array_functions() -> &'static Mutex<HashMap<String, ScalarFactory>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we name it to reflect the fact the map is of factories (and not functions)?

Something like

Suggested change
pub fn array_functions() -> &'static Mutex<HashMap<String, ScalarFactory>> {
pub fn array_scalar_factories() -> &'static Mutex<HashMap<String, ScalarFactory>> {

🤔

Comment on lines +1332 to +1333
// Need to implement Clone for this, so probably not a good idea.
// scalar_functions_impl: HashMap<String, ScalarFactory>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't clear to me if @viirya 's usecase needs the defer udf creation in all cases. If the goal is really to avoid instantiting all ScalarUDFs I think a better approach would be to be to chang the signature of FunctionRegistry

However, we may be able to avoid changing session state at all (and simply have an API to get the function factories in addition to registering all functions with the SessionState)

@alamb
Copy link
Contributor

alamb commented Apr 14, 2024

Marking as draft as we sort out the requirements

@alamb alamb marked this pull request as draft April 14, 2024 12:30
Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Jun 14, 2024
@jayzhan211 jayzhan211 closed this Jun 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create physical scalar expression in functions modules from string (name)
2 participants