-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Register scalars with boxed fn impl #9980
Conversation
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
if let Some(udf) = functions_array::get_array_udf(name) { | ||
return Some(udf); | ||
} | ||
|
||
self.state.scalar_functions().get(name).cloned() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fallback to user defiend function. Assume they can register actual function . If they need to register closure instead, call array_functions().lock().unwrap().insert(name.to_string(), udf)
/// Repalce old regsiter_udf | ||
pub fn register_array_udf(name: &str, udf: ScalarFactory) -> Option<ScalarFactory> { | ||
// TODO: Check overwrite? | ||
array_functions().lock().unwrap().insert(name.to_string(), udf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we switch to static map for storing boxed fn impl, the user can only register with this method
@alamb I would like an early comment before moving on. |
@@ -53,6 +53,16 @@ make_udf_function!(ArrayHasAny, | |||
array_has_any_udf // internal function name | |||
); | |||
|
|||
// TODO: Macro-ify aliases for all functions | |||
pub fn array_has_aliases() -> Vec<String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can get aliases without instantiating the struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the downside with this approach
- is that now every time this function is called a new Vec of Strings gets made which is likely close to the same cost as instantiating the struct in the first place
- This will will be more expensive as if the struct is ever actually called
- There are now two Vecs of arrays that need to be kept in sync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with the first 2 points.
There are now two Vecs of arrays that need to be kept in sync
We can replace private aliases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jayzhan211 -- this looks quite neat
I am quite worried about a mutable global hash table, but otherwise I like where this is heading. We should definitely check with @viirya what his usecase for deferred creation is (aka if we need to change SessionState...)
@@ -53,6 +53,16 @@ make_udf_function!(ArrayHasAny, | |||
array_has_any_udf // internal function name | |||
); | |||
|
|||
// TODO: Macro-ify aliases for all functions | |||
pub fn array_has_aliases() -> Vec<String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the downside with this approach
- is that now every time this function is called a new Vec of Strings gets made which is likely close to the same cost as instantiating the struct in the first place
- This will will be more expensive as if the struct is ever actually called
- There are now two Vecs of arrays that need to be kept in sync
/// Register a single new UDF, so the user can register their own functions | ||
/// | ||
/// Repalce old regsiter_udf | ||
pub fn register_array_udf(name: &str, udf: ScalarFactory) -> Option<ScalarFactory> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like that this is some global mutable state that can be changed by the user. I think this will be very confusing as it both may have unintended side effects as well as not being clear if users should register their functions with the SessionState or this global map
I think a much simpler to reason about interface would be that the list of functions in datafusion/functions
is immutable. For example something like
pub fn array_functions() -> &'static <HashMap<String, ScalarFactory> {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both may have unintended side effects as well as not being clear if users should register their functions with the SessionState or this global map
My intention is to replace the old one, so there is only a single map that the user need to care about.
@@ -139,13 +326,16 @@ pub fn register_all(registry: &mut dyn FunctionRegistry) -> Result<()> { | |||
replace::array_replace_all_udf(), | |||
replace::array_replace_udf(), | |||
]; | |||
|
|||
// TODO: Remove. Define in array_functions directly | |||
functions.into_iter().try_for_each(|udf| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could avoid the duplication here by looping over the elements of array_functions
and instantiating them all
For example (untested)
array_functions()
.values()
.try_for_each(|udf_factory| {
let udf = (udf_factory)();
..
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are register functions here, so instantiation should be avoided. Only when we call get_udf
is the time we instantiate the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I should use Arc
instead of Box
to avoid clone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is worth clarifying if the goal is to when we are trying to avoi instantiating ScalarUDF
It seems like it could be either:
- When we have a physical plan already (aka as in Comet) and need to look up a function by name (but not iterate through the entire list)
- Any time a
SessionContext
is created (aka make FunctionFactory handle deferred registrations too).
I can see the usecase for 1, I am not sure about for 2 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya Do you have suggestions on which usecase is your preferred one?
// | ||
// Replace register_all with our built-in functions | ||
// Replace scalar_functions: HashMap<String, Arc<ScalarUDF>> in SessionState | ||
pub fn array_functions() -> &'static Mutex<HashMap<String, ScalarFactory>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we name it to reflect the fact the map is of factories (and not functions)?
Something like
pub fn array_functions() -> &'static Mutex<HashMap<String, ScalarFactory>> { | |
pub fn array_scalar_factories() -> &'static Mutex<HashMap<String, ScalarFactory>> { |
🤔
// Need to implement Clone for this, so probably not a good idea. | ||
// scalar_functions_impl: HashMap<String, ScalarFactory>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't clear to me if @viirya 's usecase needs the defer udf creation in all cases. If the goal is really to avoid instantiting all ScalarUDF
s I think a better approach would be to be to chang the signature of FunctionRegistry
However, we may be able to avoid changing session state at all (and simply have an API to get the function factories in addition to registering all functions with the SessionState)
Marking as draft as we sort out the requirements |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Closes #9892.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?