-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add customizable equality and hash functions to UDFs #11392
Conversation
datafusion/expr/src/udaf.rs
Outdated
@@ -72,20 +76,19 @@ pub struct AggregateUDF { | |||
|
|||
impl PartialEq for AggregateUDF { | |||
fn eq(&self, other: &Self) -> bool { | |||
self.name() == other.name() && self.signature() == other.signature() | |||
self.inner.equals(other.inner.as_ref()) || other.inner.equals(self.inner.as_ref()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we want to be so general but the issue with dynamic equality is that it might not be symmetric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possibility we could do is document that the equality test in the UDFs must be symmetric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with downcast_ref
is that it's one sided. Perhaps we can document that and then change the implementation of aliased UDFs to be symmetric. Should it compare the aliases as well? I'm not sure what would be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think comparing the inner function is the most straightforward... However, you are right that it does seem like it should be comparing the aliases too probably
Trophy for lowest PR number closed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR makes sense to me @joroKr21 -- thank you
My only question is if there is some way to test this functionality (e.g. I am thinking if we accidentally broke this feature would we know from tests?)
Maybe it is not something that could be feasibly tested though. I can't really think of a great example of how to do it (other than to show the traits could be extended, which doesn't seem like a very useful test)
I had some small comment suggestions, but I don't think they are required and e could merge this PR as is as well.
Let us know!
datafusion/expr/src/udaf.rs
Outdated
@@ -72,20 +76,19 @@ pub struct AggregateUDF { | |||
|
|||
impl PartialEq for AggregateUDF { | |||
fn eq(&self, other: &Self) -> bool { | |||
self.name() == other.name() && self.signature() == other.signature() | |||
self.inner.equals(other.inner.as_ref()) || other.inner.equals(self.inner.as_ref()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possibility we could do is document that the equality test in the UDFs must be symmetric.
datafusion/expr/src/udaf.rs
Outdated
/// Dynamic equality. Allows customizing the equality of aggregate UDFs. | ||
/// By default, compares the UDF name and signature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a suggestion to improve the docstring
/// Dynamic equality. Allows customizing the equality of aggregate UDFs. | |
/// By default, compares the UDF name and signature. | |
/// Return true if this aggregate UDF is equal to the other. | |
/// | |
/// Allows customizing the equality of aggregate UDFs. Must be consistent | |
/// with [`Self::hash_value`]. | |
/// | |
/// By default, compares [`Self::name`] and [`Self::signature`] |
datafusion/expr/src/udaf.rs
Outdated
/// Dynamic hashing. Allows customizing the hash code of aggregate UDFs. | ||
/// By default, hashes the UDF name and signature. | ||
fn hash_value(&self) -> u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to note here that eq and hash value need to be consistent.
Something like this perhaps:
/// Dynamic hashing. Allows customizing the hash code of aggregate UDFs. | |
/// By default, hashes the UDF name and signature. | |
fn hash_value(&self) -> u64 { | |
/// Returns a hash value for this aggregate UDF. | |
/// | |
/// Allows customizing the hash code of aggregate UDFs. Similarly to | |
/// [`std::hash::Hash`], if [`Self::equals`] | |
/// returns true for two aggregate UDFs, the value of `hash_value` must as well. | |
/// | |
/// By default, hashes [`Self::name`] and [`Self::signature`] | |
fn hash_value(&self) -> u64 { |
@@ -562,6 +580,18 @@ impl AggregateUDFImpl for AliasedAggregateUDFImpl { | |||
fn aliases(&self) -> &[String] { | |||
&self.aliases | |||
} | |||
|
|||
fn equals(&self, other: &dyn AggregateUDFImpl) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes sense to me as the name and signature are the same as the inner
datafusion/expr/src/udf.rs
Outdated
@@ -540,6 +541,21 @@ pub trait ScalarUDFImpl: Debug + Send + Sync { | |||
fn coerce_types(&self, _arg_types: &[DataType]) -> Result<Vec<DataType>> { | |||
not_impl_err!("Function {} does not implement coerce_types", self.name()) | |||
} | |||
|
|||
/// Dynamic equality. Allows customizing the equality of scalar UDFs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend the same documentation updates here as for aggregateUDF
datafusion/expr/src/udwf.rs
Outdated
@@ -296,6 +299,21 @@ pub trait WindowUDFImpl: Debug + Send + Sync { | |||
fn simplify(&self) -> Option<WindowFunctionSimplification> { | |||
None | |||
} | |||
|
|||
/// Dynamic equality. Allows customizing the equality of window UDFs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise here for doc updates
Yes, I have an idea. I think we already have some parameterized UDFs in the tests. I will just create a query that would be broken currently and be fixed by these changes. |
31a170b
to
89e8dae
Compare
@alamb this should be ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @joroKr21
|
||
assert_eq!( | ||
format!("{plan:?}"), | ||
"Filter: t.text IS NOT NULL\n Filter: regex_udf(t.text) AND regex_udf(t.text)\n TableScan: t projection=[text]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without the changes in this PR are the expressions combined by CSE or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This particular case is deduplicated in PushDownFilter
:
LogicalPlan::Filter(child_filter) => {
let parents_predicates = split_conjunction_owned(filter.predicate);
// remove duplicated filters
let child_predicates = split_conjunction_owned(child_filter.predicate);
let new_predicates = parents_predicates
.into_iter()
.chain(child_predicates)
// use IndexSet to remove dupes while preserving predicate order
.collect::<IndexSet<_>>()
.into_iter()
.collect::<Vec<_>>();
* Add customizable equality and hash functions to UDFs * Improve equals and hash_value documentation * Add tests for parameterized UDFs
* Add customizable equality and hash functions to UDFs * Improve equals and hash_value documentation * Add tests for parameterized UDFs
* Add customizable equality and hash functions to UDFs * Improve equals and hash_value documentation * Add tests for parameterized UDFs
* Add customizable equality and hash functions to UDFs * Improve equals and hash_value documentation * Add tests for parameterized UDFs
Which issue does this PR close?
Closes #127.
Rationale for this change
After #9436 it's possible to write and use all kinds of custom UDF functions e.g. parameterized by additional arguments, but the issue is that equality and hash code are based only on the name and signature. As such some plan rewrites that put expressions in a hash set (e.g.
FilterPushDown
) violate this assumption and we end up with semantically incorrect rewrites.What changes are included in this PR?
Added
equals
andhash_value
methods to scalar, aggregate and window UDFs. By default they delegate to the name and signature as before but custom UDFs can override them to account for additional parameters. Unfortunately we can't extend theEq
andHash
traits because they prevent us from using trait objects.Are these changes tested?
Yes, added unit tests.
Are there any user-facing changes?
Yes, the UDF interfaces grow with two new functions.