Add customizable equality and hash functions to UDFs #11392

joroKr21 · 2024-07-10T14:54:09Z

Which issue does this PR close?

Closes #127.

Rationale for this change

After #9436 it's possible to write and use all kinds of custom UDF functions e.g. parameterized by additional arguments, but the issue is that equality and hash code are based only on the name and signature. As such some plan rewrites that put expressions in a hash set (e.g. FilterPushDown) violate this assumption and we end up with semantically incorrect rewrites.

What changes are included in this PR?

Added equals and hash_value methods to scalar, aggregate and window UDFs. By default they delegate to the name and signature as before but custom UDFs can override them to account for additional parameters. Unfortunately we can't extend the Eq and Hash traits because they prevent us from using trait objects.

Are these changes tested?

Yes, added unit tests.

Are there any user-facing changes?

Yes, the UDF interfaces grow with two new functions.

joroKr21 · 2024-07-10T14:55:13Z

datafusion/expr/src/udaf.rs

@@ -72,20 +76,19 @@ pub struct AggregateUDF {

 impl PartialEq for AggregateUDF {
    fn eq(&self, other: &Self) -> bool {
-        self.name() == other.name() && self.signature() == other.signature()
+        self.inner.equals(other.inner.as_ref()) || other.inner.equals(self.inner.as_ref())


I'm not sure if we want to be so general but the issue with dynamic equality is that it might not be symmetric.

Another possibility we could do is document that the equality test in the UDFs must be symmetric.

The issue with downcast_ref is that it's one sided. Perhaps we can document that and then change the implementation of aliased UDFs to be symmetric. Should it compare the aliases as well? I'm not sure what would be correct.

I think comparing the inner function is the most straightforward... However, you are right that it does seem like it should be comparing the aliases too probably

alamb · 2024-07-10T17:11:09Z

Closes #127.

Trophy for lowest PR number closed

alamb

This PR makes sense to me @joroKr21 -- thank you

My only question is if there is some way to test this functionality (e.g. I am thinking if we accidentally broke this feature would we know from tests?)

Maybe it is not something that could be feasibly tested though. I can't really think of a great example of how to do it (other than to show the traits could be extended, which doesn't seem like a very useful test)

I had some small comment suggestions, but I don't think they are required and e could merge this PR as is as well.

Let us know!

alamb · 2024-07-10T19:01:16Z

datafusion/expr/src/udaf.rs

@@ -72,20 +76,19 @@ pub struct AggregateUDF {

 impl PartialEq for AggregateUDF {
    fn eq(&self, other: &Self) -> bool {
-        self.name() == other.name() && self.signature() == other.signature()
+        self.inner.equals(other.inner.as_ref()) || other.inner.equals(self.inner.as_ref())


Another possibility we could do is document that the equality test in the UDFs must be symmetric.

alamb · 2024-07-10T19:03:56Z

datafusion/expr/src/udaf.rs

+    /// Dynamic equality. Allows customizing the equality of aggregate UDFs.
+    /// By default, compares the UDF name and signature.


Here is a suggestion to improve the docstring

Suggested change

/// Dynamic equality. Allows customizing the equality of aggregate UDFs.

/// By default, compares the UDF name and signature.

/// Return true if this aggregate UDF is equal to the other.

///

/// Allows customizing the equality of aggregate UDFs. Must be consistent

/// with [`Self::hash_value`].

///

/// By default, compares [`Self::name`] and [`Self::signature`]

alamb · 2024-07-10T19:06:46Z

datafusion/expr/src/udaf.rs

+    /// Dynamic hashing. Allows customizing the hash code of aggregate UDFs.
+    /// By default, hashes the UDF name and signature.
+    fn hash_value(&self) -> u64 {


I think it would be good to note here that eq and hash value need to be consistent.

Something like this perhaps:

Suggested change

/// Dynamic hashing. Allows customizing the hash code of aggregate UDFs.

/// By default, hashes the UDF name and signature.

fn hash_value(&self) -> u64 {

/// Returns a hash value for this aggregate UDF.

///

/// Allows customizing the hash code of aggregate UDFs. Similarly to

/// [`std::hash::Hash`], if [`Self::equals`]

/// returns true for two aggregate UDFs, the value of `hash_value` must as well.

///

/// By default, hashes [`Self::name`] and [`Self::signature`]

fn hash_value(&self) -> u64 {

alamb · 2024-07-10T19:07:37Z

datafusion/expr/src/udaf.rs

@@ -562,6 +580,18 @@ impl AggregateUDFImpl for AliasedAggregateUDFImpl {
    fn aliases(&self) -> &[String] {
        &self.aliases
    }
+
+    fn equals(&self, other: &dyn AggregateUDFImpl) -> bool {


this makes sense to me as the name and signature are the same as the inner

alamb · 2024-07-10T19:09:03Z

datafusion/expr/src/udf.rs

@@ -540,6 +541,21 @@ pub trait ScalarUDFImpl: Debug + Send + Sync {
    fn coerce_types(&self, _arg_types: &[DataType]) -> Result<Vec<DataType>> {
        not_impl_err!("Function {} does not implement coerce_types", self.name())
    }
+
+    /// Dynamic equality. Allows customizing the equality of scalar UDFs.


I recommend the same documentation updates here as for aggregateUDF

alamb · 2024-07-10T19:09:17Z

datafusion/expr/src/udwf.rs

@@ -296,6 +299,21 @@ pub trait WindowUDFImpl: Debug + Send + Sync {
    fn simplify(&self) -> Option<WindowFunctionSimplification> {
        None
    }
+
+    /// Dynamic equality. Allows customizing the equality of window UDFs.


Likewise here for doc updates

joroKr21 · 2024-07-10T19:53:37Z

Maybe it is not something that could be feasibly tested though. I can't really think of a great example of how to do it (other than to show the traits could be extended, which doesn't seem like a very useful test)

Yes, I have an idea. I think we already have some parameterized UDFs in the tests. I will just create a query that would be broken currently and be fixed by these changes.

joroKr21 · 2024-07-11T12:21:15Z

@alamb this should be ready

alamb

Thanks again @joroKr21

alamb · 2024-07-11T16:26:31Z

datafusion/core/tests/user_defined/user_defined_scalar_functions.rs

+
+    assert_eq!(
+        format!("{plan:?}"),
+        "Filter: t.text IS NOT NULL\n  Filter: regex_udf(t.text) AND regex_udf(t.text)\n    TableScan: t projection=[text]"


without the changes in this PR are the expressions combined by CSE or something?

This particular case is deduplicated in PushDownFilter:

LogicalPlan::Filter(child_filter) => { let parents_predicates = split_conjunction_owned(filter.predicate); // remove duplicated filters let child_predicates = split_conjunction_owned(child_filter.predicate); let new_predicates = parents_predicates .into_iter() .chain(child_predicates) // use IndexSet to remove dupes while preserving predicate order .collect::<IndexSet<_>>() .into_iter() .collect::<Vec<_>>();

* Add customizable equality and hash functions to UDFs * Improve equals and hash_value documentation * Add tests for parameterized UDFs

github-actions bot added the logical-expr Logical plan and expressions label Jul 10, 2024

joroKr21 commented Jul 10, 2024

View reviewed changes

alamb mentioned this pull request Jul 10, 2024

DataFusion weekly project plan (Andrew Lamb) - July 8, 2024 #11334

Closed

9 tasks

alamb approved these changes Jul 10, 2024

View reviewed changes

joroKr21 added 3 commits July 11, 2024 10:28

Add customizable equality and hash functions to UDFs

5f9968f

Improve equals and hash_value documentation

c574af4

Add tests for parameterized UDFs

89e8dae

joroKr21 force-pushed the udf-eq-hash-main branch from 31a170b to 89e8dae Compare July 11, 2024 12:15

github-actions bot added the core Core DataFusion crate label Jul 11, 2024

joroKr21 mentioned this pull request Jul 11, 2024

Allow customizing UDF equality and hash coralogix/arrow-datafusion#251

Merged

alamb approved these changes Jul 11, 2024

View reviewed changes

alamb merged commit 4bed04e into apache:main Jul 11, 2024
24 checks passed

joroKr21 deleted the udf-eq-hash-main branch July 11, 2024 17:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add customizable equality and hash functions to UDFs #11392

Add customizable equality and hash functions to UDFs #11392

joroKr21 commented Jul 10, 2024 •

edited

Loading

joroKr21 Jul 10, 2024

alamb Jul 10, 2024

joroKr21 Jul 10, 2024

alamb Jul 10, 2024

alamb commented Jul 10, 2024

alamb left a comment

alamb Jul 10, 2024

alamb Jul 10, 2024

alamb Jul 10, 2024

alamb Jul 10, 2024

alamb Jul 10, 2024

alamb Jul 10, 2024

joroKr21 commented Jul 10, 2024

joroKr21 commented Jul 11, 2024

alamb left a comment

alamb Jul 11, 2024

joroKr21 Jul 11, 2024

		/// Dynamic equality. Allows customizing the equality of aggregate UDFs.
		/// By default, compares the UDF name and signature.

-    /// Dynamic equality. Allows customizing the equality of aggregate UDFs.
-    /// By default, compares the UDF name and signature.
+    /// Return true if this aggregate UDF is equal to the other.
+    ///
+    /// Allows customizing the equality of aggregate UDFs. Must be consistent
+    /// with [`Self::hash_value`].
+    ///
+    /// By default, compares [`Self::name`] and [`Self::signature`]

-    /// Dynamic hashing. Allows customizing the hash code of aggregate UDFs.
-    /// By default, hashes the UDF name and signature.
-    fn hash_value(&self) -> u64 {
+    /// Returns a hash value for this aggregate UDF.
+    ///
+    /// Allows customizing the hash code of aggregate UDFs. Similarly to
+    /// [`std::hash::Hash`], if [`Self::equals`]
+    /// returns true for two aggregate UDFs, the value of `hash_value` must as well.
+    ///
+    /// By default, hashes [`Self::name`] and [`Self::signature`]
+    fn hash_value(&self) -> u64 {

Add customizable equality and hash functions to UDFs #11392

Add customizable equality and hash functions to UDFs #11392

Conversation

joroKr21 commented Jul 10, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 10, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joroKr21 commented Jul 10, 2024

joroKr21 commented Jul 11, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joroKr21 commented Jul 10, 2024 •

edited

Loading