Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add customizable equality and hash functions to UDFs #11392

Merged
merged 3 commits into from
Jul 11, 2024

Conversation

joroKr21
Copy link
Contributor

@joroKr21 joroKr21 commented Jul 10, 2024

Which issue does this PR close?

Closes #127.

Rationale for this change

After #9436 it's possible to write and use all kinds of custom UDF functions e.g. parameterized by additional arguments, but the issue is that equality and hash code are based only on the name and signature. As such some plan rewrites that put expressions in a hash set (e.g. FilterPushDown) violate this assumption and we end up with semantically incorrect rewrites.

What changes are included in this PR?

Added equals and hash_value methods to scalar, aggregate and window UDFs. By default they delegate to the name and signature as before but custom UDFs can override them to account for additional parameters. Unfortunately we can't extend the Eq and Hash traits because they prevent us from using trait objects.

Are these changes tested?

Yes, added unit tests.

Are there any user-facing changes?

Yes, the UDF interfaces grow with two new functions.

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Jul 10, 2024
@@ -72,20 +76,19 @@ pub struct AggregateUDF {

impl PartialEq for AggregateUDF {
fn eq(&self, other: &Self) -> bool {
self.name() == other.name() && self.signature() == other.signature()
self.inner.equals(other.inner.as_ref()) || other.inner.equals(self.inner.as_ref())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we want to be so general but the issue with dynamic equality is that it might not be symmetric.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility we could do is document that the equality test in the UDFs must be symmetric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with downcast_ref is that it's one sided. Perhaps we can document that and then change the implementation of aliased UDFs to be symmetric. Should it compare the aliases as well? I'm not sure what would be correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think comparing the inner function is the most straightforward... However, you are right that it does seem like it should be comparing the aliases too probably

@alamb
Copy link
Contributor

alamb commented Jul 10, 2024

Closes #127.

Trophy for lowest PR number closed

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR makes sense to me @joroKr21 -- thank you

My only question is if there is some way to test this functionality (e.g. I am thinking if we accidentally broke this feature would we know from tests?)

Maybe it is not something that could be feasibly tested though. I can't really think of a great example of how to do it (other than to show the traits could be extended, which doesn't seem like a very useful test)

I had some small comment suggestions, but I don't think they are required and e could merge this PR as is as well.

Let us know!

@@ -72,20 +76,19 @@ pub struct AggregateUDF {

impl PartialEq for AggregateUDF {
fn eq(&self, other: &Self) -> bool {
self.name() == other.name() && self.signature() == other.signature()
self.inner.equals(other.inner.as_ref()) || other.inner.equals(self.inner.as_ref())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility we could do is document that the equality test in the UDFs must be symmetric.

Comment on lines 514 to 515
/// Dynamic equality. Allows customizing the equality of aggregate UDFs.
/// By default, compares the UDF name and signature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a suggestion to improve the docstring

Suggested change
/// Dynamic equality. Allows customizing the equality of aggregate UDFs.
/// By default, compares the UDF name and signature.
/// Return true if this aggregate UDF is equal to the other.
///
/// Allows customizing the equality of aggregate UDFs. Must be consistent
/// with [`Self::hash_value`].
///
/// By default, compares [`Self::name`] and [`Self::signature`]

Comment on lines 520 to 534
/// Dynamic hashing. Allows customizing the hash code of aggregate UDFs.
/// By default, hashes the UDF name and signature.
fn hash_value(&self) -> u64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to note here that eq and hash value need to be consistent.

Something like this perhaps:

Suggested change
/// Dynamic hashing. Allows customizing the hash code of aggregate UDFs.
/// By default, hashes the UDF name and signature.
fn hash_value(&self) -> u64 {
/// Returns a hash value for this aggregate UDF.
///
/// Allows customizing the hash code of aggregate UDFs. Similarly to
/// [`std::hash::Hash`], if [`Self::equals`]
/// returns true for two aggregate UDFs, the value of `hash_value` must as well.
///
/// By default, hashes [`Self::name`] and [`Self::signature`]
fn hash_value(&self) -> u64 {

@@ -562,6 +580,18 @@ impl AggregateUDFImpl for AliasedAggregateUDFImpl {
fn aliases(&self) -> &[String] {
&self.aliases
}

fn equals(&self, other: &dyn AggregateUDFImpl) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense to me as the name and signature are the same as the inner

@@ -540,6 +541,21 @@ pub trait ScalarUDFImpl: Debug + Send + Sync {
fn coerce_types(&self, _arg_types: &[DataType]) -> Result<Vec<DataType>> {
not_impl_err!("Function {} does not implement coerce_types", self.name())
}

/// Dynamic equality. Allows customizing the equality of scalar UDFs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend the same documentation updates here as for aggregateUDF

@@ -296,6 +299,21 @@ pub trait WindowUDFImpl: Debug + Send + Sync {
fn simplify(&self) -> Option<WindowFunctionSimplification> {
None
}

/// Dynamic equality. Allows customizing the equality of window UDFs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here for doc updates

@joroKr21
Copy link
Contributor Author

Maybe it is not something that could be feasibly tested though. I can't really think of a great example of how to do it (other than to show the traits could be extended, which doesn't seem like a very useful test)

Yes, I have an idea. I think we already have some parameterized UDFs in the tests. I will just create a query that would be broken currently and be fixed by these changes.

@github-actions github-actions bot added the core Core DataFusion crate label Jul 11, 2024
@joroKr21
Copy link
Contributor Author

@alamb this should be ready

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @joroKr21


assert_eq!(
format!("{plan:?}"),
"Filter: t.text IS NOT NULL\n Filter: regex_udf(t.text) AND regex_udf(t.text)\n TableScan: t projection=[text]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without the changes in this PR are the expressions combined by CSE or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular case is deduplicated in PushDownFilter:

            LogicalPlan::Filter(child_filter) => {
                let parents_predicates = split_conjunction_owned(filter.predicate);

                // remove duplicated filters
                let child_predicates = split_conjunction_owned(child_filter.predicate);
                let new_predicates = parents_predicates
                    .into_iter()
                    .chain(child_predicates)
                    // use IndexSet to remove dupes while preserving predicate order
                    .collect::<IndexSet<_>>()
                    .into_iter()
                    .collect::<Vec<_>>();

@alamb alamb merged commit 4bed04e into apache:main Jul 11, 2024
24 checks passed
@joroKr21 joroKr21 deleted the udf-eq-hash-main branch July 11, 2024 17:03
Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Jul 12, 2024
* Add customizable equality and hash functions to UDFs

* Improve equals and hash_value documentation

* Add tests for parameterized UDFs
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* Add customizable equality and hash functions to UDFs

* Improve equals and hash_value documentation

* Add tests for parameterized UDFs
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 17, 2024
* Add customizable equality and hash functions to UDFs

* Improve equals and hash_value documentation

* Add tests for parameterized UDFs
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 18, 2024
* Add customizable equality and hash functions to UDFs

* Improve equals and hash_value documentation

* Add tests for parameterized UDFs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve the PartialEq implementation for UDF expressions
2 participants