Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate ordering equivalence for expressions (rather than just columns) #8281

Merged
merged 7 commits into from
Nov 23, 2023
Merged

Calculate ordering equivalence for expressions (rather than just columns) #8281

merged 7 commits into from
Nov 23, 2023

Conversation

mustafasrepo
Copy link
Contributor

Which issue does this PR close?

Improves the situation on #8064

Rationale for this change

Currently we cannot determine whether ordering requirement of the complex expressions are satisfied.
Consider the case where requirement is [a+b ASC], and we know that existing orderings are [a ASC], [b ASC]. It should be possible to understand that given [a ASC] and [b ASC], requirement: [a+b ASC] is satisfied (Please note that we already have mechanism to make deduction. However, it is not considered in requirement satisfaction analysis. Currently we search requirement among existing orderings, this limits our ability during analysis).

What changes are included in this PR?

This PR re-implements the find_longest_permutation and ordering_satisfy_requirement so that ordering of the complex(composed) expressions are calculated correctly, and considered during analysis. Most of the changes comes from the tests, and test related changes.

Are these changes tested?

Yes.

Are there any user-facing changes?

mustafasrepo and others added 5 commits November 17, 2023 08:59
* Discover ordering of complex expressions in group by and window partition by

* Remove unnecessary tests

* Update comments

* Minor changes

* Better projection support complex expression support

* Fix failing test

* Simplifications

* Simplifications

* Add is end flag

* Simplifications

* Simplifications

* Simplifications

* Minor changes

* Minor changes

* Minor changes

* All tests pass

* Change implementation of find_longest_permutation

* Minor changes

* Minor changes

* Remove projection section

* Remove projection implementation

* Fix linter errors

* Remove projection sections

* Minor changes

* Add docstring comments

* Add comments

* Minor changes

* Minor changes

* Add comments

* simplifications
@github-actions github-actions bot added physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Nov 20, 2023
@ozankabak
Copy link
Contributor

I will do another deep dive on this tomorrow

@alamb
Copy link
Contributor

alamb commented Nov 21, 2023

Thank you @mustafasrepo -- I plan to review this carefully tomorrow

@alamb alamb changed the title Complex exprs requirement support Calculate ordering equivalence for expressions (rather than just columns) Nov 22, 2023
Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a third pass today and LGTM

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @mustafasrepo -- can you explain how non monotonic functions are handled (like extract(month from ts)) where the expressions don't preserve the ordering of their inputs? I didn't see any tests for such functions or code that handles them

Sorry I may have just missed it / don't understand the code

@@ -357,15 +357,19 @@ mod tests {
let physical_plan =
sort_preserving_merge_exec(vec![sort_expr("a", &schema)], sort);

let expected_input = ["SortPreservingMergeExec: [a@0 ASC NULLS LAST]",
let expected_input = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did these plans actually change? Or is this just whitespace changes?

If it is just whitespace changes I would really appreciate breaking such changes out into their own PRs as they are much faster / easier to review and merge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just whitespace changes. Agree with your suggestion in general, we will try to improve our self discipline on that

@@ -3787,7 +3787,7 @@ pub(crate) mod tests {
fn repartition_transitively_past_sort_with_projection_and_filter() -> Result<()> {
let schema = schema();
let sort_key = vec![PhysicalSortExpr {
expr: col("c", &schema).unwrap(),
expr: col("a", &schema).unwrap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this test changed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the plan, there is a filter enforcing c = 0 -- so having a sort on c is a no-op logically. After recent improvements we are able to recognize things like that and remove sorts completely, so the test became moot in its old form.

query P
SELECT date_bin('15 minutes', ts) as time_chunks
FROM csv_with_timestamps
GROUP BY (date_bin('15 minutes', ts))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused about the seeming extra ( and ) here and in the above query:

Is there any reason it is not:

Suggested change
GROUP BY (date_bin('15 minutes', ts))
GROUP BY date_bin('15 minutes', ts)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason, it was unnecessary -- now removed

----
2018-12-13T12:00:00
2018-11-13T17:00:00

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also please add a negative test that can not work in streaming mode with an expression that doesn't preserve the order?

perhaps something like

SELECT extract(month from ts) as months,
  FROM csv_with_timestamps
  GROUP BY extract(month from ts)
  ORDER BY time_chunks DESC
  LIMIT 5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I added a test that shows the query works when the function is monotonic but not otherwise for unbounded sources. I also added a test that shows the query works regardless of the function's monotonicity for bounded sources.

@ozankabak
Copy link
Contributor

Thanks for the review @alamb. I answered your questions and added a couple tests to address your suggestions (verifying behavior difference for monotonic/nonmonotonic functions). Regarding your question, complex expression orderings are calculated via the get_expr_ordering/update_ordering functions which handle monotonic/nonmonotonic expressions.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ozankabak -- I didn't review the logic changes carefully but I don't think I need to either given you have. The tests added to datafusion/sqllogictest/test_files/groupby.slt look great to me and convince me this works correctly

@ozankabak
Copy link
Contributor

ozankabak commented Nov 22, 2023

For some reason sqllogictests are failing in CI (they pass locally for me). Looking at the logs, it seems like the expected/actual are the same, so I can't figure out why this is the case. I'll let @mustafasrepo figure it out and fix tomorrow and we can merge then.

In the meantime if there is more suggestions for improvements from the community we will incorporate them as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants