Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Don't repartition ProjectionExec when it does not compute anything #5074

Merged
merged 1 commit into from
Feb 12, 2023

Conversation

xiaoyong-z
Copy link
Contributor

Which issue does this PR close?

Closes #4968

Rationale for this change

ProjectionExec can have the following two types of computations:

  1. reorder/rename
  2. other computations like col1 + col2 For reorder/rename, ProjectionExec will not benefit from repartition, we should disable the repartition if all exprs are reorder and rename.

What changes are included in this PR?

In this pr, we introduce would_benefit to ProjectionExec, if it is true, then ProjectionExec would benefit from partitions, benefits_from_input_partitioning in ProjectionExec will return true. Otherwise, benefits_from_input_partitioning will return false. would_benefit will be false if only if all exprs are column_expr.

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added core Core DataFusion crate physical-expr Physical Expressions labels Jan 26, 2023
@xiaoyong-z
Copy link
Contributor Author

image

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me. Thank you @xiaoyong-z

@alamb
Copy link
Contributor

alamb commented Jan 26, 2023

Looks like there are a few CI tests that need to be updated

Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this makes sense. I see two points of discussion about this:

  • Do we really need this new is_column_expr function in the API? It doesn't seem to have any upcoming usages elsewhere. Not adding a new API and starting with a simple downcast-and-check seems more reasonable to me at this point.
  • The namewould_benefit is not the clearest we can choose for this. For the unsuspecting reader, it is not obvious what it refers to until you see that it is used in benefits_from_repartitioning. Why not use the same name benefits_from_repartitioning for the field too? IIRC, Rust allows this and in other languages people use this idiom for simple 'getters' like this. Alternatively, if we want to have separate names for the attribute and the function, we can always use something like repartition_benefiting so that it has clarity.

I also left one minor in-line comment about code style.

datafusion/core/src/physical_plan/projection.rs Outdated Show resolved Hide resolved
@xiaoyong-z xiaoyong-z force-pushed the optimize-repartition-for-projection branch from f1b6a48 to fac165e Compare January 28, 2023 13:04
@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) and removed physical-expr Physical Expressions labels Jan 28, 2023
@xiaoyong-z
Copy link
Contributor Author

xiaoyong-z commented Jan 28, 2023

after i disable the repartition for projection when it does not compute anything, csv_query_array_agg and csv_query_array_agg_one test fails:
image
i think this problem is introduced by a new bug, so i comment these test and plan to solve this problem in another pr,
i have opened an issue tracking the problem,
#5100.

@alamb
Copy link
Contributor

alamb commented Jan 29, 2023

I plan to review this PR carefully tomorrow

@ozankabak
Copy link
Contributor

As I mentioned in #5100, we plan to fix it early this week ASAP. Assuming no issues at the review, I suggest waiting on merging this a day or two so that this can get merge without the commented-out tests.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @xiaoyong-z -- I agree with @ozankabak that we should wait for the fix of #5100 prior to merging this PR so we don't have to disable the tests

I also think that columns with aliases (e.g. select column date as close_date) should not get repartitioned. I confirmed that in the physical expressions, I could not find any alias exprs https://docs.rs/datafusion-physical-expr/16.1.0/datafusion_physical_expr/index.html?search=alias 👍

datafusion/core/tests/sql/explain_analyze.rs Show resolved Hide resolved
@xiaoyong-z
Copy link
Contributor Author

Thanks again @xiaoyong-z -- I agree with @ozankabak that we should wait for the fix of #5100 prior to merging this PR so we don't have to disable the tests

I also think that columns with aliases (e.g. select column date as close_date) should not get repartitioned. I confirmed that in the physical expressions, I could not find any alias exprs https://docs.rs/datafusion-physical-expr/16.1.0/datafusion_physical_expr/index.html?search=alias 👍

alias expression itself is also a column_expr, we only need to check whether this expr is a column_expr. I have verified that select column date as close_date will not be repartitioned.
image

@mustafasrepo
Copy link
Contributor

We are working on a PR to refactor EnforceSorting rule, and add new capabilities. As part of that PR we have resolved issue #5100 also. It should be ready for upstream in 1 day or 2. I will let you know when it is sent to upstream. Thanks for your patience

@mustafasrepo
Copy link
Contributor

We are working on a PR to refactor EnforceSorting rule, and add new capabilities. As part of that PR we have resolved issue #5100 also. It should be ready for upstream in 1 day or 2. I will let you know when it is sent to upstream. Thanks for your patience

We have sent the PR upstream. We welcome your review. With this PR in place, you can uncomment the failing tests.

@ozankabak
Copy link
Contributor

@xiaoyong-z: FYI, the dependency has merged. If you resolve the conflicts and uncomment the previously failing tests, we can do a final review and this should be in good shape 🙂

…te anything

ProjectionExec can have the following two types of computations:
1. reorder/rename
2. other computations like col1 + col2
For reorder/rename, ProjectionExec will not benefit from repartition,
we should disable the repartition if all exprs are reorder and rename.

In this pr, we introduce `would_benefit` to ProjectionExec, if it is true,
then ProjectionExec would benefit from partitions, benefits_from_input_partitioning in
ProjectionExec will return true. Otherwise, benefits_from_input_partitioning will return
false. would_benefit will be false if only if all exprs are column_expr.

Signed-off-by: xyz <a997647204@gmail.com>
@xiaoyong-z xiaoyong-z force-pushed the optimize-repartition-for-projection branch from fac165e to 0fb0f2d Compare February 11, 2023 07:54
@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Feb 11, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR looks good to go -- thank you @xiaoyong-z

I reviewed the plan changes carefully, and I found reviewing with whitespace blond diff made the differences quite clear.

Thanks for sticking with this. 🏆

https://github.com/apache/arrow-datafusion/pull/5074/files?w=1

datafusion/core/tests/sql/joins.rs Show resolved Hide resolved
@xiaoyong-z
Copy link
Contributor Author

cc @ozankabak @Dandandan

Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too, thank you!

@alamb alamb merged commit fc211c3 into apache:master Feb 12, 2023
@ursabot
Copy link

ursabot commented Feb 12, 2023

Benchmark runs are scheduled for baseline = 50f6e69 and contender = fc211c3. fc211c3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't repartition ProjectionExec when it does not compute anything
6 participants