-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] Don't repartition ProjectionExec when it does not compute anything #5074
[Enhancement] Don't repartition ProjectionExec when it does not compute anything #5074
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me. Thank you @xiaoyong-z
Looks like there are a few CI tests that need to be updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this makes sense. I see two points of discussion about this:
- Do we really need this new
is_column_expr
function in the API? It doesn't seem to have any upcoming usages elsewhere. Not adding a new API and starting with a simple downcast-and-check seems more reasonable to me at this point. - The name
would_benefit
is not the clearest we can choose for this. For the unsuspecting reader, it is not obvious what it refers to until you see that it is used inbenefits_from_repartitioning
. Why not use the same namebenefits_from_repartitioning
for the field too? IIRC, Rust allows this and in other languages people use this idiom for simple 'getters' like this. Alternatively, if we want to have separate names for the attribute and the function, we can always use something likerepartition_benefiting
so that it has clarity.
I also left one minor in-line comment about code style.
f1b6a48
to
fac165e
Compare
after i disable the repartition for projection when it does not compute anything, csv_query_array_agg and csv_query_array_agg_one test fails: |
I plan to review this PR carefully tomorrow |
As I mentioned in #5100, we plan to fix it early this week ASAP. Assuming no issues at the review, I suggest waiting on merging this a day or two so that this can get merge without the commented-out tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @xiaoyong-z -- I agree with @ozankabak that we should wait for the fix of #5100 prior to merging this PR so we don't have to disable the tests
I also think that columns with aliases (e.g. select column date as close_date
) should not get repartitioned. I confirmed that in the physical expressions, I could not find any alias exprs https://docs.rs/datafusion-physical-expr/16.1.0/datafusion_physical_expr/index.html?search=alias 👍
alias expression itself is also a column_expr, we only need to check whether this expr is a column_expr. I have verified that select column date as close_date will not be repartitioned. |
We have sent the PR upstream. We welcome your review. With this PR in place, you can uncomment the failing tests. |
@xiaoyong-z: FYI, the dependency has merged. If you resolve the conflicts and uncomment the previously failing tests, we can do a final review and this should be in good shape 🙂 |
…te anything ProjectionExec can have the following two types of computations: 1. reorder/rename 2. other computations like col1 + col2 For reorder/rename, ProjectionExec will not benefit from repartition, we should disable the repartition if all exprs are reorder and rename. In this pr, we introduce `would_benefit` to ProjectionExec, if it is true, then ProjectionExec would benefit from partitions, benefits_from_input_partitioning in ProjectionExec will return true. Otherwise, benefits_from_input_partitioning will return false. would_benefit will be false if only if all exprs are column_expr. Signed-off-by: xyz <a997647204@gmail.com>
fac165e
to
0fb0f2d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR looks good to go -- thank you @xiaoyong-z
I reviewed the plan changes carefully, and I found reviewing with whitespace blond diff made the differences quite clear.
Thanks for sticking with this. 🏆
https://github.com/apache/arrow-datafusion/pull/5074/files?w=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too, thank you!
Benchmark runs are scheduled for baseline = 50f6e69 and contender = fc211c3. fc211c3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #4968
Rationale for this change
ProjectionExec can have the following two types of computations:
What changes are included in this PR?
In this pr, we introduce
would_benefit
to ProjectionExec, if it is true, then ProjectionExec would benefit from partitions, benefits_from_input_partitioning in ProjectionExec will return true. Otherwise, benefits_from_input_partitioning will return false. would_benefit will be false if only if all exprs are column_expr.Are these changes tested?
Yes
Are there any user-facing changes?
No