-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: RewriteCycle API for short-circuiting optimizer loops #10386
feat: RewriteCycle API for short-circuiting optimizer loops #10386
Conversation
cdb389a
to
7330dfe
Compare
ae31462
to
dcc9140
Compare
impl RewriteCycle { | ||
fn new() -> Self { | ||
RewriteCycle { | ||
// use usize::MAX as default to avoid checking null in is_done() comparison |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reason to avoid checking null because of cost? I'm fine with either way, but curious on the choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think modeling this with num_rewriters: Option<usize>
would make it clearer what is going on and help the compiler check the logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think modeling this with
num_rewriters: Option<usize>
would make it clearer what is going on and help the compiler check the logic
yes I tried to get too fancy here with micro-optimizations when I saw benchmark regressions 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think cycle_length
would also be a more clear name for this variable 🤔 I think it is confusing that it is only initialized after the first iteration
} | ||
|
||
fn is_done(&self) -> bool { | ||
self.consecutive_unchanged_count >= self.num_rewriters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic to stop looping is checking if transformed
meet the num of rewriters, but can we just stop if any of the transformed
is false? I think it is more straightforward and we don't need to have a max_iter
num too.
Idea is something like
loop {
transformed |= rewrite_rule()
if not transformed {
return
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I think there are cases when simplification will do something that allows more constant propagation to proceed so even if transformed=false
is returned another application of a different pass could actually simplify things -- so in this case for example if the const evaluator returns false, simpliy could still return true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so in this case for example if the const evaluator returns false, simpliy could still return true
In this case, transformed is true, so we should run another pass.
I think once we found every rules is not transformed in a pass that means we done the optimization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think once we found every rules is not transformed in a pass that means we done the optimization
Yes, I agree -- I think this is what the counter is attempting to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I think there are cases when simplification will do something that allows more constant propagation to proceed so even if
transformed=false
is returned another application of a different pass could actually simplify things -- so in this case for example if the const evaluator returns false, simpliy could still return true
yes I found this to be true when running tests against this code. an additional constant evaluation produced a new transformation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @erratic-pattern and @jayzhan211
I am going to run the planning benchmarks on this branch using a VM and report performance numbers
impl RewriteCycle { | ||
fn new() -> Self { | ||
RewriteCycle { | ||
// use usize::MAX as default to avoid checking null in is_done() comparison |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think modeling this with num_rewriters: Option<usize>
would make it clearer what is going on and help the compiler check the logic
Here are my measurements on my gcp machine (it does seem to help by 1-2%) Details
|
My second run was consistent: Details
|
Other PR has been merged in: #10358 I think we can merge / rebase this PR now and mark it ready for review |
I will take another look at this and see if I can clean it up a bit more. |
a0b8397
to
4700004
Compare
83de911
to
f8e5c75
Compare
datafusion/common/src/tree_node.rs
Outdated
@@ -503,7 +503,7 @@ pub trait TreeNodeVisitor: Sized { | |||
/// | |||
/// # See Also: | |||
/// * [`TreeNode::visit`] to inspect borrowed `TreeNode`s | |||
pub trait TreeNodeRewriter: Sized { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this was a supertrait of Sized
but it can't be if we want trait objects
datafusion/common/src/tree_node.rs
Outdated
@@ -172,7 +172,7 @@ pub trait TreeNode: Sized { | |||
/// TreeNodeRewriter::f_up(ChildNode2) | |||
/// TreeNodeRewriter::f_up(ParentNode) | |||
/// ``` | |||
fn rewrite<R: TreeNodeRewriter<Node = Self>>( | |||
fn rewrite<R: TreeNodeRewriter<Node = Self> + ?Sized>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed in order to be callable with a trait object
Is this PR ready for the next round of review @erratic-pattern ? Or do you plan to make further changes to it? |
@alamb I will try to update this today or tomorrow. I've been putting this off a bit. |
Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look |
1886002
to
c1ffdfa
Compare
} | ||
|
||
pub type RewriteCycleControlFlow<T> = ControlFlow<Result<T>, T>; | ||
#[cfg(test)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new tests
efb2966
to
3161f14
Compare
Checking this out... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @erratic-pattern and @jayzhan211 -- I think this PR looks really nice.
I have some documentation / naming suggestions, but nothing that I think is required
I also tried out @jayzhan211 's suggestion https://github.com/apache/datafusion/pull/10386/files#r1632502092 and it seems to work well , so I think you should respond to that before we merge this PR
While I didn't see any performance change with this PR, I think it is worthwhile simply for the readability improvements
Performance
I ran the planning benchmark with this brach an in summary there is basically no performance change
logical_plan_tpcds_all 1.00 155.0±1.26ms ? ?/sec 1.00 155.2±1.27ms ? ?/sec
logical_plan_tpch_all 1.00 17.1±0.21ms ? ?/sec 1.00 17.2±0.17ms ? ?/sec
This may be due to the fact we have improved the performance of some of the optimizer passes now so the overhead of running them multiple times is not as great.
Details
++ critcmp main simplifier-static-dispatch
group main simplifier-static-dispatch
----- ---- --------------------------
logical_aggregate_with_join 1.00 1009.0±12.76µs ? ?/sec 1.01 1014.2±13.57µs ? ?/sec
logical_plan_tpcds_all 1.00 155.0±1.26ms ? ?/sec 1.00 155.2±1.27ms ? ?/sec
logical_plan_tpch_all 1.00 17.1±0.21ms ? ?/sec 1.00 17.2±0.17ms ? ?/sec
logical_select_all_from_1000 1.00 19.0±0.12ms ? ?/sec 1.00 19.1±0.17ms ? ?/sec
logical_select_one_from_700 1.01 825.9±8.11µs ? ?/sec 1.00 819.5±8.51µs ? ?/sec
logical_trivial_join_high_numbered_columns 1.00 768.0±11.59µs ? ?/sec 1.06 811.6±41.08µs ? ?/sec
logical_trivial_join_low_numbered_columns 1.00 758.8±23.88µs ? ?/sec 1.00 756.8±10.68µs ? ?/sec
physical_plan_tpcds_all 1.00 1250.3±8.57ms ? ?/sec 1.00 1248.0±4.97ms ? ?/sec
physical_plan_tpch_all 1.00 87.0±1.17ms ? ?/sec 1.00 86.8±1.06ms ? ?/sec
physical_plan_tpch_q1 1.00 4.6±0.06ms ? ?/sec 1.03 4.7±0.07ms ? ?/sec
physical_plan_tpch_q10 1.00 4.1±0.07ms ? ?/sec 1.00 4.1±0.07ms ? ?/sec
physical_plan_tpch_q11 1.00 3.6±0.06ms ? ?/sec 1.01 3.6±0.05ms ? ?/sec
physical_plan_tpch_q12 1.00 2.7±0.03ms ? ?/sec 1.00 2.7±0.04ms ? ?/sec
physical_plan_tpch_q13 1.00 2.0±0.02ms ? ?/sec 1.00 2.0±0.02ms ? ?/sec
physical_plan_tpch_q14 1.00 2.4±0.04ms ? ?/sec 1.00 2.4±0.04ms ? ?/sec
physical_plan_tpch_q16 1.02 3.6±0.07ms ? ?/sec 1.00 3.5±0.06ms ? ?/sec
physical_plan_tpch_q17 1.00 3.4±0.06ms ? ?/sec 1.00 3.4±0.06ms ? ?/sec
physical_plan_tpch_q18 1.01 3.8±0.07ms ? ?/sec 1.00 3.8±0.07ms ? ?/sec
physical_plan_tpch_q19 1.02 5.6±0.08ms ? ?/sec 1.00 5.6±0.07ms ? ?/sec
physical_plan_tpch_q2 1.00 7.4±0.08ms ? ?/sec 1.03 7.6±0.10ms ? ?/sec
physical_plan_tpch_q20 1.02 4.5±0.13ms ? ?/sec 1.00 4.4±0.06ms ? ?/sec
physical_plan_tpch_q21 1.00 6.0±0.08ms ? ?/sec 1.00 6.0±0.10ms ? ?/sec
physical_plan_tpch_q22 1.01 3.3±0.09ms ? ?/sec 1.00 3.2±0.05ms ? ?/sec
physical_plan_tpch_q3 1.00 3.0±0.04ms ? ?/sec 1.01 3.0±0.07ms ? ?/sec
physical_plan_tpch_q4 1.00 2.2±0.02ms ? ?/sec 1.02 2.2±0.04ms ? ?/sec
physical_plan_tpch_q5 1.00 4.2±0.06ms ? ?/sec 1.02 4.3±0.09ms ? ?/sec
physical_plan_tpch_q6 1.00 1449.0±66.64µs ? ?/sec 1.01 1461.8±26.90µs ? ?/sec
physical_plan_tpch_q7 1.00 5.3±0.09ms ? ?/sec 1.01 5.3±0.08ms ? ?/sec
physical_plan_tpch_q8 1.00 6.8±0.09ms ? ?/sec 1.02 6.9±0.09ms ? ?/sec
physical_plan_tpch_q9 1.00 5.2±0.08ms ? ?/sec 1.01 5.2±0.09ms ? ?/sec
physical_select_all_from_1000 1.00 61.3±0.26ms ? ?/sec 1.01 61.7±0.24ms ? ?/sec
physical_select_one_from_700 1.00 3.6±0.04ms ? ?/sec 1.01 3.6±0.03ms ? ?/sec
self.max_cycles | ||
} | ||
|
||
/// Runs a rewrite cycle on the given [TreeNode] using the given callback function to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please define how a "cycle" and "iteration" are related in this function? They are used in the apis below but I didn't see a definition of the terms and how they are related
I think it is that a cycle is composed of several iterations
Alternately, maybe "rewrites" is a better term than "iteration" as this is a "rewrite cycle" 🤔
#[derive(Debug)] | ||
pub struct RewriteCycleState<Node: TreeNode> { | ||
node: Node, | ||
consecutive_unchanged_count: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would help me follow the code better if these fields used the terms "cycle" and "iteration" consistently
Like
consecutive_unchanged_iterations: usize,
iteration_count: usize,
// the number of iterations in each cycle (set after the first cycle)
cycle_length: Option<usize>,
self.consecutive_unchanged_count >= cycle_length | ||
} | ||
|
||
/// Finishes the iteration by consuming the state and returning a [TreeNode] and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this finishes the cycle right (not the iteration?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this finishes the cycle right (not the iteration?)
it finishes all cycles. as in, it runs all of the remaining iterations and terminates the loop. I will reword to clarify
} | ||
// run remaining cycles | ||
match (1..self.max_cycles).try_fold(state, |state, _| f(state)) { | ||
ControlFlow::Break(result) => result?.finish(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this suggestion locally and it seems to work well 👍
ControlFlow::Break(result) => result?.finish(), | |
ControlFlow::Break(result) => return result?.finish(), |
I fixed a clippy lint and merged up from main to get the latest changes into this PR |
Yes that is what I observed as well. I assume the overhead offsets any potential improvement from skipping steps here. I think the value of this API comes from the potential reuseability of the logic for the entire optimizer. It is not ready to do that in its current state, but with some improvements it could be used across the optimizer. |
I will update this PR with review suggestions (maybe) this weekend |
Marking as draft as feedback is incorporated |
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
d64dbe3
to
f5d6ed4
Compare
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Going to update this soon to get it ready for merging. |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Closes #1160.
Rationale for this change
This is a follow up to #10358 with a new approach that should short-circuit earlier. See previous discussion there.
What changes are included in this PR?
Are these changes tested?
yes
Are there any user-facing changes?
no