Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor physical create_initial_plan to iteratively & concurrently construct plan from the bottom up #10023

Merged
merged 13 commits into from
Apr 16, 2024

Conversation

Jefffrey
Copy link
Contributor

@Jefffrey Jefffrey commented Apr 10, 2024

Which issue does this PR close?

Closes #9573

Rationale for this change

Rather than recursively constructing the initial physical plan from the top down, which can lead to stack overflow errors, iteratively construct the plan from the bottom up, which is also done concurrently.

These were my previous considerations that lead to this design:

  • Can't construct top down as parent ExecutionPlans need to know their children at construction time, unless we insert dummy children (like EmptyRelation which still might cause other issues) or we introduce an intermediate representation between LogicalPlan and ExecutionPlan to allow incomplete nodes that can populate their children after construction time (seems like more work)
  • Use a tree flattened to a Vec as since we decide to construct bottom up, need a way for children to know their parents, and Vec is easier than trying to add Arcs to parents

What changes are included in this PR?

Split up create_initial_plan to do the mapping from LogicalPlan to Arc<dyn ExecutionPlan in a separate function for organization.

In create_initial_plan we first DFS the tree to get a flat Vec representation, which stores &LogicalPlan to not require duplicating the tree (this traversal is iterative).

With the flat tree, we can spawn async tasks from the leaves, which will attempt to build the individual branches of the trees from the bottom up, towards the root.

When these tasks encounter a node with 2 or more children, which represents a collision point with other tasks, they append their current tree branch to this parent node (which has it's current children branches which are ready behind a Mutex<Vec<_>> for concurrent safety) and then check if there are enough children to build the node. If not, this task terminates forever.

If there are enough children, this means the current task is the last node to reach the parent, and can construct the parent and then continue to traverse up the parent towards the root.

This continues until the number of tasks reduces to 1 and it emits the root of the tree.

Are these changes tested?

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Apr 10, 2024
Copy link
Contributor Author

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keen to get some thoughts on this. Even if we decide not to go ahead with this, it was a fun exercise to try out 🙂

Comment on lines +559 to +566
let planning_concurrency = session_state
.config_options()
.execution
.planning_concurrency;
// Can never spawn more tasks than leaves in the tree, as these tasks must
// all converge down to the root node, which can only be processed by a
// single task.
let max_concurrency = planning_concurrency.min(flat_tree_leaf_indices.len());
Copy link
Contributor Author

@Jefffrey Jefffrey Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may be more accurate than what is currently present:

https://github.com/apache/arrow-datafusion/blob/215f30f74a12e91fd7dca0d30e37014c8c3493ed/datafusion/core/src/physical_planner.rs#L499-L542

Because the current create_initial_plan_multi could be called multiple times, and the planning_concurrency is only enforced within this function call, so if its called multiple times it can spawn more tasks than is configured by planning_concurrency, I think?. Maybe this is intended?

Either way, with this new code, it will actually limit how many tasks are building the tree for the entire initial planning process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the original behavior is intended. Your change makes sense to me

Comment on lines 658 to 675
/// Given a single LogicalPlan node, map it to it's physical ExecutionPlan counterpart.
async fn map_logical_node_to_physical(
&self,
node: &LogicalPlan,
session_state: &SessionState,
// TODO: refactor to not use Vec? Wasted for leaves/1 child
mut children: Vec<Arc<dyn ExecutionPlan>>,
) -> Result<Arc<dyn ExecutionPlan>> {
let exec_node: Arc<dyn ExecutionPlan> = match node {
// Leaves (no children)
LogicalPlan::TableScan(TableScan {
source,
projection,
filters,
fetch,
..
}) => {
let source = source_as_provider(source)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is largely unchanged, except for some minor changes to accommodate passing in the children, and also for joins (see next note). Can extract to separate PR to make things cleaner?

Comment on lines 1109 to 1149
let (new_logical, physical_left, physical_right) = if has_expr_join_key {
// TODO: Can we extract this transformation to somewhere before physical plan
// creation?
let (left_keys, right_keys): (Vec<_>, Vec<_>) =
keys.iter().cloned().unzip();

let (left, left_col_keys, left_projected) =
wrap_projection_for_join_if_necessary(
&left_keys,
left.as_ref().clone(),
)?;
let (right, right_col_keys, right_projected) =
wrap_projection_for_join_if_necessary(
&right_keys,
right.as_ref().clone(),
)?;
let column_on = (left_col_keys, right_col_keys);

let left = Arc::new(left);
let right = Arc::new(right);
let new_join = LogicalPlan::Join(Join::try_new_with_project_input(
node,
left.clone(),
right.clone(),
column_on,
)?);

Ok(Arc::new(UnionExec::new(physical_plans)))
}
LogicalPlan::Repartition(Repartition {
input,
partitioning_scheme,
}) => {
let physical_input = self.create_initial_plan(input, session_state).await?;
let input_dfschema = input.as_ref().schema();
let physical_partitioning = match partitioning_scheme {
LogicalPartitioning::RoundRobinBatch(n) => {
Partitioning::RoundRobinBatch(*n)
}
LogicalPartitioning::Hash(expr, n) => {
let runtime_expr = expr
.iter()
.map(|e| {
self.create_physical_expr(
e,
input_dfschema,
session_state,
)
})
.collect::<Result<Vec<_>>>()?;
Partitioning::Hash(runtime_expr, *n)
}
LogicalPartitioning::DistributeBy(_) => {
return not_impl_err!("Physical plan does not support DistributeBy partitioning");
}
// If inputs were projected then create ExecutionPlan for these new
// LogicalPlan nodes.
let physical_left = match (left_projected, left.as_ref()) {
// If left_projected is true we are guaranteed that left is a Projection
(
true,
LogicalPlan::Projection(Projection { input, expr, .. }),
) => self.create_project_physical_exec(
session_state,
physical_left,
input,
expr,
)?,
_ => physical_left,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kinda nasty, as I mentioned here #9573 (comment)

Maybe can split this off too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is nasty and splitting it off, or moving it to some other part of the code I think sounds like a good idea to me

Perhaps as a follow on PR

Comment on lines +2145 to +2151
fn create_project_physical_exec(
&self,
session_state: &SessionState,
input_exec: Arc<dyn ExecutionPlan>,
input: &Arc<LogicalPlan>,
expr: &[Expr],
) -> Result<Arc<dyn ExecutionPlan>> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just extracted this into separate function as it was also used by Join to create the physical projections that are added during this planning if join has expression equijoin keys.

@Jefffrey Jefffrey marked this pull request as ready for review April 10, 2024 09:46
@Jefffrey
Copy link
Contributor Author

/benchmark

Copy link

Benchmark results

Benchmarks comparing 75c399c (main) and cf594d6 (PR)
Comparing 75c399c and cf594d6
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  75c399c ┃  cf594d6 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 290.13ms │ 291.90ms │     no change │
│ QQuery 2     │  55.12ms │  40.17ms │ +1.37x faster │
│ QQuery 3     │  77.24ms │  60.66ms │ +1.27x faster │
│ QQuery 4     │  75.90ms │  80.41ms │  1.06x slower │
│ QQuery 5     │  97.49ms │  99.94ms │     no change │
│ QQuery 6     │  15.88ms │  16.44ms │     no change │
│ QQuery 7     │ 225.02ms │ 243.51ms │  1.08x slower │
│ QQuery 8     │  42.52ms │  43.77ms │     no change │
│ QQuery 9     │ 119.72ms │ 120.64ms │     no change │
│ QQuery 10    │ 108.92ms │ 111.08ms │     no change │
│ QQuery 11    │  48.16ms │  45.14ms │ +1.07x faster │
│ QQuery 12    │  59.36ms │  59.55ms │     no change │
│ QQuery 13    │ 106.10ms │ 109.99ms │     no change │
│ QQuery 14    │  19.08ms │  19.34ms │     no change │
│ QQuery 15    │  32.15ms │  32.63ms │     no change │
│ QQuery 16    │  46.43ms │  48.67ms │     no change │
│ QQuery 17    │ 147.67ms │ 143.33ms │     no change │
│ QQuery 18    │ 549.85ms │ 548.61ms │     no change │
│ QQuery 19    │  63.05ms │  62.36ms │     no change │
│ QQuery 20    │ 115.85ms │ 117.12ms │     no change │
│ QQuery 21    │ 325.90ms │ 334.81ms │     no change │
│ QQuery 22    │  39.55ms │  39.74ms │     no change │
└──────────────┴──────────┴──────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (75c399c)   │ 2661.08ms │
│ Total Time (cf594d6)   │ 2669.79ms │
│ Average Time (75c399c) │  120.96ms │
│ Average Time (cf594d6) │  121.35ms │
│ Queries Faster         │         3 │
│ Queries Slower         │         2 │
│ Queries with No Change │        17 │
└────────────────────────┴───────────┘
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃  75c399c ┃  cf594d6 ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1     │ 431.54ms │ 430.70ms │ no change │
│ QQuery 2     │  57.34ms │  55.70ms │ no change │
│ QQuery 3     │ 142.14ms │ 142.24ms │ no change │
│ QQuery 4     │  86.75ms │  90.49ms │ no change │
│ QQuery 5     │ 197.71ms │ 195.88ms │ no change │
│ QQuery 6     │ 107.82ms │ 104.88ms │ no change │
│ QQuery 7     │ 277.98ms │ 291.83ms │ no change │
│ QQuery 8     │ 189.39ms │ 191.19ms │ no change │
│ QQuery 9     │ 284.62ms │ 290.07ms │ no change │
│ QQuery 10    │ 233.96ms │ 237.63ms │ no change │
│ QQuery 11    │  63.13ms │  61.66ms │ no change │
│ QQuery 12    │ 127.59ms │ 123.42ms │ no change │
│ QQuery 13    │ 176.08ms │ 175.10ms │ no change │
│ QQuery 14    │ 127.31ms │ 125.96ms │ no change │
│ QQuery 15    │ 191.92ms │ 187.14ms │ no change │
│ QQuery 16    │  49.93ms │  50.15ms │ no change │
│ QQuery 17    │ 298.46ms │ 306.25ms │ no change │
│ QQuery 18    │ 439.63ms │ 444.79ms │ no change │
│ QQuery 19    │ 228.68ms │ 225.43ms │ no change │
│ QQuery 20    │ 184.67ms │ 191.37ms │ no change │
│ QQuery 21    │ 321.61ms │ 326.60ms │ no change │
│ QQuery 22    │  53.00ms │  54.74ms │ no change │
└──────────────┴──────────┴──────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (75c399c)   │ 4271.27ms │
│ Total Time (cf594d6)   │ 4303.19ms │
│ Average Time (75c399c) │  194.15ms │
│ Average Time (cf594d6) │  195.60ms │
│ Queries Faster         │         0 │
│ Queries Slower         │         0 │
│ Queries with No Change │        22 │
└────────────────────────┴───────────┘
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃   75c399c ┃   cf594d6 ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1     │ 4244.49ms │ 4215.25ms │ no change │
│ QQuery 2     │  497.30ms │  480.93ms │ no change │
│ QQuery 3     │ 1685.44ms │ 1688.14ms │ no change │
│ QQuery 4     │  799.86ms │  811.48ms │ no change │
│ QQuery 5     │ 2139.34ms │ 2144.36ms │ no change │
│ QQuery 6     │ 1035.63ms │ 1004.57ms │ no change │
│ QQuery 7     │ 3715.85ms │ 3725.97ms │ no change │
│ QQuery 8     │ 2435.42ms │ 2430.54ms │ no change │
│ QQuery 9     │ 4016.56ms │ 4121.28ms │ no change │
│ QQuery 10    │ 2517.71ms │ 2526.47ms │ no change │
│ QQuery 11    │  551.71ms │  576.77ms │ no change │
│ QQuery 12    │ 1184.91ms │ 1181.49ms │ no change │
│ QQuery 13    │ 2324.45ms │ 2345.50ms │ no change │
│ QQuery 14    │ 1269.15ms │ 1243.66ms │ no change │
│ QQuery 15    │ 1934.50ms │ 1935.37ms │ no change │
│ QQuery 16    │  507.48ms │  523.68ms │ no change │
│ QQuery 17    │ 5184.91ms │ 5277.34ms │ no change │
│ QQuery 18    │ 6991.12ms │ 6954.28ms │ no change │
│ QQuery 19    │ 2248.65ms │ 2167.78ms │ no change │
│ QQuery 20    │ 2571.93ms │ 2579.34ms │ no change │
│ QQuery 21    │ 4299.32ms │ 4303.13ms │ no change │
│ QQuery 22    │  549.24ms │  558.32ms │ no change │
└──────────────┴───────────┴───────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (75c399c)   │ 52704.97ms │
│ Total Time (cf594d6)   │ 52795.64ms │
│ Average Time (75c399c) │  2395.68ms │
│ Average Time (cf594d6) │  2399.80ms │
│ Queries Faster         │          0 │
│ Queries Slower         │          0 │
│ Queries with No Change │         22 │
└────────────────────────┴────────────┘

@Jefffrey
Copy link
Contributor Author

/benchmark

Copy link

Benchmark results

Benchmarks comparing 5820507 (main) and e1b41a9 (PR)
Comparing 5820507 and e1b41a9
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  5820507 ┃  e1b41a9 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 291.20ms │ 295.10ms │     no change │
│ QQuery 2     │  47.81ms │  40.41ms │ +1.18x faster │
│ QQuery 3     │  78.82ms │  60.90ms │ +1.29x faster │
│ QQuery 4     │  76.27ms │  82.08ms │  1.08x slower │
│ QQuery 5     │  98.83ms │ 102.17ms │     no change │
│ QQuery 6     │  16.09ms │  16.44ms │     no change │
│ QQuery 7     │ 222.53ms │ 235.73ms │  1.06x slower │
│ QQuery 8     │  45.30ms │  43.99ms │     no change │
│ QQuery 9     │ 125.86ms │ 125.48ms │     no change │
│ QQuery 10    │ 112.28ms │ 111.70ms │     no change │
│ QQuery 11    │  50.95ms │  45.56ms │ +1.12x faster │
│ QQuery 12    │  60.22ms │  60.31ms │     no change │
│ QQuery 13    │ 112.75ms │ 114.68ms │     no change │
│ QQuery 14    │  19.26ms │  19.33ms │     no change │
│ QQuery 15    │  32.64ms │  32.54ms │     no change │
│ QQuery 16    │  47.65ms │  48.12ms │     no change │
│ QQuery 17    │ 158.91ms │ 152.14ms │     no change │
│ QQuery 18    │ 542.91ms │ 546.27ms │     no change │
│ QQuery 19    │  63.86ms │  64.58ms │     no change │
│ QQuery 20    │ 120.62ms │ 114.47ms │ +1.05x faster │
│ QQuery 21    │ 367.16ms │ 341.01ms │ +1.08x faster │
│ QQuery 22    │  39.52ms │  40.91ms │     no change │
└──────────────┴──────────┴──────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (5820507)   │ 2731.43ms │
│ Total Time (e1b41a9)   │ 2693.93ms │
│ Average Time (5820507) │  124.16ms │
│ Average Time (e1b41a9) │  122.45ms │
│ Queries Faster         │         5 │
│ Queries Slower         │         2 │
│ Queries with No Change │        15 │
└────────────────────────┴───────────┘
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃  5820507 ┃  e1b41a9 ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 434.28ms │ 434.75ms │    no change │
│ QQuery 2     │  57.50ms │  55.26ms │    no change │
│ QQuery 3     │ 142.69ms │ 142.57ms │    no change │
│ QQuery 4     │  84.58ms │  89.46ms │ 1.06x slower │
│ QQuery 5     │ 199.65ms │ 197.32ms │    no change │
│ QQuery 6     │ 108.27ms │ 104.99ms │    no change │
│ QQuery 7     │ 277.36ms │ 286.99ms │    no change │
│ QQuery 8     │ 186.31ms │ 185.19ms │    no change │
│ QQuery 9     │ 289.12ms │ 290.84ms │    no change │
│ QQuery 10    │ 232.46ms │ 235.75ms │    no change │
│ QQuery 11    │  62.73ms │  62.77ms │    no change │
│ QQuery 12    │ 127.22ms │ 123.35ms │    no change │
│ QQuery 13    │ 182.56ms │ 182.26ms │    no change │
│ QQuery 14    │ 127.66ms │ 125.14ms │    no change │
│ QQuery 15    │ 190.02ms │ 185.64ms │    no change │
│ QQuery 16    │  50.30ms │  50.74ms │    no change │
│ QQuery 17    │ 307.19ms │ 302.97ms │    no change │
│ QQuery 18    │ 444.69ms │ 459.30ms │    no change │
│ QQuery 19    │ 232.08ms │ 228.43ms │    no change │
│ QQuery 20    │ 194.81ms │ 190.45ms │    no change │
│ QQuery 21    │ 321.39ms │ 319.09ms │    no change │
│ QQuery 22    │  54.14ms │  55.81ms │    no change │
└──────────────┴──────────┴──────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (5820507)   │ 4307.00ms │
│ Total Time (e1b41a9)   │ 4309.07ms │
│ Average Time (5820507) │  195.77ms │
│ Average Time (e1b41a9) │  195.87ms │
│ Queries Faster         │         0 │
│ Queries Slower         │         1 │
│ Queries with No Change │        21 │
└────────────────────────┴───────────┘
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃   5820507 ┃   e1b41a9 ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1     │ 4277.53ms │ 4216.69ms │ no change │
│ QQuery 2     │  495.74ms │  495.26ms │ no change │
│ QQuery 3     │ 1711.66ms │ 1695.09ms │ no change │
│ QQuery 4     │  817.33ms │  819.14ms │ no change │
│ QQuery 5     │ 2182.53ms │ 2177.98ms │ no change │
│ QQuery 6     │ 1042.71ms │ 1010.44ms │ no change │
│ QQuery 7     │ 3610.39ms │ 3597.45ms │ no change │
│ QQuery 8     │ 2460.37ms │ 2456.10ms │ no change │
│ QQuery 9     │ 4030.72ms │ 4116.84ms │ no change │
│ QQuery 10    │ 2545.66ms │ 2523.34ms │ no change │
│ QQuery 11    │  563.06ms │  569.73ms │ no change │
│ QQuery 12    │ 1201.38ms │ 1181.40ms │ no change │
│ QQuery 13    │ 2295.25ms │ 2306.17ms │ no change │
│ QQuery 14    │ 1276.44ms │ 1262.88ms │ no change │
│ QQuery 15    │ 1959.77ms │ 1898.07ms │ no change │
│ QQuery 16    │  518.82ms │  509.16ms │ no change │
│ QQuery 17    │ 5182.47ms │ 5111.99ms │ no change │
│ QQuery 18    │ 6711.35ms │ 6733.76ms │ no change │
│ QQuery 19    │ 2270.26ms │ 2192.30ms │ no change │
│ QQuery 20    │ 2522.52ms │ 2509.84ms │ no change │
│ QQuery 21    │ 4312.16ms │ 4288.91ms │ no change │
│ QQuery 22    │  550.38ms │  547.19ms │ no change │
└──────────────┴───────────┴───────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (5820507)   │ 52538.50ms │
│ Total Time (e1b41a9)   │ 52219.73ms │
│ Average Time (5820507) │  2388.11ms │
│ Average Time (e1b41a9) │  2373.62ms │
│ Queries Faster         │          0 │
│ Queries Slower         │          0 │
│ Queries with No Change │         22 │
└────────────────────────┴────────────┘

@alamb
Copy link
Contributor

alamb commented Apr 10, 2024

This looks pretty awesome @Jefffrey -- thank you. I hope to review it later today but it will likely be tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Jefffrey -- I went through this PR carefully and it is amazing -- both a joy to read and beautifully structured. Thank you for this contribution 🙏 . A very fine piece of engineering work indeed 🏆

I think this is a really cool design and would be awesome to add a high level overview (maybe as comments on pub struct DefaultPhysicalPlanner) explaining the process. The comments in this PR do a great job already explaining the implemention, I was just thinking of something slightly higher level for people who aren't going to read the implementation -- like that the plan is converted in parallel, based on the configuration etc. This could be done as a follow on PR or never.

I ran the sql_planner benchmarks and they show this PR seems to improve planning speed slightly. Very cool

cargo bench --bench sql_planner
++ critcmp main refactor_create_initial_plan
group                                         main                                   refactor_create_initial_plan
-----                                         ----                                   ----------------------------
logical_aggregate_with_join                   1.00  1184.2±11.71µs        ? ?/sec    1.00  1187.6±17.96µs        ? ?/sec
logical_plan_tpcds_all                        1.01    155.4±0.74ms        ? ?/sec    1.00    154.2±0.77ms        ? ?/sec
logical_plan_tpch_all                         1.01     16.7±0.19ms        ? ?/sec    1.00     16.5±0.16ms        ? ?/sec
logical_select_all_from_1000                  1.05     19.4±0.10ms        ? ?/sec    1.00     18.4±0.08ms        ? ?/sec
logical_select_one_from_700                   1.00   782.8±18.46µs        ? ?/sec    1.00    786.5±6.83µs        ? ?/sec
logical_trivial_join_high_numbered_columns    1.00    727.1±6.67µs        ? ?/sec    1.01    731.6±6.56µs        ? ?/sec
logical_trivial_join_low_numbered_columns     1.00    715.6±7.55µs        ? ?/sec    1.01   719.8±22.84µs        ? ?/sec
physical_plan_tpcds_all                       1.02   1845.7±5.15ms        ? ?/sec    1.00   1809.8±5.84ms        ? ?/sec
physical_plan_tpch_all                        1.02    119.7±0.76ms        ? ?/sec    1.00    117.8±0.54ms        ? ?/sec
physical_plan_tpch_q1                         1.02      7.3±0.05ms        ? ?/sec    1.00      7.1±0.05ms        ? ?/sec
physical_plan_tpch_q10                        1.02      5.5±0.03ms        ? ?/sec    1.00      5.4±0.02ms        ? ?/sec
physical_plan_tpch_q11                        1.02      4.8±0.03ms        ? ?/sec    1.00      4.8±0.11ms        ? ?/sec
physical_plan_tpch_q12                        1.02      3.9±0.02ms        ? ?/sec    1.00      3.8±0.02ms        ? ?/sec
physical_plan_tpch_q13                        1.02      2.6±0.03ms        ? ?/sec    1.00      2.6±0.02ms        ? ?/sec
physical_plan_tpch_q14                        1.02      3.3±0.02ms        ? ?/sec    1.00      3.3±0.02ms        ? ?/sec
physical_plan_tpch_q16                        1.01      4.9±0.03ms        ? ?/sec    1.00      4.8±0.02ms        ? ?/sec
physical_plan_tpch_q17                        1.01      4.6±0.03ms        ? ?/sec    1.00      4.6±0.02ms        ? ?/sec
physical_plan_tpch_q18                        1.02      5.0±0.04ms        ? ?/sec    1.00      4.9±0.04ms        ? ?/sec
physical_plan_tpch_q19                        1.01      9.4±0.05ms        ? ?/sec    1.00      9.4±0.05ms        ? ?/sec
physical_plan_tpch_q2                         1.02     10.6±0.08ms        ? ?/sec    1.00     10.4±0.06ms        ? ?/sec
physical_plan_tpch_q20                        1.01      6.1±0.04ms        ? ?/sec    1.00      6.0±0.03ms        ? ?/sec
physical_plan_tpch_q21                        1.02      8.3±0.05ms        ? ?/sec    1.00      8.1±0.05ms        ? ?/sec
physical_plan_tpch_q22                        1.03      4.4±0.02ms        ? ?/sec    1.00      4.3±0.03ms        ? ?/sec
physical_plan_tpch_q3                         1.02      3.9±0.02ms        ? ?/sec    1.00      3.8±0.02ms        ? ?/sec
physical_plan_tpch_q4                         1.03      2.9±0.02ms        ? ?/sec    1.00      2.8±0.02ms        ? ?/sec
physical_plan_tpch_q5                         1.01      5.6±0.03ms        ? ?/sec    1.00      5.5±0.03ms        ? ?/sec
physical_plan_tpch_q6                         1.02  1995.8±10.47µs        ? ?/sec    1.00  1952.7±14.95µs        ? ?/sec
physical_plan_tpch_q7                         1.02      7.5±0.05ms        ? ?/sec    1.00      7.4±0.05ms        ? ?/sec
physical_plan_tpch_q8                         1.01      9.5±0.07ms        ? ?/sec    1.00      9.4±0.06ms        ? ?/sec
physical_plan_tpch_q9                         1.01      7.2±0.03ms        ? ?/sec    1.00      7.1±0.05ms        ? ?/sec
physical_select_all_from_1000                 1.05    128.5±0.62ms        ? ?/sec    1.00    122.4±0.46ms        ? ?/sec
physical_select_one_from_700                  1.00      4.0±0.02ms        ? ?/sec    1.00      4.0±0.02ms        ? ?/sec

cc @crepererum

Note to other reviewers: I found whitespace blind diff made the changes easier to review: https://github.com/apache/arrow-datafusion/pull/10023/files?w=1

Comment on lines +559 to +566
let planning_concurrency = session_state
.config_options()
.execution
.planning_concurrency;
// Can never spawn more tasks than leaves in the tree, as these tasks must
// all converge down to the root node, which can only be processed by a
// single task.
let max_concurrency = planning_concurrency.min(flat_tree_leaf_indices.len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the original behavior is intended. Your change makes sense to me

}

/// Create a physical plan from a logical plan
fn create_initial_plan<'a>(
/// These tasks start at a leaf and traverse up the tree towards the root, building
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very clever

let mut guard = children.lock().unwrap();
// Safe unwrap on option as only the last task reaching this
// node will take the contents (which happens after this line).
let children = guard.as_mut().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could return an internal error instead of panic if for some reason the option was None

// all children.
//
// This take is the only place the Option becomes None.
guard.take().unwrap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise it would be sweet if this was an internal error rather than panic. but I don't think it is necessary, just a suggestion

Comment on lines 1109 to 1149
let (new_logical, physical_left, physical_right) = if has_expr_join_key {
// TODO: Can we extract this transformation to somewhere before physical plan
// creation?
let (left_keys, right_keys): (Vec<_>, Vec<_>) =
keys.iter().cloned().unzip();

let (left, left_col_keys, left_projected) =
wrap_projection_for_join_if_necessary(
&left_keys,
left.as_ref().clone(),
)?;
let (right, right_col_keys, right_projected) =
wrap_projection_for_join_if_necessary(
&right_keys,
right.as_ref().clone(),
)?;
let column_on = (left_col_keys, right_col_keys);

let left = Arc::new(left);
let right = Arc::new(right);
let new_join = LogicalPlan::Join(Join::try_new_with_project_input(
node,
left.clone(),
right.clone(),
column_on,
)?);

Ok(Arc::new(UnionExec::new(physical_plans)))
}
LogicalPlan::Repartition(Repartition {
input,
partitioning_scheme,
}) => {
let physical_input = self.create_initial_plan(input, session_state).await?;
let input_dfschema = input.as_ref().schema();
let physical_partitioning = match partitioning_scheme {
LogicalPartitioning::RoundRobinBatch(n) => {
Partitioning::RoundRobinBatch(*n)
}
LogicalPartitioning::Hash(expr, n) => {
let runtime_expr = expr
.iter()
.map(|e| {
self.create_physical_expr(
e,
input_dfschema,
session_state,
)
})
.collect::<Result<Vec<_>>>()?;
Partitioning::Hash(runtime_expr, *n)
}
LogicalPartitioning::DistributeBy(_) => {
return not_impl_err!("Physical plan does not support DistributeBy partitioning");
}
// If inputs were projected then create ExecutionPlan for these new
// LogicalPlan nodes.
let physical_left = match (left_projected, left.as_ref()) {
// If left_projected is true we are guaranteed that left is a Projection
(
true,
LogicalPlan::Projection(Projection { input, expr, .. }),
) => self.create_project_physical_exec(
session_state,
physical_left,
input,
expr,
)?,
_ => physical_left,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is nasty and splitting it off, or moving it to some other part of the code I think sounds like a good idea to me

Perhaps as a follow on PR

}

/// Given a single LogicalPlan node, map it to it's physical ExecutionPlan counterpart.
async fn map_logical_node_to_physical(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it the main reason this structure reduces stack space is that this function requires a non trivial stack frame, but now instead of recursively calling itself (which results in many such frames on the stack) it calls itself iteratively (basically pushing the results on to a Vec)

Or put another way, map_logical_node_to_physical never calls map_logical_node_to_physical

}

// 1 Child
LogicalPlan::Copy(CopyTo {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As some follow on PR, we could refactor this logic out of one giant match statement into functions like

match plan { 
...
    LogicalPlan::Copy(copy_to) => copy_to_physical(copy_to),
...
}

But after this PR that refactoring seems like it would mostly improve readability rather than any stack usage

@Jefffrey
Copy link
Contributor Author

Thanks for the review. I will definitely add that higher level comment and clean up some of the unwraps for sure, though I am travelling until Monday 👍

@metesynnada
Copy link
Contributor

metesynnada commented Apr 15, 2024

I will also review this after you finish @alamb's review. Thanks for working on the issue.

Can you also enable tpcds_physical_q64 in this PR? I think stack overflow won't be an issue for this test anymore.

@Jefffrey
Copy link
Contributor Author

I've cleaned up the unwraps and added the high level doc

Can you also enable tpcds_physical_q64 in this PR? I think stack overflow won't be an issue for this test anymore.

Uncommented, seems like it succeeds now (at least locally) 👍

- Use tokio::Mutex in async environment
- Remove Option from enum, since it is only used for taking.
@metesynnada
Copy link
Contributor

I've just sent a small commit to lighten your load, considering the effort you've already put in.

  • Used tokio::Mutex instead of std for smoother operation in an async environment.
  • Changed the enum by removing Option, as it's solely utilized for taking the inner vector.

Your pull request appears flawless. Much appreciated for all the hard work.

@alamb
Copy link
Contributor

alamb commented Apr 16, 2024

c6eaf74 looks good to me (it makes sense to avoid the use of Option)

@alamb
Copy link
Contributor

alamb commented Apr 16, 2024

🚀

@alamb alamb merged commit b54adb3 into apache:main Apr 16, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Apr 16, 2024

Thanks again @Jefffrey -- this is epic. Also thanks @metesynnada for the review and improvement

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve create_initial_plan to get rid of "stack overflow" issues on complex queries
3 participants