Refactor physical create_initial_plan to iteratively & concurrently construct plan from the bottom up #10023

Jefffrey · 2024-04-10T09:39:26Z

Which issue does this PR close?

Rationale for this change

Rather than recursively constructing the initial physical plan from the top down, which can lead to stack overflow errors, iteratively construct the plan from the bottom up, which is also done concurrently.

These were my previous considerations that lead to this design:

Can't construct top down as parent ExecutionPlans need to know their children at construction time, unless we insert dummy children (like EmptyRelation which still might cause other issues) or we introduce an intermediate representation between LogicalPlan and ExecutionPlan to allow incomplete nodes that can populate their children after construction time (seems like more work)
Use a tree flattened to a Vec as since we decide to construct bottom up, need a way for children to know their parents, and Vec is easier than trying to add Arcs to parents

What changes are included in this PR?

Split up create_initial_plan to do the mapping from LogicalPlan to Arc<dyn ExecutionPlan in a separate function for organization.

In create_initial_plan we first DFS the tree to get a flat Vec representation, which stores &LogicalPlan to not require duplicating the tree (this traversal is iterative).

With the flat tree, we can spawn async tasks from the leaves, which will attempt to build the individual branches of the trees from the bottom up, towards the root.

When these tasks encounter a node with 2 or more children, which represents a collision point with other tasks, they append their current tree branch to this parent node (which has it's current children branches which are ready behind a Mutex<Vec<_>> for concurrent safety) and then check if there are enough children to build the node. If not, this task terminates forever.

If there are enough children, this means the current task is the last node to reach the parent, and can construct the parent and then continue to traverse up the parent towards the root.

This continues until the number of tasks reduces to 1 and it emits the root of the tree.

Are these changes tested?

Are there any user-facing changes?

No

Jefffrey

Keen to get some thoughts on this. Even if we decide not to go ahead with this, it was a fun exercise to try out 🙂

Jefffrey · 2024-04-10T09:41:53Z

datafusion/core/src/physical_planner.rs

+        let planning_concurrency = session_state
+            .config_options()
+            .execution
+            .planning_concurrency;
+        // Can never spawn more tasks than leaves in the tree, as these tasks must
+        // all converge down to the root node, which can only be processed by a
+        // single task.
+        let max_concurrency = planning_concurrency.min(flat_tree_leaf_indices.len());


I think this may be more accurate than what is currently present:

https://github.com/apache/arrow-datafusion/blob/215f30f74a12e91fd7dca0d30e37014c8c3493ed/datafusion/core/src/physical_planner.rs#L499-L542

Because the current create_initial_plan_multi could be called multiple times, and the planning_concurrency is only enforced within this function call, so if its called multiple times it can spawn more tasks than is configured by planning_concurrency, I think?. Maybe this is intended?

Either way, with this new code, it will actually limit how many tasks are building the tree for the entire initial planning process.

I don't think the original behavior is intended. Your change makes sense to me

Jefffrey · 2024-04-10T09:42:52Z

datafusion/core/src/physical_planner.rs

+    /// Given a single LogicalPlan node, map it to it's physical ExecutionPlan counterpart.
+    async fn map_logical_node_to_physical(
+        &self,
+        node: &LogicalPlan,
+        session_state: &SessionState,
+        // TODO: refactor to not use Vec? Wasted for leaves/1 child
+        mut children: Vec<Arc<dyn ExecutionPlan>>,
+    ) -> Result<Arc<dyn ExecutionPlan>> {
+        let exec_node: Arc<dyn ExecutionPlan> = match node {
+            // Leaves (no children)
+            LogicalPlan::TableScan(TableScan {
+                source,
+                projection,
+                filters,
+                fetch,
+                ..
+            }) => {
+                let source = source_as_provider(source)?;


This is largely unchanged, except for some minor changes to accommodate passing in the children, and also for joins (see next note). Can extract to separate PR to make things cleaner?

Jefffrey · 2024-04-10T09:43:32Z

datafusion/core/src/physical_planner.rs

+                let (new_logical, physical_left, physical_right) = if has_expr_join_key {
+                    // TODO: Can we extract this transformation to somewhere before physical plan
+                    //       creation?
+                    let (left_keys, right_keys): (Vec<_>, Vec<_>) =
+                        keys.iter().cloned().unzip();
+
+                    let (left, left_col_keys, left_projected) =
+                        wrap_projection_for_join_if_necessary(
+                            &left_keys,
+                            left.as_ref().clone(),
+                        )?;
+                    let (right, right_col_keys, right_projected) =
+                        wrap_projection_for_join_if_necessary(
+                            &right_keys,
+                            right.as_ref().clone(),
+                        )?;
+                    let column_on = (left_col_keys, right_col_keys);
+
+                    let left = Arc::new(left);
+                    let right = Arc::new(right);
+                    let new_join = LogicalPlan::Join(Join::try_new_with_project_input(
+                        node,
+                        left.clone(),
+                        right.clone(),
+                        column_on,
+                    )?);

-                    Ok(Arc::new(UnionExec::new(physical_plans)))
-                }
-                LogicalPlan::Repartition(Repartition {
-                    input,
-                    partitioning_scheme,
-                }) => {
-                    let physical_input = self.create_initial_plan(input, session_state).await?;
-                    let input_dfschema = input.as_ref().schema();
-                    let physical_partitioning = match partitioning_scheme {
-                        LogicalPartitioning::RoundRobinBatch(n) => {
-                            Partitioning::RoundRobinBatch(*n)
-                        }
-                        LogicalPartitioning::Hash(expr, n) => {
-                            let runtime_expr = expr
-                                .iter()
-                                .map(|e| {
-                                    self.create_physical_expr(
-                                        e,
-                                        input_dfschema,
-                                        session_state,
-                                    )
-                                })
-                                .collect::<Result<Vec<_>>>()?;
-                            Partitioning::Hash(runtime_expr, *n)
-                        }
-                        LogicalPartitioning::DistributeBy(_) => {
-                            return not_impl_err!("Physical plan does not support DistributeBy partitioning");
-                        }
+                    // If inputs were projected then create ExecutionPlan for these new
+                    // LogicalPlan nodes.
+                    let physical_left = match (left_projected, left.as_ref()) {
+                        // If left_projected is true we are guaranteed that left is a Projection
+                        (
+                            true,
+                            LogicalPlan::Projection(Projection { input, expr, .. }),
+                        ) => self.create_project_physical_exec(
+                            session_state,
+                            physical_left,
+                            input,
+                            expr,
+                        )?,
+                        _ => physical_left,


This is kinda nasty, as I mentioned here #9573 (comment)

Maybe can split this off too?

I agree it is nasty and splitting it off, or moving it to some other part of the code I think sounds like a good idea to me

Perhaps as a follow on PR

Jefffrey · 2024-04-10T09:44:50Z

datafusion/core/src/physical_planner.rs

+    fn create_project_physical_exec(
+        &self,
+        session_state: &SessionState,
+        input_exec: Arc<dyn ExecutionPlan>,
+        input: &Arc<LogicalPlan>,
+        expr: &[Expr],
+    ) -> Result<Arc<dyn ExecutionPlan>> {


Just extracted this into separate function as it was also used by Join to create the physical projections that are added during this planning if join has expression equijoin keys.

Jefffrey · 2024-04-10T09:46:34Z

/benchmark

github-actions · 2024-04-10T10:12:31Z

Benchmark results

Benchmarks comparing 75c399c (main) and cf594d6 (PR)

Comparing 75c399c and cf594d6
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  75c399c ┃  cf594d6 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 290.13ms │ 291.90ms │     no change │
│ QQuery 2     │  55.12ms │  40.17ms │ +1.37x faster │
│ QQuery 3     │  77.24ms │  60.66ms │ +1.27x faster │
│ QQuery 4     │  75.90ms │  80.41ms │  1.06x slower │
│ QQuery 5     │  97.49ms │  99.94ms │     no change │
│ QQuery 6     │  15.88ms │  16.44ms │     no change │
│ QQuery 7     │ 225.02ms │ 243.51ms │  1.08x slower │
│ QQuery 8     │  42.52ms │  43.77ms │     no change │
│ QQuery 9     │ 119.72ms │ 120.64ms │     no change │
│ QQuery 10    │ 108.92ms │ 111.08ms │     no change │
│ QQuery 11    │  48.16ms │  45.14ms │ +1.07x faster │
│ QQuery 12    │  59.36ms │  59.55ms │     no change │
│ QQuery 13    │ 106.10ms │ 109.99ms │     no change │
│ QQuery 14    │  19.08ms │  19.34ms │     no change │
│ QQuery 15    │  32.15ms │  32.63ms │     no change │
│ QQuery 16    │  46.43ms │  48.67ms │     no change │
│ QQuery 17    │ 147.67ms │ 143.33ms │     no change │
│ QQuery 18    │ 549.85ms │ 548.61ms │     no change │
│ QQuery 19    │  63.05ms │  62.36ms │     no change │
│ QQuery 20    │ 115.85ms │ 117.12ms │     no change │
│ QQuery 21    │ 325.90ms │ 334.81ms │     no change │
│ QQuery 22    │  39.55ms │  39.74ms │     no change │
└──────────────┴──────────┴──────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (75c399c)   │ 2661.08ms │
│ Total Time (cf594d6)   │ 2669.79ms │
│ Average Time (75c399c) │  120.96ms │
│ Average Time (cf594d6) │  121.35ms │
│ Queries Faster         │         3 │
│ Queries Slower         │         2 │
│ Queries with No Change │        17 │
└────────────────────────┴───────────┘
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃  75c399c ┃  cf594d6 ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1     │ 431.54ms │ 430.70ms │ no change │
│ QQuery 2     │  57.34ms │  55.70ms │ no change │
│ QQuery 3     │ 142.14ms │ 142.24ms │ no change │
│ QQuery 4     │  86.75ms │  90.49ms │ no change │
│ QQuery 5     │ 197.71ms │ 195.88ms │ no change │
│ QQuery 6     │ 107.82ms │ 104.88ms │ no change │
│ QQuery 7     │ 277.98ms │ 291.83ms │ no change │
│ QQuery 8     │ 189.39ms │ 191.19ms │ no change │
│ QQuery 9     │ 284.62ms │ 290.07ms │ no change │
│ QQuery 10    │ 233.96ms │ 237.63ms │ no change │
│ QQuery 11    │  63.13ms │  61.66ms │ no change │
│ QQuery 12    │ 127.59ms │ 123.42ms │ no change │
│ QQuery 13    │ 176.08ms │ 175.10ms │ no change │
│ QQuery 14    │ 127.31ms │ 125.96ms │ no change │
│ QQuery 15    │ 191.92ms │ 187.14ms │ no change │
│ QQuery 16    │  49.93ms │  50.15ms │ no change │
│ QQuery 17    │ 298.46ms │ 306.25ms │ no change │
│ QQuery 18    │ 439.63ms │ 444.79ms │ no change │
│ QQuery 19    │ 228.68ms │ 225.43ms │ no change │
│ QQuery 20    │ 184.67ms │ 191.37ms │ no change │
│ QQuery 21    │ 321.61ms │ 326.60ms │ no change │
│ QQuery 22    │  53.00ms │  54.74ms │ no change │
└──────────────┴──────────┴──────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (75c399c)   │ 4271.27ms │
│ Total Time (cf594d6)   │ 4303.19ms │
│ Average Time (75c399c) │  194.15ms │
│ Average Time (cf594d6) │  195.60ms │
│ Queries Faster         │         0 │
│ Queries Slower         │         0 │
│ Queries with No Change │        22 │
└────────────────────────┴───────────┘
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃   75c399c ┃   cf594d6 ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1     │ 4244.49ms │ 4215.25ms │ no change │
│ QQuery 2     │  497.30ms │  480.93ms │ no change │
│ QQuery 3     │ 1685.44ms │ 1688.14ms │ no change │
│ QQuery 4     │  799.86ms │  811.48ms │ no change │
│ QQuery 5     │ 2139.34ms │ 2144.36ms │ no change │
│ QQuery 6     │ 1035.63ms │ 1004.57ms │ no change │
│ QQuery 7     │ 3715.85ms │ 3725.97ms │ no change │
│ QQuery 8     │ 2435.42ms │ 2430.54ms │ no change │
│ QQuery 9     │ 4016.56ms │ 4121.28ms │ no change │
│ QQuery 10    │ 2517.71ms │ 2526.47ms │ no change │
│ QQuery 11    │  551.71ms │  576.77ms │ no change │
│ QQuery 12    │ 1184.91ms │ 1181.49ms │ no change │
│ QQuery 13    │ 2324.45ms │ 2345.50ms │ no change │
│ QQuery 14    │ 1269.15ms │ 1243.66ms │ no change │
│ QQuery 15    │ 1934.50ms │ 1935.37ms │ no change │
│ QQuery 16    │  507.48ms │  523.68ms │ no change │
│ QQuery 17    │ 5184.91ms │ 5277.34ms │ no change │
│ QQuery 18    │ 6991.12ms │ 6954.28ms │ no change │
│ QQuery 19    │ 2248.65ms │ 2167.78ms │ no change │
│ QQuery 20    │ 2571.93ms │ 2579.34ms │ no change │
│ QQuery 21    │ 4299.32ms │ 4303.13ms │ no change │
│ QQuery 22    │  549.24ms │  558.32ms │ no change │
└──────────────┴───────────┴───────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (75c399c)   │ 52704.97ms │
│ Total Time (cf594d6)   │ 52795.64ms │
│ Average Time (75c399c) │  2395.68ms │
│ Average Time (cf594d6) │  2399.80ms │
│ Queries Faster         │          0 │
│ Queries Slower         │          0 │
│ Queries with No Change │         22 │
└────────────────────────┴────────────┘

Jefffrey · 2024-04-10T12:20:53Z

/benchmark

github-actions · 2024-04-10T12:47:10Z

Benchmark results

Benchmarks comparing 5820507 (main) and e1b41a9 (PR)

Comparing 5820507 and e1b41a9
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  5820507 ┃  e1b41a9 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 291.20ms │ 295.10ms │     no change │
│ QQuery 2     │  47.81ms │  40.41ms │ +1.18x faster │
│ QQuery 3     │  78.82ms │  60.90ms │ +1.29x faster │
│ QQuery 4     │  76.27ms │  82.08ms │  1.08x slower │
│ QQuery 5     │  98.83ms │ 102.17ms │     no change │
│ QQuery 6     │  16.09ms │  16.44ms │     no change │
│ QQuery 7     │ 222.53ms │ 235.73ms │  1.06x slower │
│ QQuery 8     │  45.30ms │  43.99ms │     no change │
│ QQuery 9     │ 125.86ms │ 125.48ms │     no change │
│ QQuery 10    │ 112.28ms │ 111.70ms │     no change │
│ QQuery 11    │  50.95ms │  45.56ms │ +1.12x faster │
│ QQuery 12    │  60.22ms │  60.31ms │     no change │
│ QQuery 13    │ 112.75ms │ 114.68ms │     no change │
│ QQuery 14    │  19.26ms │  19.33ms │     no change │
│ QQuery 15    │  32.64ms │  32.54ms │     no change │
│ QQuery 16    │  47.65ms │  48.12ms │     no change │
│ QQuery 17    │ 158.91ms │ 152.14ms │     no change │
│ QQuery 18    │ 542.91ms │ 546.27ms │     no change │
│ QQuery 19    │  63.86ms │  64.58ms │     no change │
│ QQuery 20    │ 120.62ms │ 114.47ms │ +1.05x faster │
│ QQuery 21    │ 367.16ms │ 341.01ms │ +1.08x faster │
│ QQuery 22    │  39.52ms │  40.91ms │     no change │
└──────────────┴──────────┴──────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (5820507)   │ 2731.43ms │
│ Total Time (e1b41a9)   │ 2693.93ms │
│ Average Time (5820507) │  124.16ms │
│ Average Time (e1b41a9) │  122.45ms │
│ Queries Faster         │         5 │
│ Queries Slower         │         2 │
│ Queries with No Change │        15 │
└────────────────────────┴───────────┘
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃  5820507 ┃  e1b41a9 ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 434.28ms │ 434.75ms │    no change │
│ QQuery 2     │  57.50ms │  55.26ms │    no change │
│ QQuery 3     │ 142.69ms │ 142.57ms │    no change │
│ QQuery 4     │  84.58ms │  89.46ms │ 1.06x slower │
│ QQuery 5     │ 199.65ms │ 197.32ms │    no change │
│ QQuery 6     │ 108.27ms │ 104.99ms │    no change │
│ QQuery 7     │ 277.36ms │ 286.99ms │    no change │
│ QQuery 8     │ 186.31ms │ 185.19ms │    no change │
│ QQuery 9     │ 289.12ms │ 290.84ms │    no change │
│ QQuery 10    │ 232.46ms │ 235.75ms │    no change │
│ QQuery 11    │  62.73ms │  62.77ms │    no change │
│ QQuery 12    │ 127.22ms │ 123.35ms │    no change │
│ QQuery 13    │ 182.56ms │ 182.26ms │    no change │
│ QQuery 14    │ 127.66ms │ 125.14ms │    no change │
│ QQuery 15    │ 190.02ms │ 185.64ms │    no change │
│ QQuery 16    │  50.30ms │  50.74ms │    no change │
│ QQuery 17    │ 307.19ms │ 302.97ms │    no change │
│ QQuery 18    │ 444.69ms │ 459.30ms │    no change │
│ QQuery 19    │ 232.08ms │ 228.43ms │    no change │
│ QQuery 20    │ 194.81ms │ 190.45ms │    no change │
│ QQuery 21    │ 321.39ms │ 319.09ms │    no change │
│ QQuery 22    │  54.14ms │  55.81ms │    no change │
└──────────────┴──────────┴──────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (5820507)   │ 4307.00ms │
│ Total Time (e1b41a9)   │ 4309.07ms │
│ Average Time (5820507) │  195.77ms │
│ Average Time (e1b41a9) │  195.87ms │
│ Queries Faster         │         0 │
│ Queries Slower         │         1 │
│ Queries with No Change │        21 │
└────────────────────────┴───────────┘
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃   5820507 ┃   e1b41a9 ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1     │ 4277.53ms │ 4216.69ms │ no change │
│ QQuery 2     │  495.74ms │  495.26ms │ no change │
│ QQuery 3     │ 1711.66ms │ 1695.09ms │ no change │
│ QQuery 4     │  817.33ms │  819.14ms │ no change │
│ QQuery 5     │ 2182.53ms │ 2177.98ms │ no change │
│ QQuery 6     │ 1042.71ms │ 1010.44ms │ no change │
│ QQuery 7     │ 3610.39ms │ 3597.45ms │ no change │
│ QQuery 8     │ 2460.37ms │ 2456.10ms │ no change │
│ QQuery 9     │ 4030.72ms │ 4116.84ms │ no change │
│ QQuery 10    │ 2545.66ms │ 2523.34ms │ no change │
│ QQuery 11    │  563.06ms │  569.73ms │ no change │
│ QQuery 12    │ 1201.38ms │ 1181.40ms │ no change │
│ QQuery 13    │ 2295.25ms │ 2306.17ms │ no change │
│ QQuery 14    │ 1276.44ms │ 1262.88ms │ no change │
│ QQuery 15    │ 1959.77ms │ 1898.07ms │ no change │
│ QQuery 16    │  518.82ms │  509.16ms │ no change │
│ QQuery 17    │ 5182.47ms │ 5111.99ms │ no change │
│ QQuery 18    │ 6711.35ms │ 6733.76ms │ no change │
│ QQuery 19    │ 2270.26ms │ 2192.30ms │ no change │
│ QQuery 20    │ 2522.52ms │ 2509.84ms │ no change │
│ QQuery 21    │ 4312.16ms │ 4288.91ms │ no change │
│ QQuery 22    │  550.38ms │  547.19ms │ no change │
└──────────────┴───────────┴───────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (5820507)   │ 52538.50ms │
│ Total Time (e1b41a9)   │ 52219.73ms │
│ Average Time (5820507) │  2388.11ms │
│ Average Time (e1b41a9) │  2373.62ms │
│ Queries Faster         │          0 │
│ Queries Slower         │          0 │
│ Queries with No Change │         22 │
└────────────────────────┴────────────┘

alamb · 2024-04-10T20:40:48Z

This looks pretty awesome @Jefffrey -- thank you. I hope to review it later today but it will likely be tomorrow

alamb

Thank you @Jefffrey -- I went through this PR carefully and it is amazing -- both a joy to read and beautifully structured. Thank you for this contribution 🙏 . A very fine piece of engineering work indeed 🏆

I think this is a really cool design and would be awesome to add a high level overview (maybe as comments on pub struct DefaultPhysicalPlanner) explaining the process. The comments in this PR do a great job already explaining the implemention, I was just thinking of something slightly higher level for people who aren't going to read the implementation -- like that the plan is converted in parallel, based on the configuration etc. This could be done as a follow on PR or never.

I ran the sql_planner benchmarks and they show this PR seems to improve planning speed slightly. Very cool

cargo bench --bench sql_planner

++ critcmp main refactor_create_initial_plan
group                                         main                                   refactor_create_initial_plan
-----                                         ----                                   ----------------------------
logical_aggregate_with_join                   1.00  1184.2±11.71µs        ? ?/sec    1.00  1187.6±17.96µs        ? ?/sec
logical_plan_tpcds_all                        1.01    155.4±0.74ms        ? ?/sec    1.00    154.2±0.77ms        ? ?/sec
logical_plan_tpch_all                         1.01     16.7±0.19ms        ? ?/sec    1.00     16.5±0.16ms        ? ?/sec
logical_select_all_from_1000                  1.05     19.4±0.10ms        ? ?/sec    1.00     18.4±0.08ms        ? ?/sec
logical_select_one_from_700                   1.00   782.8±18.46µs        ? ?/sec    1.00    786.5±6.83µs        ? ?/sec
logical_trivial_join_high_numbered_columns    1.00    727.1±6.67µs        ? ?/sec    1.01    731.6±6.56µs        ? ?/sec
logical_trivial_join_low_numbered_columns     1.00    715.6±7.55µs        ? ?/sec    1.01   719.8±22.84µs        ? ?/sec
physical_plan_tpcds_all                       1.02   1845.7±5.15ms        ? ?/sec    1.00   1809.8±5.84ms        ? ?/sec
physical_plan_tpch_all                        1.02    119.7±0.76ms        ? ?/sec    1.00    117.8±0.54ms        ? ?/sec
physical_plan_tpch_q1                         1.02      7.3±0.05ms        ? ?/sec    1.00      7.1±0.05ms        ? ?/sec
physical_plan_tpch_q10                        1.02      5.5±0.03ms        ? ?/sec    1.00      5.4±0.02ms        ? ?/sec
physical_plan_tpch_q11                        1.02      4.8±0.03ms        ? ?/sec    1.00      4.8±0.11ms        ? ?/sec
physical_plan_tpch_q12                        1.02      3.9±0.02ms        ? ?/sec    1.00      3.8±0.02ms        ? ?/sec
physical_plan_tpch_q13                        1.02      2.6±0.03ms        ? ?/sec    1.00      2.6±0.02ms        ? ?/sec
physical_plan_tpch_q14                        1.02      3.3±0.02ms        ? ?/sec    1.00      3.3±0.02ms        ? ?/sec
physical_plan_tpch_q16                        1.01      4.9±0.03ms        ? ?/sec    1.00      4.8±0.02ms        ? ?/sec
physical_plan_tpch_q17                        1.01      4.6±0.03ms        ? ?/sec    1.00      4.6±0.02ms        ? ?/sec
physical_plan_tpch_q18                        1.02      5.0±0.04ms        ? ?/sec    1.00      4.9±0.04ms        ? ?/sec
physical_plan_tpch_q19                        1.01      9.4±0.05ms        ? ?/sec    1.00      9.4±0.05ms        ? ?/sec
physical_plan_tpch_q2                         1.02     10.6±0.08ms        ? ?/sec    1.00     10.4±0.06ms        ? ?/sec
physical_plan_tpch_q20                        1.01      6.1±0.04ms        ? ?/sec    1.00      6.0±0.03ms        ? ?/sec
physical_plan_tpch_q21                        1.02      8.3±0.05ms        ? ?/sec    1.00      8.1±0.05ms        ? ?/sec
physical_plan_tpch_q22                        1.03      4.4±0.02ms        ? ?/sec    1.00      4.3±0.03ms        ? ?/sec
physical_plan_tpch_q3                         1.02      3.9±0.02ms        ? ?/sec    1.00      3.8±0.02ms        ? ?/sec
physical_plan_tpch_q4                         1.03      2.9±0.02ms        ? ?/sec    1.00      2.8±0.02ms        ? ?/sec
physical_plan_tpch_q5                         1.01      5.6±0.03ms        ? ?/sec    1.00      5.5±0.03ms        ? ?/sec
physical_plan_tpch_q6                         1.02  1995.8±10.47µs        ? ?/sec    1.00  1952.7±14.95µs        ? ?/sec
physical_plan_tpch_q7                         1.02      7.5±0.05ms        ? ?/sec    1.00      7.4±0.05ms        ? ?/sec
physical_plan_tpch_q8                         1.01      9.5±0.07ms        ? ?/sec    1.00      9.4±0.06ms        ? ?/sec
physical_plan_tpch_q9                         1.01      7.2±0.03ms        ? ?/sec    1.00      7.1±0.05ms        ? ?/sec
physical_select_all_from_1000                 1.05    128.5±0.62ms        ? ?/sec    1.00    122.4±0.46ms        ? ?/sec
physical_select_one_from_700                  1.00      4.0±0.02ms        ? ?/sec    1.00      4.0±0.02ms        ? ?/sec

cc @crepererum

Note to other reviewers: I found whitespace blind diff made the changes easier to review: https://github.com/apache/arrow-datafusion/pull/10023/files?w=1

alamb · 2024-04-11T10:43:51Z

datafusion/core/src/physical_planner.rs

+        let planning_concurrency = session_state
+            .config_options()
+            .execution
+            .planning_concurrency;
+        // Can never spawn more tasks than leaves in the tree, as these tasks must
+        // all converge down to the root node, which can only be processed by a
+        // single task.
+        let max_concurrency = planning_concurrency.min(flat_tree_leaf_indices.len());


I don't think the original behavior is intended. Your change makes sense to me

alamb · 2024-04-11T10:48:36Z

datafusion/core/src/physical_planner.rs

    }

-    /// Create a physical plan from a logical plan
-    fn create_initial_plan<'a>(
+    /// These tasks start at a leaf and traverse up the tree towards the root, building


this is very clever

alamb · 2024-04-11T10:49:48Z

datafusion/core/src/physical_planner.rs

+                        let mut guard = children.lock().unwrap();
+                        // Safe unwrap on option as only the last task reaching this
+                        // node will take the contents (which happens after this line).
+                        let children = guard.as_mut().unwrap();


Maybe we could return an internal error instead of panic if for some reason the option was None

alamb · 2024-04-11T10:50:51Z

datafusion/core/src/physical_planner.rs

+                        // all children.
+                        //
+                        // This take is the only place the Option becomes None.
+                        guard.take().unwrap()


likewise it would be sweet if this was an internal error rather than panic. but I don't think it is necessary, just a suggestion

alamb · 2024-04-11T10:57:09Z

datafusion/core/src/physical_planner.rs

+                let (new_logical, physical_left, physical_right) = if has_expr_join_key {
+                    // TODO: Can we extract this transformation to somewhere before physical plan
+                    //       creation?
+                    let (left_keys, right_keys): (Vec<_>, Vec<_>) =
+                        keys.iter().cloned().unzip();
+
+                    let (left, left_col_keys, left_projected) =
+                        wrap_projection_for_join_if_necessary(
+                            &left_keys,
+                            left.as_ref().clone(),
+                        )?;
+                    let (right, right_col_keys, right_projected) =
+                        wrap_projection_for_join_if_necessary(
+                            &right_keys,
+                            right.as_ref().clone(),
+                        )?;
+                    let column_on = (left_col_keys, right_col_keys);
+
+                    let left = Arc::new(left);
+                    let right = Arc::new(right);
+                    let new_join = LogicalPlan::Join(Join::try_new_with_project_input(
+                        node,
+                        left.clone(),
+                        right.clone(),
+                        column_on,
+                    )?);

-                    Ok(Arc::new(UnionExec::new(physical_plans)))
-                }
-                LogicalPlan::Repartition(Repartition {
-                    input,
-                    partitioning_scheme,
-                }) => {
-                    let physical_input = self.create_initial_plan(input, session_state).await?;
-                    let input_dfschema = input.as_ref().schema();
-                    let physical_partitioning = match partitioning_scheme {
-                        LogicalPartitioning::RoundRobinBatch(n) => {
-                            Partitioning::RoundRobinBatch(*n)
-                        }
-                        LogicalPartitioning::Hash(expr, n) => {
-                            let runtime_expr = expr
-                                .iter()
-                                .map(|e| {
-                                    self.create_physical_expr(
-                                        e,
-                                        input_dfschema,
-                                        session_state,
-                                    )
-                                })
-                                .collect::<Result<Vec<_>>>()?;
-                            Partitioning::Hash(runtime_expr, *n)
-                        }
-                        LogicalPartitioning::DistributeBy(_) => {
-                            return not_impl_err!("Physical plan does not support DistributeBy partitioning");
-                        }
+                    // If inputs were projected then create ExecutionPlan for these new
+                    // LogicalPlan nodes.
+                    let physical_left = match (left_projected, left.as_ref()) {
+                        // If left_projected is true we are guaranteed that left is a Projection
+                        (
+                            true,
+                            LogicalPlan::Projection(Projection { input, expr, .. }),
+                        ) => self.create_project_physical_exec(
+                            session_state,
+                            physical_left,
+                            input,
+                            expr,
+                        )?,
+                        _ => physical_left,


I agree it is nasty and splitting it off, or moving it to some other part of the code I think sounds like a good idea to me

Perhaps as a follow on PR

alamb · 2024-04-11T11:00:35Z

datafusion/core/src/physical_planner.rs

+    }
+
+    /// Given a single LogicalPlan node, map it to it's physical ExecutionPlan counterpart.
+    async fn map_logical_node_to_physical(


As I understand it the main reason this structure reduces stack space is that this function requires a non trivial stack frame, but now instead of recursively calling itself (which results in many such frames on the stack) it calls itself iteratively (basically pushing the results on to a Vec)

Or put another way, map_logical_node_to_physical never calls map_logical_node_to_physical

alamb · 2024-04-11T11:02:36Z

datafusion/core/src/physical_planner.rs

+            }
+
+            // 1 Child
+            LogicalPlan::Copy(CopyTo {


As some follow on PR, we could refactor this logic out of one giant match statement into functions like

match plan { ... LogicalPlan::Copy(copy_to) => copy_to_physical(copy_to), ... }

But after this PR that refactoring seems like it would mostly improve readability rather than any stack usage

Jefffrey · 2024-04-12T07:48:51Z

Thanks for the review. I will definitely add that higher level comment and clean up some of the unwraps for sure, though I am travelling until Monday 👍

metesynnada · 2024-04-15T07:19:14Z

I will also review this after you finish @alamb's review. Thanks for working on the issue.

Can you also enable tpcds_physical_q64 in this PR? I think stack overflow won't be an issue for this test anymore.

Jefffrey · 2024-04-15T11:47:31Z

I've cleaned up the unwraps and added the high level doc

Can you also enable tpcds_physical_q64 in this PR? I think stack overflow won't be an issue for this test anymore.

Uncommented, seems like it succeeds now (at least locally) 👍

- Use tokio::Mutex in async environment - Remove Option from enum, since it is only used for taking.

metesynnada · 2024-04-16T07:59:37Z

I've just sent a small commit to lighten your load, considering the effort you've already put in.

Used tokio::Mutex instead of std for smoother operation in an async environment.
Changed the enum by removing Option, as it's solely utilized for taking the inner vector.

Your pull request appears flawless. Much appreciated for all the hard work.

alamb · 2024-04-16T15:59:35Z

c6eaf74 looks good to me (it makes sense to avoid the use of Option)

alamb · 2024-04-16T15:59:39Z

🚀

alamb · 2024-04-16T16:00:03Z

Thanks again @Jefffrey -- this is epic. Also thanks @metesynnada for the review and improvement

Jefffrey added 4 commits April 10, 2024 07:57

Refactor physical create_initial_plan to construct bottom up

a16fa6a

Refactor node mapping into separate function

c9148f7

Experiment with concurrent bottom up physical planning

fce069f

Refactoring and comments

cf594d6

github-actions bot added the core Core DataFusion crate label Apr 10, 2024

Jefffrey commented Apr 10, 2024

View reviewed changes

Jefffrey marked this pull request as ready for review April 10, 2024 09:46

Remove unnecessary collect

aeb6b49

Jefffrey added 4 commits April 10, 2024 21:06

Preallocate vector capacity

daa45f8

Remove children.clone()

3e0a9aa

Introduce ChildrenContainer enum

53ba4b7

Formatting

e1b41a9

Fix case where extension may have 0 or 1 children

69997f0

alamb mentioned this pull request Apr 10, 2024

DataFusion weekly project plan (Andrew Lamb) - April 8, 2024 #10002

Closed

9 tasks

alamb approved these changes Apr 11, 2024

View reviewed changes

Jefffrey added 2 commits April 15, 2024 21:38

Documentation and cleanup unwraps

c40e2d9

Merge branch 'main' into refactor_create_initial_plan

e0cdaf7

Minor changes

c6eaf74

- Use tokio::Mutex in async environment - Remove Option from enum, since it is only used for taking.

metesynnada approved these changes Apr 16, 2024

View reviewed changes

alamb merged commit b54adb3 into apache:main Apr 16, 2024
24 checks passed

Jefffrey deleted the refactor_create_initial_plan branch April 16, 2024 21:34

alamb mentioned this pull request Apr 22, 2024

DataFusion weekly project plan (Andrew Lamb) - April 22, 2024 #10172

Closed

7 tasks

Jefffrey mentioned this pull request Sep 8, 2024

Refactor SqlToRel::sql_expr_to_logical_expr_internal to reduce stack size #12384

Merged

blaginin mentioned this pull request Oct 15, 2024

Deeply recursive UNION queries cause stack overflow #9373

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor physical create_initial_plan to iteratively & concurrently construct plan from the bottom up #10023

Refactor physical create_initial_plan to iteratively & concurrently construct plan from the bottom up #10023

Jefffrey commented Apr 10, 2024 •

edited

Loading

Jefffrey left a comment •

edited

Loading

Jefffrey Apr 10, 2024 •

edited

Loading

alamb Apr 11, 2024

Jefffrey Apr 10, 2024

Jefffrey Apr 10, 2024

alamb Apr 11, 2024

Jefffrey Apr 10, 2024

Jefffrey commented Apr 10, 2024

github-actions bot commented Apr 10, 2024

Jefffrey commented Apr 10, 2024

github-actions bot commented Apr 10, 2024

alamb commented Apr 10, 2024

alamb left a comment

alamb Apr 11, 2024

alamb Apr 11, 2024

alamb Apr 11, 2024

alamb Apr 11, 2024

alamb Apr 11, 2024

alamb Apr 11, 2024

alamb Apr 11, 2024

Jefffrey commented Apr 12, 2024

metesynnada commented Apr 15, 2024 •

edited

Loading

Jefffrey commented Apr 15, 2024

metesynnada commented Apr 16, 2024

alamb commented Apr 16, 2024

alamb commented Apr 16, 2024

alamb commented Apr 16, 2024

Refactor physical create_initial_plan to iteratively & concurrently construct plan from the bottom up #10023

Refactor physical create_initial_plan to iteratively & concurrently construct plan from the bottom up #10023

Conversation

Jefffrey commented Apr 10, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Jefffrey left a comment • edited Loading

Choose a reason for hiding this comment

Jefffrey Apr 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jefffrey commented Apr 10, 2024

github-actions bot commented Apr 10, 2024

Benchmark results

Jefffrey commented Apr 10, 2024

github-actions bot commented Apr 10, 2024

Benchmark results

alamb commented Apr 10, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jefffrey commented Apr 12, 2024

metesynnada commented Apr 15, 2024 • edited Loading

Jefffrey commented Apr 15, 2024

metesynnada commented Apr 16, 2024

alamb commented Apr 16, 2024

alamb commented Apr 16, 2024

alamb commented Apr 16, 2024

Jefffrey commented Apr 10, 2024 •

edited

Loading

Jefffrey left a comment •

edited

Loading

Jefffrey Apr 10, 2024 •

edited

Loading

metesynnada commented Apr 15, 2024 •

edited

Loading