More scalar subqueries support #6372

mingmwang · 2023-05-17T06:56:01Z

Which issue does this PR close?

Partially Closes #3667
Closes #6370
Closes #6371.

Rationale for this change

What changes are included in this PR?

Support more type of scalar subqueries
Simply the rewrite process

Are these changes tested?

Are there any user-facing changes?

jackwener · 2023-05-17T11:01:54Z

A great job, I prepare to review it carefully.

alamb · 2023-05-17T19:52:29Z

I will try and find time to review this carefully as well if @jackwener doesn't beat me to it

mingmwang · 2023-05-19T05:49:55Z

@jackwener @alamb
Do you have time to review this PR? I think the new rewrite process is more clear and easy to understand and can support more patterns.

alamb · 2023-05-19T11:13:20Z

I will review it today

alamb

Looks like a great job to me @mingmwang -- sorry for the delay in reviewing. 🏆

alamb · 2023-05-19T12:52:21Z

benchmarks/expected-plans/q2.txt

@@ -3,12 +3,12 @@
 +---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 | logical_plan  | Sort: supplier.s_acctbal DESC NULLS FIRST, nation.n_name ASC NULLS LAST, supplier.s_name ASC NULLS LAST, part.p_partkey ASC NULLS LAST                                                                                                                                                                    |
 |               |   Projection: supplier.s_acctbal, supplier.s_name, nation.n_name, part.p_partkey, part.p_mfgr, supplier.s_address, supplier.s_phone, supplier.s_comment                                                                                                                                                   |
-|               |     Inner Join: partsupp.ps_supplycost = __scalar_sq_1.__value, part.p_partkey = __scalar_sq_1.ps_partkey                                                                                                                                                                                                 |
-|               |       Projection: part.p_partkey, part.p_mfgr, partsupp.ps_supplycost, supplier.s_name, supplier.s_address, supplier.s_phone, supplier.s_acctbal, supplier.s_comment, nation.n_name                                                                                                                       |
+|               |     Inner Join: part.p_partkey = __scalar_sq_1.ps_partkey, partsupp.ps_supplycost = __scalar_sq_1.__value                                                                                                                                                                                                 |


The only difference in these plans appear to be the names / orders of the columns used internally -- the actual output and plan appear to be the same) 👍

alamb · 2023-05-19T12:53:19Z

datafusion/core/tests/sql/subqueries.rs

-        "          TableScan: t2 [t2_id:UInt32;N, t2_name:Utf8;N, t2_int:UInt32;N]",
-        "  TableScan: t1 projection=[t1_id] [t1_id:UInt32;N]",
+        "Projection: t1.t1_id, __scalar_sq_1.__value AS t2_sum [t1_id:UInt32;N, t2_sum:UInt64;N]",
+        "  Left Join: t1.t1_id = __scalar_sq_1.t2_id [t1_id:UInt32;N, t2_id:UInt32;N, __value:UInt64;N]",


this is a nice change as the query can now actually run 👍

alamb · 2023-05-19T12:56:27Z

datafusion/core/tests/sql/subqueries.rs

+async fn simple_uncorrelated_scalar_subquery2() -> Result<()> {
+    let ctx = create_join_context("t1_id", "t2_id", true)?;
+
+    let sql = "select (select count(*) from t1) as b, (select count(1) from t2) as c";


I suggest also adding an error case here for a query that doesn't produce a single row? Something like

select (select t1_id from t1) as b, (select count(1) from t2) as c

In which case I would expect an error

I suggest also adding an error case here for a query that doesn't produce a single row? Something like

select (select t1_id from t1) as b, (select count(1) from t2) as c

In which case I would expect an error

This query can not be de-correlated by this rule, the subquery expression is kept and the real physical plan is not executable.

select (select t1_id from t1) as b, (select count(1) from t2) as c

Because this sql is project subquery, it's a special situation, this rule can't handle it.

HyPer Paper mention this condition.

alamb · 2023-05-19T13:06:40Z

datafusion/optimizer/src/scalar_subquery_to_join.rs

@@ -510,6 +494,7 @@ mod tests {
            vec![
                Arc::new(ScalarSubqueryToJoin::new()),
                Arc::new(ExtractEquijoinPredicate::new()),
+                Arc::new(EliminateCrossJoin::new()),


I think these tests are getting a little hard to understand because they use a subset of the optimizer passes that is not used for actual queries, bit is more than pass that is defined in this module.

I recommend that in a follow on PR we either

remove all optimizer passes in the tests in this file other than ScalarSubqueryToJoin (so the tests here are focused on the code in this module)

Move any query plans that we want to see the final plan after all rewrites to the end to end tests (ideally sqllogictest) files

Sure, I will do this in a following PR.

alamb · 2023-05-19T13:07:17Z

datafusion/optimizer/src/scalar_subquery_to_join.rs

-        \n      Projection: orders.o_custkey, MAX(orders.o_custkey) + Int32(1) AS __value [o_custkey:Int64, __value:Int64;N]\
-        \n        Aggregate: groupBy=[[orders.o_custkey]], aggr=[[MAX(orders.o_custkey)]] [o_custkey:Int64, MAX(orders.o_custkey):Int64;N]\
-        \n          TableScan: orders [o_orderkey:Int64, o_custkey:Int64, o_orderstatus:Utf8, o_totalprice:Float64;N]";
+        \n  Filter: customer.c_custkey = __scalar_sq_1.__value [c_custkey:Int64, c_name:Utf8, o_custkey:Int64;N, __value:Int64;N]\


In a real query plan, this filter is probably removed, right?

Yes, you are right. This filter will be converted to Join filters by the rule PushDownFilter in a real query plan.

alamb · 2023-05-19T13:07:26Z

datafusion/optimizer/src/scalar_subquery_to_join.rs

-          Filter: customer.c_custkey = orders.o_custkey [o_orderkey:Int64, o_custkey:Int64, o_orderstatus:Utf8, o_totalprice:Float64;N]
-            TableScan: orders [o_orderkey:Int64, o_custkey:Int64, o_orderstatus:Utf8, o_totalprice:Float64;N]
-    TableScan: customer [c_custkey:Int64, c_name:Utf8]"#;
+        let expected = "Projection: customer.c_custkey [c_custkey:Int64]\


alamb · 2023-05-19T13:13:55Z

FYI I don't think this closes #3667

I tried the reproducer in that ticket with this branch and it still errors

$ echo "1,2" > /tmp/test.csv
$ cd arrow-datafusion/datafusion-cli
$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.94s
     Running `/Users/alamb/Software/target-df/debug/datafusion-cli`
DataFusion CLI v24.0.0
❯ create external table test (a decimal(10,2), b decimal(10,2)) stored as csv location '/tmp/test.csv';
0 rows in set. Query took 0.004 seconds.
❯ select * from test where a between (select distinct a from test) and (select distinct b from test);
This feature is not implemented: Physical plan does not support logical expression (<subquery>)

mingmwang · 2023-05-23T02:36:11Z

FYI I don't think this closes #3667

I tried the reproducer in that ticket with this branch and it still errors

$ echo "1,2" > /tmp/test.csv
$ cd arrow-datafusion/datafusion-cli
$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.94s
     Running `/Users/alamb/Software/target-df/debug/datafusion-cli`
DataFusion CLI v24.0.0
❯ create external table test (a decimal(10,2), b decimal(10,2)) stored as csv location '/tmp/test.csv';
0 rows in set. Query took 0.004 seconds.
❯ select * from test where a between (select distinct a from test) and (select distinct b from test);
This feature is not implemented: Physical plan does not support logical expression (<subquery>)

Yes, the query in the original ticket can not be de-correlated because the select distinct can not make sure to return a scalar value.

select * from test where a between (select distinct a from test) and (select distinct b from test);

What this ticket closed is a patten like below:

select * from test where a between (select min (a) from test) and (select max(b) from test);

mingmwang · 2023-05-23T06:12:45Z

@alamb
Can I merge this PR ?

jackwener

🥲I'm sorry for the late review of this PR as I've been busy at work lately

This work is a big step forward for "Subquery Unnesting"!

Thanks @mingmwang and @alamb

jackwener

🥲I'm sorry for the late review of this PR as I've been busy at work lately

This work is a big step forward for "Subquery Unnesting"!

Thanks @mingmwang and @alamb

jackwener · 2023-05-23T07:23:35Z

BTW, I prepare to add more subquery unnesting test in sqllogicaltest to confirm our PR (related with SubqueryUnnesting) is correct.

More scalar subquery support

4b8bc06

github-actions bot added core Core DataFusion crate optimizer Optimizer rules labels May 17, 2023

mingmwang requested review from alamb and jackwener May 17, 2023 06:56

mingmwang added 2 commits May 17, 2023 18:05

fix simple scalar subquery

0499366

add more UT

44cf728

alamb approved these changes May 19, 2023

View reviewed changes

jackwener approved these changes May 23, 2023

View reviewed changes

jackwener merged commit 924059a into apache:main May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More scalar subqueries support #6372

More scalar subqueries support #6372

mingmwang commented May 17, 2023 •

edited

Loading

jackwener commented May 17, 2023

alamb commented May 17, 2023

mingmwang commented May 19, 2023

alamb commented May 19, 2023

alamb left a comment

alamb May 19, 2023

mingmwang May 23, 2023

alamb May 19, 2023

alamb May 19, 2023

mingmwang May 23, 2023

jackwener May 23, 2023 •

edited

Loading

alamb May 19, 2023

mingmwang May 23, 2023

alamb May 19, 2023

mingmwang May 23, 2023 •

edited

Loading

alamb May 19, 2023

alamb commented May 19, 2023

mingmwang commented May 23, 2023

mingmwang commented May 23, 2023

jackwener left a comment

jackwener left a comment

jackwener commented May 23, 2023 •

edited

Loading

More scalar subqueries support #6372

More scalar subqueries support #6372

Conversation

mingmwang commented May 17, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jackwener commented May 17, 2023

alamb commented May 17, 2023

mingmwang commented May 19, 2023

alamb commented May 19, 2023

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackwener May 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mingmwang May 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented May 19, 2023

mingmwang commented May 23, 2023

mingmwang commented May 23, 2023

jackwener left a comment

Choose a reason for hiding this comment

jackwener left a comment

Choose a reason for hiding this comment

jackwener commented May 23, 2023 • edited Loading

mingmwang commented May 17, 2023 •

edited

Loading

jackwener May 23, 2023 •

edited

Loading

mingmwang May 23, 2023 •

edited

Loading

jackwener commented May 23, 2023 •

edited

Loading