Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More scalar subqueries support #6372

Merged
merged 3 commits into from
May 23, 2023
Merged

Conversation

mingmwang
Copy link
Contributor

@mingmwang mingmwang commented May 17, 2023

Which issue does this PR close?

Partially Closes #3667
Closes #6370
Closes #6371.

Rationale for this change

What changes are included in this PR?

  1. Support more type of scalar subqueries
  2. Simply the rewrite process

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate optimizer Optimizer rules labels May 17, 2023
@mingmwang mingmwang requested review from alamb and jackwener May 17, 2023 06:56
@jackwener
Copy link
Member

A great job, I prepare to review it carefully.

@alamb
Copy link
Contributor

alamb commented May 17, 2023

I will try and find time to review this carefully as well if @jackwener doesn't beat me to it

@mingmwang
Copy link
Contributor Author

@jackwener @alamb
Do you have time to review this PR? I think the new rewrite process is more clear and easy to understand and can support more patterns.

@alamb
Copy link
Contributor

alamb commented May 19, 2023

I will review it today

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a great job to me @mingmwang -- sorry for the delay in reviewing. 🏆

@@ -3,12 +3,12 @@
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Sort: supplier.s_acctbal DESC NULLS FIRST, nation.n_name ASC NULLS LAST, supplier.s_name ASC NULLS LAST, part.p_partkey ASC NULLS LAST |
| | Projection: supplier.s_acctbal, supplier.s_name, nation.n_name, part.p_partkey, part.p_mfgr, supplier.s_address, supplier.s_phone, supplier.s_comment |
| | Inner Join: partsupp.ps_supplycost = __scalar_sq_1.__value, part.p_partkey = __scalar_sq_1.ps_partkey |
| | Projection: part.p_partkey, part.p_mfgr, partsupp.ps_supplycost, supplier.s_name, supplier.s_address, supplier.s_phone, supplier.s_acctbal, supplier.s_comment, nation.n_name |
| | Inner Join: part.p_partkey = __scalar_sq_1.ps_partkey, partsupp.ps_supplycost = __scalar_sq_1.__value |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference in these plans appear to be the names / orders of the columns used internally -- the actual output and plan appear to be the same) 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

" TableScan: t2 [t2_id:UInt32;N, t2_name:Utf8;N, t2_int:UInt32;N]",
" TableScan: t1 projection=[t1_id] [t1_id:UInt32;N]",
"Projection: t1.t1_id, __scalar_sq_1.__value AS t2_sum [t1_id:UInt32;N, t2_sum:UInt64;N]",
" Left Join: t1.t1_id = __scalar_sq_1.t2_id [t1_id:UInt32;N, t2_id:UInt32;N, __value:UInt64;N]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a nice change as the query can now actually run 👍

async fn simple_uncorrelated_scalar_subquery2() -> Result<()> {
let ctx = create_join_context("t1_id", "t2_id", true)?;

let sql = "select (select count(*) from t1) as b, (select count(1) from t2) as c";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest also adding an error case here for a query that doesn't produce a single row? Something like

select (select t1_id from t1) as b, (select count(1) from t2) as c

In which case I would expect an error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest also adding an error case here for a query that doesn't produce a single row? Something like

select (select t1_id from t1) as b, (select count(1) from t2) as c

In which case I would expect an error

This query can not be de-correlated by this rule, the subquery expression is kept and the real physical plan is not executable.

Copy link
Member

@jackwener jackwener May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select (select t1_id from t1) as b, (select count(1) from t2) as c

Because this sql is project subquery, it's a special situation, this rule can't handle it.

HyPer Paper mention this condition.

@@ -510,6 +494,7 @@ mod tests {
vec![
Arc::new(ScalarSubqueryToJoin::new()),
Arc::new(ExtractEquijoinPredicate::new()),
Arc::new(EliminateCrossJoin::new()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these tests are getting a little hard to understand because they use a subset of the optimizer passes that is not used for actual queries, bit is more than pass that is defined in this module.

I recommend that in a follow on PR we either

  1. remove all optimizer passes in the tests in this file other than ScalarSubqueryToJoin (so the tests here are focused on the code in this module)
  2. Move any query plans that we want to see the final plan after all rewrites to the end to end tests (ideally sqllogictest) files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will do this in a following PR.

\n Projection: orders.o_custkey, MAX(orders.o_custkey) + Int32(1) AS __value [o_custkey:Int64, __value:Int64;N]\
\n Aggregate: groupBy=[[orders.o_custkey]], aggr=[[MAX(orders.o_custkey)]] [o_custkey:Int64, MAX(orders.o_custkey):Int64;N]\
\n TableScan: orders [o_orderkey:Int64, o_custkey:Int64, o_orderstatus:Utf8, o_totalprice:Float64;N]";
\n Filter: customer.c_custkey = __scalar_sq_1.__value [c_custkey:Int64, c_name:Utf8, o_custkey:Int64;N, __value:Int64;N]\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a real query plan, this filter is probably removed, right?

Copy link
Contributor Author

@mingmwang mingmwang May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. This filter will be converted to Join filters by the rule PushDownFilter in a real query plan.

Filter: customer.c_custkey = orders.o_custkey [o_orderkey:Int64, o_custkey:Int64, o_orderstatus:Utf8, o_totalprice:Float64;N]
TableScan: orders [o_orderkey:Int64, o_custkey:Int64, o_orderstatus:Utf8, o_totalprice:Float64;N]
TableScan: customer [c_custkey:Int64, c_name:Utf8]"#;
let expected = "Projection: customer.c_custkey [c_custkey:Int64]\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@alamb
Copy link
Contributor

alamb commented May 19, 2023

FYI I don't think this closes #3667

I tried the reproducer in that ticket with this branch and it still errors

$ echo "1,2" > /tmp/test.csv
$ cd arrow-datafusion/datafusion-cli
$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.94s
     Running `/Users/alamb/Software/target-df/debug/datafusion-cli`
DataFusion CLI v24.0.0
❯ create external table test (a decimal(10,2), b decimal(10,2)) stored as csv location '/tmp/test.csv';
0 rows in set. Query took 0.004 seconds.
❯ select * from test where a between (select distinct a from test) and (select distinct b from test);
This feature is not implemented: Physical plan does not support logical expression (<subquery>)

@mingmwang
Copy link
Contributor Author

FYI I don't think this closes #3667

I tried the reproducer in that ticket with this branch and it still errors

$ echo "1,2" > /tmp/test.csv
$ cd arrow-datafusion/datafusion-cli
$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.94s
     Running `/Users/alamb/Software/target-df/debug/datafusion-cli`
DataFusion CLI v24.0.0
❯ create external table test (a decimal(10,2), b decimal(10,2)) stored as csv location '/tmp/test.csv';
0 rows in set. Query took 0.004 seconds.
❯ select * from test where a between (select distinct a from test) and (select distinct b from test);
This feature is not implemented: Physical plan does not support logical expression (<subquery>)

Yes, the query in the original ticket can not be de-correlated because the select distinct can not make sure to return a scalar value.

select * from test where a between (select distinct a from test) and (select distinct b from test);

What this ticket closed is a patten like below:

select * from test where a between (select min (a) from test) and (select max(b) from test);

@mingmwang
Copy link
Contributor Author

@alamb
Can I merge this PR ?

Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥲I'm sorry for the late review of this PR as I've been busy at work lately

This work is a big step forward for "Subquery Unnesting"!

Thanks @mingmwang and @alamb

Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥲I'm sorry for the late review of this PR as I've been busy at work lately

This work is a big step forward for "Subquery Unnesting"!

Thanks @mingmwang and @alamb

@jackwener
Copy link
Member

jackwener commented May 23, 2023

BTW, I prepare to add more subquery unnesting test in sqllogicaltest to confirm our PR (related with SubqueryUnnesting) is correct.

@jackwener jackwener merged commit 924059a into apache:main May 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules
Projects
None yet
3 participants