-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More scalar subqueries support #6372
Conversation
A great job, I prepare to review it carefully. |
I will try and find time to review this carefully as well if @jackwener doesn't beat me to it |
@jackwener @alamb |
I will review it today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a great job to me @mingmwang -- sorry for the delay in reviewing. 🏆
@@ -3,12 +3,12 @@ | |||
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | |||
| logical_plan | Sort: supplier.s_acctbal DESC NULLS FIRST, nation.n_name ASC NULLS LAST, supplier.s_name ASC NULLS LAST, part.p_partkey ASC NULLS LAST | | |||
| | Projection: supplier.s_acctbal, supplier.s_name, nation.n_name, part.p_partkey, part.p_mfgr, supplier.s_address, supplier.s_phone, supplier.s_comment | | |||
| | Inner Join: partsupp.ps_supplycost = __scalar_sq_1.__value, part.p_partkey = __scalar_sq_1.ps_partkey | | |||
| | Projection: part.p_partkey, part.p_mfgr, partsupp.ps_supplycost, supplier.s_name, supplier.s_address, supplier.s_phone, supplier.s_acctbal, supplier.s_comment, nation.n_name | | |||
| | Inner Join: part.p_partkey = __scalar_sq_1.ps_partkey, partsupp.ps_supplycost = __scalar_sq_1.__value | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only difference in these plans appear to be the names / orders of the columns used internally -- the actual output and plan appear to be the same) 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
" TableScan: t2 [t2_id:UInt32;N, t2_name:Utf8;N, t2_int:UInt32;N]", | ||
" TableScan: t1 projection=[t1_id] [t1_id:UInt32;N]", | ||
"Projection: t1.t1_id, __scalar_sq_1.__value AS t2_sum [t1_id:UInt32;N, t2_sum:UInt64;N]", | ||
" Left Join: t1.t1_id = __scalar_sq_1.t2_id [t1_id:UInt32;N, t2_id:UInt32;N, __value:UInt64;N]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a nice change as the query can now actually run 👍
async fn simple_uncorrelated_scalar_subquery2() -> Result<()> { | ||
let ctx = create_join_context("t1_id", "t2_id", true)?; | ||
|
||
let sql = "select (select count(*) from t1) as b, (select count(1) from t2) as c"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest also adding an error case here for a query that doesn't produce a single row? Something like
select (select t1_id from t1) as b, (select count(1) from t2) as c
In which case I would expect an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest also adding an error case here for a query that doesn't produce a single row? Something like
select (select t1_id from t1) as b, (select count(1) from t2) as cIn which case I would expect an error
This query can not be de-correlated by this rule, the subquery expression is kept and the real physical plan is not executable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
select (select t1_id from t1) as b, (select count(1) from t2) as c
Because this sql is project subquery
, it's a special situation, this rule can't handle it.
HyPer
Paper mention this condition.
@@ -510,6 +494,7 @@ mod tests { | |||
vec![ | |||
Arc::new(ScalarSubqueryToJoin::new()), | |||
Arc::new(ExtractEquijoinPredicate::new()), | |||
Arc::new(EliminateCrossJoin::new()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these tests are getting a little hard to understand because they use a subset of the optimizer passes that is not used for actual queries, bit is more than pass that is defined in this module.
I recommend that in a follow on PR we either
- remove all optimizer passes in the tests in this file other than
ScalarSubqueryToJoin
(so the tests here are focused on the code in this module) - Move any query plans that we want to see the final plan after all rewrites to the end to end tests (ideally sqllogictest) files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will do this in a following PR.
\n Projection: orders.o_custkey, MAX(orders.o_custkey) + Int32(1) AS __value [o_custkey:Int64, __value:Int64;N]\ | ||
\n Aggregate: groupBy=[[orders.o_custkey]], aggr=[[MAX(orders.o_custkey)]] [o_custkey:Int64, MAX(orders.o_custkey):Int64;N]\ | ||
\n TableScan: orders [o_orderkey:Int64, o_custkey:Int64, o_orderstatus:Utf8, o_totalprice:Float64;N]"; | ||
\n Filter: customer.c_custkey = __scalar_sq_1.__value [c_custkey:Int64, c_name:Utf8, o_custkey:Int64;N, __value:Int64;N]\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a real query plan, this filter is probably removed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. This filter will be converted to Join filters by the rule PushDownFilter
in a real query plan.
Filter: customer.c_custkey = orders.o_custkey [o_orderkey:Int64, o_custkey:Int64, o_orderstatus:Utf8, o_totalprice:Float64;N] | ||
TableScan: orders [o_orderkey:Int64, o_custkey:Int64, o_orderstatus:Utf8, o_totalprice:Float64;N] | ||
TableScan: customer [c_custkey:Int64, c_name:Utf8]"#; | ||
let expected = "Projection: customer.c_custkey [c_custkey:Int64]\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
FYI I don't think this closes #3667 I tried the reproducer in that ticket with this branch and it still errors $ echo "1,2" > /tmp/test.csv
$ cd arrow-datafusion/datafusion-cli
$ cargo run
Finished dev [unoptimized + debuginfo] target(s) in 0.94s
Running `/Users/alamb/Software/target-df/debug/datafusion-cli`
DataFusion CLI v24.0.0
❯ create external table test (a decimal(10,2), b decimal(10,2)) stored as csv location '/tmp/test.csv';
0 rows in set. Query took 0.004 seconds.
❯ select * from test where a between (select distinct a from test) and (select distinct b from test);
This feature is not implemented: Physical plan does not support logical expression (<subquery>) |
Yes, the query in the original ticket can not be de-correlated because the
What this ticket closed is a patten like below:
|
@alamb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥲I'm sorry for the late review of this PR as I've been busy at work lately
This work is a big step forward for "Subquery Unnesting"!
Thanks @mingmwang and @alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥲I'm sorry for the late review of this PR as I've been busy at work lately
This work is a big step forward for "Subquery Unnesting"!
Thanks @mingmwang and @alamb
BTW, I prepare to add more |
Which issue does this PR close?
Partially Closes #3667
Closes #6370
Closes #6371.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?