Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove DISTINCT inside IN subquery expression #7781

Closed
martint opened this issue Apr 7, 2017 · 3 comments
Closed

Remove DISTINCT inside IN subquery expression #7781

martint opened this issue Apr 7, 2017 · 3 comments

Comments

@martint
Copy link
Contributor

martint commented Apr 7, 2017

Add an optimization rule to remove unnecessary DISTINCT in this scenario:

SELECT ...
FROM t
WHERE c IN (SELECT DISTINCT ... FROM u)

Since the semi join already performs deduplication of the values in the subquery, that operation is unnecessary and results in an extra exchange and aggregation.

presto> explain (type distributed) select count(*) from tpch.tiny.orders where custkey in (select custkey from tpch.tiny.customer);
                                                            Query Plan
-----------------------------------------------------------------------------------------------------------------------------------
 Fragment 0 [SINGLE]
     Output layout: [count]
     Output partitioning: SINGLE []
     - Output[_col0] => [count:bigint]
             _col0 := count
         - Aggregate(FINAL) => [count:bigint]
                 count := "count"("count_13")
             - LocalExchange[SINGLE] () => count_13:bigint
                 - RemoteSource[1] => [count_13:bigint]

 Fragment 1 [HASH]
     Output layout: [count_13]
     Output partitioning: SINGLE []
     - Aggregate(PARTIAL) => [count_13:bigint]
             count_13 := "count"(*)
         - FilterProject[filterPredicate = "expr_8"] => []
             - SemiJoin[custkey = custkey_1] => [custkey:bigint, expr_8:boolean]
                 - RemoteSource[2] => [custkey:bigint]
                 - LocalExchange[SINGLE] () => custkey_1:bigint
                     - RemoteSource[3] => [custkey_1:bigint]

 Fragment 2 [tpch:orders:15000]
     Output layout: [custkey]
     Output partitioning: HASH [custkey]
     - TableScan[tpch:tpch:orders:sf0.01, originalConstraint = true] => [custkey:bigint]
             custkey := tpch:custkey

 Fragment 3 [SOURCE]
     Output layout: [custkey_1]
     Output partitioning: HASH (replicate nulls) [custkey_1]
     - TableScan[tpch:tpch:customer:sf0.01, originalConstraint = true] => [custkey_1:bigint]
             custkey_1 := tpch:custkey
presto> explain (type distributed) select count(*) from tpch.tiny.orders where custkey in (select distinct custkey from tpch.tiny.customer);
                                           Query Plan
-------------------------------------------------------------------------------------------------
 Fragment 0 [SINGLE]
     Output layout: [count]
     Output partitioning: SINGLE []
     - Output[_col0] => [count:bigint]
             _col0 := count
         - Aggregate(FINAL) => [count:bigint]
                 count := "count"("count_13")
             - LocalExchange[SINGLE] () => count_13:bigint
                 - RemoteSource[1] => [count_13:bigint]

 Fragment 1 [HASH]
     Output layout: [count_13]
     Output partitioning: SINGLE []
     - Aggregate(PARTIAL) => [count_13:bigint]
             count_13 := "count"(*)
         - FilterProject[filterPredicate = "expr_8"] => []
             - SemiJoin[custkey = custkey_1] => [custkey:bigint, expr_8:boolean]
                 - RemoteSource[2] => [custkey:bigint]
                 - LocalExchange[SINGLE] () => custkey_1:bigint
                     - RemoteSource[3] => [custkey_1:bigint]

 Fragment 2 [tpch:orders:15000]
     Output layout: [custkey]
     Output partitioning: HASH [custkey]
     - TableScan[tpch:tpch:orders:sf0.01, originalConstraint = true] => [custkey:bigint]
             custkey := tpch:custkey

 Fragment 3 [HASH]
     Output layout: [custkey_1]
     Output partitioning: HASH (replicate nulls) [custkey_1]
     - Aggregate(FINAL)[custkey_1] => [custkey_1:bigint]
         - LocalExchange[HASH] ("custkey_1") => custkey_1:bigint
             - RemoteSource[4] => [custkey_1:bigint]

 Fragment 4 [SOURCE]
     Output layout: [custkey_1]
     Output partitioning: HASH [custkey_1]
     - Aggregate(PARTIAL)[custkey_1] => [custkey_1:bigint]
         - TableScan[tpch:tpch:customer:sf0.01, originalConstraint = true] => [custkey_1:bigint]
                 custkey_1 := tpch:custkey
@hellium01
Copy link
Contributor

Hi, Martin, I am trying to work on this. Is this method correct? If aggregation node has output symbol same as group symbol and it is under FilteringSource of SemiJoin, we should remove that aggregation node.

@martint
Copy link
Contributor Author

martint commented Apr 14, 2017

Yes, that should work.

@ssaumitra
Copy link
Contributor

Looks like this is not merged yet. @kokosing Are you still working on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants