Remove DISTINCT inside IN subquery expression #7781

martint · 2017-04-07T18:40:11Z

Add an optimization rule to remove unnecessary DISTINCT in this scenario:

SELECT ...
FROM t
WHERE c IN (SELECT DISTINCT ... FROM u)

Since the semi join already performs deduplication of the values in the subquery, that operation is unnecessary and results in an extra exchange and aggregation.

presto> explain (type distributed) select count(*) from tpch.tiny.orders where custkey in (select custkey from tpch.tiny.customer);
                                                            Query Plan
-----------------------------------------------------------------------------------------------------------------------------------
 Fragment 0 [SINGLE]
     Output layout: [count]
     Output partitioning: SINGLE []
     - Output[_col0] => [count:bigint]
             _col0 := count
         - Aggregate(FINAL) => [count:bigint]
                 count := "count"("count_13")
             - LocalExchange[SINGLE] () => count_13:bigint
                 - RemoteSource[1] => [count_13:bigint]

 Fragment 1 [HASH]
     Output layout: [count_13]
     Output partitioning: SINGLE []
     - Aggregate(PARTIAL) => [count_13:bigint]
             count_13 := "count"(*)
         - FilterProject[filterPredicate = "expr_8"] => []
             - SemiJoin[custkey = custkey_1] => [custkey:bigint, expr_8:boolean]
                 - RemoteSource[2] => [custkey:bigint]
                 - LocalExchange[SINGLE] () => custkey_1:bigint
                     - RemoteSource[3] => [custkey_1:bigint]

 Fragment 2 [tpch:orders:15000]
     Output layout: [custkey]
     Output partitioning: HASH [custkey]
     - TableScan[tpch:tpch:orders:sf0.01, originalConstraint = true] => [custkey:bigint]
             custkey := tpch:custkey

 Fragment 3 [SOURCE]
     Output layout: [custkey_1]
     Output partitioning: HASH (replicate nulls) [custkey_1]
     - TableScan[tpch:tpch:customer:sf0.01, originalConstraint = true] => [custkey_1:bigint]
             custkey_1 := tpch:custkey

presto> explain (type distributed) select count(*) from tpch.tiny.orders where custkey in (select distinct custkey from tpch.tiny.customer);
                                           Query Plan
-------------------------------------------------------------------------------------------------
 Fragment 0 [SINGLE]
     Output layout: [count]
     Output partitioning: SINGLE []
     - Output[_col0] => [count:bigint]
             _col0 := count
         - Aggregate(FINAL) => [count:bigint]
                 count := "count"("count_13")
             - LocalExchange[SINGLE] () => count_13:bigint
                 - RemoteSource[1] => [count_13:bigint]

 Fragment 1 [HASH]
     Output layout: [count_13]
     Output partitioning: SINGLE []
     - Aggregate(PARTIAL) => [count_13:bigint]
             count_13 := "count"(*)
         - FilterProject[filterPredicate = "expr_8"] => []
             - SemiJoin[custkey = custkey_1] => [custkey:bigint, expr_8:boolean]
                 - RemoteSource[2] => [custkey:bigint]
                 - LocalExchange[SINGLE] () => custkey_1:bigint
                     - RemoteSource[3] => [custkey_1:bigint]

 Fragment 2 [tpch:orders:15000]
     Output layout: [custkey]
     Output partitioning: HASH [custkey]
     - TableScan[tpch:tpch:orders:sf0.01, originalConstraint = true] => [custkey:bigint]
             custkey := tpch:custkey

 Fragment 3 [HASH]
     Output layout: [custkey_1]
     Output partitioning: HASH (replicate nulls) [custkey_1]
     - Aggregate(FINAL)[custkey_1] => [custkey_1:bigint]
         - LocalExchange[HASH] ("custkey_1") => custkey_1:bigint
             - RemoteSource[4] => [custkey_1:bigint]

 Fragment 4 [SOURCE]
     Output layout: [custkey_1]
     Output partitioning: HASH [custkey_1]
     - Aggregate(PARTIAL)[custkey_1] => [custkey_1:bigint]
         - TableScan[tpch:tpch:customer:sf0.01, originalConstraint = true] => [custkey_1:bigint]
                 custkey_1 := tpch:custkey

The text was updated successfully, but these errors were encountered:

hellium01 · 2017-04-14T06:06:59Z

Hi, Martin, I am trying to work on this. Is this method correct? If aggregation node has output symbol same as group symbol and it is under FilteringSource of SemiJoin, we should remove that aggregation node.

martint · 2017-04-14T06:10:35Z

Yes, that should work.

ssaumitra · 2019-03-22T22:25:20Z

Looks like this is not merged yet. @kokosing Are you still working on this?

martint added beginner-task enhancement planner labels Apr 7, 2017

hellium01 mentioned this issue Apr 14, 2017

Optimize distinct from semijoin #7832

Closed

hellium01 mentioned this issue Nov 13, 2018

Optimize distinct from semi join #8092

Closed

Praveen2112 mentioned this issue Mar 27, 2019

Remove DISTINCT inside IN subquery expression trinodb/trino#551

Merged

kokosing closed this as completed in trinodb/trino#551 Apr 1, 2019

ankitdixit mentioned this issue May 13, 2019

Remove futile sort operations in sub queries trinodb/trino#759

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove DISTINCT inside IN subquery expression #7781

Remove DISTINCT inside IN subquery expression #7781

martint commented Apr 7, 2017

hellium01 commented Apr 14, 2017

martint commented Apr 14, 2017

ssaumitra commented Mar 22, 2019

Remove DISTINCT inside IN subquery expression #7781

Remove DISTINCT inside IN subquery expression #7781

Comments

martint commented Apr 7, 2017

hellium01 commented Apr 14, 2017

martint commented Apr 14, 2017

ssaumitra commented Mar 22, 2019