Remove futile sort operations in sub queries #759

ankitdixit · 2019-05-13T06:56:02Z

Add an optimization rule to remove unnecessary SORT in inner queries

select * from abc a join (select * from abc order by key) b on a.key = b.key;

Sort operation is unnecessary and results in extra exchange and sorts.

 presto:default> explain select * from abc a join (select * from abc order by key) b on a.key = b.key;

 - Output[key, key] => [key:bigint, key:bigint]
     - RemoteExchange[GATHER] => key:bigint
         - InnerJoin[("key" = "key_0")][$hashvalue, $hashvalue_16] => [key:bigint]
                 Distribution: PARTITIONED
             - RemoteExchange[REPARTITION][$hashvalue] => key:bigint, $hashvalue:bigint
                     Cost: {rows: ? (?), cpu: ?, memory: 0.00, network: ?}
                 - ScanProject[table = hive:default:abc, originalConstraint = true] => [key:bigint, $hashvalue_15:bigint]
                         Cost: {rows: ? (?), cpu: ?, memory: 0.00, network: 0.00}/{rows: ? (?), cpu: ?, memory: 0.00, network: 0.00}
                         $hashvalue_15 := "combine_hash"(bigint '0', COALESCE("$operator$hash_code"("key"), 0))
                         LAYOUT: default.abc
                         key := HiveColumnHandle{name=key, hiveType=bigint, hiveColumnIndex=0, columnType=REGULAR}
             - LocalExchange[HASH][$hashvalue_16] ("key_0") => key_0:bigint, $hashvalue_16:bigint
                 - RemoteExchange[REPARTITION][$hashvalue_17] => key_0:bigint, $hashvalue_17:bigint
                     - RemoteMerge[key_0 ASC_NULLS_LAST] => [key_0:bigint, $hashvalue_18:bigint]
                         - LocalMerge[key_0 ASC_NULLS_LAST] => [key_0:bigint, $hashvalue_19:bigint]
                             - PartialSort[key_0 ASC_NULLS_LAST] => [key_0:bigint, $hashvalue_20:bigint]
                                 - RemoteExchange[REPARTITION] => key_0:bigint, $hashvalue_20:bigint
                                         Cost: {rows: ? (?), cpu: ?, memory: 0.00, network: ?}
                                     - ScanProject[table = hive:default:abc, originalConstraint = true] => [key_0:bigint, $hashvalue_21:bigint]
                                             Cost: {rows: ? (?), cpu: ?, memory: 0.00, network: 0.00}/{rows: ? (?), cpu: ?, memory: 0.00, network: 0.00}
                                             $hashvalue_21 := "combine_hash"(bigint '0', COALESCE("$operator$hash_code"("key_0"), 0))
                                             LAYOUT: default.abc
                                             key_0 := HiveColumnHandle{name=key, hiveType=bigint, hiveColumnIndex=0, columnType=REGULAR}

(1 row)

We should write an optimizer which removes these inner sorts. This seems slightly similar to #551, prestodb/presto#7781.
@martint @electrum @Praveen2112 If this makes sense, i can pick this up.

The text was updated successfully, but these errors were encountered:

kokosing · 2019-05-13T07:55:37Z

It makes sense.

However, instead of removing sort I would add LIMIT 1 (i.e. wrap subquery plan with LimitNode(limit=1)), then things like sorting should be eliminated automatically with #441. Also it makes possible to pushdown the LIMIT into connector.

kokosing · 2019-05-13T08:03:28Z

Ah... adding LIMIT is incorrect. Please go ahead with your origin approach and remove sort.

ankitdixit · 2019-05-13T10:21:01Z

One case that i missed previously was that if there is a limit in the sub query on top of order by result then, we cannot remove it.
select * from abc a join (select * from abc order by key limit 10) b on a.key = b.key;
So, the logic becomes something like this,
If there is an order by operator which has a JOIN, AGGREGATE or DISTINCT node among its ancestors without having a LIMIT node in between then we should remove it.
This seems easier to implement as a Legacy Visitor style optimizer as compared to Rule based Optimizer, which one should i use?
@kokosing @Praveen2112

Praveen2112 · 2019-05-13T10:39:58Z

We can go for rule based optimizer ( as other visitor based optimizer might move a similar rule based one ) and these rules has to be placed post the merging of OrderBy and Limit node so that it might not affect the overall plan.

martint · 2019-05-25T23:28:40Z

I'm copying a comment I made in #807 so that it doesn't get lost, as it applies to the general approach to how we solve this.

The fundamental problem is that there's currently no way for the engine to reason about whether a sort is required (i.e., was it added by the planner to enforce some physical organization, was it a result of the user typing ORDER BY in their query in a place where it matters, or was it the result of the user typing ORDER BY in a place where it doesn't matter).

The SQL spec says that ORDER BY is relevant only for the immediate query expression that contains it. Since ORDER BY logically evaluates before FETCH FIRST (LIMIT) and OFFSET, those two operations in a query expression are sensitive the ORDER BY. If the ORDER BY was in a subquery, then it wouldn't. So:

SELECT * FROM t ORDER BY c LIMIT 1

and

SELECT * FROM (
    SELECT * FROM t ORDER BY c
)
LIMIT 1

have the same query plan:

- Limit
   - Sort
     - TableScan

yet, the Sort is required in the first one but not in the second one.

I think a better approach for now, instead of attempting to implement this as an optimization rule, is for the planner skip adding sort nodes in those cases altogether. The analyzer could tag the ORDER BY as irrelevant and the planner would just ignore it.

To ease the transition for the cases where people are relying on that behavior, we should:

Gate it behind a configuration option / session property
Emit a warning during analysis

martint · 2019-05-26T03:05:11Z

Here's a prototype of the approach I described above: #818

hustnn mentioned this issue May 23, 2019

Ignore order by in a from subquery #807

Closed

ankitdixit added a commit to ankitdixit/presto-1 that referenced this issue May 23, 2019

fixes trinodb#759

e7dd1b3

martint mentioned this issue May 26, 2019

Do not plan unnecessary Sort #818

Merged

martint closed this as completed in #818 May 28, 2019

fgwang7w mentioned this issue Jul 27, 2020

remove Limit/TopN/Sort/DistinctLimit when relation is know to single row or less rows than requested prestodb/presto#14897

Closed

raunaqmorarka mentioned this issue Jun 19, 2024

ORDER BY caluse of INSERT INTO statement is unexpectantly eliminated #22422

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove futile sort operations in sub queries #759

Remove futile sort operations in sub queries #759

ankitdixit commented May 13, 2019 •

edited by findepi

Loading

kokosing commented May 13, 2019

kokosing commented May 13, 2019

ankitdixit commented May 13, 2019

Praveen2112 commented May 13, 2019

martint commented May 25, 2019

martint commented May 26, 2019

Remove futile sort operations in sub queries #759

Remove futile sort operations in sub queries #759

Comments

ankitdixit commented May 13, 2019 • edited by findepi Loading

kokosing commented May 13, 2019

kokosing commented May 13, 2019

ankitdixit commented May 13, 2019

Praveen2112 commented May 13, 2019

martint commented May 25, 2019

martint commented May 26, 2019

ankitdixit commented May 13, 2019 •

edited by findepi

Loading