Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove futile sort operations in sub queries #759

Closed
ankitdixit opened this issue May 13, 2019 · 6 comments · Fixed by #818
Closed

Remove futile sort operations in sub queries #759

ankitdixit opened this issue May 13, 2019 · 6 comments · Fixed by #818

Comments

@ankitdixit
Copy link
Member

ankitdixit commented May 13, 2019

Add an optimization rule to remove unnecessary SORT in inner queries

select * from abc a join (select * from abc order by key) b on a.key = b.key;

Sort operation is unnecessary and results in extra exchange and sorts.

 presto:default> explain select * from abc a join (select * from abc order by key) b on a.key = b.key;

 - Output[key, key] => [key:bigint, key:bigint]
     - RemoteExchange[GATHER] => key:bigint
         - InnerJoin[("key" = "key_0")][$hashvalue, $hashvalue_16] => [key:bigint]
                 Distribution: PARTITIONED
             - RemoteExchange[REPARTITION][$hashvalue] => key:bigint, $hashvalue:bigint
                     Cost: {rows: ? (?), cpu: ?, memory: 0.00, network: ?}
                 - ScanProject[table = hive:default:abc, originalConstraint = true] => [key:bigint, $hashvalue_15:bigint]
                         Cost: {rows: ? (?), cpu: ?, memory: 0.00, network: 0.00}/{rows: ? (?), cpu: ?, memory: 0.00, network: 0.00}
                         $hashvalue_15 := "combine_hash"(bigint '0', COALESCE("$operator$hash_code"("key"), 0))
                         LAYOUT: default.abc
                         key := HiveColumnHandle{name=key, hiveType=bigint, hiveColumnIndex=0, columnType=REGULAR}
             - LocalExchange[HASH][$hashvalue_16] ("key_0") => key_0:bigint, $hashvalue_16:bigint
                 - RemoteExchange[REPARTITION][$hashvalue_17] => key_0:bigint, $hashvalue_17:bigint
                     - RemoteMerge[key_0 ASC_NULLS_LAST] => [key_0:bigint, $hashvalue_18:bigint]
                         - LocalMerge[key_0 ASC_NULLS_LAST] => [key_0:bigint, $hashvalue_19:bigint]
                             - PartialSort[key_0 ASC_NULLS_LAST] => [key_0:bigint, $hashvalue_20:bigint]
                                 - RemoteExchange[REPARTITION] => key_0:bigint, $hashvalue_20:bigint
                                         Cost: {rows: ? (?), cpu: ?, memory: 0.00, network: ?}
                                     - ScanProject[table = hive:default:abc, originalConstraint = true] => [key_0:bigint, $hashvalue_21:bigint]
                                             Cost: {rows: ? (?), cpu: ?, memory: 0.00, network: 0.00}/{rows: ? (?), cpu: ?, memory: 0.00, network: 0.00}
                                             $hashvalue_21 := "combine_hash"(bigint '0', COALESCE("$operator$hash_code"("key_0"), 0))
                                             LAYOUT: default.abc
                                             key_0 := HiveColumnHandle{name=key, hiveType=bigint, hiveColumnIndex=0, columnType=REGULAR}

(1 row)  

We should write an optimizer which removes these inner sorts. This seems slightly similar to #551, prestodb/presto#7781.
@martint @electrum @Praveen2112 If this makes sense, i can pick this up.

@kokosing
Copy link
Member

It makes sense.

However, instead of removing sort I would add LIMIT 1 (i.e. wrap subquery plan with LimitNode(limit=1)), then things like sorting should be eliminated automatically with #441. Also it makes possible to pushdown the LIMIT into connector.

@kokosing
Copy link
Member

Ah... adding LIMIT is incorrect. Please go ahead with your origin approach and remove sort.

@ankitdixit
Copy link
Member Author

One case that i missed previously was that if there is a limit in the sub query on top of order by result then, we cannot remove it.
select * from abc a join (select * from abc order by key limit 10) b on a.key = b.key;
So, the logic becomes something like this,
If there is an order by operator which has a JOIN, AGGREGATE or DISTINCT node among its ancestors without having a LIMIT node in between then we should remove it.
This seems easier to implement as a Legacy Visitor style optimizer as compared to Rule based Optimizer, which one should i use?
@kokosing @Praveen2112

@Praveen2112
Copy link
Member

We can go for rule based optimizer ( as other visitor based optimizer might move a similar rule based one ) and these rules has to be placed post the merging of OrderBy and Limit node so that it might not affect the overall plan.

ankitdixit added a commit to ankitdixit/presto-1 that referenced this issue May 23, 2019
@martint
Copy link
Member

martint commented May 25, 2019

I'm copying a comment I made in #807 so that it doesn't get lost, as it applies to the general approach to how we solve this.

The fundamental problem is that there's currently no way for the engine to reason about whether a sort is required (i.e., was it added by the planner to enforce some physical organization, was it a result of the user typing ORDER BY in their query in a place where it matters, or was it the result of the user typing ORDER BY in a place where it doesn't matter).

The SQL spec says that ORDER BY is relevant only for the immediate query expression that contains it. Since ORDER BY logically evaluates before FETCH FIRST (LIMIT) and OFFSET, those two operations in a query expression are sensitive the ORDER BY. If the ORDER BY was in a subquery, then it wouldn't. So:

SELECT * FROM t ORDER BY c LIMIT 1

and

SELECT * FROM (
    SELECT * FROM t ORDER BY c
)
LIMIT 1

have the same query plan:

- Limit
   - Sort
     - TableScan

yet, the Sort is required in the first one but not in the second one.

I think a better approach for now, instead of attempting to implement this as an optimization rule, is for the planner skip adding sort nodes in those cases altogether. The analyzer could tag the ORDER BY as irrelevant and the planner would just ignore it.

To ease the transition for the cases where people are relying on that behavior, we should:

  • Gate it behind a configuration option / session property
  • Emit a warning during analysis

@martint
Copy link
Member

martint commented May 26, 2019

Here's a prototype of the approach I described above: #818

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

4 participants