Design how to respect output stream ordering #424

tustvold · 2021-05-25T10:39:01Z

As identified in #423 and #378 (and #412) there needs to be a mechanism for physical operators to express their behaviour with respect to sort order, so that optimisation passes can handle it correctly.

It is assumed that the initial physical plan created from the logical plan is valid, and that the requirement is for the optimisation phase to not alter the plan in a way that violates its implicit ordering requirements. I think it is therefore sufficient to encode some notion of sort sensitivity, as opposed to what the sort order necessarily is. I believe any optimisations related to the specific sort orders being utilised would take place at the LogicalPlan level, and avoiding this at the physical layer sidesteps issues around equality for PhysicalExpressions, etc...

The proposal would be to introduce a new member function to ExecutionPlan called partition_order() that returns a variant of a new enum PartitionOrder. This would have three variants:

PartitionOrder::Preserving - operators that preserve the ordering of their input partition(s) - e.g. FilterExec, CoalesceBatchesExec
PartitionOrder::Sensitive - operators which rely on the order of their input partition(s) - e.g. GlobalLimitExec, LocalLimitExec
PartitionOrder::Agnostic - operators which do not rely on, nor preserve the order of their input partition(s) - e.g. HashAggregateExec, MergeExec, RepartitionExec

Note that the formulation does not distinguish between 1 or many partitions, as this is a detail already encapsulated by required_child_distribution (although I do wonder if this should be a property of the plan and not the operators themselves). There is no mechanism to express an ordering requirement across partitions, I'm not sure that this would be useful.

The default implementation of partition_order() would return PartitionOrder::Sensitive. Or to put it another way, unless explicitly told otherwise the optimiser cannot assume that an operator isn't sensitive to the ordering of its inputs.

The Repartition pass would then be modified to only insert a RepartitionExec on branches of the graph that have no PartitionOrder::Sensitive operators without an intervening PartitionOrder::Agnostic operator. This would fix #423. AddMergeExec could additionally be modified to error if it find itself needing to insert a MergeExec on an order sensitive branch.

Eventually as certain forms of RepartitionExec are order preserving, e.g. splitting a single partition into multiple, this could be codified and combined with a modified version of AddMergeExec that inserts an order preserving merge. This would naturally fit into the proposed framework.

I'm not sure how ordering is typically handled in query engines, so if there is a standard solution I'd be happy to go with that instead, but thought I'd write up the simplest solution I can see to the issue in #423

The text was updated successfully, but these errors were encountered:

andygrove · 2021-05-25T13:11:24Z

We may want to consider following Spark's approach here, since we are already implementing other methods such as required_child_distribution based on Spark.

For sorting, Spark has:

/** Specifies how data is ordered in each partition. */
def outputOrdering: Seq[SortOrder]

/** Specifies sort order for each partition requirements on the input data for this operator. */
def requiredChildOrdering: Seq[Seq[SortOrder]]

alamb · 2021-05-25T15:41:48Z

My thoughts:

I think it will be simpler, as @tustvold has suggested, to do the majority / all of sort based optimizations (e.g. optimize away a Sort) on the LogicalPlan level, rather than in the physical plan. That way:

We can work with Exprs rather than PhysicalExprs.
The knowledge of sort order can also feed into potential cost model decisions too (e.g. join ordering, algorithm selection)

Encoding the requirements / assumptions of LogicalPlan nodes via outputOrdering or requiredChildOrdering seems like a good idea to me.

In terms of physical plans, what about adding something like ExecutionPlan::requires_output_sort() that would communicate to the various physical optimizer passes when they had to preserve the output sort (and thus might preclude things like "repartition exec" from rewriting the plan)

andygrove · 2021-05-25T16:17:39Z

I agree that doing this at the logical plan level is simpler. My concern is that later on, if/when we make Ballista viable, we may want to dynamically optimize the query while it is executing, based on statistics from completed stages, and I was hoping to avoid Spark's approach of going back to the logical plan for this, and just optimize the physical plan. However, we can always revisit this later.

alamb · 2021-05-25T17:04:30Z

if/when we make Ballista viable, we may want to dynamically optimize the query while it is executing,

I do think dynamically changing the plan (or algorithms) based on actual execution experience is currently state of the art. I have often wondered if it is better done within the operators themselves (like maybe a join deciding to switch to sort-merge-join when it filled up its hash tables, or sampling both inputs to decide which was smaller. etc)

I would have to think of the kinds of dynamic plan changes we might want to do

alamb · 2022-02-07T21:25:00Z

Proposed fix: #1776

alamb · 2023-09-28T18:02:57Z

Related PR: #7671 (review)

tustvold added the enhancement New feature or request label May 25, 2021

alamb changed the title ~~Operator Ordering~~ Design how to respect output stream ordering May 25, 2021

tustvold mentioned this issue Feb 4, 2022

Prevent repartitioning of certain operator's direct children (#1731) #1732

Merged

This was referenced Feb 7, 2022

Do not repartition sorted inputs SortPreservingMerge #1748

Closed

Update ExecutionPlan to know about sortedness and repartitioning optimizer pass respect the invariants #1776

Merged

alamb closed this as completed in #1776 Feb 9, 2022

alamb mentioned this issue Sep 28, 2023

Enhance Enforce Dist capabilities to fix, sub optimal bad plans #7671

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design how to respect output stream ordering #424

Design how to respect output stream ordering #424

tustvold commented May 25, 2021 •

edited

Loading

andygrove commented May 25, 2021 •

edited

Loading

alamb commented May 25, 2021

andygrove commented May 25, 2021

alamb commented May 25, 2021

alamb commented Feb 7, 2022

alamb commented Sep 28, 2023

Design how to respect output stream ordering #424

Design how to respect output stream ordering #424

Comments

tustvold commented May 25, 2021 • edited Loading

andygrove commented May 25, 2021 • edited Loading

alamb commented May 25, 2021

andygrove commented May 25, 2021

alamb commented May 25, 2021

alamb commented Feb 7, 2022

alamb commented Sep 28, 2023

tustvold commented May 25, 2021 •

edited

Loading

andygrove commented May 25, 2021 •

edited

Loading