Skip to content

Commit

Permalink
Add high level CBO documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
losipiuk authored and findepi committed Mar 13, 2019
1 parent 1d50b15 commit f79058e
Show file tree
Hide file tree
Showing 2 changed files with 77 additions and 0 deletions.
1 change: 1 addition & 0 deletions presto-docs/src/main/sphinx/optimizer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ Query Optimizer

optimizer/statistics
optimizer/cost-in-explain
optimizer/cost-based-optimizations
76 changes: 76 additions & 0 deletions presto-docs/src/main/sphinx/optimizer/cost-based-optimizations.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
========================
Cost based optimizations
========================

Presto supports several cost based optimizations, described below.

Join Enumeration
----------------

The order in which joins are executed in a query can have a significant impact
on the query's performance. The aspect of join ordering that has the largest
impact on performance is the size of the data being processed and transferred
over the network. If a join that produces a lot of data is performed early in
the execution, then subsequent stages will need to process large amounts of
data for longer than necessary, increasing the time and resources needed for
the query.

With cost based join enumeration, Presto uses
:doc:`/optimizer/statistics` provided by connectors to estimate
the costs for different join orders and automatically pick the
join order with the lowest computed costs.

The join enumeration strategy is governed by the ``join_reordering_strategy``
session property, with the ``optimizer.join-reordering-strategy``
configuration property providing the default value.

The valid values are:
* ``AUTOMATIC`` (default) - full automatic join enumeration enabled
* ``ELIMINATE_CROSS_JOINS`` - eliminate unnecessary cross joins
* ``NONE`` - purely syntactic join order

If using ``AUTOMATIC`` and statistics are not available, or if for any other
reason a cost could not be computed, the ``ELIMINATE_CROSS_JOINS`` strategy is
used instead.

Join Distribution Selection
---------------------------

Presto uses a hash based join algorithm. That implies that for each join
operator a hash table must be created from one join input (called build side).
The other input (probe side) is then iterated and for each row the hash table is
queried to find matching rows.

There are two types of join distributions:
* Partitioned: each node participating in the query builds a hash table
from only fraction of the data
* Broadcast: each node participating in the query builds a hash table
from all of the data (data is replicated to each node)

Each type have their trade offs. Partitioned joins require redistributing both
tables using a hash of the join key. This can be slower (sometimes
substantially) than broadcast joins, but allows much larger joins. In
particular, broadcast joins will be faster if the build side is much smaller
than the probe side. However, broadcast joins require that the tables on the
build side of the join after filtering fit in memory on each node, whereas
distributed joins only need to fit in distributed memory across all nodes.

With cost based join distribution selection, Presto automatically chooses to
use a partitioned or broadcast join. With cost based join enumeration, Presto
automatically chooses which side is the probe and which is the build.

The join distribution strategy is governed by the ``join_distribution_type ``
session property, with the ``join-distribution-type`` configuration property
providing the default value.

The valid values are:
* ``AUTOMATIC`` (default) - join distribution type is determined automatically
for each join
* ``BROADCAST`` - broadcast join distribution is used for all joins
* ``PARTITIONED`` - partitioned join distribution is used for all join

Connector Implementations
-------------------------

In order for the Presto optimizer to use the cost based strategies,
the connection implementation must provide :doc:`statistics`.

0 comments on commit f79058e

Please sign in to comment.