Skip to content

Commit

Permalink
feat: clarify the meaning of plans (#616)
Browse files Browse the repository at this point in the history
#612 and #613 highlighted a number of ambiguities in plan
interpretation. This should address those ambiguities. 

Co-authored-by: Ingo Müller <github.com@ingomueller.net>
  • Loading branch information
westonpace and ingomueller-net authored Aug 1, 2024
1 parent e4f5b68 commit c1553df
Show file tree
Hide file tree
Showing 3 changed files with 38 additions and 5 deletions.
12 changes: 9 additions & 3 deletions site/docs/relations/basics.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,14 @@
# Basics

Substrait is designed to allow a user to construct an arbitrarily complex data transformation plan. The plan is composed of one or more relational operations. Relational operations are well-defined transformation operations that work by taking zero or more input datasets and transforming them into zero or more output transformations. Substrait defines a core set of transformations, but users are also able to extend the operations with their own specialized operations.
Substrait is designed to allow a user to describe arbitrarily complex data transformations. These transformations are composed of one or more relational operations. Relational operations are well-defined transformation operations that work by taking zero or more input datasets and transforming them into zero or more output transformations. Substrait defines a core set of transformations, but users are also able to extend the operations with their own specialized operations.

## Plans

A plan is a tree of relations. The root of the tree is the final output of the plan. Each node in the tree is a relational operation. The children of a node are the inputs to the operation. The leaves of the tree are the input datasets to the plan.

Plans can be composed together using reference relations. This allows for the construction of common plans that can be reused in multiple places. If a plan has no cycles (there is only one plan or each reference relation only references later plans) then the plan will form a DAG (Directed Acyclic Graph).

## Relational Operators

Each relational operation is composed of several properties. Common properties for relational operations include the following:

Expand All @@ -10,8 +18,6 @@ Each relational operation is composed of several properties. Common properties f
| Hints | A set of optionally provided, optionally consumed information about an operation that better informs execution. These might include estimated number of input and output records, estimated record size, likely filter reduction, estimated dictionary size, etc. These can also include implementation specific pieces of execution information. | Physical |
| Constraint | A set of runtime constraints around the operation, limiting its consumption based on real-world resources (CPU, memory) as well as virtual resources like number of records produced, the largest record size, etc. | Physical |



## Relational Signatures

In functions, function signatures are declared externally to the use of those signatures (function bindings). In the case of relational operations, signatures are declared directly in the specification. This is due to the speed of change and number of total operations. Relational operations in the specification are expected to be &lt;100 for several years with additions being infrequent. On the other hand, there is an expectation of both a much larger number of functions (1,000s) and a much higher velocity of additions.
Expand Down
6 changes: 4 additions & 2 deletions site/docs/serialization/_config
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
arrange:
- binary_serialization.md
- text_serialization.md

- basics.md
- binary_serialization.md
- text_serialization.md
25 changes: 25 additions & 0 deletions site/docs/serialization/basics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Basics

Substrait is designed to be serialized into various different formats. Currently we support a binary serialization for
transmission of plans between programs (e.g. IPC or network communication) and a text serialization for debugging and human readability. Other formats may be added in the future.

These formats serialize a collection of plans. Substrait does not define how a collection of plans is to be interpreted.
For example, the following scenarios are all valid uses of a collection of plans:

- A query engine receives a plan and executes it. It receives a collection of plans with a single root plan. The
top-level node of the root plan defines the output of the query. Non-root plans may be included as common subplans
which are referenced from the root plan.
- A transpiler may convert plans from one dialect to another. It could take, as input, a single root plan. Then
it could output a serialized binary containing multiple root plans. Each root plan is a representation of the
input plan in a different dialect.
- A distributed scheduler might expect 1+ root plans. Each root plan describes a different stage of computation.

Libraries should make sure to thoroughly describe the way plan collections will be produced or consumed.

## Root plans

We often refer to query plans as a graph of nodes (typically a DAG unless the query is recursive). However, we
encode this graph as a collection of trees with a single root tree that references other trees (which may also
transitively reference other trees). Plan serializations all have some way to indicate which plan(s) are "root"
plans. Any plan that is not a root plan and is not referenced (directly or transitively) by some root plan
can safely be ignored.

0 comments on commit c1553df

Please sign in to comment.