diff --git a/docs/life_of_a_query.md b/docs/life_of_a_query.md index a940419bef6d..bd7ad37d1e75 100644 --- a/docs/life_of_a_query.md +++ b/docs/life_of_a_query.md @@ -1,6 +1,6 @@ # Life of a SQL Query -This post aims to explain the execution of an SQL query against CockroachDB, explaining the code paths through the various layers of the system (network protocol, SQL session management, parsing, AST building, syntax tree transformations, query running, interface with the KV code, routing of KV requests, request processing, Raft, on-disk storage engine). The idea is to provide a high-level unifying view of the structure of the various components; no one will be explored in particular depth but pointers to other documentation will be provided where such documentation exists. Code pointers will abound. +This post aims to explain the execution of an SQL query against CockroachDB, explaining the code paths through the various layers of the system (network protocol, SQL session management, parsing, execution planning, syntax tree transformations, query running, interface with the KV code, routing of KV requests, request processing, Raft, on-disk storage engine). The idea is to provide a high-level unifying view of the structure of the various components; no one will be explored in particular depth but pointers to other documentation will be provided where such documentation exists. Code pointers will abound. This post will generally not discuss design decisions; it will rather focus on tracing through the actual (current) code. The intended audience for this post is folks curious about a dive through the architecture of a modern, albeit young, database presented differently than in a [design doc](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md). It will hopefully also be helpful for open source contributors and [new Cockroach Labs engineers](https://www.cockroachlabs.com/careers/jobs/). @@ -18,7 +18,7 @@ The [`sql.Executor`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb3 `Executor.execRequest()` implements a state-machine of sorts by receiving batches of statements from `pgwire`, executing them one by one, updating the `Session`'s transaction state (did a new transaction just begin or an old transaction just end? did we encounter an error which forces us to abort the current transaction?) and returning results and control back to pgwire. The next batch of statements received from the client will continue from the transaction state left by the previous batch. ### Parsing -The first thing the `Executor` does is [parse the statements](https://github.com/cockroachdb/cockroach/blob/677f6f18b63355cb2d040b251af202fe6505128f/pkg/sql/executor.go#L495); parsing uses a [Yacc grammar file](https://github.com/cockroachdb/cockroach/blob/677f6f18b63355cb2d040b251af202fe6505128f/pkg/sql/parser/sql.y), originally copied from Postgres and stripped down, and then gradually grown organically with ever-more SQL support. The process of parsing transforms a `string` into an array of trees, one for each statement. The trees' nodes are structs defined in the `sql/parser` package, generally of two types - statements and expressions. Expressions implement a common interface useful for applying tree transformations. These parse trees will later be transformed by the `planner` into an Abstract Syntax Tree. +The first thing the `Executor` does is [parse the statements](https://github.com/cockroachdb/cockroach/blob/677f6f18b63355cb2d040b251af202fe6505128f/pkg/sql/executor.go#L495); parsing uses a [Yacc grammar file](https://github.com/cockroachdb/cockroach/blob/677f6f18b63355cb2d040b251af202fe6505128f/pkg/sql/parser/sql.y), originally copied from Postgres and stripped down, and then gradually grown organically with ever-more SQL support. The process of parsing transforms a `string` into an array of ASTs (Abstract Syntax Trees), one for each statement. The AST nodes are structs defined in the `sql/parser` package, generally of two types - statements and expressions. Expressions implement a common interface useful for applying tree transformations. These ASTs will later be transformed by the `planner` into an execution plan. ### Statement Execution @@ -28,16 +28,18 @@ There's an impedance mismatch that has to be explained here, around the interfac The `Txn.Exec` interface takes a callback and some execution options and, based on those options, executes the callback possibly multiple times and commits the transaction afterwards. If allowed by the options, the callback might be called multiple times, to deal with retries of transactions that are [sometimes necessary](https://www.cockroachlabs.com/docs/transactions.html#transaction-retries) in CockroachDB (usually because of data contention). The SQL `Executor` might or might not want to let the KV client perform such retries automatically. To hint at the complications: a single SQL statement executed outside of a SQL transaction (i.e. an "implicit transaction") can be safely retried. However, a SQL transaction spanning multiple client requests will have different statements executed in different callbacks passed to `Txn.Exec()`; as such, it's not sufficient to retry one of these callbacks - we have to retry all the statements in the transaction, and generally some of these statements might be conditional on the client's logic and thus can't be retried verbatim (i.e. different results for a `SELECT` might trigger different subsequent statements). In this case, we bubble up a retryable error to the client; more details about this can be read in our [transaction documentation](https://www.cockroachlabs.com/docs/transactions.html#client-side-intervention). This complexity is captured in [`Executor.execRequest()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L495), which has logic for setting the different execution options and contains a [suitable callback](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L604) that's [passed to `Txn.Exec()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L643); this callback will call [`runTxnAttempt()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L621). The statement execution code path continues inside the callback, but it's worth noting that, from this moment on, we've interfaced with the (client of the) KV layer and everything below is executing in the context of a KV transaction. -### AST Building +### Building execution plans Now that we've figured out what (KV) transaction we're running inside of, we're concerned with executing SQL statements one at a time. `runTxnAttempt()` has a few layers below it dealing with the various states a SQL transaction can be in ([open](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L966s) / [aborted](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L887) / [waiting for a user retry](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L944), etc.), but the interesting one is [execStmt](https://github.com/cockroachdb/cockroach/blob/677f6f18b63355cb2d040b251af202fe6505128f/pkg/sql/executor.go#L1258). This guy [creates an "execution plan"](https://github.com/cockroachdb/cockroach/blob/677f6f18b63355cb2d040b251af202fe6505128f/pkg/sql/executor.go#L1261) for a statement and runs it. -An execution plan in CockroachDB is a tree of [`planNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad/pkg/sql/plan.go#L72) nodes, similar in spirit to the parse tree but, this time, containing semantic information and also runtime state. This tree is built by [`planner.makePlan()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/plan.go#L199), which takes a parsed statement and returns the root of the `planNode` tree after having performed all the semantic analysis and various transformations. The nodes in this tree are actually "executable" (they have `Start()` and `Next()` methods), and each one will consume data produced by its children (e.g. a `JoinNode` has [`left and right`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/join.go#L125) children whose data it consumes). +An execution plan in CockroachDB is a tree of [`planNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad/pkg/sql/plan.go#L72) nodes, similar in spirit to the AST but, this time, containing semantic information and also runtime state. This tree is built by [`planner.makePlan()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/plan.go#L199), which takes a parsed statement and returns the root of the `planNode` tree after having performed all the semantic analysis and various transformations. The nodes in this tree are actually "executable" (they have `Start()` and `Next()` methods), and each one will consume data produced by its children (e.g. a `JoinNode` has [`left and right`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/join.go#L125) children whose data it consumes). -Currently building the AST, performing semantic analysis and applying various transformations is a pretty ad-hoc process, but we're working on replacing the code with a more structured process and separating the Intermediate Representation used for analysis and transforms from the runtime structures (see this WIP RFC)[https://github.com/cockroachdb/cockroach/pull/10055/files#diff-542aa8b21b245d1144c920577333ceed]. +Currently building the execution plan, performing semantic analysis and applying various transformations is a pretty ad-hoc process, but we're working on replacing the code with a more structured process and separating the IR (Intermediate Representation) used for analysis and transforms from the runtime structures (see this WIP RFC)[https://github.com/cockroachdb/cockroach/pull/10055/files#diff-542aa8b21b245d1144c920577333ceed]. -In the meantime, the `planner` [looks at the type of the statement](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/plan.go#L248) at the top of the parse tree and, for each statement type, invokes a specific method that builds the tree. For example, the tree for a `SELECT` statement is produced by [`planner.SelectClause()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/select.go#L257). Notice how different aspects of a `SELECT` statement are handled there: a `scanNode` is created ([`selectNode.initFrom()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/select.go#L336)->...-> [`planner.Scan()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/data_source.go#L441)) to scan a table, a `WHERE` clause is transformed into an expression and [assigned to a `selectNode.filter`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/select.go#L383), an `ORDER BY` clause is [turned into a `sortNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/select.go#L296), etc. In the end, a [`selectTopNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/select.go#L320) is produced, which in fact is a tree of a `groupNode`, a `windowNode`, a `sortNode`, a `distinctNode` and a `selectNode` wrapping a `scanNode` acting as an original data source). +In the meantime, the `planner` [looks at the type of the statement](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/plan.go#L248) at the top of the AST and, for each statement type, invokes a specific method that builds the execution plan. For example, the tree for a `SELECT` statement is produced by [`planner.SelectClause()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/select.go#L257). Notice how different aspects of a `SELECT` statement are handled there: a `scanNode` is created ([`selectNode.initFrom()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/select.go#L336)->...-> [`planner.Scan()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/data_source.go#L441)) to scan a table, a `WHERE` clause is transformed into an expression and [assigned to a `selectNode.filter`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/select.go#L383), an `ORDER BY` clause is [turned into a `sortNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/select.go#L296), etc. In the end, a [`selectTopNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/select.go#L320) is produced, which in fact is a tree of a `groupNode`, a `windowNode`, a `sortNode`, a `distinctNode` and a `selectNode` wrapping a `scanNode` acting as an original data source). -To make this notion of the AST more real, let's see one actually "rendered" by the `EXPLAIN` statement: +Finally, the execution plan is simplified and optimized somewhat; this includes removing the `selectTopNode` wrappers and eliding all no-op intermediate nodes. + +To make this notion of the execution plan more concrete, let's see one actually "rendered" by the `EXPLAIN` statement: ```sql root@:26257> create table customers( @@ -52,28 +54,49 @@ root@:26257> insert into customers values ('Apple', '1 Infinite Loop', 'CA'), ('IBM', '1 New Orchard Road ', 'NY'); -root@:26257> EXPLAIN(EXPRS) SELECT * FROM customers WHERE address like '%Infinite%' ORDER BY state; -+-------+---------------+----------+---------------------------+ -| Level | Type | Field | Description | -+-------+---------------+----------+---------------------------+ -| 0 | sort | | | -| 0 | | order | +state | -| 1 | render/filter | | | -| 1 | | render 0 | name | -| 1 | | render 1 | address | -| 1 | | render 2 | state | -| 2 | scan | | | -| 2 | | table | customers@primary | -| 2 | | spans | ALL | -| 2 | | filter | address LIKE '%Infinite%' | -+-------+---------------+----------+---------------------------+ +root@:26257> EXPLAIN(EXPRS,NOEXPAND,NOOPTIMIZE,METADATA) SELECT * FROM customers WHERE address like '%Infinite%' ORDER BY state; ++-------+--------+----------+---------------------------+------------------------+----------+ +| Level | Type | Field | Description | Columns | Ordering | ++-------+--------+----------+---------------------------+------------------------+----------+ +| 0 | select | | | (name, address, state) | +state | +| 1 | nosort | | | (name, address, state) | +state | +| 1 | | order | +@3 | | | +| 1 | render | | | (name, address, state) | | +| 1 | | render 0 | name | | | +| 1 | | render 1 | address | | | +| 1 | | render 2 | state | | | +| 2 | filter | | | (name, address, state) | | +| 2 | | filter | address LIKE '%Infinite%' | | | +| 3 | scan | | | (name, address, state) | | +| 3 | | table | customers@primary | | | ++-------+--------+----------+---------------------------+------------------------+----------+ ``` -The `selectTopNode` node is not represented in the `EXPLAIN` output, but you can see data being produced by a `scanNode`, being filtered by a `selectNode` (presented as "render/filter"), and then sorted by a `sortNode`. +You can see data being produced by a `scanNode`, being filtered by a +`selectNode` (presented as "render"), and then sorted by a `sortNode` +(presented as "nosort", because we've turned off order analysis with +NOEXPAND and the sort node doesn't know yet whether sorting is +needed), wrapped in a `selectTopNode` (presented as "select"). + +With plan simplification turned on, the EXPLAIN output becomes: + +``` +root@:26257> EXPLAIN (EXPRS,METADATA) SELECT * FROM customers WHERE address LIKE '%Infinite%' ORDER BY state; ++-------+------+--------+---------------------------+------------------------+--------------+ +| Level | Type | Field | Description | Columns | Ordering | ++-------+------+--------+---------------------------+------------------------+--------------+ +| 0 | sort | | | (name, address, state) | +state | +| 0 | | order | +state | | | +| 1 | scan | | | (name, address, state) | +name,unique | +| 1 | | table | customers@primary | | | +| 1 | | spans | ALL | | | +| 1 | | filter | address LIKE '%Infinite%' | | | ++-------+------+--------+---------------------------+------------------------+--------------+ +``` #### Expressions -A special class of parser (sub)trees are [`parser.Expr`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/parser/expr.go#L25), representing various "expressions" - parts of statements that can occur in many various places - in a `WHERE` clause, in a `LIMIT` clause, in an `ORDER BY` clause, as the projections of a `SELECT` statement, etc. Expressions (i.e. various parser nodes) implement a common interface so that a [visitor pattern](https://en.wikipedia.org/wiki/Visitor_pattern) can be applied to them for different transformations and analysis. Regardless of where they appear in the query, all expressions need some common processing (e.g. names appearing in them need to be resolved to columns from data sources). These tasks are performed by [`planner.analyzeExpr`](https://github.com/cockroachdb/cockroach/blob/a83c960a0547720a3179e05eb54ea5b67d107d10/pkg/sql/analyze.go#L1596). Each `planNode` is responsible for calling `analyzeExpr` on the expressions it contains, usually at node creation time (again, we hope to unify our AST processing more in the future). +A subset of ASTs are [`parser.Expr`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/parser/expr.go#L25), representing various "expressions" - parts of statements that can occur in many various places - in a `WHERE` clause, in a `LIMIT` clause, in an `ORDER BY` clause, as the projections of a `SELECT` statement, etc. Expressions nodes implement a common interface so that a [visitor pattern](https://en.wikipedia.org/wiki/Visitor_pattern) can be applied to them for different transformations and analysis. Regardless of where they appear in the query, all expressions need some common processing (e.g. names appearing in them need to be resolved to columns from data sources). These tasks are performed by [`planner.analyzeExpr`](https://github.com/cockroachdb/cockroach/blob/a83c960a0547720a3179e05eb54ea5b67d107d10/pkg/sql/analyze.go#L1596). Each `planNode` is responsible for calling `analyzeExpr` on the expressions it contains, usually at node creation time (again, we hope to unify our execution planning more in the future). `planner.analyzeExpr` performs the following tasks: 1. resolving names (the `colA` in `select 3 * colA from MyTable` needs to be replaced by an index within the rows produced by the underlying data source (usually a `scanNode`)) @@ -81,14 +104,14 @@ A special class of parser (sub)trees are [`parser.Expr`](https://github.com/cock 1. type checking (see [the typing RFC](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/typing.md) for an in-depth discussion of Cockroach's typing system). 1. constant folding (e.g. `1 + 2` becomes `3`): we perform exact arithmetic using [the same library used by the Go compiler](https://golang.org/pkg/go/constant/) and classify all the constants into two categories: [numeric - `NumVal`](https://github.com/cockroachdb/cockroach/blob/a83c960a0547720a3179e05eb54ea5b67d107d10/pkg/sql/parser/constant.go#L108) or [string-like - `StrVal`](https://github.com/cockroachdb/cockroach/blob/a83c960a0547720a3179e05eb54ea5b67d107d10/pkg/sql/parser/constant.go#L281). These representations of the constants are smart enough to figure out the set of types that can represent the value (e.g. [`NumVal.AvailableTypes`](https://github.com/cockroachdb/cockroach/blob/a83c960a0547720a3179e05eb54ea5b67d107d10/pkg/sql/parser/constant.go#L188) - `5` can be represented as `int, decimal or float`, but `5.4` can only be represented as `decimal or float`) This will come in useful in the next step. 1. type inference and propagation: this analysis phase assigns a result type to an expression, and in the process types all the sub-expressions. Typed expressions are represented by the [`TypedExpr`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/parser/expr.go#L48) interface, and they are finally able to evaluate themselves to a result value through the `Eval` method. The typing algorithm is presented in detail in the typing RFC: the general idea is that it's a recursive algorithm operating on sub-expressions; each level of the recursion may take a hint about the desired outcome, and each expression node takes that hint into consideration while weighting what options it has. In the absence of a hint, there's also a set of "natural typing" rules. For example, a `NumVal` described above [checks](https://github.com/cockroachdb/cockroach/blob/a83c960a0547720a3179e05eb54ea5b67d107d10/pkg/sql/parser/constant.go#L61) whether the hint is compatible with its list of possible types. This process also deals with [`overload resolution`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/parser/type_check.go#L371) for function calls and operators. -1. replacing sub-query syntax nodes by a `sql.subquery` AST node. +1. replacing sub-query syntax nodes by a `sql.subquery` execution plan node. 1. resolving names (the `colA` in `select colA from MyTable` needs to be replaced by an index within the rows produced by the underlying data source (usually a `scanNode`)). -A note about sub-queries: consider a query like `select * from Employees where DepartmentID in (select DepartmentID from Departments where NumEmployees > 100)`. The query on the `Departments` table is called a sub-query. Subqueries are recognized and replaced with an AST by [`subqueryVisitor`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/subquery.go#L294). The subqueries are then run and replaced by their results through the [`subqueryPlanVisitor`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/subquery.go#L194). This is usually done by various top-level nodes when they start execution (e.g. [`selectNode.Start()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/select.go#L140)). +A note about sub-queries: consider a query like `select * from Employees where DepartmentID in (select DepartmentID from Departments where NumEmployees > 100)`. The query on the `Departments` table is called a sub-query. Subqueries are recognized and replaced with an execution node by [`subqueryVisitor`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/subquery.go#L294). The subqueries are then run and replaced by their results through the [`subqueryPlanVisitor`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/subquery.go#L194). This is usually done by various top-level nodes when they start execution (e.g. [`selectNode.Start()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/select.go#L140)). ### Notable `planNodes` -As hinted throughout, AST nodes are responsible for executing parts of a query. Each one consumes data from lower-level nodes, performs some logic, and feeds data into a higher-level one. +As hinted throughout, execution plan nodes are responsible for executing parts of a query. Each one consumes data from lower-level nodes, performs some logic, and feeds data into a higher-level one. After being constructed, their main methods are [`Start`](https://github.com/cockroachdb/cockroach/blob/a83c960a0547720a3179e05eb54ea5b67d107d10/pkg/sql/plan.go#L142), which initiates the processing, and [`Next`](https://github.com/cockroachdb/cockroach/blob/a83c960a0547720a3179e05eb54ea5b67d107d10/pkg/sql/plan.go#L149), which is called repeatedly to produce the next row. To tie this to the [SQL Executor](#SQL Executor) section above, [`executor.execClassic()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/executor.go#L1251), the method responsible for executing one statement, calls `plan.Next()` repeatedly and accumulates the results. @@ -100,44 +123,34 @@ SELECT * FROM customers WHERE State LIKE 'C%' AND strpos(address, 'Infinite') != as a slightly contrived example. This is supposed to return customers from states starting with "N" and whose address contains the string "Infinite". To get excited, let's see the query plan for this statement: ```sql root@:26257> EXPLAIN(EXPRS) SELECT * FROM customers WHERE State LIKE 'C%' and strpos(address, 'Infinite') != 0 order by name; -+-------+---------------+----------+----------------------------------+ -| Level | Type | Field | Description | -+-------+---------------+----------+----------------------------------+ -| 0 | sort | | | -| 0 | | order | +name | -| 1 | render/filter | | | -| 1 | | render 0 | name | -| 1 | | render 1 | address | -| 1 | | render 2 | state | -| 2 | index-join | | | -| 3 | scan | | | -| 3 | | table | customers@SI | -| 3 | | spans | /"C"-/"D" | -| 3 | | filter | state LIKE 'C%' | -| 3 | scan | | | -| 3 | | table | customers@primary | -| 3 | | filter | strpos(address, 'Infinite') != 0 | -+-------+---------------+----------+----------------------------------+ ++-------+------------+--------+----------------------------------+ +| Level | Type | Field | Description | ++-------+------------+--------+----------------------------------+ +| 0 | sort | | | +| 0 | | order | +name | +| 1 | index-join | | | +| 2 | scan | | | +| 2 | | table | customers@SI | +| 2 | | spans | /"C"-/"D" | +| 2 | | filter | state LIKE 'C%' | +| 2 | scan | | | +| 2 | | table | customers@primary | +| 2 | | filter | strpos(address, 'Infinite') != 0 | ++-------+------------+--------+----------------------------------+ ``` -So the AST produced for this query, from top (highest-level) to bottom, looks like: +So the plan produced for this query, from top (highest-level) to bottom, looks like: ``` -selectTopNode -> sortNode -> selectNode -> indexJoinNode -> scanNode (index) - -> scanNode (PK) +sortNode -> indexJoinNode -> scanNode (index) + -> scanNode (PK) ``` -Before we inspect the nodes in turn, one thing deserves explanation: how did the `indexJoinNode` (which indicates that the query is going to use the "SI" index) come to be? The fact that this query uses an index is not apparent in the syntactical structure of the `SELECT` statement, and so this AST is not simply a product of the mechanical tree building hinted to above. Indeed, there's a step that we haven't mentioned before: ["plan expansion"](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/expand_plan.go#L28). Among other things, this step performs "index selection" (more information about the algorithms currently used for index selection can be found in [Radu's blog post](https://www.cockroachlabs.com/blog/index-selection-cockroachdb-2/)). We're looking for indexes that can be scanned to efficiently retrieve only rows that match (part of) the filter. In our case, the "SI" index (indexing the state) can be scanned to efficiently retrieve only the rows that are candidates for satisfying the `state LIKE 'C%'` expression (in an ecstasy to agony moment, we see that our index selection / expression normalization code [is smart enough](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/analyze.go#L1436) to infer that `state LIKE 'C%'` implies `state >= 'C' AND state < 'D'`, but is not smart enough to infer that the two expressions are in fact equivalent and thus the filter can be elided altogether). We won't go into plan expansion or index selection here, but the index selection process happens [in the expansion of the `SelectNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/expand_plan.go#L283) and, as a byproduct, produces `indexJoinNode`s configured with the index spans to be scanned. +Before we inspect the nodes in turn, one thing deserves explanation: how did the `indexJoinNode` (which indicates that the query is going to use the "SI" index) come to be? The fact that this query uses an index is not apparent in the syntactical structure of the `SELECT` statement, and so this plan is not simply a product of the mechanical tree building hinted to above. Indeed, there's a step that we haven't mentioned before: ["plan expansion"](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/expand_plan.go#L28). Among other things, this step performs "index selection" (more information about the algorithms currently used for index selection can be found in [Radu's blog post](https://www.cockroachlabs.com/blog/index-selection-cockroachdb-2/)). We're looking for indexes that can be scanned to efficiently retrieve only rows that match (part of) the filter. In our case, the "SI" index (indexing the state) can be scanned to efficiently retrieve only the rows that are candidates for satisfying the `state LIKE 'C%'` expression (in an ecstasy to agony moment, we see that our index selection / expression normalization code [is smart enough](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/analyze.go#L1436) to infer that `state LIKE 'C%'` implies `state >= 'C' AND state < 'D'`, but is not smart enough to infer that the two expressions are in fact equivalent and thus the filter can be elided altogether). We won't go into plan expansion or index selection here, but the index selection process happens [in the expansion of the `SelectNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/expand_plan.go#L283) and, as a byproduct, produces `indexJoinNode`s configured with the index spans to be scanned. Now let's see how these `planNode`s run: -1. [`selectTopNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/select_top.go#L21): -The `selectTopNode` is a bit of an artificial node; it does not participate in running a query, instead it serves as a container linking together a bunch of other nodes (sorting, grouping, window functions, limits, etc.) at query optimization time. It is replaced by one of those nodes (e.g. the `sortNode` in our example query) once optimization is done. - 1. [`sortNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sort.go#L31): The `sortNode` sorts the rows produced by its child and corresponds to the `ORDER BY` SQL clause. The [constructor](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sort.go#L60) has a bunch of logic related to the quirky rules for name resolution from SQL92/99. Another interesting fact is that, if we're sorting by a non-trivial expression (e.g. `SELECT a, b ... ORDER BY a + b`), we need the `a + b` values (for every row) to be produced by a lower-level node. This is achieved through a patter that's also present in other node: the lower node capable of evaluating expressions and rendering their results is the `selectNode`; the `sortNode` ctor checks if the expressions it needs are already rendered by that node and, if they aren't, asks for them to be produced through the [`selectNode.addOrMergeRenders()`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sort.go#L206) method. The actual sorting is performed in the `sortNode.Next()` method. The first time it is called, [it consumes all the data produced by the child node](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/sort.go#L359) and accumulates it into `n.sortStrategy` (an interface hiding multiple sorting algorithms). When the last row is consumed, `n.sortStrategy.Finish()` is called, at which time the sorting algorithm finishes its processing. Subsequent calls to `sortNode.Next()` simply iterate through the results of sorting algorithm. -1. [`selectNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/select.go#L28) (represented as "render/filter" in the output of the `EXPLAIN` above) encapsulates the core logic of a select statement: retrieving filtered results from -the SQL sources. The "targets" (columns, constants, or generally expressions whose results need to be produced for every row) are initialized by [`selectNode.initTargets`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/select.go#L350). The `selectNode` is going to hang on to all the expressions it needs to evaluate in the context of every row - the targets and the filter. You can then see it evaluating these expressions (i.e. ultimately calling `TypedExpr.Eval()`) in its `Next()` implementation, [here for the filter](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/select.go#L163) and [here for the targets](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/select.go#L169). Notice that, in the case of our query, the `selectNode` doesn't actually have a filter. Instead, the filtering is performed by the two `scanNode`s, see below. - 1. [`indexJoinNode`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/index_join.go#L30): The `indexJoinNode` implements joining of results from an index with the rows of a table. It is used when an index can be used for a query, but it doesn't contain all the necessary columns; columns not available in the index need to be retrieved from the Primary Key (PK) key-values. The `indexJoinNode` sits on top of two scan nodes - one configured to scan the index, and one that is constantly reconfigured to do "point lookups" by PK. In the case of our query, we can see that the "SI" index is used to read a compact set of rows that match the "state" filter but, since it doesn't contain the "address" columns, the PK also needs to be used. Each index KV pair contains the primary key of the row, so there's enough information to do PK lookups. [`indexJoinNode.Next`](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/index_join.go#L273) keeps reading rows from the index and, for each one, adds a spans to be read by the PK. Once enough such spans have been batched, they're all [read from the PK](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/index_join.go#L257). As described in the section on [SQL rows to KV pairs](https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#data-mapping-between-the-sql-model-and-kv)) from the design doc, each SQL row is represented as a single KV pair in the indexes, but as multiple consecutive rows in the PK (represented by a "key span"). An interesting detail has to do with how filters are handled: note that the `state LIKE 'C%'` condition is evaluated by the index scan, and the `strpos(address, 'Infinite') != 0` condition is evaluated by the PK scan. This is nice because it means that we'll be filtering as much as we can on the index side and we'll be doing fewer expensive PK lookups. The code that figures out which conjunction is to be evaluated where is in `splitFilter()`, called by the [`indexJoinNode` ctor](https://github.com/cockroachdb/cockroach/blob/33c18ad1bcdb37ed6ed428b7527148977a8c566a/pkg/sql/index_join.go#L180). @@ -146,7 +159,7 @@ An interesting detail has to do with how filters are handled: note that the `sta /* Select the orders placed by each customer in the first year of membership. */ SELECT * FROM Orders o inner join Customers c ON o.CustomerID = c.ID WHERE Orders.amount > 10 AND Customers.State = 'NY' AND age(c.JoinDate, o.Date) < INTERVAL '1 year' ``` -is going to be compiled into two `scanNode`s, one for `Customers`, one for `Orders`. Each one of them can do the part of filtering that refers exclusively to their respective tables, and then the higher-level `selectNode` only needs to evaluate expressions that need data from both (i.e. `age(c.JoinDate, o.Date) < INTERVAL '1 year'`). +is going to be compiled into two `scanNode`s, one for `Customers`, one for `Orders`. Each one of them can do the part of filtering that refers exclusively to their respective tables, and then the higher-level `joinNode` only needs to evaluate expressions that need data from both (i.e. `age(c.JoinDate, o.Date) < INTERVAL '1 year'`). Let's continue downwards, looking at the structures that the `scanNode` uses for actually reading data.