Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rework and vastly expand the MIR section #67

Merged
merged 4 commits into from
Feb 28, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,15 @@
- [Type checking](./type-checking.md)
- [The MIR (Mid-level IR)](./mir.md)
- [MIR construction](./mir-construction.md)
- [MIR visitor and traversal](./mir-visitor.md)
- [MIR passes: getting the MIR for a function](./mir-passes.md)
- [MIR borrowck](./mir-borrowck.md)
- [MIR-based region checking (NLL)](./mir-regionck.md)
- [MIR optimizations](./mir-optimizations.md)
- [Constant evaluation](./const-eval.md)
- [miri const evaluator](./miri.md)
- [Parameter Environments](./param_env.md)
- [Generating LLVM IR](./trans.md)
- [Background material](./background.md)
- [Glossary](./glossary.md)
- [Code Index](./code-index.md)
122 changes: 122 additions & 0 deletions src/background.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Background topics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to add all of these topics to the glossary in brief form with links to the right spot in this chapter.


This section covers a numbers of common compiler terms that arise in
this guide. We try to give the general definition while providing some
Rust-specific context.

<a name=cfg>

## What is a control-flow graph?

A control-flow graph is a common term from compilers. If you've ever
used a flow-chart, then the concept of a control-flow graph will be
pretty familiar to you. It's a representation of your program that
exposes the underlying control flow in a very clear way.

A control-flow graph is structured as a set of **basic blocks**
connected by edges. The key idea of a basic block is that it is a set
of statements that execute "together" -- that is, whenever you branch
to a basic block, you start at the first statement and then execute
all the remainder. Only at the end of the block is there the
possibility of branching to more than one place (in MIR, we call that
final statement the **terminator**):

```
bb0: {
statement0;
statement1;
statement2;
...
terminator;
}
```

Many expressions that you are used to in Rust compile down to multiple
basic blocks. For example, consider an if statement:

```rust
a = 1;
if some_variable {
b = 1;
} else {
c = 1;
}
d = 1;
```

This would compile into four basic blocks:

```
BB0: {
a = 1;
if some_variable { goto BB1 } else { goto BB2 }
}

BB1: {
b = 1;
goto BB3;
}

BB2: {
c = 1;
goto BB3;
}

BB3: {
d = 1;
...;
}
```

When using a control-flow graph, a loop simply appears as a cycle in
the graph, and the `break` keyword translates into a path out of that
cycle.

<a name=dataflow>

## What is a dataflow analysis?

*to be written*

<a name=quantified>

## What is "universally quantified"? What about "existentially quantified"?

*to be written*

<a name=variance>

## What is co- and contra-variance?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is content from the nomicon that could be borrowed here...


Check out the subtyping chapter from the
[Rust Nomicon](https://doc.rust-lang.org/nomicon/subtyping.html).

<a name=free-vs-bound>

## What is a "free region" or a "free variable"? What about "bound region"?

Let's describe the concepts of free vs bound in terms of program
variables, since that's the thing we're most familiar with.

- Consider this expression, which creates a closure: `|a,
b| a + b`. Here, the `a` and `b` in `a + b` refer to the arguments
that the closure will be given when it is called. We say that the
`a` and `b` there are **bound** to the closure, and that the closure
signature `|a, b|` is a **binder** for the names `a` and `b`
(because any references to `a` or `b` within refer to the variables
that it introduces).
- Consider this expression: `a + b`. In this expression, `a` and `b`
refer to local variables that are defined *outside* of the
expression. We say that those variables **appear free** in the
expression (i.e., they are **free**, not **bound** (tied up)).

So there you have it: a variable "appears free" in some
expression/statement/whatever if it refers to something defined
outside of that expressions/statement/whatever. Equivalently, we can
then refer to the "free variables" of an expression -- which is just
the set of variables that "appear free".

So what does this have to do with regions? Well, we can apply the
analogous concept to type and regions. For example, in the type `&'a
u32`, `'a` appears free. But in the type `for<'a> fn(&'a u32)`, it
does not.
12 changes: 12 additions & 0 deletions src/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,33 +6,44 @@ The compiler uses a number of...idiosyncratic abbreviations and things. This glo
Term | Meaning
------------------------|--------
AST | the abstract syntax tree produced by the syntax crate; reflects user syntax very closely.
binder | a "binder" is a place where a variable or type is declared; for example, the `<T>` is a binder for the generic type parameter `T` in `fn foo<T>(..)`, and `|a| ...` is a binder for the parameter `a`. See [the background chapter for more](./background.html#free-vs-bound)
bound variable | a "bound variable" is one that is declared within an expression/term. For example, the variable `a` is bound within the closure expession `|a| a * 2`. See [the background chapter for more](./background.html#free-vs-bound)
codegen unit | when we produce LLVM IR, we group the Rust code into a number of codegen units. Each of these units is processed by LLVM independently from one another, enabling parallelism. They are also the unit of incremental re-use.
completeness | completeness is a technical term in type theory. Completeness means that every type-safe program also type-checks. Having both soundness and completeness is very hard, and usually soundness is more important. (see "soundness").
control-flow graph | a representation of the control-flow of a program; see [the background chapter for more](./background.html#cfg)
cx | we tend to use "cx" as an abbrevation for context. See also `tcx`, `infcx`, etc.
DAG | a directed acyclic graph is used during compilation to keep track of dependencies between queries. ([see more](incremental-compilation.html))
data-flow analysis | a static analysis that figures out what properties are true at each point in the control-flow of a program; see [the background chapter for more](./background.html#dataflow)
DefId | an index identifying a definition (see `librustc/hir/def_id.rs`). Uniquely identifies a `DefPath`.
free variable | a "free variable" is one that is not bound within an expression or term; see [the background chapter for more](./background.html#free-vs-bound)
'gcx | the lifetime of the global arena ([see more](ty.html))
generics | the set of generic type parameters defined on a type or item
HIR | the High-level IR, created by lowering and desugaring the AST ([see more](hir.html))
HirId | identifies a particular node in the HIR by combining a def-id with an "intra-definition offset".
HIR Map | The HIR map, accessible via tcx.hir, allows you to quickly navigate the HIR and convert between various forms of identifiers.
ICE | internal compiler error. When the compiler crashes.
ICH | incremental compilation hash. ICHs are used as fingerprints for things such as HIR and crate metadata, to check if changes have been made. This is useful in incremental compilation to see if part of a crate has changed and should be recompiled.
inference variable | when doing type or region inference, an "inference variable" is a kind of special type/region that represents what you are trying to infer. Think of X in algebra. For example, if we are trying to infer the type of a variable in a program, we create an inference variable to represent that unknown type.
infcx | the inference context (see `librustc/infer`)
IR | Intermediate Representation. A general term in compilers. During compilation, the code is transformed from raw source (ASCII text) to various IRs. In Rust, these are primarily HIR, MIR, and LLVM IR. Each IR is well-suited for some set of computations. For example, MIR is well-suited for the borrow checker, and LLVM IR is well-suited for codegen because LLVM accepts it.
local crate | the crate currently being compiled.
LTO | Link-Time Optimizations. A set of optimizations offered by LLVM that occur just before the final binary is linked. These include optmizations like removing functions that are never used in the final program, for example. _ThinLTO_ is a variant of LTO that aims to be a bit more scalable and efficient, but possibly sacrifices some optimizations. You may also read issues in the Rust repo about "FatLTO", which is the loving nickname given to non-Thin LTO. LLVM documentation: [here][lto] and [here][thinlto]
[LLVM] | (actually not an acronym :P) an open-source compiler backend. It accepts LLVM IR and outputs native binaries. Various languages (e.g. Rust) can then implement a compiler front-end that output LLVM IR and use LLVM to compile to all the platforms LLVM supports.
MIR | the Mid-level IR that is created after type-checking for use by borrowck and trans ([see more](./mir.html))
miri | an interpreter for MIR used for constant evaluation ([see more](./miri.html))
newtype | a "newtype" is a wrapper around some other type (e.g., `struct Foo(T)` is a "newtype" for `T`). This is commonly used in Rust to give a stronger type for indices.
NLL | [non-lexical lifetimes](./mir-regionck.html), an extension to Rust's borrowing system to make it be based on the control-flow graph.
node-id or NodeId | an index identifying a particular node in the AST or HIR; gradually being phased out and replaced with `HirId`.
obligation | something that must be proven by the trait system ([see more](trait-resolution.html))
promoted constants | constants extracted from a function and lifted to static scope; see [this section](./mir.html#promoted) for more details.
provider | the function that executes a query ([see more](query.html))
quantified | in math or logic, existential and universal quantification are used to ask questions like "is there any type T for which is true?" or "is this true for all types T?"; see [the background chapter for more](./background.html#quantified)
query | perhaps some sub-computation during compilation ([see more](query.html))
region | another term for "lifetime" often used in the literature and in the borrow checker.
sess | the compiler session, which stores global data used throughout compilation
side tables | because the AST and HIR are immutable once created, we often carry extra information about them in the form of hashtables, indexed by the id of a particular node.
sigil | like a keyword but composed entirely of non-alphanumeric tokens. For example, `&` is a sigil for references.
skolemization | a way of handling subtyping around "for-all" types (e.g., `for<'a> fn(&'a u32)`) as well as solving higher-ranked trait bounds (e.g., `for<'a> T: Trait<'a>`). See [the chapter on skolemization and universes](./mir-regionck.html#skol) for more details.
soundness | soundness is a technical term in type theory. Roughly, if a type system is sound, then if a program type-checks, it is type-safe; i.e. I can never (in safe rust) force a value into a variable of the wrong type. (see "completeness").
span | a location in the user's source code, used for error reporting primarily. These are like a file-name/line-number/column tuple on steroids: they carry a start/end point, and also track macro expansions and compiler desugaring. All while being packed into a few bytes (really, it's an index into a table). See the Span datatype for more.
substs | the substitutions for a given generic type or item (e.g. the `i32`, `u32` in `HashMap<i32, u32>`)
Expand All @@ -43,6 +54,7 @@ token | the smallest unit of parsing. Tokens are produced aft
trans | the code to translate MIR into LLVM IR.
trait reference | a trait and values for its type parameters ([see more](ty.html)).
ty | the internal representation of a type ([see more](ty.html)).
variance | variance determines how changes to a generic type/lifetime parameter affect subtyping; for example, if `T` is a subtype of `U`, then `Vec<T>` is a subtype `Vec<U>` because `Vec` is *covariant* in its generic parameter. See [the background chapter for more](./background.html#variance).

[LLVM]: https://llvm.org/
[lto]: https://llvm.org/docs/LinkTimeOptimization.html
Expand Down
57 changes: 56 additions & 1 deletion src/mir-borrowck.md
Original file line number Diff line number Diff line change
@@ -1 +1,56 @@
# MIR borrowck
# MIR borrow check

The borrow check is Rust's "secret sauce" -- it is tasked with
enforcing a number of properties:

- That all variables are initialized before they are used.
- That you can't move the same value twice.
- That you can't move a value while it is borrowed.
- That you can't access a place while it is mutably borrowed (except through the reference).
- That you can't mutate a place while it is shared borrowed.
- etc

At the time of this writing, the code is in a state of transition. The
"main" borrow checker still works by processing [the HIR](hir.html),
but that is being phased out in favor of the MIR-based borrow checker.
Doing borrow checking on MIR has two key advantages:

- The MIR is *far* less complex than the HIR; the radical desugaring
helps prevent bugs in the borrow checker. (If you're curious, you
can see
[a list of bugs that the MIR-based borrow checker fixes here][47366].)
- Even more importantly, using the MIR enables ["non-lexical lifetimes"][nll],
which are regions derived from the control-flow graph.

[47366]: https://github.com/rust-lang/rust/issues/47366
[nll]: http://rust-lang.github.io/rfcs/2094-nll.html

### Major phases of the borrow checker

The borrow checker source is found in
[the `rustc_mir::borrow_check` module][b_c]. The main entry point is
the `mir_borrowck` query. At the time of this writing, MIR borrowck can operate
in several modes, but this text will describe only the mode when NLL is enabled
(what you get with `#![feature(nll)]`).

[b_c]: https://github.com/rust-lang/rust/tree/master/src/librustc_mir/borrow_check

The overall flow of the borrow checker is as follows:

- We first create a **local copy** C of the MIR. In the coming steps,
we will modify this copy in place to modify the types and things to
include references to the new regions that we are computing.
- We then invoke `nll::replace_regions_in_mir` to modify this copy C.
Among other things, this function will replace all of the regions in
the MIR with fresh [inference variables](glossary.html).
- (More details can be found in [the regionck section](./mir-regionck.html).)
- Next, we perform a number of [dataflow analyses](./background.html#dataflow)
that compute what data is moved and when. The results of these analyses
are needed to do both borrow checking and region inference.
- Using the move data, we can then compute the values of all the regions in the MIR.
- (More details can be found in [the NLL section](./mir-regionck.html).)
- Finally, the borrow checker itself runs, taking as input (a) the
results of move analysis and (b) the regions computed by the region
checker. This allows us to figure out which loans are still in scope
at any particular point.

Loading