Skip to content

Commit

Permalink
Auto merge of #44505 - nikomatsakis:lotsa-comments, r=steveklabnik
Browse files Browse the repository at this point in the history
rework the README.md for rustc and add other readmes

OK, so, long ago I committed to the idea of trying to write some high-level documentation for rustc. This has proved to be much harder for me to get done than I thought it would! This PR is far from as complete as I had hoped, but I wanted to open it so that people can give me feedback on the conventions that it establishes. If this seems like a good way forward, we can land it and I will open an issue with a good check-list of things to write (and try to take down some of them myself).

Here are the conventions I established on which I would like feedback.

**Use README.md files**. First off, I'm aiming to keep most of the high-level docs in `README.md` files, rather than entries on forge. My thought is that such files are (a) more discoverable than forge and (b) closer to the code, and hence can be edited in a single PR. However, since they are not *in the code*, they will naturally get out of date, so the intention is to focus on the highest-level details, which are least likely to bitrot. I've included a few examples of common functions and so forth, but never tried to (e.g.) exhaustively list the names of functions and so forth.
    - I would like to use the tidy scripts to try and check that these do not go out of date. Future work.

**librustc/README.md as the main entrypoint.** This seems like the most natural place people will look first. It lays out how the crates are structured and **is intended** to give pointers to the main data structures of the compiler (I didn't update that yet; the existing material is terribly dated).

**A glossary listing abbreviations and things.** It's much harder to read code if you don't know what some obscure set of letters like `infcx` stands for.

**Major modules each have their own README.md that documents the high-level idea.** For example, I wrote some stuff about `hir` and `ty`. Both of them have many missing topics, but I think that is roughly the level of depth that would be good. The idea is to give people a "feeling" for what the code does.

What is missing primarily here is lots of content. =) Here are some things I'd like to see:

- A description of what a QUERY is and how to define one
    - Some comments for `librustc/ty/maps.rs`
- An overview of how compilation proceeds now (i.e., the hybrid demand-driven and forward model) and how we would like to see it going in the future (all demand-driven)
- Some coverage of how incremental will work under red-green
- An updated list of the major IRs in use of the compiler (AST, HIR, TypeckTables, MIR) and major bits of interesting code (typeck, borrowck, etc)
- More advice on how to use `x.py`, or at least pointers to that
- Good choice for `config.toml`
- How to use `RUST_LOG` and other debugging flags (e.g., `-Zverbose`, `-Ztreat-err-as-bug`)
- Helpful conventions for `debug!` statement formatting

cc @rust-lang/compiler @mgattozzi
  • Loading branch information
bors committed Sep 19, 2017
2 parents 325ba23 + 638958b commit f60bc3a
Show file tree
Hide file tree
Showing 20 changed files with 2,571 additions and 1,757 deletions.
341 changes: 185 additions & 156 deletions src/librustc/README.md

Large diffs are not rendered by default.

119 changes: 119 additions & 0 deletions src/librustc/hir/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Introduction to the HIR

The HIR -- "High-level IR" -- is the primary IR used in most of
rustc. It is a desugared version of the "abstract syntax tree" (AST)
that is generated after parsing, macro expansion, and name resolution
have completed. Many parts of HIR resemble Rust surface syntax quite
closely, with the exception that some of Rust's expression forms have
been desugared away (as an example, `for` loops are converted into a
`loop` and do not appear in the HIR).

This README covers the main concepts of the HIR.

### Out-of-band storage and the `Crate` type

The top-level data-structure in the HIR is the `Crate`, which stores
the contents of the crate currently being compiled (we only ever
construct HIR for the current crate). Whereas in the AST the crate
data structure basically just contains the root module, the HIR
`Crate` structure contains a number of maps and other things that
serve to organize the content of the crate for easier access.

For example, the contents of individual items (e.g., modules,
functions, traits, impls, etc) in the HIR are not immediately
accessible in the parents. So, for example, if had a module item `foo`
containing a function `bar()`:

```
mod foo {
fn bar() { }
}
```

Then in the HIR the representation of module `foo` (the `Mod`
stuct) would have only the **`ItemId`** `I` of `bar()`. To get the
details of the function `bar()`, we would lookup `I` in the
`items` map.

One nice result from this representation is that one can iterate
over all items in the crate by iterating over the key-value pairs
in these maps (without the need to trawl through the IR in total).
There are similar maps for things like trait items and impl items,
as well as "bodies" (explained below).

The other reason to setup the representation this way is for better
integration with incremental compilation. This way, if you gain access
to a `&hir::Item` (e.g. for the mod `foo`), you do not immediately
gain access to the contents of the function `bar()`. Instead, you only
gain access to the **id** for `bar()`, and you must invoke some
function to lookup the contents of `bar()` given its id; this gives us
a chance to observe that you accessed the data for `bar()` and record
the dependency.

### Identifiers in the HIR

Most of the code that has to deal with things in HIR tends not to
carry around references into the HIR, but rather to carry around
*identifier numbers* (or just "ids"). Right now, you will find four
sorts of identifiers in active use:

- `DefId`, which primarily name "definitions" or top-level items.
- You can think of a `DefId` as being shorthand for a very explicit
and complete path, like `std::collections::HashMap`. However,
these paths are able to name things that are not nameable in
normal Rust (e.g., impls), and they also include extra information
about the crate (such as its version number, as two versions of
the same crate can co-exist).
- A `DefId` really consists of two parts, a `CrateNum` (which
identifies the crate) and a `DefIndex` (which indixes into a list
of items that is maintained per crate).
- `HirId`, which combines the index of a particular item with an
offset within that item.
- the key point of a `HirId` is that it is *relative* to some item (which is named
via a `DefId`).
- `BodyId`, this is an absolute identifier that refers to a specific
body (definition of a function or constant) in the crate. It is currently
effectively a "newtype'd" `NodeId`.
- `NodeId`, which is an absolute id that identifies a single node in the HIR tree.
- While these are still in common use, **they are being slowly phased out**.
- Since they are absolute within the crate, adding a new node
anywhere in the tree causes the node-ids of all subsequent code in
the crate to change. This is terrible for incremental compilation,
as you can perhaps imagine.

### HIR Map

Most of the time when you are working with the HIR, you will do so via
the **HIR Map**, accessible in the tcx via `tcx.hir` (and defined in
the `hir::map` module). The HIR map contains a number of methods to
convert between ids of various kinds and to lookup data associated
with a HIR node.

For example, if you have a `DefId`, and you would like to convert it
to a `NodeId`, you can use `tcx.hir.as_local_node_id(def_id)`. This
returns an `Option<NodeId>` -- this will be `None` if the def-id
refers to something outside of the current crate (since then it has no
HIR node), but otherwise returns `Some(n)` where `n` is the node-id of
the definition.

Similarly, you can use `tcx.hir.find(n)` to lookup the node for a
`NodeId`. This returns a `Option<Node<'tcx>>`, where `Node` is an enum
defined in the map; by matching on this you can find out what sort of
node the node-id referred to and also get a pointer to the data
itself. Often, you know what sort of node `n` is -- e.g., if you know
that `n` must be some HIR expression, you can do
`tcx.hir.expect_expr(n)`, which will extract and return the
`&hir::Expr`, panicking if `n` is not in fact an expression.

Finally, you can use the HIR map to find the parents of nodes, via
calls like `tcx.hir.get_parent_node(n)`.

### HIR Bodies

A **body** represents some kind of executable code, such as the body
of a function/closure or the definition of a constant. Bodies are
associated with an **owner**, which is typically some kind of item
(e.g., a `fn()` or `const`), but could also be a closure expression
(e.g., `|x, y| x + y`). You can use the HIR map to find find the body
associated with a given def-id (`maybe_body_owned_by()`) or to find
the owner of a body (`body_owner_def_id()`).
4 changes: 4 additions & 0 deletions src/librustc/hir/map/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
The HIR map, accessible via `tcx.hir`, allows you to quickly navigate the
HIR and convert between various forms of identifiers. See [the HIR README] for more information.

[the HIR README]: ../README.md
26 changes: 25 additions & 1 deletion src/librustc/hir/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -413,6 +413,10 @@ pub struct WhereEqPredicate {

pub type CrateConfig = HirVec<P<MetaItem>>;

/// The top-level data structure that stores the entire contents of
/// the crate currently being compiled.
///
/// For more details, see [the module-level README](README.md).
#[derive(Clone, PartialEq, Eq, RustcEncodable, RustcDecodable, Debug)]
pub struct Crate {
pub module: Mod,
Expand Down Expand Up @@ -927,7 +931,27 @@ pub struct BodyId {
pub node_id: NodeId,
}

/// The body of a function or constant value.
/// The body of a function, closure, or constant value. In the case of
/// a function, the body contains not only the function body itself
/// (which is an expression), but also the argument patterns, since
/// those are something that the caller doesn't really care about.
///
/// # Examples
///
/// ```
/// fn foo((x, y): (u32, u32)) -> u32 {
/// x + y
/// }
/// ```
///
/// Here, the `Body` associated with `foo()` would contain:
///
/// - an `arguments` array containing the `(x, y)` pattern
/// - a `value` containing the `x + y` expression (maybe wrapped in a block)
/// - `is_generator` would be false
///
/// All bodies have an **owner**, which can be accessed via the HIR
/// map using `body_owner_def_id()`.
#[derive(Clone, PartialEq, Eq, RustcEncodable, RustcDecodable, Hash, Debug)]
pub struct Body {
pub arguments: HirVec<Arg>,
Expand Down
23 changes: 22 additions & 1 deletion src/librustc/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,28 @@
// option. This file may not be copied, modified, or distributed
// except according to those terms.

//! The Rust compiler.
//! The "main crate" of the Rust compiler. This crate contains common
//! type definitions that are used by the other crates in the rustc
//! "family". Some prominent examples (note that each of these modules
//! has their own README with further details).
//!
//! - **HIR.** The "high-level (H) intermediate representation (IR)" is
//! defined in the `hir` module.
//! - **MIR.** The "mid-level (M) intermediate representation (IR)" is
//! defined in the `mir` module. This module contains only the
//! *definition* of the MIR; the passes that transform and operate
//! on MIR are found in `librustc_mir` crate.
//! - **Types.** The internal representation of types used in rustc is
//! defined in the `ty` module. This includes the **type context**
//! (or `tcx`), which is the central context during most of
//! compilation, containing the interners and other things.
//! - **Traits.** Trait resolution is implemented in the `traits` module.
//! - **Type inference.** The type inference code can be found in the `infer` module;
//! this code handles low-level equality and subtyping operations. The
//! type check pass in the compiler is found in the `librustc_typeck` crate.
//!
//! For a deeper explanation of how the compiler works and is
//! organized, see the README.md file in this directory.
//!
//! # Note
//!
Expand Down
165 changes: 165 additions & 0 deletions src/librustc/ty/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# Types and the Type Context

The `ty` module defines how the Rust compiler represents types
internally. It also defines the *typing context* (`tcx` or `TyCtxt`),
which is the central data structure in the compiler.

## The tcx and how it uses lifetimes

The `tcx` ("typing context") is the central data structure in the
compiler. It is the context that you use to perform all manner of
queries. The struct `TyCtxt` defines a reference to this shared context:

```rust
tcx: TyCtxt<'a, 'gcx, 'tcx>
// -- ---- ----
// | | |
// | | innermost arena lifetime (if any)
// | "global arena" lifetime
// lifetime of this reference
```

As you can see, the `TyCtxt` type takes three lifetime parameters.
These lifetimes are perhaps the most complex thing to understand about
the tcx. During Rust compilation, we allocate most of our memory in
**arenas**, which are basically pools of memory that get freed all at
once. When you see a reference with a lifetime like `'tcx` or `'gcx`,
you know that it refers to arena-allocated data (or data that lives as
long as the arenas, anyhow).

We use two distinct levels of arenas. The outer level is the "global
arena". This arena lasts for the entire compilation: so anything you
allocate in there is only freed once compilation is basically over
(actually, when we shift to executing LLVM).

To reduce peak memory usage, when we do type inference, we also use an
inner level of arena. These arenas get thrown away once type inference
is over. This is done because type inference generates a lot of
"throw-away" types that are not particularly interesting after type
inference completes, so keeping around those allocations would be
wasteful.

Often, we wish to write code that explicitly asserts that it is not
taking place during inference. In that case, there is no "local"
arena, and all the types that you can access are allocated in the
global arena. To express this, the idea is to us the same lifetime
for the `'gcx` and `'tcx` parameters of `TyCtxt`. Just to be a touch
confusing, we tend to use the name `'tcx` in such contexts. Here is an
example:

```rust
fn not_in_inference<'a, 'tcx>(tcx: TyCtxt<'a, 'tcx, 'tcx>, def_id: DefId) {
// ---- ----
// Using the same lifetime here asserts
// that the innermost arena accessible through
// this reference *is* the global arena.
}
```

In contrast, if we want to code that can be usable during type inference, then you
need to declare a distinct `'gcx` and `'tcx` lifetime parameter:

```rust
fn maybe_in_inference<'a, 'gcx, 'tcx>(tcx: TyCtxt<'a, 'gcx, 'tcx>, def_id: DefId) {
// ---- ----
// Using different lifetimes here means that
// the innermost arena *may* be distinct
// from the global arena (but doesn't have to be).
}
```

### Allocating and working with types

Rust types are represented using the `Ty<'tcx>` defined in the `ty`
module (not to be confused with the `Ty` struct from [the HIR]). This
is in fact a simple type alias for a reference with `'tcx` lifetime:

```rust
pub type Ty<'tcx> = &'tcx TyS<'tcx>;
```

[the HIR]: ../hir/README.md

You can basically ignore the `TyS` struct -- you will basically never
access it explicitly. We always pass it by reference using the
`Ty<'tcx>` alias -- the only exception I think is to define inherent
methods on types. Instances of `TyS` are only ever allocated in one of
the rustc arenas (never e.g. on the stack).

One common operation on types is to **match** and see what kinds of
types they are. This is done by doing `match ty.sty`, sort of like this:

```rust
fn test_type<'tcx>(ty: Ty<'tcx>) {
match ty.sty {
ty::TyArray(elem_ty, len) => { ... }
...
}
}
```

The `sty` field (the origin of this name is unclear to me; perhaps
structural type?) is of type `TypeVariants<'tcx>`, which is an enum
definined all of the different kinds of types in the compiler.

> NB: inspecting the `sty` field on types during type inference can be
> risky, as there are may be inference variables and other things to
> consider, or sometimes types are not yet known that will become
> known later.).
To allocate a new type, you can use the various `mk_` methods defined
on the `tcx`. These have names that correpond mostly to the various kinds
of type variants. For example:

```rust
let array_ty = tcx.mk_array(elem_ty, len * 2);
```

These methods all return a `Ty<'tcx>` -- note that the lifetime you
get back is the lifetime of the innermost arena that this `tcx` has
access to. In fact, types are always canonicalized and interned (so we
never allocate exactly the same type twice) and are always allocated
in the outermost arena where they can be (so, if they do not contain
any inference variables or other "temporary" types, they will be
allocated in the global arena). However, the lifetime `'tcx` is always
a safe approximation, so that is what you get back.

> NB. Because types are interned, it is possible to compare them for
> equality efficiently using `==` -- however, this is almost never what
> you want to do unless you happen to be hashing and looking for
> duplicates. This is because often in Rust there are multiple ways to
> represent the same type, particularly once inference is involved. If
> you are going to be testing for type equality, you probably need to
> start looking into the inference code to do it right.
You can also find various common types in the tcx itself by accessing
`tcx.types.bool`, `tcx.types.char`, etc (see `CommonTypes` for more).

### Beyond types: Other kinds of arena-allocated data structures

In addition to types, there are a number of other arena-allocated data
structures that you can allocate, and which are found in this
module. Here are a few examples:

- `Substs`, allocated with `mk_substs` -- this will intern a slice of types, often used to
specify the values to be substituted for generics (e.g., `HashMap<i32, u32>`
would be represented as a slice `&'tcx [tcx.types.i32, tcx.types.u32]`.
- `TraitRef`, typically passed by value -- a **trait reference**
consists of a reference to a trait along with its various type
parameters (including `Self`), like `i32: Display` (here, the def-id
would reference the `Display` trait, and the substs would contain
`i32`).
- `Predicate` defines something the trait system has to prove (see `traits` module).

### Import conventions

Although there is no hard and fast rule, the `ty` module tends to be used like so:

```rust
use ty::{self, Ty, TyCtxt};
```

In particular, since they are so common, the `Ty` and `TyCtxt` types
are imported directly. Other types are often referenced with an
explicit `ty::` prefix (e.g., `ty::TraitRef<'tcx>`). But some modules
choose to import a larger or smaller set of names explicitly.
7 changes: 4 additions & 3 deletions src/librustc/ty/context.rs
Original file line number Diff line number Diff line change
Expand Up @@ -793,9 +793,10 @@ impl<'tcx> CommonTypes<'tcx> {
}
}

/// The data structure to keep track of all the information that typechecker
/// generates so that so that it can be reused and doesn't have to be redone
/// later on.
/// The central data structure of the compiler. It stores references
/// to the various **arenas** and also houses the results of the
/// various **compiler queries** that have been performed. See [the
/// README](README.md) for more deatils.
#[derive(Copy, Clone)]
pub struct TyCtxt<'a, 'gcx: 'a+'tcx, 'tcx: 'a> {
gcx: &'a GlobalCtxt<'gcx>,
Expand Down
Loading

0 comments on commit f60bc3a

Please sign in to comment.