Auto merge of #44505 - nikomatsakis:lotsa-comments, r=steveklabnik

rework the README.md for rustc and add other readmes OK, so, long ago I committed to the idea of trying to write some high-level documentation for rustc. This has proved to be much harder for me to get done than I thought it would! This PR is far from as complete as I had hoped, but I wanted to open it so that people can give me feedback on the conventions that it establishes. If this seems like a good way forward, we can land it and I will open an issue with a good check-list of things to write (and try to take down some of them myself). Here are the conventions I established on which I would like feedback. **Use README.md files**. First off, I'm aiming to keep most of the high-level docs in `README.md` files, rather than entries on forge. My thought is that such files are (a) more discoverable than forge and (b) closer to the code, and hence can be edited in a single PR. However, since they are not *in the code*, they will naturally get out of date, so the intention is to focus on the highest-level details, which are least likely to bitrot. I've included a few examples of common functions and so forth, but never tried to (e.g.) exhaustively list the names of functions and so forth. - I would like to use the tidy scripts to try and check that these do not go out of date. Future work. **librustc/README.md as the main entrypoint.** This seems like the most natural place people will look first. It lays out how the crates are structured and **is intended** to give pointers to the main data structures of the compiler (I didn't update that yet; the existing material is terribly dated). **A glossary listing abbreviations and things.** It's much harder to read code if you don't know what some obscure set of letters like `infcx` stands for. **Major modules each have their own README.md that documents the high-level idea.** For example, I wrote some stuff about `hir` and `ty`. Both of them have many missing topics, but I think that is roughly the level of depth that would be good. The idea is to give people a "feeling" for what the code does. What is missing primarily here is lots of content. =) Here are some things I'd like to see: - A description of what a QUERY is and how to define one - Some comments for `librustc/ty/maps.rs` - An overview of how compilation proceeds now (i.e., the hybrid demand-driven and forward model) and how we would like to see it going in the future (all demand-driven) - Some coverage of how incremental will work under red-green - An updated list of the major IRs in use of the compiler (AST, HIR, TypeckTables, MIR) and major bits of interesting code (typeck, borrowck, etc) - More advice on how to use `x.py`, or at least pointers to that - Good choice for `config.toml` - How to use `RUST_LOG` and other debugging flags (e.g., `-Zverbose`, `-Ztreat-err-as-bug`) - Helpful conventions for `debug!` statement formatting cc @rust-lang/compiler @mgattozzi
rust-lang · Sep 19, 2017 · f60bc3a · f60bc3a
2 parents 325ba23 + 638958b
commit f60bc3a
Show file tree

Hide file tree

Showing 20 changed files with 2,571 additions and 1,757 deletions.
diff --git a/src/librustc/README.md b/src/librustc/README.md
diff --git a/src/librustc/hir/README.md b/src/librustc/hir/README.md
@@ -0,0 +1,119 @@
+# Introduction to the HIR
+
+The HIR -- "High-level IR" -- is the primary IR used in most of
+rustc. It is a desugared version of the "abstract syntax tree" (AST)
+that is generated after parsing, macro expansion, and name resolution
+have completed. Many parts of HIR resemble Rust surface syntax quite
+closely, with the exception that some of Rust's expression forms have
+been desugared away (as an example, `for` loops are converted into a
+`loop` and do not appear in the HIR).
+
+This README covers the main concepts of the HIR.
+
+### Out-of-band storage and the `Crate` type
+
+The top-level data-structure in the HIR is the `Crate`, which stores
+the contents of the crate currently being compiled (we only ever
+construct HIR for the current crate). Whereas in the AST the crate
+data structure basically just contains the root module, the HIR
+`Crate` structure contains a number of maps and other things that
+serve to organize the content of the crate for easier access.
+
+For example, the contents of individual items (e.g., modules,
+functions, traits, impls, etc) in the HIR are not immediately
+accessible in the parents. So, for example, if had a module item `foo`
+containing a function `bar()`:
+
+```
+mod foo {
+  fn bar() { }
+}
+```
+
+Then in the HIR the representation of module `foo` (the `Mod`
+stuct) would have only the **`ItemId`** `I` of `bar()`. To get the
+details of the function `bar()`, we would lookup `I` in the
+`items` map.
+
+One nice result from this representation is that one can iterate
+over all items in the crate by iterating over the key-value pairs
+in these maps (without the need to trawl through the IR in total).
+There are similar maps for things like trait items and impl items,
+as well as "bodies" (explained below).
+
+The other reason to setup the representation this way is for better
+integration with incremental compilation. This way, if you gain access
+to a `&hir::Item` (e.g. for the mod `foo`), you do not immediately
+gain access to the contents of the function `bar()`. Instead, you only
+gain access to the **id** for `bar()`, and you must invoke some
+function to lookup the contents of `bar()` given its id; this gives us
+a chance to observe that you accessed the data for `bar()` and record
+the dependency.
+
+### Identifiers in the HIR
+
+Most of the code that has to deal with things in HIR tends not to
+carry around references into the HIR, but rather to carry around
+*identifier numbers* (or just "ids"). Right now, you will find four
+sorts of identifiers in active use:
+
+- `DefId`, which primarily name "definitions" or top-level items.
+  - You can think of a `DefId` as being shorthand for a very explicit
+    and complete path, like `std::collections::HashMap`. However,
+    these paths are able to name things that are not nameable in
+    normal Rust (e.g., impls), and they also include extra information
+    about the crate (such as its version number, as two versions of
+    the same crate can co-exist).
+  - A `DefId` really consists of two parts, a `CrateNum` (which
+    identifies the crate) and a `DefIndex` (which indixes into a list
+    of items that is maintained per crate).
+- `HirId`, which combines the index of a particular item with an
+  offset within that item.
+  - the key point of a `HirId` is that it is *relative* to some item (which is named
+    via a `DefId`).
+- `BodyId`, this is an absolute identifier that refers to a specific
+  body (definition of a function or constant) in the crate. It is currently
+  effectively a "newtype'd" `NodeId`.
+- `NodeId`, which is an absolute id that identifies a single node in the HIR tree.
+  - While these are still in common use, **they are being slowly phased out**.
+  - Since they are absolute within the crate, adding a new node
+    anywhere in the tree causes the node-ids of all subsequent code in
+    the crate to change. This is terrible for incremental compilation,
+    as you can perhaps imagine.
+
+### HIR Map
+
+Most of the time when you are working with the HIR, you will do so via
+the **HIR Map**, accessible in the tcx via `tcx.hir` (and defined in
+the `hir::map` module). The HIR map contains a number of methods to
+convert between ids of various kinds and to lookup data associated
+with a HIR node.
+
+For example, if you have a `DefId`, and you would like to convert it
+to a `NodeId`, you can use `tcx.hir.as_local_node_id(def_id)`. This
+returns an `Option<NodeId>` -- this will be `None` if the def-id
+refers to something outside of the current crate (since then it has no
+HIR node), but otherwise returns `Some(n)` where `n` is the node-id of
+the definition.
+
+Similarly, you can use `tcx.hir.find(n)` to lookup the node for a
+`NodeId`. This returns a `Option<Node<'tcx>>`, where `Node` is an enum
+defined in the map; by matching on this you can find out what sort of
+node the node-id referred to and also get a pointer to the data
+itself. Often, you know what sort of node `n` is -- e.g., if you know
+that `n` must be some HIR expression, you can do
+`tcx.hir.expect_expr(n)`, which will extract and return the
+`&hir::Expr`, panicking if `n` is not in fact an expression.
+
+Finally, you can use the HIR map to find the parents of nodes, via
+calls like `tcx.hir.get_parent_node(n)`.
+
+### HIR Bodies
+
+A **body** represents some kind of executable code, such as the body
+of a function/closure or the definition of a constant. Bodies are
+associated with an **owner**, which is typically some kind of item
+(e.g., a `fn()` or `const`), but could also be a closure expression
+(e.g., `|x, y| x + y`). You can use the HIR map to find find the body
+associated with a given def-id (`maybe_body_owned_by()`) or to find
+the owner of a body (`body_owner_def_id()`).
diff --git a/src/librustc/hir/map/README.md b/src/librustc/hir/map/README.md
@@ -0,0 +1,4 @@
+The HIR map, accessible via `tcx.hir`, allows you to quickly navigate the
+HIR and convert between various forms of identifiers. See [the HIR README] for more information.
+
+[the HIR README]: ../README.md
diff --git a/src/librustc/hir/mod.rs b/src/librustc/hir/mod.rs
@@ -413,6 +413,10 @@ pub struct WhereEqPredicate {
 
 pub type CrateConfig = HirVec<P<MetaItem>>;
 
+/// The top-level data structure that stores the entire contents of
+/// the crate currently being compiled.
+///
+/// For more details, see [the module-level README](README.md).
 #[derive(Clone, PartialEq, Eq, RustcEncodable, RustcDecodable, Debug)]
 pub struct Crate {
     pub module: Mod,
@@ -927,7 +931,27 @@ pub struct BodyId {
     pub node_id: NodeId,
 }
 
-/// The body of a function or constant value.
+/// The body of a function, closure, or constant value. In the case of
+/// a function, the body contains not only the function body itself
+/// (which is an expression), but also the argument patterns, since
+/// those are something that the caller doesn't really care about.
+///
+/// # Examples
+///
+/// ```
+/// fn foo((x, y): (u32, u32)) -> u32 {
+///     x + y
+/// }
+/// ```
+///
+/// Here, the `Body` associated with `foo()` would contain:
+///
+/// - an `arguments` array containing the `(x, y)` pattern
+/// - a `value` containing the `x + y` expression (maybe wrapped in a block)
+/// - `is_generator` would be false
+///
+/// All bodies have an **owner**, which can be accessed via the HIR
+/// map using `body_owner_def_id()`.
 #[derive(Clone, PartialEq, Eq, RustcEncodable, RustcDecodable, Hash, Debug)]
 pub struct Body {
     pub arguments: HirVec<Arg>,

diff --git a/src/librustc/lib.rs b/src/librustc/lib.rs
@@ -8,7 +8,28 @@
 // option. This file may not be copied, modified, or distributed
 // except according to those terms.
 
-//! The Rust compiler.
+//! The "main crate" of the Rust compiler. This crate contains common
+//! type definitions that are used by the other crates in the rustc
+//! "family". Some prominent examples (note that each of these modules
+//! has their own README with further details).
+//!
+//! - **HIR.** The "high-level (H) intermediate representation (IR)" is
+//!   defined in the `hir` module.
+//! - **MIR.** The "mid-level (M) intermediate representation (IR)" is
+//!   defined in the `mir` module. This module contains only the
+//!   *definition* of the MIR; the passes that transform and operate
+//!   on MIR are found in `librustc_mir` crate.
+//! - **Types.** The internal representation of types used in rustc is
+//!   defined in the `ty` module. This includes the **type context**
+//!   (or `tcx`), which is the central context during most of
+//!   compilation, containing the interners and other things.
+//! - **Traits.** Trait resolution is implemented in the `traits` module.
+//! - **Type inference.** The type inference code can be found in the `infer` module;
+//!   this code handles low-level equality and subtyping operations. The
+//!   type check pass in the compiler is found in the `librustc_typeck` crate.
+//!
+//! For a deeper explanation of how the compiler works and is
+//! organized, see the README.md file in this directory.
 //!
 //! # Note
 //!

diff --git a/src/librustc/ty/README.md b/src/librustc/ty/README.md
@@ -0,0 +1,165 @@
+# Types and the Type Context
+
+The `ty` module defines how the Rust compiler represents types
+internally. It also defines the *typing context* (`tcx` or `TyCtxt`),
+which is the central data structure in the compiler.
+
+## The tcx and how it uses lifetimes
+
+The `tcx` ("typing context") is the central data structure in the
+compiler. It is the context that you use to perform all manner of
+queries. The struct `TyCtxt` defines a reference to this shared context:
+
+```rust
+tcx: TyCtxt<'a, 'gcx, 'tcx>
+//          --  ----  ----
+//          |   |     |
+//          |   |     innermost arena lifetime (if any)
+//          |   "global arena" lifetime
+//          lifetime of this reference
+```
+
+As you can see, the `TyCtxt` type takes three lifetime parameters.
+These lifetimes are perhaps the most complex thing to understand about
+the tcx. During Rust compilation, we allocate most of our memory in
+**arenas**, which are basically pools of memory that get freed all at
+once. When you see a reference with a lifetime like `'tcx` or `'gcx`,
+you know that it refers to arena-allocated data (or data that lives as
+long as the arenas, anyhow).
+
+We use two distinct levels of arenas. The outer level is the "global
+arena". This arena lasts for the entire compilation: so anything you
+allocate in there is only freed once compilation is basically over
+(actually, when we shift to executing LLVM).
+
+To reduce peak memory usage, when we do type inference, we also use an
+inner level of arena. These arenas get thrown away once type inference
+is over. This is done because type inference generates a lot of
+"throw-away" types that are not particularly interesting after type
+inference completes, so keeping around those allocations would be
+wasteful.
+
+Often, we wish to write code that explicitly asserts that it is not
+taking place during inference. In that case, there is no "local"
+arena, and all the types that you can access are allocated in the
+global arena.  To express this, the idea is to us the same lifetime
+for the `'gcx` and `'tcx` parameters of `TyCtxt`. Just to be a touch
+confusing, we tend to use the name `'tcx` in such contexts. Here is an
+example:
+
+```rust
+fn not_in_inference<'a, 'tcx>(tcx: TyCtxt<'a, 'tcx, 'tcx>, def_id: DefId) {
+    //                                        ----  ----
+    //                                        Using the same lifetime here asserts
+    //                                        that the innermost arena accessible through
+    //                                        this reference *is* the global arena.
+}
+```
+
+In contrast, if we want to code that can be usable during type inference, then you
+need to declare a distinct `'gcx` and `'tcx` lifetime parameter:
+
+```rust
+fn maybe_in_inference<'a, 'gcx, 'tcx>(tcx: TyCtxt<'a, 'gcx, 'tcx>, def_id: DefId) {
+    //                                                ----  ----
+    //                                        Using different lifetimes here means that
+    //                                        the innermost arena *may* be distinct
+    //                                        from the global arena (but doesn't have to be).
+}
+```
+
+### Allocating and working with types
+
+Rust types are represented using the `Ty<'tcx>` defined in the `ty`
+module (not to be confused with the `Ty` struct from [the HIR]). This
+is in fact a simple type alias for a reference with `'tcx` lifetime:
+
+```rust
+pub type Ty<'tcx> = &'tcx TyS<'tcx>;
+```
+
+[the HIR]: ../hir/README.md
+
+You can basically ignore the `TyS` struct -- you will basically never
+access it explicitly. We always pass it by reference using the
+`Ty<'tcx>` alias -- the only exception I think is to define inherent
+methods on types. Instances of `TyS` are only ever allocated in one of
+the rustc arenas (never e.g. on the stack).
+
+One common operation on types is to **match** and see what kinds of
+types they are. This is done by doing `match ty.sty`, sort of like this:
+
+```rust
+fn test_type<'tcx>(ty: Ty<'tcx>) {
+    match ty.sty {
+        ty::TyArray(elem_ty, len) => { ... }
+        ...
+    }
+}
+```
+
+The `sty` field (the origin of this name is unclear to me; perhaps
+structural type?) is of type `TypeVariants<'tcx>`, which is an enum
+definined all of the different kinds of types in the compiler.
+
+> NB: inspecting the `sty` field on types during type inference can be
+> risky, as there are may be inference variables and other things to
+> consider, or sometimes types are not yet known that will become
+> known later.).
+
+To allocate a new type, you can use the various `mk_` methods defined
+on the `tcx`. These have names that correpond mostly to the various kinds
+of type variants. For example:
+
+```rust
+let array_ty = tcx.mk_array(elem_ty, len * 2);
+```
+
+These methods all return a `Ty<'tcx>` -- note that the lifetime you
+get back is the lifetime of the innermost arena that this `tcx` has
+access to. In fact, types are always canonicalized and interned (so we
+never allocate exactly the same type twice) and are always allocated
+in the outermost arena where they can be (so, if they do not contain
+any inference variables or other "temporary" types, they will be
+allocated in the global arena). However, the lifetime `'tcx` is always
+a safe approximation, so that is what you get back.
+
+> NB. Because types are interned, it is possible to compare them for
+> equality efficiently using `==` -- however, this is almost never what
+> you want to do unless you happen to be hashing and looking for
+> duplicates. This is because often in Rust there are multiple ways to
+> represent the same type, particularly once inference is involved. If
+> you are going to be testing for type equality, you probably need to
+> start looking into the inference code to do it right.
+
+You can also find various common types in the tcx itself by accessing
+`tcx.types.bool`, `tcx.types.char`, etc (see `CommonTypes` for more).
+
+### Beyond types: Other kinds of arena-allocated data structures
+
+In addition to types, there are a number of other arena-allocated data
+structures that you can allocate, and which are found in this
+module. Here are a few examples:
+
+- `Substs`, allocated with `mk_substs` -- this will intern a slice of types, often used to
+  specify the values to be substituted for generics (e.g., `HashMap<i32, u32>`
+  would be represented as a slice `&'tcx [tcx.types.i32, tcx.types.u32]`.
+- `TraitRef`, typically passed by value -- a **trait reference**
+  consists of a reference to a trait along with its various type
+  parameters (including `Self`), like `i32: Display` (here, the def-id
+  would reference the `Display` trait, and the substs would contain
+  `i32`).
+- `Predicate` defines something the trait system has to prove (see `traits` module).
+
+### Import conventions
+
+Although there is no hard and fast rule, the `ty` module tends to be used like so:
+
+```rust
+use ty::{self, Ty, TyCtxt};
+```
+
+In particular, since they are so common, the `Ty` and `TyCtxt` types
+are imported directly. Other types are often referenced with an
+explicit `ty::` prefix (e.g., `ty::TraitRef<'tcx>`). But some modules
+choose to import a larger or smaller set of names explicitly.
diff --git a/src/librustc/ty/context.rs b/src/librustc/ty/context.rs
@@ -793,9 +793,10 @@ impl<'tcx> CommonTypes<'tcx> {
     }
 }
 
-/// The data structure to keep track of all the information that typechecker
-/// generates so that so that it can be reused and doesn't have to be redone
-/// later on.
+/// The central data structure of the compiler. It stores references
+/// to the various **arenas** and also houses the results of the
+/// various **compiler queries** that have been performed. See [the
+/// README](README.md) for more deatils.
 #[derive(Copy, Clone)]
 pub struct TyCtxt<'a, 'gcx: 'a+'tcx, 'tcx: 'a> {
     gcx: &'a GlobalCtxt<'gcx>,