Skip to content

Latest commit

 

History

History
74 lines (55 loc) · 7.12 KB

Bootstrapping.md

File metadata and controls

74 lines (55 loc) · 7.12 KB

Bootstrapping the Virgil Compiler

Virgil is a programming language designed for building systems, such as virtual machines, kernels, networking stacks, databases, etc. Such software usually runs directly on the CPU, rather that itself on a virtual machine or interpreter, and thus necessitates a compiler that translates Virgil source code to native (machine) code.

The Virgil compiler Aeneas serves the role of both an interpreter and compiler to several targets. Aeneas is itself written in Virgil. This is often called "self-hosted", meaning that a compiler is written in the language it accepts, and thus can compile itself. But how does the Virgil compiler run, if not for another compiler or interpreter for Virgil?

Self-hosting requires bootstrapping

The process of creating a self-hosted compiler K for a new programming language L is called bootstrapping. This follows a tried-and-true pattern: before building the first compiler K, a compiler (or interpreter) for the language, PreK, is written in an existing language. The very first compilers had no pre-existing language to bootstrap with; they were written in assembly or machine code. The PreK implementation of L of Virgil III was an interpreter written in Java circa 2009 that was adapted from a Virgil II compiler, known as VPC (for Virgil Prototype Compiler) also written in Java. When Aeneas running on VPC could finally generate an executable of itself, VPC became unnecessary.

Virgil repo contains compiler binaries

Virgil III no longer requires a compiler written in a separate language to bootstrap--i.e. no more VPC written in Java. Maintaining a separate compiler simply to bootstrap slows evolution and risks incompatibility. In fact, as the language has evolved to include new features, these new features are eventually used by the Aeneas compiler itself. The original VPC for Virgil would no longer be able to compile or run today's compiler.

Nowadays, the Virgil repo maintains several stable executable binaries that are simply snapshots of the compiler compiling itself. These stable executables are built from the source code in the repo using the immediately preceding stable binaries and then checked in. Thus each revision in the repository history contains at least one stable binary capable of compiling the source code of the compiler in that same revision. There's no need to have a separate Virgil compiler to get started; this repo is always bootstrappable.

The stable binaries thus form a complete chain back to the very first bootstrap, the "primordial compiler binary" that was first generated by Aeneas running on VPC. That first revision which included that primordial compiler binary thus represents a "something from nothing" moment in time, after which Virgil has been fully self-hosted.

New features need a rev

New features for Virgil must first be implemented in the compiler, tested thoroughly, and a new set of stable binaries produced, before the feature can be used in the compiler's implementation itself. This is to preserve the chain so that any revision of the compiler's source can be built with the corresponding stable binaries checked in at the same revision. Thus each stable revision represents a checkpoint in the evolution in the language. In practice, it's a good idea to be conservative in using new language features in the compiler, since the use of a new feature means that that compiler version can no longer bootstrap with older stable binaries. Though the compiler can always bootstrap from the current stable version, the impact of compiler bugs in stable binaries is mitigated by being able to use earlier stable versions to bootstrap.

Show-stopping compiler bugs

Compiler bugs carry increased risk for self-hosted languages, since miscompilation of the compiler itself may turn a rare bug into a bug that affects all programs (by affecting the behavior of the compiler itself). This is particularly bad if the stable compiler snapshots bugs, since the stable compiler is needed to bootstrap a new compiler (and thus fix a bug in stable!). Development of Aeneas tries to mitigate this problem by broad testing (the current repo contains over 14,000 mini-programs as tests) and bootstrap testing (the testsuite is always run twice, once on a compiler compiled with stable and once with the new compiler having compiled itself). Often, just after the release of a new stable revision, the current compiler will generate exactly the same binary as the stable compiler, so if it compiles itself and produces the same binary as the stable compiler for itself, then no need for a second round of testing! When the compiler improves and generates better code than stable, then binaries diverge and it becomes imperative to do two-stage testing.

A compiler bug in stable could also prevent bootstrapping the compiler, e.g. if the stable compiler crashes on the new compiler. To mitigate this, development of Aeneas always procedes in steps small enough to ensure that every revision can build with the stable compiler in the same revision. If such a bug arises, it would first be fixed, the stable binaries are rev'd, and extensive testing makes sure that the bug doesn't persist. To date, no show-stopper bug has yet appeared.

Relatively minor bugs in stable do happen, however. They are treated according to their severity; e.g. an incorrect rejection of a valid program is not as serious as a wrong code bug. Generally, users are encouraged to work with a bootstrapped compiler (rather than stable) to subject the current compiler to as much testing as possible before the next stable revision.

Reflections on Trusting Trust

In Ken Thompson's now famous Turing Award lecture, he demonstrated a backdoor that was inserted into the compiler's binary and then erased from its source. The backdoor in the compiler pattern-matched the source code of specific programs (such as the login command) in order to insert vulnerabilities into them. Interestingly, the compiler backdoor also pattern-matched the source code of the compiler itself and inserted the very same backdoor logic. Thus the backdoor was not present in the source code of the compiler, only the binary, yet recompiling from source would reintroduce it, and unwary users would never know. Without disassembling the compromised compiler or auditing its output, the backdoor and vulnerability would never be detected.

How does this break down for Virgil? Don't we necessarily trust stable Aeneas compiler binaries to not have backdoors that insert vulnerabilities? Yes, but we can get more confidence that these binaries do not contain backdoors because:

  • Diversity. There are stable binaries for many different platforms
  • Determinism. As a language with well-specified semantics and no platform-specific behavior, all stable compiler binaries produce bit-equivalent results, particularly when compiling the compiler itself
  • Auditability. At the revision where a new stable compiler is checked in, we can run the compiler in the interpreter (even an external one), which has a tracing mode that we can actively verify the execution steps of interpreting the compiler