Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bonus Points: The "I'll F**king Show Them" Bootstrapping Plan #6723

Closed
ghost opened this issue Oct 18, 2020 · 12 comments
Closed

Bonus Points: The "I'll F**king Show Them" Bootstrapping Plan #6723

ghost opened this issue Oct 18, 2020 · 12 comments
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@ghost
Copy link

ghost commented Oct 18, 2020

This may be the single most ambitious, most awesome thing we could possibly do.

Absolute Zero

Bootstrapping Zig from below C level.

Introduction

In the early days, when BCPL was considered high-level, compiler languages were bootstrapped from assembly. The process was to define a subset of the target language which was easier to process, writing a compiler for that in assembly, writing a compiler for the full language in the subset, then writing a full compiler in the full language. At the time, this was done out of necessity, but it also meant you had a fully auditable system right down to bare metal, as well as a concrete path to recreate everything from scratch.

This isn't done anymore -- computers are so ubiquitous, and C-based software stacks so widespread, that quite literally all languages and systems that "bootstrap" start from C or C++, and abandon their roots as soon as they can build themselves. Zig stands out by maintaining its bootstrap chain, but this only extends down to LLVM/Clang/LLD, and there are currently no plans to go further down than a system C compiler. "Good enough", maybe, but we can do better.

The Plan

To maintain a bootstrap chain that goes right down to assembly, such that we don't even need a system C compiler to get off the ground. We do this by defining a strict subset of Zig, herein referred to as tZig (for "thin Zig" or "tiny Zig", take your pick), which can be compiled in one pass and more closely matches the operations performed by (most) physical architectures. The stage 1 compiler is then written in this subset, and it's Zig all the way down. We can provide tZig compilers in various assembly languages for those who want a fully auditable system, and one in C as a catch-all -- and if we miss one, and that target doesn't have a C compiler either, the user can write their own.

Our job is more complicated than C's, since we need to support multiple architectures, and Zig the language is designed assuming a fairly sophisticated compiler to start with. This means we need to be more restrictive than is strictly necessary for any one architecture, and we do need to make some assumptions. It also means that we need to keep implementational simplicity as a key consideration, so that other users can cover ground that we haven't.

Restrictions

Any language feature not explicitly prohibited by these is allowed. The goal is to have a single-pass compilable language with only one operation per statement that is also fully compliant with ZIg -- doubtless I've missed some things which make that impossible, additions welcome.

  • No forward references -- declarations may only reference previously defined symbols
  • No comptime -- only global constants are comptime-known, and see below
  • Global declarations must not include ALU ops, function calls, dereferences, or indexes, and must have type annotations
    • One exception: lone @import shall be allowed, but the @import("std") special case shall not -- see below
  • No anonymous keyword types (structs, enums, unions) -- must be bound to a global constant before use
    • No union (enum) -- enum type must be explicitly declared
  • No anonymous functions -- see above (relevant once RFC: Make function definitions expressions #1717 comes through)
  • In a compound type declaration, all members must precede all decls
  • No type or error set inference
    • Two exceptions: integer and enum literals
  • Local declarations must all be made at the start of a function, and must have type annotations
    • Local constants must precede local variables
  • No compound types or operations:
    • A local statement must contain at most one defer/errdefer and ((one return OR one assignment/declaration) and (one try/catch/orelse and one ALU op/function call/dereference/index/member chain/global variable on either side OR one complex literal)) OR one if statement with optional else branch OR one switch statement OR one while loop with optional continue statement
    • A while/if clause must consist of a single ALU op on simple values/variables OR a single boolean/optional value/variable with optional capture
    • A switch payload must be a simple value/variable
    • No for loops
  • No floating point types or values
  • No non-power of 2 integers
  • No packed structs
  • No unaligned accesses
  • No async
  • No inline assembly
  • No builtin functions that would contradict previous rules

To provide necessary platform functionality, there shall be one special-case import, @import("bootstrap"), that contains
hooks into the implementation for heap allocation, file I/O, and other things necessary for a compiler. This interface needs to be carefully designed to be as small and portable as possible. When compiling as Zig, this will alias selected standard library functionality.

Notes

  • @import("std") requires an implementation to process the whole standard library, or to understand lazy evaluation (and by extension the full complexity of Zig syntax) to avoid doing so. There is no reasonable way to mitigate this, and no significant reason to use arbitrary stdlib functionality anyway, so it is disallowed. Interfacing with platform binary functionality can be done with extern.
  • The tight restrictions on individual ops were chosen such that a typical RISC machine can convert them to single instructions -- in particular, the extra memory access involved in global variable use is significant. Exceptions include:
    • Member chains. These represent the standard use case for structs, and are not excessively more difficult to compile than a single access. Also, the module system is expressed through this, and I feel it's important we encourage a highly structured and decoupled programming style, even at the lowest level.
    • The reference operator, &. Note that this has no restrictions on its use at all -- its presence actually means less work for the compiler.
    • return. The presence of this is immediately known, and attaching it to any computation is not much effort. Also, in case an implementation ever wants to include tail recursion, compound return is necessary.
    • try/catch/orelse. Similar to return. This represents the standard use case for these keywords.
  • defer/errdefer require some implementational complexity and potential duplicate machine code to fit into a single-pass framework, but they are too useful to leave out.
  • Floating point values are not necessary for compilers, and not supported on all platforms.
  • Arbitrary-width integers, packed structs, and unaligned accesses share a common problem: requiring potentially multiple intricate machine operations to perform a single language operation.
  • Any for loop can be rewritten as a while loop with a continue statement, and for loops comparatively have hidden state and control flow (the current index and bounds check). These are small details, but everything must be explicit.
  • async is a deliberate departure from platform ABIs, hence it might not be obvious how to implement it on every platform. If it is useful, however, it can be accommodated.
  • Inline assembly makes code non-portable without comptime switching, and requires an integrated assembler, hurting simplicity.
  • @import("bootstrap") is necessary in the absence of inline assembly and the impossibility of the standard library, but I acknowledge it's ugly. However, the only other solution I can see given the restrictions is directly interfacing with platform binaries, which would be possible but much more fiddly and less portable. The best we can do is make sure we provide the simplest and most portable interface we can.

But Why?

In no particular order:

  • As long as we require C to bootstrap, we depend on it, and are at the mercy of its standards committee.
  • As simple as C is, a compiler for it is still a pretty big ROT.
  • C's original bootstrap chain has been lost to time, so if we somehow lose all binaries someday, we're screwed.
  • If you're Stallman-level paranoid, you want to be able to audit every byte of your system, which is only possible if you start from assembly.
  • Bootstrapping Zig on a platform with no C compiler (for instance, a custom one, or a new one designed in the far future when we have supplanted C) under the current plan would involve first writing a C compiler to compile stage 1. It's the same problem that other languages have when they don't maintain their bootstrap chains, but even worse because it recapitulates not only our history but a different project's history.
  • The aim of Zig is to replace C in every use case. If we depend on it for bootstrapping, we're admitting defeat. We're saying, "What has been done before will never be done again. This is the best we can do in the modern day."
  • It's impossible. Unprecedented. The only even remotely similar project is single-platform. Therefore, when we do it, it will be the single most awesome thing ever done in computing, and we'll be lauded as gods. From there, it will be a simple matter to convert everyone in the world to Zig. (Exaggerating? Facetious? Maybe. But there is some truth in it.)

Parting Words

According to all known laws of aviation, there is no way a bee should be able to fly. Its wings are too small to get its fat little body off the ground. The bee, of course, flies anyway, because bees don't care what humans think is impossible.

@tauoverpi
Copy link
Contributor

Guess this is related to https://github.com/oriansj/stage0 and http://guix.gnu.org/en/blog/2020/guix-further-reduces-bootstrap-seed-to-25/ which aim to do the same thing. It would be interesting and go closer to stage0 or the sort for a completely clean bootstrapping build pipeline but that might be less practical.

@Rocknest
Copy link
Contributor

A nice vanity project for the post 1.0 future. Although an intermediate dialect of zig is a bad idea, because it could lead to fragmentation. There is already zir, and it could be a lot simpler to implement in the assembly.

@ghost
Copy link
Author

ghost commented Oct 20, 2020

Easier to parse, sure, but ZIR still has the full power of Zig -- so, easier to implement, no. And I'm not necessarily sure a tZIR would be better than a tZig -- ZIR makes comptime more explicit than Zig, so that functionality would be harder to special-case. It's worth considering though.

@zenith391
Copy link
Contributor

zenith391 commented Oct 22, 2020

(what i'm saying assume we use stage2 compiler, which is likely since your idea would be implemented in post-1.0)

Don't know if i already said that, but days have changed since BCPL times, seeing the goal of this project (your "But Why ?" section), can't it just be implemented by cross-compiling from a compatible architecture (assuming stage2 is done) ? It sure sounds less cool but it's easier than having a custom platform creator create a whole compiler in assembly language and it's also easier for the compiler writers to not write in a dialect that strict. We don't have to depend on C for bootstrapping (assuming stage2), we don't even have to bootstrap if we do cross compiling correctly. The only case bootstrapping is actually useful is if all binaries of the Zig compiler magically disappear from everyone's PC on all the planet and nobody did a backup, but like Rocknest said, your idea will create fragmentation, because we will have architecture creators that will have to create a stage0 compiler (in assembly) to compile the stage1 compiler (written in your tZig proposal) to compile the stage2 compiler (in Zig). That isn't DRY, even if it's subsets, you end up implementing 3 times a Zig compiler, and the stage1 and stage2 compilers will have a lot of code to share, but unless rewriting 95% of the stage2 compiler in tZig, this won't be possible.
This also mean the architecture creator will have to implement his architecture in stage1 after writing a stage0 compiler in assembly, that's quite a lot of work isn't it? Mostly compared to just adding his architecture to stage2 and cross-compiling, voilà.

@ghost
Copy link

ghost commented Oct 22, 2020

I have to agree with @zenith391 here. The scope of the project is inspiring, but also a bit ... unnecessary? Kind of like the Space Shuttle -- technically brilliant, but very resource intensive and not really doing anything useful that can't be achieved in other ways. Excising C++ and LLVM as bootstrapping dependencies (eventually) seems like a worthwhile goal, but going beyond C the cost-benefit ratio seems to drop off quite a bit:

  • As mentioned before, Zig's awesome cross-compilation abilities make it unnecessary to bootstrap on every new architecture. It is sufficient to be able to bootstrap on commonly used systems.

  • Trusting-trust issues can be mitigated by setting up a reproducible bootstrap build and running it on multiple independent C compilers on several different platforms and then comparing hashes. Unless literally everything is subverted, this should be enough to eliminate binary backdoors.

  • C is a reasonably lightweight and very ubiquitous dependency. Depending on C is not much different than depending on the presence of an operating system of some kind. So unless we want to also bootstrap a ZigOS and ideally a RISC-Z CPU from punch cards, I'd say there is not much point in eliminating C at the start of the bootstrap chain.

@Michael-S
Copy link

* Trusting-trust issues can be mitigated by setting up a reproducible bootstrap build and running it on multiple independent C compilers on several different platforms and then comparing hashes. Unless literally everything is subverted, this should be enough to eliminate binary backdoors.

I'm pretty sure just about everyone active in Zig knows more about this kind of thing than I do, low level compiler details are outside my skill. But my understanding is that it would be difficult or impossible to get a byte-for-byte identical binary out of different C compilers running on different platforms. i.e. there is always "drift", which is what makes reproducible, provably safe builds such a hard problem to solve.

On the general proposal, I love it - but again, I wouldn't be able to contribute. (I'd love to learn, but my day job working with languages a hundred times uglier than Zig takes up my time.) I'll leave the heavier feedback to the people who would be doing the work.

@zenith391
Copy link
Contributor

zenith391 commented Oct 23, 2020

Sure, it could make a "drift", but if we compile the stage1 compiler, use it to compile the stage2 compiler and then use the stage2 compiler to compile itself, there will be no drift and the build will be byte-for-byte identical.

There's also, like i said in my above comment, the fact that you will be able to cross-compile directly using the stage2 compiler, which will make us independent from C once the stage2 is done.

@Michael-S
Copy link

I think I understand that, thanks.

@ghost
Copy link

ghost commented Oct 23, 2020

And another counterpoint: Writing stage0 in assembly could easily make the code less understandable/auditable/portable for distro maintainers, who, in my understanding, will be the primary users of the bootstrap chain. Other devs will use an existing compiler binary 99.9% of the time anyway.

I think that Andrew's original plan (#853) to keep a one-stage bootstrap process by continuing to maintain the existing stage1 compiler is sound. Optionally, stage1 can be converted to C output or even pure C (#5246), so that the rather large and brittle dependency of LLVM can be dropped from the bootstrap chain. But I really fail to see the value of going beyond that, other than just for the sake of it.

@andrewrk andrewrk added the proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. label Oct 26, 2020
@andrewrk andrewrk added this to the 0.8.0 milestone Oct 26, 2020
@matu3ba
Copy link
Contributor

matu3ba commented Jan 21, 2021

As mentioned before, Zig's awesome cross-compilation abilities make it unnecessary to bootstrap on every new architecture. It is sufficient to be able to bootstrap on commonly used systems.
Trusting-trust issues can be mitigated by setting up a reproducible bootstrap build and running it on multiple independent C compilers on several different platforms and then comparing hashes. Unless literally everything is subverted, this should be enough to eliminate binary backdoors.
C is a reasonably lightweight and very ubiquitous dependency. Depending on C is not much different than depending on the presence of an operating system of some kind. So unless we want to also bootstrap a ZigOS and ideally a RISC-Z CPU from punch cards, I'd say there is not much point in eliminating C at the start of the bootstrap chain.

From an ultra-paranoide perspective, you need to be able to bootstrap from inspectable hardware like the precursor or better microscopical inspectable hardware. Otherwise, you are trusting that at no time in the past the binary code of compilers was compromised.

The much easier and currently favoured way is to use platform backdoors like IME and hardware backdoors (like special CPU instructions, chip configurations etc). The problem with those is, that they may enable the aforement problem of subverting everything and you cant detect it.

@andrewrk
Copy link
Member

This is a cool idea but it is out of scope for this project. All Zig needs to do is provide a reasonable way to bootstrap from another well supported language, and C is ideal for this.

I hope somebody tries to bootstrap Zig from scratch in a third party project some day. That would be sweet.

@Sivecano
Copy link

most based proposal I've ever read

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

No branches or pull requests

7 participants