-
Notifications
You must be signed in to change notification settings - Fork 0
MIRROR of the Ubik programming environment. Upstream is hosted on git.haldean.org.
haldean/ubik-mirror
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
| | -- | | / | | | | | -- | | | | | | | -- ---- | | | This document attempts to fully explain the implementation and behavior of the Ubik runtime. When the code and this document disagree, the code is incorrect; on the other hand, don't trust anything in this document. Structure of the project ----------------------------------------------- The project itself is divided into a few software components. By far the most interesting is libubik, which is the shared library that contains the entirety of the Ubik runtime. There's runubik, which is a simple thing that uses libubik to load and run Ubik bytecode. There's ubikc and ubiki, which are a compiler and interpreter, respectively; both are thin wrappers around libubik, which contains the actual compiler implementation. And then there's pyasm, which is a hilarious "assembler" written as a Python DSL; pyasm will be gone the moment there's a working Ubik compiler, but for now, with my focus on the runtime, pyasm is here to stay. There are unit tests and "integration tests"; unit tests are haphazard but the integration tests (in the form of pyasm scripts in test/pyasm/*.upy) are pretty comprehensive. You can run all of the tests by running `make check`. make check When you're done with all that, the compiler will be present at bin/ubic, the interpreter will be at bin/ubi and the runtime will be at bin/ubik. One last thing: before contributing, please ensure you have installed the pre-commit hook in the `res` directory before you commit anything. It does some simple checks to make sure that you're not doing anything dumb. For completeness, building Ubik requires: - GCC 4.9 or later (it's possible that the build works with other versions of GCC and with clang, but they're untested) - Bison 3.0 or later - Flex 2.6 or later Ubik has no runtime dependencies outside of libc. The rest of this document is more interesting, I promise. Runtime representations ------------------------------------------------ Ubik is based on a simple primitive: unbalanced binary trees of 64-bit values (called "words"). Everything is expressed in this way, from integers to types to functions; this homogeneity of data representation allows us to simplify otherwise-complicated tasks and has a satisfying purity. Note that the in-memory representation is extremely close to the on-disk representation of compiled Ubik programs. Each node in the tree has a left value, a right value, and a tag. Each value can be one of two things: a pointer to another node or a word; the tag tells you how to interpret the left and right values. Tags Tags are 16-bit integers which contain types and subtypes for a given binary blob. The available tag types are: 0x1000: object is a tree 0x2000: object is a graph 0x3000: object is a URI Values that are trees have the following lower bytes: 0b00000001 (0x01): right child is a node 0b00000010 (0x02): right child is a word 0b00000100 (0x04): right child is a graph 0b00010000 (0x10): left child is a node 0b00100000 (0x20): left child is a word 0b01000000 (0x40): left child is a graph Thus, a node whose left value is a node and whose right value is a word would have the lower byte `0b00001001`, and its tag would be `0x1009`. Graphs have the following lower bytes: 0b00000001 (0x01): graph is native 0b00000010 (0x02): graph is unresolved 0b00000100 (0x04): graph is a modinit graph TODO: what do those mean. Types Types can be base types or derived types. Base types are special in that they are identified only by an integer constant; derived types (quite predictably) require quite a bit more information about the type itself. The base types of Ubik are a few flavors of integer, lists and tuples. Each of these has a base type code and comes with a tree encoding. The full list of base type codes is: Code Description ....word unsigned 64-bit integer ...sword signed 64-bit integer ....bool boolean ..packed packed list ..lambda function ...float 64-bit floating-point number Each type code is actually a word that contains the numeric equivalent of a 8-byte string, with the first letter in the byte with the lowest address. Note that periods in the above table stand in for space characters (`0x20`). The tree encoding of integer types is very simple; on its left is a word containing the type code of the relevant integer type and on its right is the representation of the value. The value representation is a word in which the represented value is stored in the lowest bits. The tree encoding of list types puts a list type descriptor on its left and the value of the list on its right. The list type descriptor has a constant word `list` on its left and a node representing the type of the elements of the list on its right. Each item on the right of the list should have a value on its left and either a node representing a list or a word set to zero on its right; finding the zero-word is a sentinel that the end of the list has been reached. The tree encoding of tuple types is similar to lists; there's a tuple type descriptor on the left and a tuple value on the right. The values are cons cells just like lists. However, the type descriptor is more complicated, as tuple elements are not homogenous. Instead, the type descriptor is a node whose left is the constant word `tuple` and whose right is a zero-word-terminated list of type encodings. The tree encoding of packed list types is provided for better space usage than a straight list would provide. The left type descriptor for a packed list is a node whose left is the constant word `packed` and whose right is the type of the elements of the list; this type is restricted to be a built-in integral type or `char`. The right value of a packed list is a node whose left is a word containing the number of items in the list, and whose right is a list of packed values encoded as cons cells, where each cell has a left of 8 bytes of packed data, and a right that either points to the next node or a zero-word. If the data ends off of an 8-byte boundary, the remaining bytes in the left word are zeroed; it is up to the user of the tree value to respect the length given in the first word of the value. Strings are stored as UTF-8 and are represented as packed lists of bytes; the represented string is the concatenation of all left values in the list. Note that this means that each individual left value may in fact be poorly-formed UTF-8, as a multibyte code point may be split across words. A type descriptor describes a type on its own; its left is the constant word `type` and its right is the descriptor itself. Anything that would appear on the left of a typed value is a type descriptor value, and could appear on the right of a type node. A resource identifier is how all entities within the ubik runtime are identified. Resources all have a name, author, version and scope; all of these are encoded in the URI tree. The left of a URI tree is the constant word `uri`, and the right is a cons-cell list, where the first element in the list is the name of the resource as a packed string, the second element in the list is the author (also as a packed string), the third item is a 64-bit integer representing the version number and the final element is a 64-bit integer represending the scope of the resource. The meanings of these fields will be covered in the Bindings section below. Building logic --------------------------------------------------------- Logic in Ubik is encoded in a directed acyclic graph of computations (DAGC); nodes in this graph then reference values stored as trees. There are four kinds of nodes in a DAGC: `apply` Points to a node with a function value and a node with an argument. Applies the argument to the function value. If the function is then fully applied the value of the node is the result of the function; if the function is partially applied the value of the node is a function that has an arity of one fewer than the input function. `const` A constant type and value; the value can be either a tree or a graph. `load` A load operation. Has a URI that references where the value should come from. `store` A store operation. Has a URI and a reference to a node with a value, and stores the given value at that URI. `input` The input to a DAGC. `cond` Points to a condition, a true value node and a false value node. If the condition evaluates as true, the value of the node is the value of the true value node; if it evaluates as false, it as the value of the false value node. `native` This one is sneaky; the native node is a node that is filled in by native code in "native DAGCs". Native DAGCs are backed by native code and evaluating them just calls the associated native code block. That code block then fills in the result of the compuation over the graph on the native node. Reading native nodes in from ubik bytecode is not supported; this is only a construct used internally in the runtime. `ref` This node has the value of another node in the graph, and is used primarily as a simplification in the code generation phase before it is optimized away. Each individual DAGC represents a function; each DAGC contains an input node for each of the function's arguments, and a single result node that represents the function's result. However, the graph may have multiple nodes whose value must be evaluated for the graph to have been called fully evaluated; these nodes are called terminal nodes, and the result node is one of them. The minimal set of nodes required to evaluate all terminal nodes is found, and all nodes in that set are evaluated in turn. This gives the result for the DAGC. Once evaluated, the DAGC is semantically and computationally equivalent to a single value; this is possible because all functions in Ubik are idempotent, and thus a zero-arity function is equivalent to a value. Logic is then built up by creating a number of graphs and referencing them from each other. When a node with a graph value is applied, a copy of the referenced DAGC is made and that DAGC's `input` nodes are completed. Once all required input nodes have been completed, the DAGC can be evaluated. An ubik bytecode blob contains some number of graphs; the various graphs can reference each other, and in general their ordering in the bytecode does not matter. There is only one exception: when evaluating a bytecode blob, the first graph is considered to represent the desired computation; the other graphs are only evaluated if it is necessary to do so to computate the value of the first graph. In this sense, the first graph is the "main method" of other languages. Encoding in-flight and at-rest ----------------------------------------- Ubik bytecode has a single binary storage format that is used for network operations and for on-disk storage. The format begins with a header, and then contains a number of encoded graphs. All numerically-encoded data is stored in big-endian ("network") byte order, which means the most significant byte comes first. The header has the following fields: Byte index Field 0-3 The constant bytes 0x65 0x78 0x70 0x6C ("expl") 4-7 The version number of the encoding format, as a big-endian 4-byte unsigned integer. 8-15 The number of graphs in the file. 16-23 The number of values in the file. The version of the encoding specified in this document is 1, and thus bytes 4-7 should be 0x00000001. The decoding proceeds by concatenating a number of value encodings followed by a number of graph encodings. Each graph or value that is encoded is identified by its index in the file. Immediately after the bytecode header comes a series of tree-encoded values. The tree encoding proceeds as follows: Begin with the root of the tree to be encoded. The node's 8-bit tag is written to the stream, followed by the encoding of its left child, then its right child. If a child is a value, the value is written to the stream in little-endian format as an 8-byte integer. If a child is another value, recurse and apply this same operation. If a child is a reference to a graph, the word-encoded index of the graph in the blob is written. The tag of the value is used to disambiguate the types of the children of the node. Tree decoding proceeds in much the same way: Take a byte off of the head of the stream. If the byte indicates the left is a node, recurse. If it indicates the left is a value, read the value as a little-endian 8-byte integer. Repeat the same operation for the right node. The graphs are specified after the values have all been read. The graph block has a header of its own, that is given once for each graph: Byte index Field 0-3 The tag of the graph 8-15 The number of nodes in the graph 16-23 The index of the result node in the graph 24-31 The value index of the identity URI of the graph 32-39 The value index of the type of the graph The identity URI represents the meaning of the graph; in most cases this will be the URI of whatever the graph is bound to. The result node should be set to all-ones if the graph has no identity. Following the graph header, the specified number of node encodings follows. The order of encoding matters; each node is identified by other nodes by its index. A node is encoded by a type-independent header followed by a type-dependent body. The type-independent node header has the following format: Byte index Field 0-7 Node type (word-encoded) 8-15 The ID of the node (for error reporting) 16 Set to 0x01 if the node is a terminal node, meaning that its value is intended as an output of computation. Set to 0x00 otherwise. 17-23 Unused Once the header has been loaded, the type-dependent decoding of the body is given. Nodes with node type ` apply` have the following body: Byte index Field 0-7 Node index of function node 8-15 Node index of argument node Nodes with node type ` load` have the following body: Byte index Field 0-7 Value index of the URI value Nodes with node type ` store` have the following body: Byte index Field 0-7 Node index of value node 8-15 Value index of the URI value Nodes with node type ` const` have the following subheader: Byte index Field 0-7 The subtype of the associated value (either `cvalue` or `cgraph`) which informs whether this constant points at a DAGC or a tree. 8-15 Value index of the tree-encoded type of the constant Then, depending on the associated subtype, this header is either followed by a word-encoded index of a value (if the subtype is `cvalue`) or contains a word-encoded index of another graph in the encoding (if the subtype is `cgraph`). Nodes with node type ` input` have the following body: Byte index Field 0-7 Argument index of the node Nodes with node type ` cond` have the following body: Byte index Field 0-7 Node index of the condition 8-15 Node index of the true value 16-23 Node index of the false value Nodes with node type ` ref` have the following body: Byte index Field 0-7 Node index of referrent node Ubik bytecode blobs may have arbitrary data following the data encoded by the standard; conforming parsers will ignore the data that follows the encoded graphs. This is provided as a means for tools to attach metadata onto the end of the bytecode if so desired. Compilation ------------------------------------------------------------ The compiler for Ubik is distributed as part of the Ubik library; a compiler executable can be built using: $ make $ ub/ubic path/to/file.uk out.ub The process of creating an executable is divided into two phases, not unlike compilation for C and other native languages: compilation and linking. However, unlike C, the user has less control over the build process; object files are always created for each compilation unit, "headers" (called "interface files") are automatically generated from the source, and all linking is static. The compiler abstracts most of the details away from the user, but they still get the advantage of fast, incremental builds. Compilation begins by loading the given Ubik file and generating code for the module (see below). In general, at the end of this process there are still unresolved references; we collect all of the imports that provide those references, and find the files that provide the requisite units among the compiler's search path. If the file is found, we check to see if there already exists an interface file for it in the build cache directory with the correct content hash; if there is, definitions are loaded from the interface file, and if not, we compile the module from scratch. Code generation -------------------------------------------------------- Code generation begins by lexing and parsing the provided source code, using the lexer and parser given in token.l and grammar.y. At the end of parsing, an in-memory representation of the AST is available. That AST is walked and DAGC nodes are attached to each AST element. Graphs are then built from these elements and a DAGC structure is emitted. In addition, each compiled file contains a "modinit", which is a DAGC to be executed upon the loading of the file. This modinit contains everything required to add the contents of the module to an environment, as well as any code the user has specified should be run upon module load using the immediate operator. Special care must be taken when calling the modinit to ensure that it is not called with an ephemeral child environment, which will cause the modue's contents to be discarded. The convention held by the compiler is that the modinit is always the first graph returned by the compilation phase. When assigning nodes to AST elements, the first applicable rule applies: 1. Constant AST atoms (strings, integers, floats) are assigned a CONST node with the given value. a. At the moment, no common subexpression elimination is in place to allow constant nodes with the same value to point at the same value in-memory. 2. AST atoms that are names, where the name is in the argument list that is in the nearest enclosing function in scope, are assigned REF nodes whose referrent is the input node corresponding to that name. 3. AST atoms that are names, where the name is defined in a scope that is within the current function, are assigned LOAD nodes whose URI has the given name, a NULL source and the LOCAL scope. 4. AST atoms that are names, where the name is in the scope of an enclosing function that is not the nearest function in scope, raise the nearest function to a closure. The closure transformation adds an argument to the head of the argument list for the closure and adds a REF node from the AST element that raises the function to a closure to the new INPUT node created corresponding to the argument. The function is then immediately applied to the argument from the enclosing scope. This process continues recursively through the stack until all modified functions have been raised if necessary. This transforms this: \x -> (\y -> + x y) To this: \x -> ((\x0 y -> + x0 y) x) It transforms this: \x -> (\y -> (\z -> x)) Into this: \x -> ((\x0 y -> ((\x1 z -> x1) x0)) x) It transforms this: \x -> { : y = + x 10 ! \z -> y } Into this: \x -> { : y = + x 10 ! (\y0 z -> y0) y } This closes these functions over the enclosing scope by making all enclosing scope information an explicit argument to the function, and then partially applying the function over that information. 5. AST atoms that are qualified names are checked against the imports that correspond to the source of the qualifier. If a corresponding element in the module is found, the AST atom is assigned a LOAD node whose URI has the given name, whose source is the name of the package, and the PACKAGE source. If no corresponding import is found, or there is no item within the imported module with the given name, compilation fails. 6. AST atoms that are names are assigned LOAD nodes whose URI has the given name and the UNKNOWN scope. 7. AST expressions that represent application are assigned an APPLY node with the func and args of the appropriate AST substructures. 8. Bindings in the AST are assigned a STORE node with the location given by the name of the binding and the LOCAL scope, and the value given by the appropriate AST substructure. 9. Lambda statements in the AST are assigned a CONST node that points at the graph represented by the lambda's body (the "lambda subgraph"). The lambda subgraph contains one input node for each item in the lambda statement's argument list. After nodes have been assigned to all AST elements, all URIs present within the combined AST are resolved. At the end of this process, there do not exist any URIs in the graph structures with an UNKNOWN scope. When resolving URIs, the first applicable rule applies: 1. URIs whose name occurs as a binding in the module that is currently being compiled are assigned the LOCAL scope. 2. URIs whose name occurs as a binding in the native library are assigned NATIVE scope.
About
MIRROR of the Ubik programming environment. Upstream is hosted on git.haldean.org.
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published