Goals for the format, in order from most to least important:
-
Longevity: The format should be designed to be long-lived. This means that the format must have sufficient documentation to permit independent implementations to be designed, without requiring any reverse-engineering effort.
-
Simplicity: The format should be straightforward. In the worst case, it should be possible to write your own parser for the format.
-
Compactness: The format shouldn't waste excessive space.
The database is a gzip-encoded JSON blob. The decompressed contents consist of:
-
A JSON-encoded object containing the database size and checksum. JSON Schema
-
The byte
0xA
(i.e. the ASCII character\n
). -
A JSON-encoded object containing the database contents. JSON Schema
The format is designed to be agnostic to the hash algorithm used. Multiple algorithms may be used simultaneously. By default, the following algorithm is used:
- SHA2-512/256
The following algorithms are also supported:
- BLAKE2b
Here are some formats under consideration:
-
Relational databases (SQL)
SQLite is embeddable, has Rust bindings, and is able to produce a database file that could plausibly be used as the integrity database. The main disadvantages of such an approach are:
-
Any relational database is going to support massively more features than we need, introducing unnecessary complexity. In general these formats will be optimized for efficiency of lookup over simplicity.
-
It's unclear how compact the resulting database will be: both the format itself, and artifacts resulting from trying to force fundamentally tree-shaped data into a relational schema.
-
It's unclear what kind of longevity to expect from the database format. Major revisions to the database software may introduces incompatible changes to the database format.
-
If the database format does change, reverse-engineering the format and writing a parser could be non-trivial.
Other SQL implementations such as Postgres and MySQL have an additional drawback: they are generally intended to be used in daemon mode, and do not expose the database files to the end-user at all.
-
-
Other databases
cdb is a "constant database" by D. J. Bernstein. At a first glance, cdb looks like a good fit for an integrity database. It supports a one-time creation operation, and read-only queries. The format is compact and simple enough to explain in a page of text. However, the format has a number of drawbacks:
-
The format has a built-in limit of 4 GB. For cdb's intended use cases, this is more than sufficient. However, for an integrity database, it is plausible that for large file systems, an integrity database could grow to exceed 4 GB.
-
The format is non-hierarchical, which will inevitably result in duplicated data in our use case.
-
-
Archive formats
The tar file format is well-known and could plausibly serve our use case. Basically, the integrity database would be equivalent to the original directory structure, but instead of story the contents of files, we'd store checksums and other metadata. However:
-
The tar format introduces unnecessary complexity.
-
Tar archives may waste space for the purpose we're using them, because the format aligns records to a certain number of bytes.
-
It's unclear how you'd interact with archive. Existing tools are intended to expand the archive into a set of files, but in this case we just want to scan through the contents in memory. While this is probably possible, it would likely require you to write a custom implementation, which would (due to the complexity of the format) introduce the possibility of bugs.
-
-
Message formats
Various formats (JSON, CBOR, ProtoBufs, Avro, etc.) were originally intended as formats to encode messages for transfer over the wire, but can also be used to describe data at rest.
A key advantage of these formats is that they are well-supported by Serde making it easy to produce robust and high-quality serializers and deserializers for any of these formats.
The advantage of JSON specifically is that it is ubiquitous: every language can be expected to have a mature JSON parser available. Also, because the format is self-describing, no prior knowledge of the format is required. Even better, the format is human-readable, so you don't even need to decode it to understand what you are looking at. The main disadvantages of JSON are:
-
The cost of being human-readable and self-describing is compactness. The field names of objects will be repeated, and binary strings (such as hashes) will need to be encoded in something like base64.
-
The format must be read sequentially, and does not permit random access. This is not an issue for the features described above, but could prove problematic for certain plausible extensions to the functionality of the tool.
There are other formats that are less ubiquitous but improve on the first point by offering compact binary encodings.
CBOR is a self-describing binary format that is relatively simple and has an official standard. Because it is self-describing, there will still be some inefficiency in the encoding, especially of objects.
Apache Avro is a binary format in which all documents are accompanied by a schema. Thus a document can be interpreted without any external knowledge of the schema, but avoids the repetition of traditional self-describing formats. However, the format is not standardized and it is unclear what stability guarantees are offered. Overall, Avro is less popular, so there is additional risk of the format being unsupported in the future.
-
-
Custom format
It would also be possible to design a custom format. Version control systems like Git and Mercurial essentially do this. However, those projects have the advantage of being popular, so there is interest in having multiple robust implementations of their formats. For a new tool, however, it would be better to use existing, well-known formats, as these are more likely to provide the desired longevity.