-
Notifications
You must be signed in to change notification settings - Fork 32
Format
This is the current database file format used by Couchbase Server 2.0. It's similar, but not identical, to the format used by Apache CouchDB 1.2. It is implemented in a Couchbase 2.0 fork of CouchDB (in Erlang), and also by CouchStore in C.
The most important thing to understand about a CouchDB file is that it is append-only. The only way the file is modified is to append new data at the end; bytes written are never overwritten. As a consequence, the critical file "header" data actually lives at the tail end of the file, since it has to be re-appended every time the file is changed.
This has several advantages:
- The data format is extremely robust, since the file is never in an inconsistent state. Even if the process crashes or the kernel panics partway through writing an update, the software can recover simply by scanning back from the end of the file to the last valid header.
- Writers don't disturb readers at all, allowing concurrent access without read locks.
- By default, earlier versions of a record remain in the file, making it easy to implement a version history and multi-version concurrency control (as exposed in the CouchDB API.)
The disadvantage, of course, is that the file grows without bound as it's modified. To work around this, the file is periodically compacted by writing the live data to a new file, then replacing the old file with the new one. This can be done in the background without locking out either readers or writers.
- Numbers are in big-endian byte order (most significant byte first).
- All values are tightly packed; there is no padding or multi-byte alignment.
- Some values occupy partial bytes and have lengths that are not a multiple of 8 bits. These are stored in most-significant to least-significant bit order. So for example, if a 1-bit field is followed by a 47-bit field, the first field occupies the MSB of the first byte, and the second occupies the rest of the first byte and the entirety of the next five bytes.
For purposes of locating the current file header (q.v.), the file is organized into 4096-byte blocks. The first byte of every block is reserved and identifies whether it's a data block (zero) or a header block (nonzero). Therefore only 4095 of the bytes are available for storing data.
Above the block level, these prefix bytes are invisible and simply skipped. So a data value that spans a block boundary will be written out with a zero byte inserted at the boundary, and this byte will be removed when reading the value.
The data in the file (above the block level) is grouped into variable-length chunks. All chunks are prefixed with their length as a 32-bit big endian integer, followed by the CRC32 checksum of the data. The CRC32 is not included in the length.
- 1 bit -- Whether there is a CRC32 checksum or not (there always is, hence it's always 1)
- 31 bits -- body length
- 32 bits -- CRC32 checksum of body
- followed by the body data
A file header always appears on a 4096-byte block boundary, and the first byte of the block is nonzero to signal that it contains a header. A file will contain many headers as it grows, since a new updated one is appended whenever data is saved; the current header is the last one in the file. So the algorithm to find the header when opening the file is to seek to the last block boundary, read one byte, and keep skipping back 4096 bytes until the byte read is nonzero.
The file header is prefixed with a 32 bit length and a checksum, similarly to other data chunks, but the length field does include the length of the hash.
Values in the body of a file header:
- 8 bits -- File format version (Currently 11)
- 48 bits -- Update sequence number counter. This is the sequence number that will appear in the by-sequence index for the next update.
- 48 bits -- Purge sequence number.
- 48 bits -- Purged documents pointer
- 16 bits -- Size of by-sequence B-tree root
- 16 bits -- Size of by-ID B-tree root
- 16 bits -- Size of local documents B-tree root
- The B-tree roots, in the order of the sizes, are B-tree node pointers as described in the "Node Pointers" section.
The B-trees used in CouchDB files are a bit different than in a typical implementation, because the file is append-only. A tree node is never updated in place; instead, a new copy of the updated node is written at the end of the file. Of course this means that the node's parent also has to be updated to point at the updated node, so the effect is that every modification has to rewrite all the nodes from the affected leaf up to the root. In practice this isn't too expensive, especially if multiple writes are batched together into one update.
CouchDB's B-trees also have to forgo the sibling-node chaining that's typically used to speed up sequential access. The reason is that updating a node would invalidate the pointers to it in its siblings, forcing those nodes to be updated as well, resulting in a huge cascade of updates. Instead, the iteration algorithm remembers the path back to the root and periodically backtracks to find the next sibling node.
All B-tree nodes are compressed using the Snappy algorithm. The descriptions following all refer to the uncompressed form.
- First byte -- 1 if a leaf (key/value) node, 0 if an interior (key/pointer) node
- A list of key-value or key-pointer pairs, each of which is:
- 12 bits -- Key size
- 28 bits -- Value size
- Followed by the Key data,
- then the Value or Pointer data
In interior nodes the Value parts of these pairs are pointers to another B-tree node, where keys less than or equal to that pair's Key will be.
In leaf nodes the values are interpreted differently by each index; see the Indexes section below.
- 48 bits -- Node position in file
- 48 bits -- Sub tree size, or disk size of B-tree data below this node. For pointers to leaf nodes this is the size on disk of the pointed to node, otherwise it is the sum of the size of the pointed to node and all the values of this field on the pointers in that node.
- 16 bits -- Reduce value size (NOT present in the root pointers embedded in the file header, as it can be inferred from the lengths in the header)
- Reduce value -- Stores statistics about the subtree beneath this node. The exact data format is different in each type of B-tree, and is detailed below.
A CouchDB database file contains three B-trees:
- The by-ID index, which maps document IDs to values.
- The by-sequence index, which maps sequence numbers (database change numbers, which increase monotonically with every change) to document IDs and values.
- The local-documents index, which is conceptually the same as the by-ID index except that the documents in it do not appear in the
by-sequence index, and by CouchDB convention the document IDs all begin with the ASCII sequence
_local/
.
A CouchDB mapreduce view file contains two B-trees:
- The by-ID index, which maps document IDs to values. This is also called the back-index.
- The mapreduce-view index, which maps the the keys emitted by the view to its emitted values.
This B-tree is an index of documents by their IDs.
The keys are simply the document IDs, ordered lexicographically by raw bytes (as by the memcmp
function.)
The values are:
-
48 bits -- The sequence number this document was last modified at. If another revision of this document is later saved, this sequence number should be removed from the by-sequence index.
-
32 bits -- Size of the document data
-
1 bit -- Deleted flag. If this is set this document should be ignored in indexing.
-
47 bits -- Position of the document content on disk
-
48 bits -- Document revision number (rev_seq)
-
1 bit -- 1 if the value is compressed using the Snappy algorithm
-
7 bits -- Content type code (q.v.)
-
Revision Metadata (q.v.)
CouchStore does not interpret the deleted flag, content type code, document revision number, or revision metadata. These values are used by CouchDB and Couchbase Server.
The reduce value in interior nodes of this B-tree is:
- 40 bits -- Number of documents that are not flagged as deleted.
- 40 bits -- Number of documents that are flagged as deleted
- 48 bits -- Total disk size of document content
CouchDB uses this to identify the type of the document data.
- 0 -- JSON data
- 1 -- BLOB data (Attempted to parse as JSON but failed)
- 2 -- BLOB data (Parsed as syntactically valid JSON, but illegally contains reserved keys)
- 3 -- BLOB data (No attempt to parse as JSON)
This is identifying information about a document revision that should be small enough to be kept in the indexes. This value is not interpreted by Couchstore.
It's currently used by Couchbase as
- 64 bits -- Memcached CAS value (unique cookie used by memcache to allow atomic writes)
- 32 bits -- Expiration time of a value
- 32 bits -- User flags as specified by Memcache protocol
The keys in this B-tree are 48-bit numbers representing the sequence number of a change to the database. This number is strictly increasing.
The values are:
- 12 bits -- Size of the document ID
- 28 bits -- Size of the document data
- 1 bit -- Deleted flag (this item in the by-sequence index represents a deletion action if this is set)
- 47 bits -- Position of the document content on disk
- 48 bits -- Document revision number (rev_seq)
- 1 bit -- 1 if the value is compressed using the Snappy algorithm
- 7 bits -- Content type code
- Document ID
- Revision Metadata (variable size; extends to the end of the value record)
The reduce value in interior nodes of this B-tree is:
- 40 bits -- The number of records
TBD!
The reduce value in interior nodes of this B-tree is unused (empty).
The keys in this B-tree consist of the emitted key and the document ID. For node pointer it is the emitted key and document ID of the rightmost child node.
- 16 bits -- Size of the emitted Key
- Emitted Key -- The emitted key is a JSON string
- Document ID
The values are:
- 16 bits -- Partition ID (aka vBucket ID)
- List of emitted values:
- 24 bits -- Size of the emitted value
- Emitted Value