Skip to content

An RDF and URN-based content-addressing datastore for creating, processing, exporting, and sharing snapshots of files and directories

Notifications You must be signed in to change notification settings

TOGoS/ContentCouch

Repository files navigation

ContentCouch

ContentCouch is a Git-like system designed to store snapshots of files and directories for backup/synchronization/sharing purposes.

The model is based on RDF, and is thereby somewhat agnostic to specific data formats and URI schemes. This implementation uses RDF+XML for serializing objects and bitprint URIs for referencing blobs.

The URI schemes and serialization formats are common to several projects. See https://github.com/TOGoS/ContentCouchRepositoryFormat for a more formal specification of the on-disk repository format.

Data model

Like Git, the data model consists of a graph of blobs, trees, and commits, and objects are referenced by hash.

Unlike Git:

  • Blobs are stored internally as themselves, optionally by hardlinking
    • Good for storing large media files, since data can be deduplicated at the filesystem level
    • Hash of the file = the object ID
    • Contrast to Git, which adds a small object type + length header, messing up your hashes
  • Data structures are serialized as RDF+XML
    • Easy to inspect or author with a text editor
    • Can be extended with arbitrary metadata as desired
  • Objects are referenced by URI
    • So different URI schemes may be used

To differentiate between a blob and the serialization of a non-blob data structure (e.g. the serialization of a Directory object), URIs of abstract concepts are made by prepending “x-rdf-subject:” to the URI of the object’s serialized form.

e.g. the URI “x-rdf-subject:urn:bitprint:7ZI4YNKSIYB3OZGVLAUYBBQ4YQZDS5BK.L4LVKJV4XBF7MSXVP7FXLCFN6PXGYKMYERATYQA” refers to a directory containing two subdirectories, which is RDF+XML encoded as ”urn:bitprint:7ZI4YNKSIYB3OZGVLAUYBBQ4YQZDS5BK.L4LVKJV4XBF7MSXVP7FXLCFN6PXGYKMYERATYQA”.

RDF Schema Documentation

  • http://ns.nuke24.net/ContentCouch/ - the primary namespace
  • http://ns.nuke24.net/ContentCouch/Blob - a finite byte squence, i.e. the contents of a file
    • http://bitzi.com/xmlns/2002/01/bz-core#fileLength - the length of the blob, in bytes
  • http://ns.nuke24.net/ContentCouch/Commit - a commit object; can point to a target and any number of parent commits
    • http://ns.nuke24.net/ContentCouch/description - commit message
    • http://ns.nuke24.net/ContentCouch/parent - a parent commit, corresponding to a state of the thing being versioned from which this state was derived. More than one parent attribute may be present (unlike directory entries, the set of parents is not modeled as a collection; you just have multiple attribute values).
    • http://ns.nuke24.net/ContentCouch/target - the target of the commit, usually but not necessarily a directory; a snapshot of the ‘thing being versioned’. Usually provided as a reference.
    • http://ns.nuke24.net/ContentCouch/targetType - (deprecated) indicator of what is to be found at the target, “Blob” or “Directory”
    • http://purl.org/dc/terms/created - creation date of the commit (e.g. “2009-02-25 14:40:39”)
    • http://purl.org/dc/terms/creator - name of the author (e.g. “Fred Q”)
  • http://ns.nuke24.net/ContentCouch/Directory - a directory
    • http://ns.nuke24.net/ContentCouch/entries - the list of DirectoryEntry objects
  • http://ns.nuke24.net/ContentCouch/DirectoryEntry - a directory entry
    • http://ns.nuke24.net/ContentCouch/name - directory entry name, i.e. filename of contained file or subdirectory
    • http://ns.nuke24.net/ContentCouch/target - content of the directory entry; probably a Directory or Blob
    • http://ns.nuke24.net/ContentCouch/targetType - (deprecated) indicator of what is to be found at the target, “Blob” or “Directory”
    • http://purl.org/dc/terms/modified - modification time of the file; may be omitted

Examples

urn:bitprint:7ZI4YNKSIYB3OZGVLAUYBBQ4YQZDS5BK.L4LVKJV4XBF7MSXVP7FXLCFN6PXGYKMYERATYQA (browseable link) describes a directory containing two subdirectories:

<Directory xmlns="http://ns.nuke24.net/ContentCouch/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<entries rdf:parseType="Collection">
		<DirectoryEntry>
			<name>2009</name>
			<target>
				<Directory rdf:about="x-rdf-subject:urn:bitprint:UNYPOT6UW4GQCZH3ZNLWXEDK7ULMEF6Y.YKMAGQZNPTIHD2WYGRTZOO7KAFPOOPAY6R3XICA"/>
			</target>
		</DirectoryEntry>
		<DirectoryEntry>
			<name>2010</name>
			<target>
				<Directory rdf:about="x-rdf-subject:urn:bitprint:7YXK4RQGBCGT5NH6SU7GEWVV2TUMACFN.GA2FOEGTT7YLO6YTV46AHXRMEIVT4B5JVMHIPHQ"/>
			</target>
		</DirectoryEntry>
	</entries>
</Directory>

A subdirectory of the above, ’2009’, containing two JPEG files:

<Directory xmlns="http://ns.nuke24.net/ContentCouch/" xmlns:bz="http://bitzi.com/xmlns/2002/01/bz-core#" xmlns:dc="http://purl.org/dc/terms/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<entries rdf:parseType="Collection">
		<DirectoryEntry>
			<name>0817-IMG_0233.JPG</name>
			<target>
				<Blob rdf:about="urn:bitprint:RO5JL2GVCACEG7MGIK4JATPB6RTWCLFC.INYA5BK64TKIB2B7FEXWBVD56H55FL3ZLF3TGGY">
					<bz:fileLength>797878</bz:fileLength>
				</Blob>
			</target>
			<dc:modified>2009-08-17 14:45:01 GMT</dc:modified>
		</DirectoryEntry>
		<DirectoryEntry>
			<name>0817-IMG_0236.JPG</name>
			<target>
				<Blob rdf:about="urn:bitprint:EQLJSS4FNM4NATV4CZDOISF7YMAVPPWL.GFKPQM5ZQ5FLZRX77PMTBI4RWG4H547MU4JW4OA">
					<bz:fileLength>651881</bz:fileLength>
				</Blob>
			</target>
			<dc:modified>2009-08-17 14:45:52 GMT</dc:modified>
		</DirectoryEntry>
	</entries>
</Directory>

Note: references to blobs can be made simply with a “urn:bitprint:” URI. References to objects described by blobs require the “x-rdf-subject:” prefix (in other situations I have simply appended “#” to mean the same thing; see my notes on the subject in ’Existing standards for hash-based URN schemes’).

Note: DirectoryEntry target attributes, when represented in RDF+XML, may link directly to the target (<target rdf:resource="urn:WHATEVER"/>) or provide some metadata (<target><Blob rdf:about="urn:WHATEVER"><bz:fileLength>1234</bz:fileLength></Blob></target>). The latter is redundant, but provides potentially useful information about the target so that it can be considered without having to actually download the linked object, which may be large or unavailable. e.g. to show a directory listing, one generally would like to see subdirectoriers listed differently than files, and for files to at least have their size shown. The deprecated targetType attribute served this same role for type information in a hackier way.

TODO: Example of a commit object

Implementation(s)

This repository contains the original, still-in-use but not-actively-maintained Java implementation of the system. It was created in 2007 targetting JDK 1.4. In 2008 I rearchitected the project using a ‘resource-oriented’ approach. Components communicate by making lightweight (no network sockets are involved) REST calls to each other. This provides some flexibility and uniformity between internal and external APIs, but resulted in logic for anything other than very basic GET/PUT blob calls being in kind of weird places, and important information being crammed into Request and Response metadata. Lesson learned: it’s probably best to keep your functions simple and the call tree shallow.

See also:

Old documentation

./doc/old-README.txt