From a98020f64aed0c59b50bd2a836bb7b86f5295d01 Mon Sep 17 00:00:00 2001 From: Juan Benet Date: Sun, 8 Nov 2015 08:09:28 -0500 Subject: [PATCH 01/31] ipld spec --- merkledag/ipld.md | 464 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 464 insertions(+) create mode 100644 merkledag/ipld.md diff --git a/merkledag/ipld.md b/merkledag/ipld.md new file mode 100644 index 000000000..23d254e84 --- /dev/null +++ b/merkledag/ipld.md @@ -0,0 +1,464 @@ +# IPLD -- the "thin-waist" merkle dag format. + +There are a variety of systems that use merkle-tree and hash-chain inspired datastructures (e.g. git, bittorrent, ipfs, tahoe-lafs, sfsro). IPLD defines: + +- **_merkle-links_**: the core unit of a merkle-graph +- **_merkle-dag_**: any graphs whose edges are _merkle-links_. +- **_merkle-paths_**: unix-style paths for traversing _merkle-dags_ with _named merkle-links** +- **IPLD Data Model**: a flexible, JSON-based data model for representing merkle-dags. +- **IPLD Serialized Formats**: a set of formats in which IPLD objects can be represented, for example JSON, CBOR, CSON, YAML, Protobuf, XML, RDF, etc. +- **IPLD Canonical Format**: a deterministic description on a serialized format that ensures the same _logical_ object is always serialized to _the exact same sequence of bits_. This is critical for merkle-linking, and all cryptographic applications. + +In short: JSON documents with named merkle-links that can be traversed. + +## Intro + +### What is a _merkle-link_? + +A _merkle-link_ is a link between two objects which is content-addressed with the _cryptographic hash_ of the target object, and embedded in the source object. Content addressing with merkle-links allows: + +- **Cryptographic Integrity Checking**: resolving a link's value can be tested by hashing. In turn, this allows wide, secure, trustless exchanges of data (e.g. git or bittorrent), as others cannot give you any data that does not hash to the link's value. +- **Immutable Datastructures**: data structures with merkle links cannot mutate, which is a nice property for distributed systems. This is useful for versioning, for representing distributed mutable state (eg CRDTs), and for long term archival. + +### What is a _merkle-graph_ or a _merkle-dag_? + +Objects with merkle-links form a Graph (merkle-graph), which necessarily is both Directed, and which can be counted on to be Acyclic, iff the properties of the cryptographic hash function hold. I.e. a _merkle-dag_. Hence all graphs which use _merkle-linking_ (_merkle-graph_) are necessarily also Directed Acyclic Graphs (DAGs, hence _merkle-dag_). + +### What is a _merkle-path_? + +A _merkle-path_ is a unix-style path (e.g. `/a/b/c/d`) which initially dereferences through a _merkle-link_ and then follows _named merkle-links_ in the intermediate objects. Following a name means looking into the object, finding the _name_ and resolving the associated _merkle-link_. + +For example, suppose we have this _merkle-path_: + +``` +/ipfs/QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k/a/b/c/d +``` +Where: +- `ipfs` is a protocol namespace (to allow the computer to discern what to do) +- `QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k` is a cryptographic hash. +- `a/b/c/d` is a path _traversal_, as in unix. +- this link traverses five objects. + +Resolving it involves looking up each object and attaining a hash value, then traversing to the next. + +``` + +-------------------+ +O_1 = | "a": "QmV76pU..." | whose hash value is QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k + +-------------------+ + | + v + +-------------------+ +O_2 = | "b": "QmV76pU..." | whose hash value is QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT + +-------------------+ + | + v + +-------------------+ +O_3 = | "c": "QmV76pU..." | whose hash value is QmQmkZPNPoRkPd7wj2xUJe5v5DsY6MX33MFaGhZKB2pRSE + +-------------------+ + | + v + +-------------------+ +O_4 = | "d": "QmV76pU..." | whose hash value is QmWkyYNrN5wnHgX5vfs88q7QUaFKq52TVNTFeTzxm73UbT + +-------------------+ + | + v + +-------------------+ +O_5 = | "hello": "world" | whose hash value is QmR8Bzg59Y4FGWHeu9iTYhwhiP8PHCNFiaGhP1UjywA43j + +-------------------+ +``` + +This entire _merkle-path_ traversal is a unix-style path traversal over a _merkle-dag_ which uses _merkle-links_ with names. + +## What is the IPLD Data Model? + +The IPLD Data Model defines a simple JSON-based _structure_ for all merkle-dags, and identifies a set of formats to encode the structure into. + +### Constraints and Desires + +Some Constraints: +- IPLD paths MUST be unambiguous. A given path string MUST always deterministically traverse to the same object. (e.g. avoid duplicating link names) +- IPLD paths MUST be universal and avoid opressing non-english societies (e.g. use UTF-8, not ASCII). +- IPLD paths MUST layer cleanly over UNIX and The Web (use `/`, have deterministic transforms for ASCII systems). +- Given the wide success of JSON, a huge number of systems present JSON interfaces. IPLD MUST be able to import and export to JSON trivially. +- The JSON data model is also very simple and easy to use. IPLD MUST be just as easy to use. +- Definining new datastructures MUST be trivially easy. It should not be cumbersome -- or require much knowledge -- to experiment with new definitions on top of IPLD. +- IPLD MUST be compatible with RDF and the proper W3C Semantic Web / Linked Data standards. We can achieve this easily through JSON-LD. +- IPLD Serialized Formats (on disk and on the wire) MUST be fast and space efficient. (should not use JSON as the storage format, and instead use CBOR or similar formats) +- IPLD cryptographic hashes MUST be upgradeable (use [multihash](https://github.com/jbenet/multihash)) + +Some nice-to-haves: +- IPLD SHOULD NOT carry over mistakes, e.g. the lack of integers in JSON. +- IPLD SHOULD be upgradable, e.g. if a better on-disk format emerges, systems should be able to migrate to it and minimize costs of doing so. +- IPLD objects SHOULD be able to resolve properties too as paths, not just merkle links. +- IPLD Canonical Format SHOULD be easy to write a parser for. +- IPLD Canonical Format SHOULD enable seeking without parsing full objects. (CBOR and Protobuf allow this). + + +### Format Definition + +(**NOTE:** Here we will use both JSON and YML to show what formats look like. We explicitly use both to show equivalence of the object across two different formats.) + +At its core, IPLD Data Model "is just JSON" in that it (a) is also tree based documents with a few primitive types, (b) maps 1:1 to json, (c) users can use it through JSON itself. It "is not JSON" in that (a) it improves on some mistakes, (b) has an efficient serialized representation, and (c) does not actually specify a single on-wire format, as the world is known to improve. + +#### Basic Node + +Here is an example IPLD object in JSON: + +```json +{ + "name": "Vannevar Bush" +} +``` + +Suppose it hashes to the multihash value `QmAAA...AAA`. Note that it has no links at all, just a string name value. But we are still be able to "resolve" the key `name` under it: + +```sh +> ipld cat --json QmAAA...AAA +{ + "name": "Vannevar Bush" +} + +> ipld cat --json QmAAA...AAA/name +"Vannevar Bush" +``` + +And -- of course -- we are able to view it in other formats + +```sh +> ipld cat --yml QmAAA...AAA +--- +name: Vannevar Bush + +> ipld cat --xml QmAAA...AAA + + + Vannevar Bush + +``` + +#### Linking Between Nodes + +Merkle-Linking between nodes is the reason for IPLD to exist. A Link in IPLD is just an embedded node with a special format: + +```js +{ + "title": "As We May Think", + "author": { + "mlink": "QmAAA...AAA" // links to the node above. + } +} +``` + +Suppose this hashes to the multihash value `QmBBB...BBB`. This node links the _subpath `author` to `QmAAA...AAA`, the node in the section above. So we can now do: + +```sh +> ipld cat --json QmBBB...BBB +{ + "title": "As We May Think", + "author": { + "mlink": "QmAAA...AAA" // links to the node above. + } +} + +> ipld cat --json QmBBB...BBB/author +{ + "title": "As We May Think", + "author": { + "mlink": "QmAAA...AAA" // links to the node above. + } +} + +> ipld cat --yml QmBBB...BBB/author +--- +title: As We May Think +author: + mlink: QmAAA...AAA + +> ipld cat --json QmBBB...BBB/author/name +"Vannevar Bush" +``` + +#### Link Properties + +IPLD allows for links to have other properties themselves. This is useful to encode other invormation into a link, such as the kind of relationship, or ancilliary data required in the link. This is _different from_ "Link Objects", discussed below, which are very useful in their own right. But sometimes, you just want to add a bit of data on the link and not have to make another object. IPLD doesn't get in your way. + +For example, supposed you have a file system, and want to assign metadata like permissions, or owners in the link between objects. Suppose you have a `directory` object with hash `QmCCC...CCC` like this: + +```js +{ + "foo": { + "mlink": "QmCCC...111" + "mode": "0755", + "owner": "jbenet" + }, + "cat.jpg": { + "mlink": "QmCCC...222" + "mode": "0644", + "owner": "jbenet" + }, + "doge.jpg": { + "mlink": "QmCCC...333", + "mode": "0644", + "owner": "jbenet" + } +} +``` + +or in YML + +```yml +--- +foo: + mlink: QmCCC...111 + mode: 0755 + owner: jbenet +cat.jpg: + mlink: QmCCC...222 + mode: 0644 + owner: jbenet +doge.jpg: + mlink: QmCCC...333 + mode: 0644 + owner: jbenet +``` + +Though we have new properties in the links that are _specific to this datastructure_, we can still resolve links just fine: + +```js +> ipld cat --json QmCCC...CCC/cat.jpg +{ + "data": "\u0008\u0002\u0012��\u0008����\u0000\u0010JFIF\u0000\u0001\u0001\u0001\u0000H\u0000H..." +} + +> ipld cat --json QmCCC...CCC/doge.jpg +{ + "subfiles": [ + { + "mlink": "QmPHPs1P3JaWi53q5qqiNauPhiTqa3S1mbszcVPHKGNWRh" + }, + { + "mlink": "QmPCuqUTNb21VDqtp5b8VsNzKEMtUsZCCVsEUBrjhERRSR" + }, + { + "mlink": "QmS7zrNSHEt5GpcaKrwdbnv1nckBreUxWnLaV4qivjaNr3" + } + ] +} + +> ipld cat --yml QmCCC...CCC/doge.jpg +--- +subfiles: + - mlink: QmPHPs1P3JaWi53q5qqiNauPhiTqa3S1mbszcVPHKGNWRh + - mlink: QmPCuqUTNb21VDqtp5b8VsNzKEMtUsZCCVsEUBrjhERRSR + - mlink: QmS7zrNSHEt5GpcaKrwdbnv1nckBreUxWnLaV4qivjaNr3 + +> ipld cat --json QmCCC...CCC/doge.jpg/subfiles/1 +{ + "data": "\u0008\u0002\u0012��\u0008����\u0000\u0010JFIF\u0000\u0001\u0001\u0001\u0000H\u0000H..." +} +``` + +But we can't extract the link as nicely as other properties, as links are meant to _resolve through_. + +#### Duplicate property keys + +Note that having two properties with _the same_ name IS NOT ALLOWED, but actually impossible to prevent (someone will do it and feed it to parsers), so to be safe, we define the value of the path traversal to be _the first_ entry in the serialized representation. For example, suppose we have the object: + +```json +{ + "name": "J.C.R. Licklider", + "name": "Hans Moravec" +} +``` + +Suppose _this_ was the _exact order_ in the _Canonical Format_ (not json, but cbor), and it hashes to `QmDDD...DDD`. We would _ALWAYS_ get: + +```sh +> ipld cat --json QmDDD...DDD +{ + "name": "J.C.R. Licklider", + "name": "Hans Moravec" +} +> ipld cat --json QmDDD...DDD/name +"J.C.R. Licklider" +``` + + +#### Path Restrictions + +There are some important problems that come about with path descriptions in Unix and the web. For a discussion see (TODO link to path issue in go-ipfs or go-ipld). In order to be compatible with the models and expectations of unix and the web, IPLD explicitly disallows paths with certain path components. **Note that the data itself _may_ still contain these properties (someone will do it, and there are legitimate uses for it). So it is only _Path Resolvers_ that MUST NOT resolve through those paths.** The restrictions are the same as typical unix and UTF-8 path systems: + + +TODO: +- [ ] list path resolving restrictions +- [ ] show examples + +#### Integers in JSON + +IPLD is _directly compatible_ with JSON, to take advantage of JSON's successes, but it need not be _held back_ by JSON's mistakes. This is where we can afford to follow format idiomatic choices, though care MUST be given to ensure there is always a well-defined 1:1 mapping. + +On the subject of integers, there exist a variety of formats which represent integers as strings in JSON, for example, [EJSON](https://www.meteor.com/ejson). These can be used and conversion to and from other formats should happen naturally-- that is, when converting JSON to CBOR, an EJSON integer should be transformed naturally to a proper CBOR integer, instead of representing it as a map with string values. + + +## Serialized Data Formats + +IPLD supports a variety of serialized data formats trough [multicodec](https://github.com/jbenet/multicodec). These can be used however is idiomatic to the format, for example in `CBOR`, we can use `CBOR` type tags to represent the merkle-link, and avoid writing out the full string key `mlink`. Users are encouraged to use the formats to their fullest, and to store and transmit IPLD data in whatever format makes the most sense. The only requirement **is that there MUST be a well-defined one-to-one mapping with the IPLD Canonical format.** This is so that data can be transformed from one format to another, and back, without changing its meaning nor its cryptographic hashes. + +### Canonical Format + +In order to preserve merkle-linking's power, we muste ensure that there is a single **_canonical_** serialized representation of an IPLD document. This ensures that applications arrive at the same cryptographic hashes. It should be noted --though-- that this is a system-wide parameter. Future systems might change it to evolve representations. However we estimate this would need to be done no more than once per decade. + +**The IPLD Canonical format is _canonicalized CBOR_.** + + +## Datastructure Examples + +It is important that IPLD be a simple, nimble, and flexible format that does not get in the way of users defining new or importing old datastractures. For this purpose, below I will show a few example data structures. + + +### Unix Filesystem + + +#### A small File + +```js +{ + "data": "hello world", + "size": "11" +} +``` + +#### A Chunked File + +Split into multiple independent sub-Files. + +```js +{ + "size": "1424119", + "subfiles": [ + { + "mlink": "QmAAA...", + "size": "100324" + }, + { + "mlink": "QmAA1...", + "size": "120345", + "repeat": "10" + }, + { + "mlink": "QmAA1...", + "size": "120345" + }, + ] +} +``` + +#### A Directory + +```js +{ + "foo": { + "mlink": "QmCCC...111" + "mode": "0755", + "owner": "jbenet" + }, + "cat.jpg": { + "mlink": "QmCCC...222" + "mode": "0644", + "owner": "jbenet" + }, + "doge.jpg": { + "mlink": "QmCCC...333", + "mode": "0644", + "owner": "jbenet" + } +} +``` + +### git + +#### git blob + +```js +{ + "data": "hello world" +} +``` + +#### git tree + +```js +{ + "foo": { + "mlink": "QmCCC...111" + "mode": "0755" + }, + "cat.jpg": { + "mlink": "QmCCC...222" + "mode": "0644" + }, + "doge.jpg": { + "mlink": "QmCCC...333", + "mode": "0644" + } +} +``` + +#### git commit + +```js +{ + "tree": {"mlink": "e4647147e940e2fab134e7f3d8a40c2022cb36f3"}, + "parents": [ + {"mlink": "b7d3ead1d80086940409206f5bd1a7a858ab6c95"}, + {"mlink": "ba8fbf7bc07818fa2892bd1a302081214b452afb"} + ], + "author": { + "name": "Juan Batiz-Benet", + "email": "juan@benet.ai", + "time": "1435398707 -0700" + }, + "committer": { + "name": "Juan Batiz-Benet", + "email": "juan@benet.ai", + "time": "1435398707 -0700" + }, + "message": "Merge pull request #7 from ipfs/iprs\n\n(WIP) records + merkledag specs" +} +``` + +### Bitcoin + +#### Bitcoin Block + +```js +{ + "parent": {"mlink": "Qm000000002CPGAzmfdYPghgrFtYFB6pf1BqMvqfiPDam8"}, + "transactions": {"mlink": "QmTgzctfxxE8ZwBNGn744rL5R826EtZWzKvv2TF2dAcd9n"}, + "nonce": "UJPTFZnR2CPGAzmfdYPghgrFtYFB6pf1BqMvqfiPDam8" +} +``` + +#### Bitcoin Transaction + +This time, im YML. TODO: make this a real txn + +```yml +--- +inputs: + - input: {mlink: Qmes5e1x9YEku2Y4kDgT6pjf91TPGsE2nJAaAKgwnUqR82} + amount: 100 +outputs: + - output: {mlink: Qmes5e1x9YEku2Y4kDgT6pjf91TPGsE2nJAaAKgwnUqR82} + amount: 50 + - output: {mlink: QmbcfRVZqMNVRcarRN3JjEJCHhQBcUeqzZfa3zoWMaSrTW} + amount: 30 + - output: {mlink: QmV9PkR2gXcmUgNH7s7zMg9dsk7Hy7bLS18S9SHK96m7zV} + amount: 15 + - output: {mlink: QmP8r8fLUnEywGnRRUrHB28nnBKwmshMLiYeg8udzYg7TK} + amount: 5 +script: OP_VERIFY +``` + + + From 74bc794794c3cfea3e5229546766cd38efdf387d Mon Sep 17 00:00:00 2001 From: Juan Benet Date: Sun, 8 Nov 2015 08:16:09 -0500 Subject: [PATCH 02/31] link to paths issue --- merkledag/ipld.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index 23d254e84..b159f3d46 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -286,7 +286,7 @@ Suppose _this_ was the _exact order_ in the _Canonical Format_ (not json, but cb #### Path Restrictions -There are some important problems that come about with path descriptions in Unix and the web. For a discussion see (TODO link to path issue in go-ipfs or go-ipld). In order to be compatible with the models and expectations of unix and the web, IPLD explicitly disallows paths with certain path components. **Note that the data itself _may_ still contain these properties (someone will do it, and there are legitimate uses for it). So it is only _Path Resolvers_ that MUST NOT resolve through those paths.** The restrictions are the same as typical unix and UTF-8 path systems: +There are some important problems that come about with path descriptions in Unix and the web. For a discussion see [this discussion](https://github.com/ipfs/go-ipfs/issues/1710). In order to be compatible with the models and expectations of unix and the web, IPLD explicitly disallows paths with certain path components. **Note that the data itself _may_ still contain these properties (someone will do it, and there are legitimate uses for it). So it is only _Path Resolvers_ that MUST NOT resolve through those paths.** The restrictions are the same as typical unix and UTF-8 path systems: TODO: From 37c662a73092e14fda73c986e7b89026dd0b7a82 Mon Sep 17 00:00:00 2001 From: Juan Benet Date: Sat, 21 Nov 2015 22:12:00 -0800 Subject: [PATCH 03/31] CR fixes --- merkledag/ipld.md | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index b159f3d46..619ac37c6 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -33,12 +33,15 @@ For example, suppose we have this _merkle-path_: ``` /ipfs/QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k/a/b/c/d ``` + Where: - `ipfs` is a protocol namespace (to allow the computer to discern what to do) - `QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k` is a cryptographic hash. - `a/b/c/d` is a path _traversal_, as in unix. - this link traverses five objects. +Suppose also that this path points to the object `{ "hello": "world" }`. + Resolving it involves looking up each object and attaining a hash value, then traversing to the next. ``` @@ -82,7 +85,7 @@ Some Constraints: - Given the wide success of JSON, a huge number of systems present JSON interfaces. IPLD MUST be able to import and export to JSON trivially. - The JSON data model is also very simple and easy to use. IPLD MUST be just as easy to use. - Definining new datastructures MUST be trivially easy. It should not be cumbersome -- or require much knowledge -- to experiment with new definitions on top of IPLD. -- IPLD MUST be compatible with RDF and the proper W3C Semantic Web / Linked Data standards. We can achieve this easily through JSON-LD. +- Since IPLD is based on the JSON data model, it is fully compatible with RDF and Linked Data standards through JSON-LD. - IPLD Serialized Formats (on disk and on the wire) MUST be fast and space efficient. (should not use JSON as the storage format, and instead use CBOR or similar formats) - IPLD cryptographic hashes MUST be upgradeable (use [multihash](https://github.com/jbenet/multihash)) @@ -162,17 +165,12 @@ Suppose this hashes to the multihash value `QmBBB...BBB`. This node links the _s > ipld cat --json QmBBB...BBB/author { - "title": "As We May Think", - "author": { - "mlink": "QmAAA...AAA" // links to the node above. - } + "name": "Vannevar Bush" } > ipld cat --yml QmBBB...BBB/author --- -title: As We May Think -author: - mlink: QmAAA...AAA +name: "Vannevar Bush" > ipld cat --json QmBBB...BBB/author/name "Vannevar Bush" @@ -302,11 +300,11 @@ On the subject of integers, there exist a variety of formats which represent int ## Serialized Data Formats -IPLD supports a variety of serialized data formats trough [multicodec](https://github.com/jbenet/multicodec). These can be used however is idiomatic to the format, for example in `CBOR`, we can use `CBOR` type tags to represent the merkle-link, and avoid writing out the full string key `mlink`. Users are encouraged to use the formats to their fullest, and to store and transmit IPLD data in whatever format makes the most sense. The only requirement **is that there MUST be a well-defined one-to-one mapping with the IPLD Canonical format.** This is so that data can be transformed from one format to another, and back, without changing its meaning nor its cryptographic hashes. +IPLD supports a variety of serialized data formats through [multicodec](https://github.com/jbenet/multicodec). These can be used however is idiomatic to the format, for example in `CBOR`, we can use `CBOR` type tags to represent the merkle-link, and avoid writing out the full string key `mlink`. Users are encouraged to use the formats to their fullest, and to store and transmit IPLD data in whatever format makes the most sense. The only requirement **is that there MUST be a well-defined one-to-one mapping with the IPLD Canonical format.** This is so that data can be transformed from one format to another, and back, without changing its meaning nor its cryptographic hashes. ### Canonical Format -In order to preserve merkle-linking's power, we muste ensure that there is a single **_canonical_** serialized representation of an IPLD document. This ensures that applications arrive at the same cryptographic hashes. It should be noted --though-- that this is a system-wide parameter. Future systems might change it to evolve representations. However we estimate this would need to be done no more than once per decade. +In order to preserve merkle-linking's power, we must ensure that there is a single **_canonical_** serialized representation of an IPLD document. This ensures that applications arrive at the same cryptographic hashes. It should be noted --though-- that this is a system-wide parameter. Future systems might change it to evolve representations. However we estimate this would need to be done no more than once per decade. **The IPLD Canonical format is _canonicalized CBOR_.** From 878cb317c0dc7830deb51ebaa67091cb9913e451 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Sat, 9 Jan 2016 00:52:43 +0100 Subject: [PATCH 04/31] IPLD: precisions about canonical format --- merkledag/ipld.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index 619ac37c6..4fa69db06 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -308,6 +308,16 @@ In order to preserve merkle-linking's power, we must ensure that there is a sing **The IPLD Canonical format is _canonicalized CBOR_.** +The legacy canonical format is protocol buffers. + +This canonical format is used to decide which format to use when creating the object for the first time and computing its hash. Once the format is decided for an IPLD object, it must be used in all communications so senders and receivers can check the data against the hash. + +For example, when sending a legacy object encoded in protocol buffers over the wire, the sender must not send the CBOR version as the receiver will not be able to check the file validity. + +In the same way, when the receiver is storing the object, it must make sure that the canonical format for this object is store along with the object so it will be able to share the object with other peers. + +A simple way to store such objects with their format is to store them with their multicodec header. + ## Datastructure Examples From 2dd135e3c31e2bcd8f4adc1a2b6658dbf0675028 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Sun, 10 Jan 2016 16:05:14 +0100 Subject: [PATCH 05/31] Rename Merkle links key from mlink to link --- merkledag/ipld.md | 75 ++++++++++++++++++++++++----------------------- 1 file changed, 38 insertions(+), 37 deletions(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index 4fa69db06..e6f962bfb 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -20,6 +20,10 @@ A _merkle-link_ is a link between two objects which is content-addressed with th - **Cryptographic Integrity Checking**: resolving a link's value can be tested by hashing. In turn, this allows wide, secure, trustless exchanges of data (e.g. git or bittorrent), as others cannot give you any data that does not hash to the link's value. - **Immutable Datastructures**: data structures with merkle links cannot mutate, which is a nice property for distributed systems. This is useful for versioning, for representing distributed mutable state (eg CRDTs), and for long term archival. +A _merkle-link_ is represented in the IPLD object model by a map containing a key `link` which value is the actual link. When dereferencing the link, the map itself is to be replaced by the object it points to. + +The link can either be a base58 hash, in which case it is assumed that it is a link in the `/ipfs` hierarchy, or directly the absolute path to the object. + ### What is a _merkle-graph_ or a _merkle-dag_? Objects with merkle-links form a Graph (merkle-graph), which necessarily is both Directed, and which can be counted on to be Acyclic, iff the properties of the cryptographic hash function hold. I.e. a _merkle-dag_. Hence all graphs which use _merkle-linking_ (_merkle-graph_) are necessarily also Directed Acyclic Graphs (DAGs, hence _merkle-dag_). @@ -147,7 +151,7 @@ Merkle-Linking between nodes is the reason for IPLD to exist. A Link in IPLD is { "title": "As We May Think", "author": { - "mlink": "QmAAA...AAA" // links to the node above. + "link": "QmAAA...AAA" // links to the node above. } } ``` @@ -159,7 +163,7 @@ Suppose this hashes to the multihash value `QmBBB...BBB`. This node links the _s { "title": "As We May Think", "author": { - "mlink": "QmAAA...AAA" // links to the node above. + "link": "QmAAA...AAA" // links to the node above. } } @@ -185,17 +189,17 @@ For example, supposed you have a file system, and want to assign metadata like p ```js { "foo": { - "mlink": "QmCCC...111" + "link": "QmCCC...111" "mode": "0755", "owner": "jbenet" }, "cat.jpg": { - "mlink": "QmCCC...222" + "link": "QmCCC...222" "mode": "0644", "owner": "jbenet" }, "doge.jpg": { - "mlink": "QmCCC...333", + "link": "QmCCC...333", "mode": "0644", "owner": "jbenet" } @@ -207,15 +211,15 @@ or in YML ```yml --- foo: - mlink: QmCCC...111 + link: QmCCC...111 mode: 0755 owner: jbenet cat.jpg: - mlink: QmCCC...222 + link: QmCCC...222 mode: 0644 owner: jbenet doge.jpg: - mlink: QmCCC...333 + link: QmCCC...333 mode: 0644 owner: jbenet ``` @@ -232,13 +236,13 @@ Though we have new properties in the links that are _specific to this datastruct { "subfiles": [ { - "mlink": "QmPHPs1P3JaWi53q5qqiNauPhiTqa3S1mbszcVPHKGNWRh" + "link": "QmPHPs1P3JaWi53q5qqiNauPhiTqa3S1mbszcVPHKGNWRh" }, { - "mlink": "QmPCuqUTNb21VDqtp5b8VsNzKEMtUsZCCVsEUBrjhERRSR" + "link": "QmPCuqUTNb21VDqtp5b8VsNzKEMtUsZCCVsEUBrjhERRSR" }, { - "mlink": "QmS7zrNSHEt5GpcaKrwdbnv1nckBreUxWnLaV4qivjaNr3" + "link": "QmS7zrNSHEt5GpcaKrwdbnv1nckBreUxWnLaV4qivjaNr3" } ] } @@ -246,9 +250,9 @@ Though we have new properties in the links that are _specific to this datastruct > ipld cat --yml QmCCC...CCC/doge.jpg --- subfiles: - - mlink: QmPHPs1P3JaWi53q5qqiNauPhiTqa3S1mbszcVPHKGNWRh - - mlink: QmPCuqUTNb21VDqtp5b8VsNzKEMtUsZCCVsEUBrjhERRSR - - mlink: QmS7zrNSHEt5GpcaKrwdbnv1nckBreUxWnLaV4qivjaNr3 + - link: QmPHPs1P3JaWi53q5qqiNauPhiTqa3S1mbszcVPHKGNWRh + - link: QmPCuqUTNb21VDqtp5b8VsNzKEMtUsZCCVsEUBrjhERRSR + - link: QmS7zrNSHEt5GpcaKrwdbnv1nckBreUxWnLaV4qivjaNr3 > ipld cat --json QmCCC...CCC/doge.jpg/subfiles/1 { @@ -300,7 +304,7 @@ On the subject of integers, there exist a variety of formats which represent int ## Serialized Data Formats -IPLD supports a variety of serialized data formats through [multicodec](https://github.com/jbenet/multicodec). These can be used however is idiomatic to the format, for example in `CBOR`, we can use `CBOR` type tags to represent the merkle-link, and avoid writing out the full string key `mlink`. Users are encouraged to use the formats to their fullest, and to store and transmit IPLD data in whatever format makes the most sense. The only requirement **is that there MUST be a well-defined one-to-one mapping with the IPLD Canonical format.** This is so that data can be transformed from one format to another, and back, without changing its meaning nor its cryptographic hashes. +IPLD supports a variety of serialized data formats through [multicodec](https://github.com/jbenet/multicodec). These can be used however is idiomatic to the format, for example in `CBOR`, we can use `CBOR` type tags to represent the merkle-link, and avoid writing out the full string key `link`. Users are encouraged to use the formats to their fullest, and to store and transmit IPLD data in whatever format makes the most sense. The only requirement **is that there MUST be a well-defined one-to-one mapping with the IPLD Canonical format.** This is so that data can be transformed from one format to another, and back, without changing its meaning nor its cryptographic hashes. ### Canonical Format @@ -345,16 +349,16 @@ Split into multiple independent sub-Files. "size": "1424119", "subfiles": [ { - "mlink": "QmAAA...", + "link": "QmAAA...", "size": "100324" }, { - "mlink": "QmAA1...", + "link": "QmAA1...", "size": "120345", "repeat": "10" }, { - "mlink": "QmAA1...", + "link": "QmAA1...", "size": "120345" }, ] @@ -366,17 +370,17 @@ Split into multiple independent sub-Files. ```js { "foo": { - "mlink": "QmCCC...111" + "link": "QmCCC...111" "mode": "0755", "owner": "jbenet" }, "cat.jpg": { - "mlink": "QmCCC...222" + "link": "QmCCC...222" "mode": "0644", "owner": "jbenet" }, "doge.jpg": { - "mlink": "QmCCC...333", + "link": "QmCCC...333", "mode": "0644", "owner": "jbenet" } @@ -398,15 +402,15 @@ Split into multiple independent sub-Files. ```js { "foo": { - "mlink": "QmCCC...111" + "link": "QmCCC...111" "mode": "0755" }, "cat.jpg": { - "mlink": "QmCCC...222" + "link": "QmCCC...222" "mode": "0644" }, "doge.jpg": { - "mlink": "QmCCC...333", + "link": "QmCCC...333", "mode": "0644" } } @@ -416,10 +420,10 @@ Split into multiple independent sub-Files. ```js { - "tree": {"mlink": "e4647147e940e2fab134e7f3d8a40c2022cb36f3"}, + "tree": {"link": "e4647147e940e2fab134e7f3d8a40c2022cb36f3"}, "parents": [ - {"mlink": "b7d3ead1d80086940409206f5bd1a7a858ab6c95"}, - {"mlink": "ba8fbf7bc07818fa2892bd1a302081214b452afb"} + {"link": "b7d3ead1d80086940409206f5bd1a7a858ab6c95"}, + {"link": "ba8fbf7bc07818fa2892bd1a302081214b452afb"} ], "author": { "name": "Juan Batiz-Benet", @@ -441,8 +445,8 @@ Split into multiple independent sub-Files. ```js { - "parent": {"mlink": "Qm000000002CPGAzmfdYPghgrFtYFB6pf1BqMvqfiPDam8"}, - "transactions": {"mlink": "QmTgzctfxxE8ZwBNGn744rL5R826EtZWzKvv2TF2dAcd9n"}, + "parent": {"link": "Qm000000002CPGAzmfdYPghgrFtYFB6pf1BqMvqfiPDam8"}, + "transactions": {"link": "QmTgzctfxxE8ZwBNGn744rL5R826EtZWzKvv2TF2dAcd9n"}, "nonce": "UJPTFZnR2CPGAzmfdYPghgrFtYFB6pf1BqMvqfiPDam8" } ``` @@ -454,19 +458,16 @@ This time, im YML. TODO: make this a real txn ```yml --- inputs: - - input: {mlink: Qmes5e1x9YEku2Y4kDgT6pjf91TPGsE2nJAaAKgwnUqR82} + - input: {link: Qmes5e1x9YEku2Y4kDgT6pjf91TPGsE2nJAaAKgwnUqR82} amount: 100 outputs: - - output: {mlink: Qmes5e1x9YEku2Y4kDgT6pjf91TPGsE2nJAaAKgwnUqR82} + - output: {link: Qmes5e1x9YEku2Y4kDgT6pjf91TPGsE2nJAaAKgwnUqR82} amount: 50 - - output: {mlink: QmbcfRVZqMNVRcarRN3JjEJCHhQBcUeqzZfa3zoWMaSrTW} + - output: {link: QmbcfRVZqMNVRcarRN3JjEJCHhQBcUeqzZfa3zoWMaSrTW} amount: 30 - - output: {mlink: QmV9PkR2gXcmUgNH7s7zMg9dsk7Hy7bLS18S9SHK96m7zV} + - output: {link: QmV9PkR2gXcmUgNH7s7zMg9dsk7Hy7bLS18S9SHK96m7zV} amount: 15 - - output: {mlink: QmP8r8fLUnEywGnRRUrHB28nnBKwmshMLiYeg8udzYg7TK} + - output: {link: QmP8r8fLUnEywGnRRUrHB28nnBKwmshMLiYeg8udzYg7TK} amount: 5 script: OP_VERIFY ``` - - - From 038188831cfb3be7ab37f69aad12b437561dc51a Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Thu, 7 Jan 2016 10:29:19 +0100 Subject: [PATCH 06/31] Relationship with Protocol Buffers legacy IPFS node format --- merkledag/ipld.md | 111 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 111 insertions(+) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index 4fa69db06..b0937a980 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -319,6 +319,117 @@ In the same way, when the receiver is storing the object, it must make sure that A simple way to store such objects with their format is to store them with their multicodec header. +## Relationship with Protocol Buffers legacy IPFS node format + +IPLD has a known conversion with the legacy Protocol Buffers format. This format is defined with the Protocol Buffers syntax as: + + message PBLink { + optional bytes Hash = 1; + optional string Name = 2; + optional uint64 Tsize = 3; + } + + message PBNode { + repeated PBLink Links = 2; + optional bytes Data = 1; + } + +The conversion to the IPLD data model must have the following properties: + +- It should be convertible back to protocol buffers, resulting in an identical byte stream (so the hash corresponds). This implies that ordering and duplicate links must be preserved in some way. +- When using paths as defined earlier in this document, links should be accessible without further indirection. This requires the top node object to have keys corresponding to link names. +- Link names should not conflict with other keys. + +There are multiple ways to do that that will be described next. + +### Current encoding in go-ipld + +go-ipld implements the following conversion: + + { + "": { + "hash": "", + "name": "", + "size": + }, + "": { + "hash": "", + "name": "", + "size": + }, + ... + "@attrs": { + "data": "", + "links": [ + { + "hash": "", + "name": "", + "size": + }, + { + "hash": "", + "name": "", + "size": + } + ] + } + } + +Notes : + +- The `links` array in the `@attrs` section is there to preserve order and duplicate links to hash back to the exact same protocol buffer object. + +- The link names are escaped to prevent clashing with the `@attr` key. + +- The escaping mechanism transforms the `@` character into `\@`. This mechanism also implies a modification of the path algorithm. When a path component contains the `@` character, it must be escaped to look it up in the IPLD Node object. + + For example, a path a path `/root/first@component/second@component/third_component` would look for object `root["first\@component"]["second\@component"]["third_component"]` (following mlinks when necessary). + +- Links are represented using the `hash` key instead of `mlink` as used in this specification. This must be changed. + +### Other proposition that does away with escaping + +We can imagine another transformation where the link names are not escaped. For example: + + { + "": { + "mlink": "", + "tsize": + }, + "": { + "mlink": "", + "tsize": + }, + ... + ".": { + "data": "", + "links": [ + "", + { + "name": "", + "mlink": "", + "tsize": + } + "", + ... + ] + } + } + +Notes: + +- Very conveniently, we use the key `.` to represent data for the current node, and any other key can represent a link. This means that we forbid link to be named `.`. This is in any case a good thing to do as the `.` element in paths can always be removed (the same way `..` can be replaced by the parent directory) + +- No escaping is needed, and no modification to the path algorithm is needed. + +- Link order is kept by using the link names for links present in the top node. This avoid repeating identical data (even though it is probably generated on the fly). + + Links that cannot be present in the top node (the case for the link named `.`, which is forbidden, or for links that are repeated with the same name) are present in full to allow reconstructing the exact protocol buffer object. + +### Other encodings + +Speak up in comments if you have other encoding suggestions, or you have another implementation with another encoding. + ## Datastructure Examples It is important that IPLD be a simple, nimble, and flexible format that does not get in the way of users defining new or importing old datastractures. For this purpose, below I will show a few example data structures. From bc2050cbebbde31ec9ff814527649bb8d4e7cfe3 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Fri, 8 Jan 2016 09:07:14 +0100 Subject: [PATCH 07/31] Talk about escaping keys in merkle-paths --- merkledag/ipld.md | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index b0937a980..7450db344 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -72,6 +72,23 @@ O_5 = | "hello": "world" | whose hash value is QmR8Bzg59Y4FGWHeu9iTYhwhiP8PHCN This entire _merkle-path_ traversal is a unix-style path traversal over a _merkle-dag_ which uses _merkle-links_ with names. +**[In case we use escaping in protobuf IPLD format]** + +In order to not restrict individual path component by disallowing some file names and still allow storing arbitrary data in IPLD objects, path components must be escaped when they are looked up in IPLD objects. + +To escape a path component in order to look it up in an IPLD object: + +- every `\` character in the path component must be replaced with `\\` +- every `@` character in the path component must be replaced with `\@` + +This makes any key containing a `@` character unescaped in an IPLD object not accessible through a _filesystem merkle-path_. This is a reserved key that can be used to store auxiliary data without making it a link and visible in regular filesystems. This data can be made available in filesystems through extended attributes or opening and reading file contents. + +To unescape IPLD object keys that are not reserved and get the corresponding path component: + +- every `\@` sequence in the key must be replaced by `@` +- every `\\` sequence in the key must be replaced by `\` + + ## What is the IPLD Data Model? The IPLD Data Model defines a simple JSON-based _structure_ for all merkle-dags, and identifies a set of formats to encode the structure into. @@ -338,7 +355,7 @@ The conversion to the IPLD data model must have the following properties: - It should be convertible back to protocol buffers, resulting in an identical byte stream (so the hash corresponds). This implies that ordering and duplicate links must be preserved in some way. - When using paths as defined earlier in this document, links should be accessible without further indirection. This requires the top node object to have keys corresponding to link names. -- Link names should not conflict with other keys. +- Link names should be able to be any valid file name. As such, the encoding must ensure that link names do not conflict with other keys in the model. There are multiple ways to do that that will be described next. From 89dd82d801a7b3acc6033863564efe22f52e9553 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Fri, 8 Jan 2016 19:40:54 +0100 Subject: [PATCH 08/31] Move (and update) section about protobuf compat to separate file --- merkledag/ipld-compat-protobuf.md | 138 ++++++++++++++++++++++++++++++ merkledag/ipld.md | 107 ----------------------- 2 files changed, 138 insertions(+), 107 deletions(-) create mode 100644 merkledag/ipld-compat-protobuf.md diff --git a/merkledag/ipld-compat-protobuf.md b/merkledag/ipld-compat-protobuf.md new file mode 100644 index 000000000..525e995c9 --- /dev/null +++ b/merkledag/ipld-compat-protobuf.md @@ -0,0 +1,138 @@ +# IPLD conversion with Protocol Buffer legacy IPFS node format + +IPLD has a known conversion with the legacy Protocol Buffers format in order for new IPLD objects to interact with older protocol buffer objects. + +## Detecting if the legacy format is in use + +The format is encapsulated after a multicodec header that tells which codec to use. In addition, older applications that do not yet use the multicodec header will transmit a protocol buffer stream. This can be detected by looking at the first byte: + +- if the first byte is between 0 and 127, it is a multicodec header +- if the first byte if between 128 and 255, it is a protocol buffer stream + +In case a multicodec header is in use, the actual IPLD object is encapsulated first with a multicodec header which identifier is `/mdagv1`, then by a second header which identifier corresponds to the actual encoding of the object: + +- `/protobuf/msgio`: is the encapsulation for protocol buffer message +- `/json`: is the encapsulation for JSON encoding +- `/cbor`: is the encapsulation for CBOR encoding + +For example, a protocol buffer object encapsulated in a multicodec header would start with "`\x08/mdagv1\n\x10/protobuf/msgio\n`" corresponding to the bytes : + + 08 2f 6d 64 61 67 76 31 0a + 10 2f 70 72 6f 74 6f 62 75 66 2f 6d 73 67 69 6f 0a + +A JSON encoded object would start with "`\x08/mdagv1\n\x06/json\n`" and a CBOR encoded object would start with "`\x08/mdagv1\n\x06/cbor\n`". + + +## Description of the legacy protocol buffers format + +This format is defined with the Protocol Buffers syntax as: + + message PBLink { + optional bytes Hash = 1; + optional string Name = 2; + optional uint64 Tsize = 3; + } + + message PBNode { + repeated PBLink Links = 2; + optional bytes Data = 1; + } + +## Conversion to IPLD model + +The conversion to the IPLD data model must have the following properties: + +- It should be convertible back to protocol buffers, resulting in an identical byte stream (so the hash corresponds). This implies that ordering and duplicate links must be preserved in some way. +- When using paths as defined earlier in this document, links should be accessible without further indirection. This requires the top node object to have keys corresponding to link names. +- Link names should be able to be any valid file name. As such, the encoding must ensure that link names do not conflict with other keys in the model. + +There is a canonical form which is described below: + +**FIXME: decide on that form. Until now, multiple possible forms are presented here** + + +### Escape encoding + +A protocol buffer message would be converted the following way: + + { + "": { + "mlink": "", + "name": "", + "size": + }, + "": { + "mlink": "", + "name": "", + "size": + }, + ... + "@attrs": { + "data": "", + "links": [ + { + "mlink": "", + "name": "", + "size": + }, + { + "mlink": "", + "name": "", + "size": + } + ] + } + } + +Notes : + +- The `links` array in the `@attrs` section is there to preserve order and duplicate links to hash back to the exact same protocol buffer object. + +- Link hashes are encoded in base58 + +- The link names are escaped to prevent clashing with the `@attr` key. + +- The escaping mechanism transforms the `@` character into `\@`. This mechanism also implies a modification of the path algorithm. When a path component contains the `@` character, it must be escaped to look it up in the IPLD Node object. + + For example, a _filesystem merkle-path_ `/root/first@component/second@component/third_component` would look for object `root["first\@component"]["second\@component"]["third_component"]` (following mlinks when necessary). + +**FIXME: Using the `@` character is not mandatory. Any other character could fit. Don't hesitate to give your ideas.** + +### Other proposition that avoids escaping + +We can imagine another transformation where the link names are not escaped. For example: + + { + "": { + "mlink": "", + "tsize": + }, + "": { + "mlink": "", + "tsize": + }, + ... + ".": { + "data": "", + "links": [ + "", + { + "name": "", + "mlink": "", + "tsize": + } + "", + ... + ] + } + } + +Notes: + +- Very conveniently, we use the key `.` to represent data for the current node, and any other key can represent a link. This means that we forbid link to be named `.`. This is in any case a good thing to do as the `.` element in paths can always be removed (the same way `..` can be replaced by the parent directory) + +- No escaping is needed, and no modification to the path algorithm is needed. + +- Link order is kept by using the link names for links present in the top node. This avoid repeating identical data (even though it is probably generated on the fly). + + Links that cannot be present in the top node (the case for the link named `.`, which is forbidden, or for links that are repeated with the same name) are present in full to allow reconstructing the exact protocol buffer object. diff --git a/merkledag/ipld.md b/merkledag/ipld.md index 7450db344..affc300b8 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -336,113 +336,6 @@ In the same way, when the receiver is storing the object, it must make sure that A simple way to store such objects with their format is to store them with their multicodec header. -## Relationship with Protocol Buffers legacy IPFS node format - -IPLD has a known conversion with the legacy Protocol Buffers format. This format is defined with the Protocol Buffers syntax as: - - message PBLink { - optional bytes Hash = 1; - optional string Name = 2; - optional uint64 Tsize = 3; - } - - message PBNode { - repeated PBLink Links = 2; - optional bytes Data = 1; - } - -The conversion to the IPLD data model must have the following properties: - -- It should be convertible back to protocol buffers, resulting in an identical byte stream (so the hash corresponds). This implies that ordering and duplicate links must be preserved in some way. -- When using paths as defined earlier in this document, links should be accessible without further indirection. This requires the top node object to have keys corresponding to link names. -- Link names should be able to be any valid file name. As such, the encoding must ensure that link names do not conflict with other keys in the model. - -There are multiple ways to do that that will be described next. - -### Current encoding in go-ipld - -go-ipld implements the following conversion: - - { - "": { - "hash": "", - "name": "", - "size": - }, - "": { - "hash": "", - "name": "", - "size": - }, - ... - "@attrs": { - "data": "", - "links": [ - { - "hash": "", - "name": "", - "size": - }, - { - "hash": "", - "name": "", - "size": - } - ] - } - } - -Notes : - -- The `links` array in the `@attrs` section is there to preserve order and duplicate links to hash back to the exact same protocol buffer object. - -- The link names are escaped to prevent clashing with the `@attr` key. - -- The escaping mechanism transforms the `@` character into `\@`. This mechanism also implies a modification of the path algorithm. When a path component contains the `@` character, it must be escaped to look it up in the IPLD Node object. - - For example, a path a path `/root/first@component/second@component/third_component` would look for object `root["first\@component"]["second\@component"]["third_component"]` (following mlinks when necessary). - -- Links are represented using the `hash` key instead of `mlink` as used in this specification. This must be changed. - -### Other proposition that does away with escaping - -We can imagine another transformation where the link names are not escaped. For example: - - { - "": { - "mlink": "", - "tsize": - }, - "": { - "mlink": "", - "tsize": - }, - ... - ".": { - "data": "", - "links": [ - "", - { - "name": "", - "mlink": "", - "tsize": - } - "", - ... - ] - } - } - -Notes: - -- Very conveniently, we use the key `.` to represent data for the current node, and any other key can represent a link. This means that we forbid link to be named `.`. This is in any case a good thing to do as the `.` element in paths can always be removed (the same way `..` can be replaced by the parent directory) - -- No escaping is needed, and no modification to the path algorithm is needed. - -- Link order is kept by using the link names for links present in the top node. This avoid repeating identical data (even though it is probably generated on the fly). - - Links that cannot be present in the top node (the case for the link named `.`, which is forbidden, or for links that are repeated with the same name) are present in full to allow reconstructing the exact protocol buffer object. - ### Other encodings Speak up in comments if you have other encoding suggestions, or you have another implementation with another encoding. From 5b97e14c23a13a0869897951583361ddfc722fbc Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Sun, 10 Jan 2016 16:42:14 +0100 Subject: [PATCH 09/31] Change protocol buffer compatibility format. Links needs not to be present at the top level. having them in a separate map removes all complexity of key escaping. --- merkledag/ipld-compat-protobuf.md | 113 ++++++++++++++++++++++++------ 1 file changed, 90 insertions(+), 23 deletions(-) diff --git a/merkledag/ipld-compat-protobuf.md b/merkledag/ipld-compat-protobuf.md index 525e995c9..d72c96787 100644 --- a/merkledag/ipld-compat-protobuf.md +++ b/merkledag/ipld-compat-protobuf.md @@ -48,40 +48,107 @@ The conversion to the IPLD data model must have the following properties: There is a canonical form which is described below: -**FIXME: decide on that form. Until now, multiple possible forms are presented here** + { + "data": "", + "named-links": { + "": { + "link": "", + "name": "", + "size": + }, + "": { + "link": "", + "name": "", + "size": + }, + ... + } + "ordered-links": [ + "", + { + "name": "", + "link": "", + "tsize": + } + "", + ... + ] + } +- Here we assume that the link #0 and #1 have the same name. As specified in [ipld.md](ipld.md) in paragraph **Duplicate property keys**, only the first link is present in the named link section. The other link is present in the `ordered-links` section for completeness and to allow recreating the original protocol buffer message. -### Escape encoding +- Links are not accessible on the top level object. Applications that are using protocol buffer objects such as unixfs will have to handle that and special case for legacy objects. -A protocol buffer message would be converted the following way: +- No escaping is needed and no conflict is possible + +----------------- + +### Simple variation on that solution { - "": { - "mlink": "", + "data": "", + "": { + "link": "", "name": "", "size": }, - "": { - "mlink": "", + "": { + "link": "", "name": "", "size": }, - ... - "@attrs": { - "data": "", - "links": [ - { - "mlink": "", - "name": "", - "size": - }, - { - "mlink": "", - "name": "", - "size": - } - ] + ... + "ordered-links": [ + "", + { + "name": "", + "link": "", + "tsize": + } + "", + ... + ] + } + +- Here we assume that the link #0 and #1 have the same name. As specified in [ipld.md](ipld.md) in paragraph **Duplicate property keys**, only the first link is present in the named link section. The other link is present in the `ordered-links` section for completeness and to allow recreating the original protocol buffer message. + +- Link whose name would conflict with other top level keys are not included in the top level object. They are only accessible in `ordered-links` section by iterating through the values. + +- Links are not accessible on the top level object. Applications that are using protocol buffer objects such as unixfs will have to handle that and special case for legacy objects. + +- No escaping is needed and no conflict is possible + +### Other variation: escape encoding + +A protocol buffer message would be converted the following way: + + { + "data": "", + "named-links": { + "": { + "mlink": "", + "name": "", + "size": + }, + "": { + "mlink": "", + "name": "", + "size": + }, + ... } + "ordered-links": [ + { + "mlink": "", + "name": "", + "size": + }, + { + "mlink": "", + "name": "", + "size": + } + ] } Notes : @@ -98,7 +165,7 @@ Notes : **FIXME: Using the `@` character is not mandatory. Any other character could fit. Don't hesitate to give your ideas.** -### Other proposition that avoids escaping +### Other variation that avoids escaping We can imagine another transformation where the link names are not escaped. For example: From 691b2b0fe6b88b38b7684089316a63b26b00b591 Mon Sep 17 00:00:00 2001 From: findkiko Date: Tue, 12 Jan 2016 18:32:18 -0800 Subject: [PATCH 10/31] Fix traversal example in ipld-spec Copy/Paste error made it hard to follow design. --- merkledag/ipld.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index 4fa69db06..069efe707 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -51,17 +51,17 @@ O_1 = | "a": "QmV76pU..." | whose hash value is QmUmg7BZC1YP1ca66rRtWKxpXp77WgV | v +-------------------+ -O_2 = | "b": "QmV76pU..." | whose hash value is QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT +O_2 = | "b": "QmQmkZP..." | whose hash value is QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT +-------------------+ | v +-------------------+ -O_3 = | "c": "QmV76pU..." | whose hash value is QmQmkZPNPoRkPd7wj2xUJe5v5DsY6MX33MFaGhZKB2pRSE +O_3 = | "c": "QmWkyYN..." | whose hash value is QmQmkZPNPoRkPd7wj2xUJe5v5DsY6MX33MFaGhZKB2pRSE +-------------------+ | v +-------------------+ -O_4 = | "d": "QmV76pU..." | whose hash value is QmWkyYNrN5wnHgX5vfs88q7QUaFKq52TVNTFeTzxm73UbT +O_4 = | "d": "QmR8Bzg..." | whose hash value is QmWkyYNrN5wnHgX5vfs88q7QUaFKq52TVNTFeTzxm73UbT +-------------------+ | v From 33ca56e7b27e66fab9aee58fbde2eab656ad96f7 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Tue, 9 Feb 2016 17:50:32 +0100 Subject: [PATCH 11/31] IPLD Protocol Buffer compatibility: fix errors Fix the paragraph about the first byte that is able to determine if the data in prefixed by a multicodec or is a protocol buffer object. --- merkledag/ipld-compat-protobuf.md | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/merkledag/ipld-compat-protobuf.md b/merkledag/ipld-compat-protobuf.md index d72c96787..860f81d94 100644 --- a/merkledag/ipld-compat-protobuf.md +++ b/merkledag/ipld-compat-protobuf.md @@ -2,14 +2,11 @@ IPLD has a known conversion with the legacy Protocol Buffers format in order for new IPLD objects to interact with older protocol buffer objects. -## Detecting if the legacy format is in use +## Detecting the format in use -The format is encapsulated after a multicodec header that tells which codec to use. In addition, older applications that do not yet use the multicodec header will transmit a protocol buffer stream. This can be detected by looking at the first byte: +The format is encapsulated after two multicodec headers. The first have the codec path `/mdagv1` and can be used to detect whether IPLD objects are transmitted or just legacy protocol buffer messages. -- if the first byte is between 0 and 127, it is a multicodec header -- if the first byte if between 128 and 255, it is a protocol buffer stream - -In case a multicodec header is in use, the actual IPLD object is encapsulated first with a multicodec header which identifier is `/mdagv1`, then by a second header which identifier corresponds to the actual encoding of the object: +The second multicodec header is used to detect the actual format in which the IPLD object is encoded: - `/protobuf/msgio`: is the encapsulation for protocol buffer message - `/json`: is the encapsulation for JSON encoding From 1ab421b3500e0e9e96ae312fcb8b5730899ba727 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Tue, 9 Feb 2016 21:03:10 +0100 Subject: [PATCH 12/31] Only keep first alternative. --- merkledag/ipld-compat-protobuf.md | 131 +----------------------------- 1 file changed, 4 insertions(+), 127 deletions(-) diff --git a/merkledag/ipld-compat-protobuf.md b/merkledag/ipld-compat-protobuf.md index 860f81d94..6ae9ac0b4 100644 --- a/merkledag/ipld-compat-protobuf.md +++ b/merkledag/ipld-compat-protobuf.md @@ -49,12 +49,12 @@ There is a canonical form which is described below: "data": "", "named-links": { "": { - "link": "", + "@link": "", "name": "", "size": }, "": { - "link": "", + "@link": "", "name": "", "size": }, @@ -64,43 +64,8 @@ There is a canonical form which is described below: "", { "name": "", - "link": "", - "tsize": - } - "", - ... - ] - } - -- Here we assume that the link #0 and #1 have the same name. As specified in [ipld.md](ipld.md) in paragraph **Duplicate property keys**, only the first link is present in the named link section. The other link is present in the `ordered-links` section for completeness and to allow recreating the original protocol buffer message. - -- Links are not accessible on the top level object. Applications that are using protocol buffer objects such as unixfs will have to handle that and special case for legacy objects. - -- No escaping is needed and no conflict is possible - ------------------ - -### Simple variation on that solution - - { - "data": "", - "": { - "link": "", - "name": "", - "size": - }, - "": { - "link": "", - "name": "", - "size": - }, - ... - "ordered-links": [ - "", - { - "name": "", - "link": "", - "tsize": + "@link": "", + "size": } "", ... @@ -109,94 +74,6 @@ There is a canonical form which is described below: - Here we assume that the link #0 and #1 have the same name. As specified in [ipld.md](ipld.md) in paragraph **Duplicate property keys**, only the first link is present in the named link section. The other link is present in the `ordered-links` section for completeness and to allow recreating the original protocol buffer message. -- Link whose name would conflict with other top level keys are not included in the top level object. They are only accessible in `ordered-links` section by iterating through the values. - - Links are not accessible on the top level object. Applications that are using protocol buffer objects such as unixfs will have to handle that and special case for legacy objects. - No escaping is needed and no conflict is possible - -### Other variation: escape encoding - -A protocol buffer message would be converted the following way: - - { - "data": "", - "named-links": { - "": { - "mlink": "", - "name": "", - "size": - }, - "": { - "mlink": "", - "name": "", - "size": - }, - ... - } - "ordered-links": [ - { - "mlink": "", - "name": "", - "size": - }, - { - "mlink": "", - "name": "", - "size": - } - ] - } - -Notes : - -- The `links` array in the `@attrs` section is there to preserve order and duplicate links to hash back to the exact same protocol buffer object. - -- Link hashes are encoded in base58 - -- The link names are escaped to prevent clashing with the `@attr` key. - -- The escaping mechanism transforms the `@` character into `\@`. This mechanism also implies a modification of the path algorithm. When a path component contains the `@` character, it must be escaped to look it up in the IPLD Node object. - - For example, a _filesystem merkle-path_ `/root/first@component/second@component/third_component` would look for object `root["first\@component"]["second\@component"]["third_component"]` (following mlinks when necessary). - -**FIXME: Using the `@` character is not mandatory. Any other character could fit. Don't hesitate to give your ideas.** - -### Other variation that avoids escaping - -We can imagine another transformation where the link names are not escaped. For example: - - { - "": { - "mlink": "", - "tsize": - }, - "": { - "mlink": "", - "tsize": - }, - ... - ".": { - "data": "", - "links": [ - "", - { - "name": "", - "mlink": "", - "tsize": - } - "", - ... - ] - } - } - -Notes: - -- Very conveniently, we use the key `.` to represent data for the current node, and any other key can represent a link. This means that we forbid link to be named `.`. This is in any case a good thing to do as the `.` element in paths can always be removed (the same way `..` can be replaced by the parent directory) - -- No escaping is needed, and no modification to the path algorithm is needed. - -- Link order is kept by using the link names for links present in the top node. This avoid repeating identical data (even though it is probably generated on the fly). - - Links that cannot be present in the top node (the case for the link named `.`, which is forbidden, or for links that are repeated with the same name) are present in full to allow reconstructing the exact protocol buffer object. From d1ceeb303c4fc4ce648da92e54694cd0c4f61b00 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Tue, 9 Feb 2016 22:07:02 +0100 Subject: [PATCH 13/31] Do not make use of escaping --- merkledag/ipld.md | 21 --------------------- 1 file changed, 21 deletions(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index affc300b8..4fa69db06 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -72,23 +72,6 @@ O_5 = | "hello": "world" | whose hash value is QmR8Bzg59Y4FGWHeu9iTYhwhiP8PHCN This entire _merkle-path_ traversal is a unix-style path traversal over a _merkle-dag_ which uses _merkle-links_ with names. -**[In case we use escaping in protobuf IPLD format]** - -In order to not restrict individual path component by disallowing some file names and still allow storing arbitrary data in IPLD objects, path components must be escaped when they are looked up in IPLD objects. - -To escape a path component in order to look it up in an IPLD object: - -- every `\` character in the path component must be replaced with `\\` -- every `@` character in the path component must be replaced with `\@` - -This makes any key containing a `@` character unescaped in an IPLD object not accessible through a _filesystem merkle-path_. This is a reserved key that can be used to store auxiliary data without making it a link and visible in regular filesystems. This data can be made available in filesystems through extended attributes or opening and reading file contents. - -To unescape IPLD object keys that are not reserved and get the corresponding path component: - -- every `\@` sequence in the key must be replaced by `@` -- every `\\` sequence in the key must be replaced by `\` - - ## What is the IPLD Data Model? The IPLD Data Model defines a simple JSON-based _structure_ for all merkle-dags, and identifies a set of formats to encode the structure into. @@ -336,10 +319,6 @@ In the same way, when the receiver is storing the object, it must make sure that A simple way to store such objects with their format is to store them with their multicodec header. -### Other encodings - -Speak up in comments if you have other encoding suggestions, or you have another implementation with another encoding. - ## Datastructure Examples It is important that IPLD be a simple, nimble, and flexible format that does not get in the way of users defining new or importing old datastractures. For this purpose, below I will show a few example data structures. From 6f5a6017f93895b64b092c7ac384a2f38d9fabd6 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Tue, 9 Feb 2016 22:21:55 +0100 Subject: [PATCH 14/31] Replace link key by @link --- merkledag/ipld.md | 74 ++++++++++++++++++++++++----------------------- 1 file changed, 38 insertions(+), 36 deletions(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index e6f962bfb..33c6d3481 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -20,9 +20,11 @@ A _merkle-link_ is a link between two objects which is content-addressed with th - **Cryptographic Integrity Checking**: resolving a link's value can be tested by hashing. In turn, this allows wide, secure, trustless exchanges of data (e.g. git or bittorrent), as others cannot give you any data that does not hash to the link's value. - **Immutable Datastructures**: data structures with merkle links cannot mutate, which is a nice property for distributed systems. This is useful for versioning, for representing distributed mutable state (eg CRDTs), and for long term archival. -A _merkle-link_ is represented in the IPLD object model by a map containing a key `link` which value is the actual link. When dereferencing the link, the map itself is to be replaced by the object it points to. +A _merkle-link_ is represented in the IPLD object model by a map containing a key `@link` mapped to a string value: the actual link. When dereferencing the link, the map itself is to be replaced by the object it points to unless the link path is invalid. -The link can either be a base58 hash, in which case it is assumed that it is a link in the `/ipfs` hierarchy, or directly the absolute path to the object. +The link can either be a base58 hash, in which case it is assumed that it is a link in the `/ipfs` hierarchy, or directly the absolute path to the object. Currently, only the `/ipfs` hierarchy is allowed. + +If an application wants to use the `@link` key for other purposes, the application itself is responsible to escape the keys in the IPLD object so that the application keys do not conflict with the `@link` key. When discussing application specific paths, it may be worth escaping all keys starting with `@` in case future versions of IPLD make use of other keys. ### What is a _merkle-graph_ or a _merkle-dag_? @@ -151,7 +153,7 @@ Merkle-Linking between nodes is the reason for IPLD to exist. A Link in IPLD is { "title": "As We May Think", "author": { - "link": "QmAAA...AAA" // links to the node above. + "@link": "QmAAA...AAA" // links to the node above. } } ``` @@ -163,7 +165,7 @@ Suppose this hashes to the multihash value `QmBBB...BBB`. This node links the _s { "title": "As We May Think", "author": { - "link": "QmAAA...AAA" // links to the node above. + "@link": "QmAAA...AAA" // links to the node above. } } @@ -189,17 +191,17 @@ For example, supposed you have a file system, and want to assign metadata like p ```js { "foo": { - "link": "QmCCC...111" + "@link": "QmCCC...111" "mode": "0755", "owner": "jbenet" }, "cat.jpg": { - "link": "QmCCC...222" + "@link": "QmCCC...222" "mode": "0644", "owner": "jbenet" }, "doge.jpg": { - "link": "QmCCC...333", + "@link": "QmCCC...333", "mode": "0644", "owner": "jbenet" } @@ -211,15 +213,15 @@ or in YML ```yml --- foo: - link: QmCCC...111 + @link: QmCCC...111 mode: 0755 owner: jbenet cat.jpg: - link: QmCCC...222 + @link: QmCCC...222 mode: 0644 owner: jbenet doge.jpg: - link: QmCCC...333 + @link: QmCCC...333 mode: 0644 owner: jbenet ``` @@ -236,13 +238,13 @@ Though we have new properties in the links that are _specific to this datastruct { "subfiles": [ { - "link": "QmPHPs1P3JaWi53q5qqiNauPhiTqa3S1mbszcVPHKGNWRh" + "@link": "QmPHPs1P3JaWi53q5qqiNauPhiTqa3S1mbszcVPHKGNWRh" }, { - "link": "QmPCuqUTNb21VDqtp5b8VsNzKEMtUsZCCVsEUBrjhERRSR" + "@link": "QmPCuqUTNb21VDqtp5b8VsNzKEMtUsZCCVsEUBrjhERRSR" }, { - "link": "QmS7zrNSHEt5GpcaKrwdbnv1nckBreUxWnLaV4qivjaNr3" + "@link": "QmS7zrNSHEt5GpcaKrwdbnv1nckBreUxWnLaV4qivjaNr3" } ] } @@ -250,9 +252,9 @@ Though we have new properties in the links that are _specific to this datastruct > ipld cat --yml QmCCC...CCC/doge.jpg --- subfiles: - - link: QmPHPs1P3JaWi53q5qqiNauPhiTqa3S1mbszcVPHKGNWRh - - link: QmPCuqUTNb21VDqtp5b8VsNzKEMtUsZCCVsEUBrjhERRSR - - link: QmS7zrNSHEt5GpcaKrwdbnv1nckBreUxWnLaV4qivjaNr3 + - @link: QmPHPs1P3JaWi53q5qqiNauPhiTqa3S1mbszcVPHKGNWRh + - @link: QmPCuqUTNb21VDqtp5b8VsNzKEMtUsZCCVsEUBrjhERRSR + - @link: QmS7zrNSHEt5GpcaKrwdbnv1nckBreUxWnLaV4qivjaNr3 > ipld cat --json QmCCC...CCC/doge.jpg/subfiles/1 { @@ -304,7 +306,7 @@ On the subject of integers, there exist a variety of formats which represent int ## Serialized Data Formats -IPLD supports a variety of serialized data formats through [multicodec](https://github.com/jbenet/multicodec). These can be used however is idiomatic to the format, for example in `CBOR`, we can use `CBOR` type tags to represent the merkle-link, and avoid writing out the full string key `link`. Users are encouraged to use the formats to their fullest, and to store and transmit IPLD data in whatever format makes the most sense. The only requirement **is that there MUST be a well-defined one-to-one mapping with the IPLD Canonical format.** This is so that data can be transformed from one format to another, and back, without changing its meaning nor its cryptographic hashes. +IPLD supports a variety of serialized data formats through [multicodec](https://github.com/jbenet/multicodec). These can be used however is idiomatic to the format, for example in `CBOR`, we can use `CBOR` type tags to represent the merkle-link, and avoid writing out the full string key `@link`. Users are encouraged to use the formats to their fullest, and to store and transmit IPLD data in whatever format makes the most sense. The only requirement **is that there MUST be a well-defined one-to-one mapping with the IPLD Canonical format.** This is so that data can be transformed from one format to another, and back, without changing its meaning nor its cryptographic hashes. ### Canonical Format @@ -349,16 +351,16 @@ Split into multiple independent sub-Files. "size": "1424119", "subfiles": [ { - "link": "QmAAA...", + "@link": "QmAAA...", "size": "100324" }, { - "link": "QmAA1...", + "@link": "QmAA1...", "size": "120345", "repeat": "10" }, { - "link": "QmAA1...", + "@link": "QmAA1...", "size": "120345" }, ] @@ -370,17 +372,17 @@ Split into multiple independent sub-Files. ```js { "foo": { - "link": "QmCCC...111" + "@link": "QmCCC...111" "mode": "0755", "owner": "jbenet" }, "cat.jpg": { - "link": "QmCCC...222" + "@link": "QmCCC...222" "mode": "0644", "owner": "jbenet" }, "doge.jpg": { - "link": "QmCCC...333", + "@link": "QmCCC...333", "mode": "0644", "owner": "jbenet" } @@ -402,15 +404,15 @@ Split into multiple independent sub-Files. ```js { "foo": { - "link": "QmCCC...111" + "@link": "QmCCC...111" "mode": "0755" }, "cat.jpg": { - "link": "QmCCC...222" + "@link": "QmCCC...222" "mode": "0644" }, "doge.jpg": { - "link": "QmCCC...333", + "@link": "QmCCC...333", "mode": "0644" } } @@ -420,10 +422,10 @@ Split into multiple independent sub-Files. ```js { - "tree": {"link": "e4647147e940e2fab134e7f3d8a40c2022cb36f3"}, + "tree": {"@link": "e4647147e940e2fab134e7f3d8a40c2022cb36f3"}, "parents": [ - {"link": "b7d3ead1d80086940409206f5bd1a7a858ab6c95"}, - {"link": "ba8fbf7bc07818fa2892bd1a302081214b452afb"} + {"@link": "b7d3ead1d80086940409206f5bd1a7a858ab6c95"}, + {"@link": "ba8fbf7bc07818fa2892bd1a302081214b452afb"} ], "author": { "name": "Juan Batiz-Benet", @@ -445,8 +447,8 @@ Split into multiple independent sub-Files. ```js { - "parent": {"link": "Qm000000002CPGAzmfdYPghgrFtYFB6pf1BqMvqfiPDam8"}, - "transactions": {"link": "QmTgzctfxxE8ZwBNGn744rL5R826EtZWzKvv2TF2dAcd9n"}, + "parent": {"@link": "Qm000000002CPGAzmfdYPghgrFtYFB6pf1BqMvqfiPDam8"}, + "transactions": {"@link": "QmTgzctfxxE8ZwBNGn744rL5R826EtZWzKvv2TF2dAcd9n"}, "nonce": "UJPTFZnR2CPGAzmfdYPghgrFtYFB6pf1BqMvqfiPDam8" } ``` @@ -458,16 +460,16 @@ This time, im YML. TODO: make this a real txn ```yml --- inputs: - - input: {link: Qmes5e1x9YEku2Y4kDgT6pjf91TPGsE2nJAaAKgwnUqR82} + - input: {@link: Qmes5e1x9YEku2Y4kDgT6pjf91TPGsE2nJAaAKgwnUqR82} amount: 100 outputs: - - output: {link: Qmes5e1x9YEku2Y4kDgT6pjf91TPGsE2nJAaAKgwnUqR82} + - output: {@link: Qmes5e1x9YEku2Y4kDgT6pjf91TPGsE2nJAaAKgwnUqR82} amount: 50 - - output: {link: QmbcfRVZqMNVRcarRN3JjEJCHhQBcUeqzZfa3zoWMaSrTW} + - output: {@link: QmbcfRVZqMNVRcarRN3JjEJCHhQBcUeqzZfa3zoWMaSrTW} amount: 30 - - output: {link: QmV9PkR2gXcmUgNH7s7zMg9dsk7Hy7bLS18S9SHK96m7zV} + - output: {@link: QmV9PkR2gXcmUgNH7s7zMg9dsk7Hy7bLS18S9SHK96m7zV} amount: 15 - - output: {link: QmP8r8fLUnEywGnRRUrHB28nnBKwmshMLiYeg8udzYg7TK} + - output: {@link: QmP8r8fLUnEywGnRRUrHB28nnBKwmshMLiYeg8udzYg7TK} amount: 5 script: OP_VERIFY ``` From c8452235e33979c811f37004943a05004eb17c53 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Thu, 11 Feb 2016 21:46:02 +0100 Subject: [PATCH 15/31] Minor wording tweaks --- merkledag/ipld-compat-protobuf.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/merkledag/ipld-compat-protobuf.md b/merkledag/ipld-compat-protobuf.md index 6ae9ac0b4..59839c719 100644 --- a/merkledag/ipld-compat-protobuf.md +++ b/merkledag/ipld-compat-protobuf.md @@ -39,8 +39,8 @@ This format is defined with the Protocol Buffers syntax as: The conversion to the IPLD data model must have the following properties: -- It should be convertible back to protocol buffers, resulting in an identical byte stream (so the hash corresponds). This implies that ordering and duplicate links must be preserved in some way. -- When using paths as defined earlier in this document, links should be accessible without further indirection. This requires the top node object to have keys corresponding to link names. +- It MUST be convertible back to protocol buffers, resulting in an identical byte stream (so the hash corresponds). This implies that ordering and duplicate links must be preserved in some way. +- When using paths as defined in the IPLD specification, links should be accessible without further indirection. This requires the top node object to have keys corresponding to link names. - Link names should be able to be any valid file name. As such, the encoding must ensure that link names do not conflict with other keys in the model. There is a canonical form which is described below: From 33e48ff720790ea2be5187c8323d1ed2ad40a3ae Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Fri, 8 Jan 2016 20:40:51 +0100 Subject: [PATCH 16/31] Describe CBOR tagging --- merkledag/ipld.md | 28 +++++++++++++++++++++++++++- 1 file changed, 27 insertions(+), 1 deletion(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index bb30bbd63..8f3adba35 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -308,11 +308,37 @@ On the subject of integers, there exist a variety of formats which represent int IPLD supports a variety of serialized data formats through [multicodec](https://github.com/jbenet/multicodec). These can be used however is idiomatic to the format, for example in `CBOR`, we can use `CBOR` type tags to represent the merkle-link, and avoid writing out the full string key `@link`. Users are encouraged to use the formats to their fullest, and to store and transmit IPLD data in whatever format makes the most sense. The only requirement **is that there MUST be a well-defined one-to-one mapping with the IPLD Canonical format.** This is so that data can be transformed from one format to another, and back, without changing its meaning nor its cryptographic hashes. +## Serialised CBOR with tags + +IPLD objects can be represented using cbor using the tags described below when possible. Tags are defined in [RFC 7049 section 2.4](http://tools.ietf.org/html/rfc7049#section-2.4): + +- ``: **[If key escaping is necessary]** The string that follows (major type 2 or 3) is interpreted as an escaped string (of the same major type). Every occurrences of `\` are considered to be `\\`, and every occurrences of `@` are considered to be `\@`. + +- ``: the byte string that follows (major type 2) is to be interpreted as a text string instead (major type 3). This text string is the base58 encoded version of the byte string. + +- ``: the text string (major type 2) that follows (or the byte string tagged with ``) is to be interpreted with "`/ipfs/`" added in front of the string. + +- ``: an array (major type 4) must follow. The array must have two elements: a text string (or a byte string tagged using ``) followed by a map (major type 5). This whole must be interpreted as a map identical to the map of the array, but with an additional entry. The additional entry would have a text string containing `link` as a key, and the text string contained in the array as value. + +**FIXME:** register tags with IANA. + +When encoding an IPLD node to CBOR with tags, these tags must be included whenever possible, and avoided if not necessary. This will ensure a unique encoding across implementations. More specifically (and in this order): + +- If map key is a text string `s` and `escape(unescape(s)) == s`, then this string is transformed to `unescape(s)` and tagged with `` + +- if a text string starts with "`/ipfs/`", this prefix is removed and the string is tagged with ``. + +- If a text string contains a valid base58 encoded value, it is converted to a binary string and tagged with `` + +- If a map contains an entry which key is the text string "`link`", this entry is removed from the map, an array is created containing the entry value and the map, and this array is prefixed by the tag ``. The result is used in place of the map. + +When an IPLD object contains these tags in the way explained here, the multicodec header used to represent the object codec must be `/cbor/ipld-tagsv1` instead of just `/cbor`. Readers will be able to use an optimized reading process to detect links using these tags. + ### Canonical Format In order to preserve merkle-linking's power, we must ensure that there is a single **_canonical_** serialized representation of an IPLD document. This ensures that applications arrive at the same cryptographic hashes. It should be noted --though-- that this is a system-wide parameter. Future systems might change it to evolve representations. However we estimate this would need to be done no more than once per decade. -**The IPLD Canonical format is _canonicalized CBOR_.** +**The IPLD Canonical format is _canonicalized CBOR with tags_.** The legacy canonical format is protocol buffers. From 9c22d9a1005ae65a61ce423eae613c73490a8dd4 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Sun, 10 Jan 2016 15:44:53 +0100 Subject: [PATCH 17/31] CBOR tagging: simplify tagging and remove key escapes management --- merkledag/ipld.md | 33 +++++++++++++++++++++++---------- 1 file changed, 23 insertions(+), 10 deletions(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index 8f3adba35..bb7a7bef1 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -312,25 +312,38 @@ IPLD supports a variety of serialized data formats through [multicodec](https:// IPLD objects can be represented using cbor using the tags described below when possible. Tags are defined in [RFC 7049 section 2.4](http://tools.ietf.org/html/rfc7049#section-2.4): -- ``: **[If key escaping is necessary]** The string that follows (major type 2 or 3) is interpreted as an escaped string (of the same major type). Every occurrences of `\` are considered to be `\\`, and every occurrences of `@` are considered to be `\@`. +- ``: the byte string that follows (major type 2) is to be interpreted as a text string instead (major type 3). This text string is the base58 encoded version of the byte string (using the IPFS alphabet). -- ``: the byte string that follows (major type 2) is to be interpreted as a text string instead (major type 3). This text string is the base58 encoded version of the byte string. +- ``: an array (major type 4) of two or three elements (link prefix, link hash (optional) and map) must follow: -- ``: the text string (major type 2) that follows (or the byte string tagged with ``) is to be interpreted with "`/ipfs/`" added in front of the string. + - The link prefix must be an integer representing the first path of the link, or a text string appended at the beginning of the link. Available integers are: + - `1`: represents the prefix `/ipfs/` -- ``: an array (major type 4) must follow. The array must have two elements: a text string (or a byte string tagged using ``) followed by a map (major type 5). This whole must be interpreted as a map identical to the map of the array, but with an additional entry. The additional entry would have a text string containing `link` as a key, and the text string contained in the array as value. + - The link hash, either a text string to be appended after the link prefix, or a tag `` followed by the binary string representing the hash digest. -**FIXME:** register tags with IANA. + - a map -When encoding an IPLD node to CBOR with tags, these tags must be included whenever possible, and avoided if not necessary. This will ensure a unique encoding across implementations. More specifically (and in this order): + The link value is constructed by concatenating the link prefix and the link hash (if present) in its text form. -- If map key is a text string `s` and `escape(unescape(s)) == s`, then this string is transformed to `unescape(s)` and tagged with `` + This must be interpreted as a map identical to the map of the array, but with an additional entry. The additional entry would have a text string containing `link` as a key, and the text string representing the link formed by the first two elements of the array. When iterating over the map, this entry must appear first. -- if a text string starts with "`/ipfs/`", this prefix is removed and the string is tagged with ``. +**TODO:** -- If a text string contains a valid base58 encoded value, it is converted to a binary string and tagged with `` +- [ ] register tags with IANA. +- [ ] specify tags we use for escaping (if we want to store escaped string in unescaped form) -- If a map contains an entry which key is the text string "`link`", this entry is removed from the map, an array is created containing the entry value and the map, and this array is prefixed by the tag ``. The result is used in place of the map. +When encoding an IPLD node to CBOR with tags, some conversion steps are necessary (in this order): + +- if a map contains an entry which key is the text string `link` and the value is a text string, the map is converted to a link object: + + - the `link` entry is removed from the map + - if the link value cannot be split in a prefix and a base58 suffix, an array is created with the link value (a text string) and the transformed map. + - else, the link is split in a textual prefix and a base58 binary digest (the base58 value is decoded) and an array with the prefix, the `` followed by the binary hash, and the transformed map is created + - the original map is transformed to a `` followed by the array just created + +- if a text string is a canonical base58 representation of a binary string, the text string is converted to binary and `` is added at the beginning + +- if a text string is a canonical base64 representation (with no stray characters) of a binary string, the text string is converted to binary and the tag `22` (defined in RFC7049) is added at the beginning When an IPLD object contains these tags in the way explained here, the multicodec header used to represent the object codec must be `/cbor/ipld-tagsv1` instead of just `/cbor`. Readers will be able to use an optimized reading process to detect links using these tags. From ccadf1d3f023a178c0813e029d9def2bb5baffb5 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Tue, 9 Feb 2016 23:02:42 +0100 Subject: [PATCH 18/31] Simple CBOR format --- merkledag/ipld.md | 38 +++++++++++++------------------------- 1 file changed, 13 insertions(+), 25 deletions(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index bb7a7bef1..cca5526aa 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -310,42 +310,28 @@ IPLD supports a variety of serialized data formats through [multicodec](https:// ## Serialised CBOR with tags -IPLD objects can be represented using cbor using the tags described below when possible. Tags are defined in [RFC 7049 section 2.4](http://tools.ietf.org/html/rfc7049#section-2.4): +IPLD objects can be represented using cbor using tags which are defined in [RFC 7049 section 2.4](http://tools.ietf.org/html/rfc7049#section-2.4). -- ``: the byte string that follows (major type 2) is to be interpreted as a text string instead (major type 3). This text string is the base58 encoded version of the byte string (using the IPFS alphabet). +A tag `` is defined. This tag must be followed by an array (major type 4) containing two elements. The first being either a text string (major type 3) or a byte string (major type 2). The second element is defined to be a map (major type 5). -- ``: an array (major type 4) of two or three elements (link prefix, link hash (optional) and map) must follow: +When encoding an IPLD object to CBOR, every map that contain a link key is transformed to a `` followed by the array containing the link and then containing the CBOR version of the map without the link key. - - The link prefix must be an integer representing the first path of the link, or a text string appended at the beginning of the link. Available integers are: - - `1`: represents the prefix `/ipfs/` +- if the link key is a valid [multiaddress](https://github.com/jbenet/multiaddr) and converting that link text to the multiaddress binary string and back to text is guaranteed to result to the exact same text, the link is stored as a binary multiaddress as the first array item. +- else, the link is stored as text as the first array item. - - The link hash, either a text string to be appended after the link prefix, or a tag `` followed by the binary string representing the hash digest. +When decoding CBOR and converting it to IPLD, each occurences of `` with its following array is transformed. - - a map +- If the first array item is a binary string, it is interpreted as a multiaddress and converted to a textual format. Else, the text string is used directly. +- The map that follows is augmented with a new pair. The key is the standard IPLD link property, the value is the link in its textual format. +- When iterating over this augmented map, the link property must come first and not in any other order. This guarantee a consistent ordering. +- This augmented map is used instead of the `` in the IPLD output. - The link value is constructed by concatenating the link prefix and the link hash (if present) in its text form. - - This must be interpreted as a map identical to the map of the array, but with an additional entry. The additional entry would have a text string containing `link` as a key, and the text string representing the link formed by the first two elements of the array. When iterating over the map, this entry must appear first. +When an IPLD object contains these tags in the way explained here, the multicodec header used to represent the object codec must be `/cbor/ipld-tagsv1` instead of just `/cbor`. Readers will be able to use an optimized reading process to detect links using these tags. **TODO:** - [ ] register tags with IANA. -- [ ] specify tags we use for escaping (if we want to store escaped string in unescaped form) - -When encoding an IPLD node to CBOR with tags, some conversion steps are necessary (in this order): - -- if a map contains an entry which key is the text string `link` and the value is a text string, the map is converted to a link object: - - - the `link` entry is removed from the map - - if the link value cannot be split in a prefix and a base58 suffix, an array is created with the link value (a text string) and the transformed map. - - else, the link is split in a textual prefix and a base58 binary digest (the base58 value is decoded) and an array with the prefix, the `` followed by the binary hash, and the transformed map is created - - the original map is transformed to a `` followed by the array just created -- if a text string is a canonical base58 representation of a binary string, the text string is converted to binary and `` is added at the beginning - -- if a text string is a canonical base64 representation (with no stray characters) of a binary string, the text string is converted to binary and the tag `22` (defined in RFC7049) is added at the beginning - -When an IPLD object contains these tags in the way explained here, the multicodec header used to represent the object codec must be `/cbor/ipld-tagsv1` instead of just `/cbor`. Readers will be able to use an optimized reading process to detect links using these tags. ### Canonical Format @@ -353,6 +339,8 @@ In order to preserve merkle-linking's power, we must ensure that there is a sing **The IPLD Canonical format is _canonicalized CBOR with tags_.** +Users of this format should not expect any specific ordering of the keys, as the keys might be ordered differently in non canonical formats. + The legacy canonical format is protocol buffers. This canonical format is used to decide which format to use when creating the object for the first time and computing its hash. Once the format is decided for an IPLD object, it must be used in all communications so senders and receivers can check the data against the hash. From 7619b64e5305ef74893c144bbe02a1a39f365c07 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Wed, 10 Feb 2016 12:55:43 +0100 Subject: [PATCH 19/31] Change wording and don't store an empty map when links have no attributes --- merkledag/ipld.md | 32 +++++++++++++++++++++----------- 1 file changed, 21 insertions(+), 11 deletions(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index cca5526aa..2c8d7bf40 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -310,27 +310,35 @@ IPLD supports a variety of serialized data formats through [multicodec](https:// ## Serialised CBOR with tags -IPLD objects can be represented using cbor using tags which are defined in [RFC 7049 section 2.4](http://tools.ietf.org/html/rfc7049#section-2.4). +IPLD links can be represented in CBOR using tags which are defined in [RFC 7049 section 2.4](http://tools.ietf.org/html/rfc7049#section-2.4). -A tag `` is defined. This tag must be followed by an array (major type 4) containing two elements. The first being either a text string (major type 3) or a byte string (major type 2). The second element is defined to be a map (major type 5). +A tag `` is defined. This tag must be followed by an array (major type 4) containing one or two elements. The first being either a text string (major type 3) or a byte string (major type 2). The second element is defined to be a map (major type 5) and can be omitted if the map is empty. The canonical format is to omit this map if it is empty. -When encoding an IPLD object to CBOR, every map that contain a link key is transformed to a `` followed by the array containing the link and then containing the CBOR version of the map without the link key. +When encoding an IPLD object to CBOR, every IPLD object can be considered to be encoded using `` using this algorithm: -- if the link key is a valid [multiaddress](https://github.com/jbenet/multiaddr) and converting that link text to the multiaddress binary string and back to text is guaranteed to result to the exact same text, the link is stored as a binary multiaddress as the first array item. -- else, the link is stored as text as the first array item. +- If the IPLD object doesn't contain a link property, it is encoded in CBOR as a map. +- If the IPLD object contain a link property but it is not a string, it is encoded in CBOR as a map. +- The link property is extracted and the object is converted to a map that don't contain the link. +- If the link is a valid [multiaddress](https://github.com/jbenet/multiaddr) and converting that link text to the multiaddress binary string and back to text is guaranteed to result to the exact same text, the link is converted to a binary multiaddress stored in CBOR as a byte string (major type 2). +- Else, the link is stored as text (major type 3) +- A CBOR array is constructed containing the link as first item +- If the map created earlier is not empty, the map is added to the array as its second item +- The array is prefixed by the ``, this is the final CBOR representation of a link. -When decoding CBOR and converting it to IPLD, each occurences of `` with its following array is transformed. +When decoding CBOR and converting it to IPLD, each occurences of `` with its following array is transformed by the following algorithm: - If the first array item is a binary string, it is interpreted as a multiaddress and converted to a textual format. Else, the text string is used directly. -- The map that follows is augmented with a new pair. The key is the standard IPLD link property, the value is the link in its textual format. -- When iterating over this augmented map, the link property must come first and not in any other order. This guarantee a consistent ordering. -- This augmented map is used instead of the `` in the IPLD output. +- If the array contains a second item (which should be a map), it is extracted. Else an empty map is created. +- The map is augmented with a new key value pair. The key is the standard IPLD link property, the valus is the string containing the link. +- This map should be interpreted as an IPLD object instead of the tag. +- When iterating over the map in its canonical form, the link must be come before every other key even if the canonical CBOR order says otherwise. + +When an IPLD object contains these tags in the way explained here, the multicodec header used to represent the object codec must be `/cbor/ipld-tagsv1` instead of just `/cbor`. Readers should be able to use an optimized reading process to detect links using these tags. -When an IPLD object contains these tags in the way explained here, the multicodec header used to represent the object codec must be `/cbor/ipld-tagsv1` instead of just `/cbor`. Readers will be able to use an optimized reading process to detect links using these tags. **TODO:** -- [ ] register tags with IANA. +- [ ] register tag with IANA. ### Canonical Format @@ -339,6 +347,8 @@ In order to preserve merkle-linking's power, we must ensure that there is a sing **The IPLD Canonical format is _canonicalized CBOR with tags_.** +The canonical CBOR format must follow rules defines in [RFC 7049 section 3.9](http://tools.ietf.org/html/rfc7049#section-3.9) in addition to the rules defined here. + Users of this format should not expect any specific ordering of the keys, as the keys might be ordered differently in non canonical formats. The legacy canonical format is protocol buffers. From 1a3f4a99cfb85d047741892c5554e16d897f9ba5 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Wed, 10 Feb 2016 13:15:13 +0100 Subject: [PATCH 20/31] Don't require the tag to be followed by an array if there are no properties --- merkledag/ipld.md | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index 2c8d7bf40..4acdc42ce 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -312,7 +312,10 @@ IPLD supports a variety of serialized data formats through [multicodec](https:// IPLD links can be represented in CBOR using tags which are defined in [RFC 7049 section 2.4](http://tools.ietf.org/html/rfc7049#section-2.4). -A tag `` is defined. This tag must be followed by an array (major type 4) containing one or two elements. The first being either a text string (major type 3) or a byte string (major type 2). The second element is defined to be a map (major type 5) and can be omitted if the map is empty. The canonical format is to omit this map if it is empty. +A tag `` is defined. This tag can be followed by: + +- a text string (major type 3) or byte string (major type 2) corresponding to the link target. This is the canonical format for links with no link properties. +- an array (major type 4) containing as first element the link target (text or binary string) and as optional second argument the link properties (a map, major type 5) When encoding an IPLD object to CBOR, every IPLD object can be considered to be encoded using `` using this algorithm: @@ -321,15 +324,15 @@ When encoding an IPLD object to CBOR, every IPLD object can be considered to be - The link property is extracted and the object is converted to a map that don't contain the link. - If the link is a valid [multiaddress](https://github.com/jbenet/multiaddr) and converting that link text to the multiaddress binary string and back to text is guaranteed to result to the exact same text, the link is converted to a binary multiaddress stored in CBOR as a byte string (major type 2). - Else, the link is stored as text (major type 3) -- A CBOR array is constructed containing the link as first item -- If the map created earlier is not empty, the map is added to the array as its second item -- The array is prefixed by the ``, this is the final CBOR representation of a link. +- If the map created earlier is empty, the resulting encoding is the `` followed by the CBOR representation of the link +- If the map is not empty, the resulting encoding is the `` followed by an array of two elements containing the link followed by the map -When decoding CBOR and converting it to IPLD, each occurences of `` with its following array is transformed by the following algorithm: +When decoding CBOR and converting it to IPLD, each occurences of `` is transformed by the following algorithm: -- If the first array item is a binary string, it is interpreted as a multiaddress and converted to a textual format. Else, the text string is used directly. -- If the array contains a second item (which should be a map), it is extracted. Else an empty map is created. -- The map is augmented with a new key value pair. The key is the standard IPLD link property, the valus is the string containing the link. +- If the following value is an array, its elements are extracted. First the link followed by the link properties. If there are no link properties, an empty map is used instead. +- Else, the following value must be the link, which is extracted. The link properties are created as an empty map. +- If the link is a binary string, it is interpreted as a multiaddress and converted to a textual format. Else, the text string is used directly. +- The map of the link properties is augmented with a new key value pair. The key is the standard IPLD link property, the value is the textual string containing the link. - This map should be interpreted as an IPLD object instead of the tag. - When iterating over the map in its canonical form, the link must be come before every other key even if the canonical CBOR order says otherwise. From dfb8903b182586a6e8ebdf764bbd4694ccc31161 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Thu, 11 Feb 2016 21:53:22 +0100 Subject: [PATCH 21/31] Minor edits --- merkledag/ipld.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index 4acdc42ce..9e14a378f 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -308,7 +308,7 @@ On the subject of integers, there exist a variety of formats which represent int IPLD supports a variety of serialized data formats through [multicodec](https://github.com/jbenet/multicodec). These can be used however is idiomatic to the format, for example in `CBOR`, we can use `CBOR` type tags to represent the merkle-link, and avoid writing out the full string key `@link`. Users are encouraged to use the formats to their fullest, and to store and transmit IPLD data in whatever format makes the most sense. The only requirement **is that there MUST be a well-defined one-to-one mapping with the IPLD Canonical format.** This is so that data can be transformed from one format to another, and back, without changing its meaning nor its cryptographic hashes. -## Serialised CBOR with tags +### Serialised CBOR with tags IPLD links can be represented in CBOR using tags which are defined in [RFC 7049 section 2.4](http://tools.ietf.org/html/rfc7049#section-2.4). @@ -320,8 +320,8 @@ A tag `` is defined. This tag can be followed by: When encoding an IPLD object to CBOR, every IPLD object can be considered to be encoded using `` using this algorithm: - If the IPLD object doesn't contain a link property, it is encoded in CBOR as a map. -- If the IPLD object contain a link property but it is not a string, it is encoded in CBOR as a map. -- The link property is extracted and the object is converted to a map that don't contain the link. +- If the IPLD object contains a link property but it is not a string, it is encoded in CBOR as a map. +- The link property is extracted and the object is converted to a map that doesn't contain the link. - If the link is a valid [multiaddress](https://github.com/jbenet/multiaddr) and converting that link text to the multiaddress binary string and back to text is guaranteed to result to the exact same text, the link is converted to a binary multiaddress stored in CBOR as a byte string (major type 2). - Else, the link is stored as text (major type 3) - If the map created earlier is empty, the resulting encoding is the `` followed by the CBOR representation of the link From a15797a2891c4af95d01f2d0bfe1bc379acfbfc4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Shanti=20Bouchez-Mongard=C3=A9?= Date: Thu, 7 Jan 2016 15:08:23 +0100 Subject: [PATCH 22/31] Separate filesystem merkle-path from IPLD merkle-path --- merkledag/ipld.md | 88 ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 87 insertions(+), 1 deletion(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index bb30bbd63..dcc8f77ee 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -32,7 +32,18 @@ Objects with merkle-links form a Graph (merkle-graph), which necessarily is both ### What is a _merkle-path_? -A _merkle-path_ is a unix-style path (e.g. `/a/b/c/d`) which initially dereferences through a _merkle-link_ and then follows _named merkle-links_ in the intermediate objects. Following a name means looking into the object, finding the _name_ and resolving the associated _merkle-link_. +A merkle-path is a unix-style path (e.g. `/a/b/c/d`) which initially dereferences through a _merkle-link_ and allows access of elements of the referenced node and other nodes transitively. + +There is no single merkle-path, but there are two: + +- merkle-path for filesystems: this is a merkle-path that is designed to be used in the context of filesystems (that also includes network protocols such as HTTP or FTP). Their idea is to be as close as possible to the traditional filesystem semantic +- merkle-path for IPLD: this is a merkle-path that can be used to access more elements of the IPLD data model (specifically: link properties) but that doesn't fit within the traditional filesystem model. + +When you use a merkle path, make sure of which one you use. Command line tools are encouraged to allow switching between the two flavors using a switch. + +### Filesystem merkle-path + +A _filesystem merkle-path_ is a unix-style path which initially dereferences through a _merkle-link_ and then follows _named merkle-links_ in the intermediate objects. Following a name means looking into the object, finding the _name_ and resolving the associated _merkle-link_. For example, suppose we have this _merkle-path_: @@ -78,6 +89,81 @@ O_5 = | "hello": "world" | whose hash value is QmR8Bzg59Y4FGWHeu9iTYhwhiP8PHCN This entire _merkle-path_ traversal is a unix-style path traversal over a _merkle-dag_ which uses _merkle-links_ with names. +### IPLD merkle-path (best solution) + +An _IPLD merkle-path_ is an extension of a _filesystem merkle-path_ which uses a special syntax to access link properties. + +Path elements are suffixed by either `.link` to access the link properties or by `.object` to dereference the _merkle-link_. if no suffix is present, the _merkle-link_ is dereferenced (to be compatible with _filesystem merkle-paths_ in most cases) + +**FIXME**: perhaps use different suffixes so we are less likely to have ambiguities. Using a character that is denied by Windows would be a good idea since those are less likely to be present in filenames. For most cases, this would make _IPLD merkle-paths_ a superset of _filesystem merkle-paths_. For example we could use `?link` and `?object` + +Suppose we have object which hashes to QmCCC...000: + + --- + stuff: + foo: + mlink: QmCCC...111 + mode: 0755 + owner: jbenet + +and we have object which hashes to QmCCC...111 (the foo link): + + --- + other: + cat.link: + mlink: QmCCC...222 + mode: 0644 + owner: jbenet + +Now: + +- the path `/ipfs/QmCCC...000/stuff/foo.link/mode` yields `0755` +- the path `/ipfs/QmCCC...000/stuff/foo/other/cat.link/mode` does not exists because `other` does not have a `cat` object, only a `cat.link` +- the path `/ipfs/QmCCC...000/stuff/foo/other/cat.link.link/mode` yields `0644` +- the path `/ipfs/QmCCC...000/stuff/foo.object/other/cat.link` yields object `QmCCC...222` + +### IPLD merkle-path (other solution) + +An _IPLD merkle-path_ is a path which initially dereferences through a _merkle-link_ and then follows elements in intermediate objects through the separator `.`, and follows _merkle-links_ through the separator `/`. + +**Variation:** The separator `/` can also be used instead of `.` if there is no ambiguity. + +The separator can be escaped in any path element using `\.`, and the `\` character is escaped using `\\`. + +Suppose we have object which hashes to QmCCC...000: + + --- + stuff: + foo: + mlink: QmCCC...111 + mode: 0755 + owner: jbenet + +and we have object which hashes to QmCCC...111 (the foo link): + + --- + other: + cat.jpg: + mlink: QmCCC...222 + mode: 0644 + owner: jbenet + +Now: + +- the path `/ipfs/QmCCC...000/stuff.foo.mode` yields `0755` +- the path `/ipfs/QmCCC...000/stuff/foo` does not exists because in the object `QmCCC...000`, the `stuff` object cannot is not a _merkle-link_ (it doesn't have the `mlink` key) +- the path `/ipfs/QmCCC...000/stuff.foo/other.cat\.jpg` yields object `QmCCC...222` +
**FIXME:** or does it yields `{"mlink": "QmCCC...222", "mode": 0644, "owner": "jbenet"}` and `/ipfs/QmCCC...000/stuff.foo/other.cat\.jpg` yields the object `QmCCC...222`? +- the path `/ipfs/QmCCC...000/stuff.foo/other.cat\.jpg.mode` yields `0644` + +Variation: + +- the path `/ipfs/QmCCC...000/stuff/foo.mode` yields `0755` +- the path `/ipfs/QmCCC...000/stuff.foo.mode` yields `0755` +- the path `/ipfs/QmCCC...000/stuff/foo/mode` does not exists because object `QmCCC...111` does not have a `mode` key. +- the path `/ipfs/QmCCC...000/stuff.foo/other.cat\.jpg` yields same as above (**FIXME**) +- the path `/ipfs/QmCCC...000/stuff.foo/other.cat\.jpg.mode` yields `0644` + ## What is the IPLD Data Model? The IPLD Data Model defines a simple JSON-based _structure_ for all merkle-dags, and identifies a set of formats to encode the structure into. From 24bd624ffe8c8ee6095f014820208bd838717432 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Fri, 8 Jan 2016 08:52:09 +0100 Subject: [PATCH 23/31] Add note about uses of filesystem merkle-paths --- merkledag/ipld.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index dcc8f77ee..e708b37df 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -41,6 +41,8 @@ There is no single merkle-path, but there are two: When you use a merkle path, make sure of which one you use. Command line tools are encouraged to allow switching between the two flavors using a switch. +Filesystem representations (fuse mounts, HTTP or FTP protocols) should use the _filesystem merkle-paths_ if they intend to store arbitrary file. They can allow switching to _IPLD merkle-paths_ using a mount option or a configuration switch to allow object inspection, and turn the filesystem something like `/proc` or `/sys` on unix machines where storing user files is not the objective. + ### Filesystem merkle-path A _filesystem merkle-path_ is a unix-style path which initially dereferences through a _merkle-link_ and then follows _named merkle-links_ in the intermediate objects. Following a name means looking into the object, finding the _name_ and resolving the associated _merkle-link_. From ec26862e1d49dbddb2f078d831bf49ce5445e5cf Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Fri, 8 Jan 2016 09:07:14 +0100 Subject: [PATCH 24/31] Talk about escaping keys in merkle-paths --- merkledag/ipld.md | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index e708b37df..6c78ce099 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -91,6 +91,23 @@ O_5 = | "hello": "world" | whose hash value is QmR8Bzg59Y4FGWHeu9iTYhwhiP8PHCN This entire _merkle-path_ traversal is a unix-style path traversal over a _merkle-dag_ which uses _merkle-links_ with names. +**[In case we use escaping in protobuf IPLD format]** + +In order to not restrict individual path component by disallowing some file names and still allow storing arbitrary data in IPLD objects, path components must be escaped when they are looked up in IPLD objects. + +To escape a path component in order to look it up in an IPLD object: + +- every `\` character in the path component must be replaced with `\\` +- every `@` character in the path component must be replaced with `\@` + +This makes any key containing a `@` character unescaped in an IPLD object not accessible through a _filesystem merkle-path_. This is a reserved key that can be used to store auxiliary data without making it a link and visible in regular filesystems. This data can be made available in filesystems through extended attributes or opening and reading file contents. + +To unescape IPLD object keys that are not reserved and get the corresponding path component: + +- every `\@` sequence in the key must be replaced by `@` +- every `\\` sequence in the key must be replaced by `\` + + ### IPLD merkle-path (best solution) An _IPLD merkle-path_ is an extension of a _filesystem merkle-path_ which uses a special syntax to access link properties. From 201a0d4125e2e21fe025da55d9f83fae0b6d3144 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Fri, 8 Jan 2016 09:13:08 +0100 Subject: [PATCH 25/31] Consideration about key escaping --- merkledag/ipld.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index 6c78ce099..3740289bf 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -110,7 +110,8 @@ To unescape IPLD object keys that are not reserved and get the corresponding pat ### IPLD merkle-path (best solution) -An _IPLD merkle-path_ is an extension of a _filesystem merkle-path_ which uses a special syntax to access link properties. +An _IPLD merkle-path_ is an extension of a _filesystem merkle-path_ which uses a special syntax to access link properties. **[In case we use escaping in protobuf IPLD format** Except that key escaping is not performed when looking up items in the IPLD objects. This allow accessing reserved keys using _IPLD merkle-paths_ that are not accessible in filesystems.**]** + Path elements are suffixed by either `.link` to access the link properties or by `.object` to dereference the _merkle-link_. if no suffix is present, the _merkle-link_ is dereferenced (to be compatible with _filesystem merkle-paths_ in most cases) From e1a1071a9cabfbb70e3d7f1523db5af48ddd3f34 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Sat, 9 Jan 2016 00:37:36 +0100 Subject: [PATCH 26/31] Go back to a single merkle-link with two separators --- merkledag/ipld.md | 170 +++++++++++++++++----------------------------- 1 file changed, 61 insertions(+), 109 deletions(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index 3740289bf..940aab54c 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -34,23 +34,18 @@ Objects with merkle-links form a Graph (merkle-graph), which necessarily is both A merkle-path is a unix-style path (e.g. `/a/b/c/d`) which initially dereferences through a _merkle-link_ and allows access of elements of the referenced node and other nodes transitively. -There is no single merkle-path, but there are two: +Merkle paths aren't suited to be used in filesystem representations (fuse mounts, HTTP or FTP protocols) as they describe the underlying IPLD data structure. Their use in filesystems is howver well suited for debug purposes (like `/proc` on unix). -- merkle-path for filesystems: this is a merkle-path that is designed to be used in the context of filesystems (that also includes network protocols such as HTTP or FTP). Their idea is to be as close as possible to the traditional filesystem semantic -- merkle-path for IPLD: this is a merkle-path that can be used to access more elements of the IPLD data model (specifically: link properties) but that doesn't fit within the traditional filesystem model. +Filesystems are encouraged to design an object model on top of IPLD that would be specialized for file manipulation and have specific path algorithms to query this model -When you use a merkle path, make sure of which one you use. Command line tools are encouraged to allow switching between the two flavors using a switch. +### How do _merkle-paths_ work? -Filesystem representations (fuse mounts, HTTP or FTP protocols) should use the _filesystem merkle-paths_ if they intend to store arbitrary file. They can allow switching to _IPLD merkle-paths_ using a mount option or a configuration switch to allow object inspection, and turn the filesystem something like `/proc` or `/sys` on unix machines where storing user files is not the objective. - -### Filesystem merkle-path - -A _filesystem merkle-path_ is a unix-style path which initially dereferences through a _merkle-link_ and then follows _named merkle-links_ in the intermediate objects. Following a name means looking into the object, finding the _name_ and resolving the associated _merkle-link_. +A _merkle-path_ is a unix-style path which initially dereferences through a _merkle-link_ and then follows _named merkle-links_ in the intermediate objects. Following a name means looking into the object, finding the _name_ and resolving the associated _merkle-link_. For example, suppose we have this _merkle-path_: ``` -/ipfs/QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k/a/b/c/d +/ipfs/QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k/a/b/c/d/ ``` Where: @@ -64,125 +59,82 @@ Suppose also that this path points to the object `{ "hello": "world" }`. Resolving it involves looking up each object and attaining a hash value, then traversing to the next. ``` - +-------------------+ -O_1 = | "a": "QmV76pU..." | whose hash value is QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k - +-------------------+ - | - v - +-------------------+ -O_2 = | "b": "QmQmkZP..." | whose hash value is QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT - +-------------------+ - | - v - +-------------------+ -O_3 = | "c": "QmWkyYN..." | whose hash value is QmQmkZPNPoRkPd7wj2xUJe5v5DsY6MX33MFaGhZKB2pRSE - +-------------------+ - | - v - +-------------------+ -O_4 = | "d": "QmR8Bzg..." | whose hash value is QmWkyYNrN5wnHgX5vfs88q7QUaFKq52TVNTFeTzxm73UbT - +-------------------+ - | - v - +-------------------+ -O_5 = | "hello": "world" | whose hash value is QmR8Bzg59Y4FGWHeu9iTYhwhiP8PHCNFiaGhP1UjywA43j - +-------------------+ + +-----------------------------+ +O_1 = | "a": {"link": "QmV76pU..."} | whose hash value is QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k + +-----------------------------+ + | + v + +-----------------------------+ +O_2 = | "b": {"link": "QmQmkZP..."} | whose hash value is QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT + +-----------------------------+ + | + v + +-----------------------------+ +O_3 = | "c": {"link": "QmWkyYN..."} | whose hash value is QmQmkZPNPoRkPd7wj2xUJe5v5DsY6MX33MFaGhZKB2pRSE + +-----------------------------+ + | + v + +-----------------------------+ +O_4 = | "d": {"link": "QmR8Bzg..."} | whose hash value is QmWkyYNrN5wnHgX5vfs88q7QUaFKq52TVNTFeTzxm73UbT + +-----------------------------+ + | + v + +-------------------+ +O_5 = | "hello": "world" | whose hash value is QmR8Bzg59Y4FGWHeu9iTYhwhiP8PHCNFiaGhP1UjywA43j + +-------------------+ ``` This entire _merkle-path_ traversal is a unix-style path traversal over a _merkle-dag_ which uses _merkle-links_ with names. -**[In case we use escaping in protobuf IPLD format]** - -In order to not restrict individual path component by disallowing some file names and still allow storing arbitrary data in IPLD objects, path components must be escaped when they are looked up in IPLD objects. - -To escape a path component in order to look it up in an IPLD object: - -- every `\` character in the path component must be replaced with `\\` -- every `@` character in the path component must be replaced with `\@` - -This makes any key containing a `@` character unescaped in an IPLD object not accessible through a _filesystem merkle-path_. This is a reserved key that can be used to store auxiliary data without making it a link and visible in regular filesystems. This data can be made available in filesystems through extended attributes or opening and reading file contents. - -To unescape IPLD object keys that are not reserved and get the corresponding path component: - -- every `\@` sequence in the key must be replaced by `@` -- every `\\` sequence in the key must be replaced by `\` - - -### IPLD merkle-path (best solution) +#### Accessing properties within IPLD objects -An _IPLD merkle-path_ is an extension of a _filesystem merkle-path_ which uses a special syntax to access link properties. **[In case we use escaping in protobuf IPLD format** Except that key escaping is not performed when looking up items in the IPLD objects. This allow accessing reserved keys using _IPLD merkle-paths_ that are not accessible in filesystems.**]** +Now, to travel within an IPLD object, we introduce a second separator: the dot (`.`). This separator can be used to avoid dereferencing `merkle-links` but travel within the IPLD object. +For example, suppose we have this _merkle-path_: -Path elements are suffixed by either `.link` to access the link properties or by `.object` to dereference the _merkle-link_. if no suffix is present, the _merkle-link_ is dereferenced (to be compatible with _filesystem merkle-paths_ in most cases) - -**FIXME**: perhaps use different suffixes so we are less likely to have ambiguities. Using a character that is denied by Windows would be a good idea since those are less likely to be present in filenames. For most cases, this would make _IPLD merkle-paths_ a superset of _filesystem merkle-paths_. For example we could use `?link` and `?object` - -Suppose we have object which hashes to QmCCC...000: - - --- - stuff: - foo: - mlink: QmCCC...111 - mode: 0755 - owner: jbenet - -and we have object which hashes to QmCCC...111 (the foo link): - - --- - other: - cat.link: - mlink: QmCCC...222 - mode: 0644 - owner: jbenet - -Now: +``` +/ipfs/QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k/a.b/c.d +``` -- the path `/ipfs/QmCCC...000/stuff/foo.link/mode` yields `0755` -- the path `/ipfs/QmCCC...000/stuff/foo/other/cat.link/mode` does not exists because `other` does not have a `cat` object, only a `cat.link` -- the path `/ipfs/QmCCC...000/stuff/foo/other/cat.link.link/mode` yields `0644` -- the path `/ipfs/QmCCC...000/stuff/foo.object/other/cat.link` yields object `QmCCC...222` +The link will: -### IPLD merkle-path (other solution) +- look up the first object `QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k` that we call `root` +- Look up the key `root["a"]["b"]` and find here a _merkle-link_ +- Dereference this _merkle-link_ to get `object1` +- Look up the key `object1["c"]["d"]` which will be returned as the result -An _IPLD merkle-path_ is a path which initially dereferences through a _merkle-link_ and then follows elements in intermediate objects through the separator `.`, and follows _merkle-links_ through the separator `/`. +Note that if we added a trailing slash to the path (`/ipfs/QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k/a.b/c.d/`), we would perform a last _merkle-link_ dereferencing: -**Variation:** The separator `/` can also be used instead of `.` if there is no ambiguity. +- look up the first object `QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k` that we call `root` +- Look up the key `root["a"]["b"]` and find here a _merkle-link_ +- Dereference this _merkle-link_ to get `object1` +- Look up the key `object1["c"]["d"]` and find a _merkle-link_ +- Dereference this _merkle-link_ and return the IPLD object as the result -The separator can be escaped in any path element using `\.`, and the `\` character is escaped using `\\`. +Also, in case the IPLD object does not contain a _merkle-link_, it is possible to use both the `/` or the `.` separator as there is no ambiguity. -Suppose we have object which hashes to QmCCC...000: +To be able to access objects that are behind keys containing either a `/` or a `.` character, the individual path element can be character escaped using `\`. - --- - stuff: - foo: - mlink: QmCCC...111 - mode: 0755 - owner: jbenet +For example, resolving `/ipfs/QmUmg7B.../a\.b.c/d\/e/f\\g/` will: -and we have object which hashes to QmCCC...111 (the foo link): +- look for the IPLD node that we call `root` whose hash is `QmUmg7B...` +- resolve _merkle-link_ found in `root["a.b"]["c"]` to `object1` +- resolve _merkle-link_ found in `object1["d/e"]` to `object2` +- resolve _merkle-link_ found in `object2["f\\g"]` and return the result. - --- - other: - cat.jpg: - mlink: QmCCC...222 - mode: 0644 - owner: jbenet +#### Escaping algorithm -Now: +To escape a path component you have to: -- the path `/ipfs/QmCCC...000/stuff.foo.mode` yields `0755` -- the path `/ipfs/QmCCC...000/stuff/foo` does not exists because in the object `QmCCC...000`, the `stuff` object cannot is not a _merkle-link_ (it doesn't have the `mlink` key) -- the path `/ipfs/QmCCC...000/stuff.foo/other.cat\.jpg` yields object `QmCCC...222` -
**FIXME:** or does it yields `{"mlink": "QmCCC...222", "mode": 0644, "owner": "jbenet"}` and `/ipfs/QmCCC...000/stuff.foo/other.cat\.jpg` yields the object `QmCCC...222`? -- the path `/ipfs/QmCCC...000/stuff.foo/other.cat\.jpg.mode` yields `0644` +- replace `.` by `\.` +- replace `/` by `\/` +- replace `\` by `\\` -Variation: +To unescape a path component you have to: -- the path `/ipfs/QmCCC...000/stuff/foo.mode` yields `0755` -- the path `/ipfs/QmCCC...000/stuff.foo.mode` yields `0755` -- the path `/ipfs/QmCCC...000/stuff/foo/mode` does not exists because object `QmCCC...111` does not have a `mode` key. -- the path `/ipfs/QmCCC...000/stuff.foo/other.cat\.jpg` yields same as above (**FIXME**) -- the path `/ipfs/QmCCC...000/stuff.foo/other.cat\.jpg.mode` yields `0644` +- replace `\\` by `\` +- replace `\/` by `/` +- replace `\.` by `.` ## What is the IPLD Data Model? From 5062f8c5377e1ced78d23c89aefa08cbef4c9dc2 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Sat, 9 Jan 2016 01:11:33 +0100 Subject: [PATCH 27/31] IPLD merkle paths: typo fixes --- merkledag/ipld.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index 940aab54c..a53b5cf52 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -98,27 +98,27 @@ For example, suppose we have this _merkle-path_: The link will: -- look up the first object `QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k` that we call `root` -- Look up the key `root["a"]["b"]` and find here a _merkle-link_ +- look up the first object `QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k` that we call `object0` +- Look up the key `object0["a"]["b"]` and find here a _merkle-link_ - Dereference this _merkle-link_ to get `object1` - Look up the key `object1["c"]["d"]` which will be returned as the result Note that if we added a trailing slash to the path (`/ipfs/QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k/a.b/c.d/`), we would perform a last _merkle-link_ dereferencing: -- look up the first object `QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k` that we call `root` -- Look up the key `root["a"]["b"]` and find here a _merkle-link_ +- look up the first object `QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k` that we call `object0` +- Look up the key `object0["a"]["b"]` and find here a _merkle-link_ - Dereference this _merkle-link_ to get `object1` - Look up the key `object1["c"]["d"]` and find a _merkle-link_ - Dereference this _merkle-link_ and return the IPLD object as the result -Also, in case the IPLD object does not contain a _merkle-link_, it is possible to use both the `/` or the `.` separator as there is no ambiguity. +Also, in case the IPLD object does not contain a _merkle-link_, it is possible to use both the `/` or the `.` separator to access internal properties as there is no ambiguity. To be able to access objects that are behind keys containing either a `/` or a `.` character, the individual path element can be character escaped using `\`. For example, resolving `/ipfs/QmUmg7B.../a\.b.c/d\/e/f\\g/` will: -- look for the IPLD node that we call `root` whose hash is `QmUmg7B...` -- resolve _merkle-link_ found in `root["a.b"]["c"]` to `object1` +- look for the IPLD node that we call `object0` whose hash is `QmUmg7B...` +- resolve _merkle-link_ found in `object0["a.b"]["c"]` to `object1` - resolve _merkle-link_ found in `object1["d/e"]` to `object2` - resolve _merkle-link_ found in `object2["f\\g"]` and return the result. From bb5ad86fe13f1ff8af8e3ab8bdc723d0f72ffb39 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Sat, 9 Jan 2016 14:24:30 +0100 Subject: [PATCH 28/31] ipld merkle paths: clarify the usage scope --- merkledag/ipld.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index a53b5cf52..4255fdf84 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -34,9 +34,9 @@ Objects with merkle-links form a Graph (merkle-graph), which necessarily is both A merkle-path is a unix-style path (e.g. `/a/b/c/d`) which initially dereferences through a _merkle-link_ and allows access of elements of the referenced node and other nodes transitively. -Merkle paths aren't suited to be used in filesystem representations (fuse mounts, HTTP or FTP protocols) as they describe the underlying IPLD data structure. Their use in filesystems is howver well suited for debug purposes (like `/proc` on unix). +_Merkle-paths_ aren't suited for using them in a general purpose filesystem because it introduces many restrictions on file names. However, it can be used to work on special purpose filesystems. It can be compared to the `/proc` filesystem on unix computers or HTTP Web APIs where the allowed paths is restricted. -Filesystems are encouraged to design an object model on top of IPLD that would be specialized for file manipulation and have specific path algorithms to query this model +General purpose filesystems are encouraged to design an object model on top of IPLD that would be specialized for file manipulation and have specific path algorithms to query this model ### How do _merkle-paths_ work? From 213c7ed87fdb825243a15629477a46c60c838bc9 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Thu, 11 Feb 2016 22:52:50 +0100 Subject: [PATCH 29/31] New description of merkle-paths --- merkledag/ipld.md | 96 +++++++++++++++-------------------------------- 1 file changed, 30 insertions(+), 66 deletions(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index 4255fdf84..0fcd4529d 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -36,7 +36,7 @@ A merkle-path is a unix-style path (e.g. `/a/b/c/d`) which initially dereference _Merkle-paths_ aren't suited for using them in a general purpose filesystem because it introduces many restrictions on file names. However, it can be used to work on special purpose filesystems. It can be compared to the `/proc` filesystem on unix computers or HTTP Web APIs where the allowed paths is restricted. -General purpose filesystems are encouraged to design an object model on top of IPLD that would be specialized for file manipulation and have specific path algorithms to query this model +General purpose filesystems are encouraged to design an object model on top of IPLD that would be specialized for file manipulation and have specific path algorithms to query this model. ### How do _merkle-paths_ work? @@ -52,89 +52,53 @@ Where: - `ipfs` is a protocol namespace (to allow the computer to discern what to do) - `QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k` is a cryptographic hash. - `a/b/c/d` is a path _traversal_, as in unix. -- this link traverses five objects. -Suppose also that this path points to the object `{ "hello": "world" }`. +Path traversal can either happen inside a single IPLD object, or can happen between objects throught *merkle-links*. The simple rule is that in case the traversal is possible in the same IPLD object, *merkle-links* should not be followed. -Resolving it involves looking up each object and attaining a hash value, then traversing to the next. +In order to specify a path that follows a *merkle-link* even in case the traversal can be done without fetching another IPLD object, there are two mechanisms: -``` - +-----------------------------+ -O_1 = | "a": {"link": "QmV76pU..."} | whose hash value is QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k - +-----------------------------+ - | - v - +-----------------------------+ -O_2 = | "b": {"link": "QmQmkZP..."} | whose hash value is QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT - +-----------------------------+ - | - v - +-----------------------------+ -O_3 = | "c": {"link": "QmWkyYN..."} | whose hash value is QmQmkZPNPoRkPd7wj2xUJe5v5DsY6MX33MFaGhZKB2pRSE - +-----------------------------+ - | - v - +-----------------------------+ -O_4 = | "d": {"link": "QmR8Bzg..."} | whose hash value is QmWkyYNrN5wnHgX5vfs88q7QUaFKq52TVNTFeTzxm73UbT - +-----------------------------+ - | - v - +-------------------+ -O_5 = | "hello": "world" | whose hash value is QmR8Bzg59Y4FGWHeu9iTYhwhiP8PHCNFiaGhP1UjywA43j - +-------------------+ -``` +- Use `//` instead of `/` as a path separator when following the *merkle-link* is desired. This may not always be possible depending on the filesystem implementation. -This entire _merkle-path_ traversal is a unix-style path traversal over a _merkle-dag_ which uses _merkle-links_ with names. +- Use the special path component `/@link/` instead of a simple path separator. That also signifies the path needs to dereference the *merkle-link* -#### Accessing properties within IPLD objects +As a consequence, `@link` keys that are not *merkle-links* cannot be referenced in *merkle-paths*. -Now, to travel within an IPLD object, we introduce a second separator: the dot (`.`). This separator can be used to avoid dereferencing `merkle-links` but travel within the IPLD object. +#### Examples -For example, suppose we have this _merkle-path_: +The IPLD object with the hash `QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k` contains: -``` -/ipfs/QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k/a.b/c.d -``` + a: + b: + @link: QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT + c: "d" + foo: + @link: QmQmkZPNPoRkPd7wj2xUJe5v5DsY6MX33MFaGhZKB2pRSE -The link will: +And the object `QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT` contains: -- look up the first object `QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k` that we call `object0` -- Look up the key `object0["a"]["b"]` and find here a _merkle-link_ -- Dereference this _merkle-link_ to get `object1` -- Look up the key `object1["c"]["d"]` which will be returned as the result + c: "e" + d: + e: "f" + foo: + name: "second/foo" -Note that if we added a trailing slash to the path (`/ipfs/QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k/a.b/c.d/`), we would perform a last _merkle-link_ dereferencing: +And the object `QmQmkZPNPoRkPd7wj2xUJe5v5DsY6MX33MFaGhZKB2pRSE` contains: -- look up the first object `QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k` that we call `object0` -- Look up the key `object0["a"]["b"]` and find here a _merkle-link_ -- Dereference this _merkle-link_ to get `object1` -- Look up the key `object1["c"]["d"]` and find a _merkle-link_ -- Dereference this _merkle-link_ and return the IPLD object as the result + name: "third" -Also, in case the IPLD object does not contain a _merkle-link_, it is possible to use both the `/` or the `.` separator to access internal properties as there is no ambiguity. +An example of the paths: -To be able to access objects that are behind keys containing either a `/` or a `.` character, the individual path element can be character escaped using `\`. - -For example, resolving `/ipfs/QmUmg7B.../a\.b.c/d\/e/f\\g/` will: - -- look for the IPLD node that we call `object0` whose hash is `QmUmg7B...` -- resolve _merkle-link_ found in `object0["a.b"]["c"]` to `object1` -- resolve _merkle-link_ found in `object1["d/e"]` to `object2` -- resolve _merkle-link_ found in `object2["f\\g"]` and return the result. +- `/ipfs/QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT/a/b/c` will only traverse the first object and lead to string `d`. +- `/ipfs/QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT/a/b//c` will traverse both objects and lead to the string `e` +- `/ipfs/QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT/a/b/@link/c` is equivalent +- `/ipfs/QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT/a/b/d/e` traverse both objects and leads to the string `f` +- `/ipfs/QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT/a/b/foo/name` traverse the first and last object and lead to string `third` #### Escaping algorithm -To escape a path component you have to: - -- replace `.` by `\.` -- replace `/` by `\/` -- replace `\` by `\\` - -To unescape a path component you have to: +Elements named `@link` that are not *merkle-links* are not addressable with this scheme. For example, if a `@link` key points to an array, it is not a valid *merkle-link*. -- replace `\\` by `\` -- replace `\/` by `/` -- replace `\.` by `.` +If this is not desirable, a simple escaping mechanism can be devised. For example any key matching the regular expression `^\@+link$` can be escaped by adding `@` at the beginning, or unescaped by removing one `@` sign. ## What is the IPLD Data Model? From 2ee690996d0dc6e1f6c919ebe7ed68e5e220f343 Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Fri, 12 Feb 2016 10:04:01 +0100 Subject: [PATCH 30/31] Improve spec on paths --- merkledag/ipld.md | 39 +++++++++++++++++++++++++-------------- 1 file changed, 25 insertions(+), 14 deletions(-) diff --git a/merkledag/ipld.md b/merkledag/ipld.md index 0fcd4529d..b12c1aa43 100644 --- a/merkledag/ipld.md +++ b/merkledag/ipld.md @@ -34,8 +34,6 @@ Objects with merkle-links form a Graph (merkle-graph), which necessarily is both A merkle-path is a unix-style path (e.g. `/a/b/c/d`) which initially dereferences through a _merkle-link_ and allows access of elements of the referenced node and other nodes transitively. -_Merkle-paths_ aren't suited for using them in a general purpose filesystem because it introduces many restrictions on file names. However, it can be used to work on special purpose filesystems. It can be compared to the `/proc` filesystem on unix computers or HTTP Web APIs where the allowed paths is restricted. - General purpose filesystems are encouraged to design an object model on top of IPLD that would be specialized for file manipulation and have specific path algorithms to query this model. ### How do _merkle-paths_ work? @@ -45,7 +43,7 @@ A _merkle-path_ is a unix-style path which initially dereferences through a _mer For example, suppose we have this _merkle-path_: ``` -/ipfs/QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k/a/b/c/d/ +/ipfs/QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k/a/b/c/d ``` Where: @@ -53,20 +51,33 @@ Where: - `QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k` is a cryptographic hash. - `a/b/c/d` is a path _traversal_, as in unix. -Path traversal can either happen inside a single IPLD object, or can happen between objects throught *merkle-links*. The simple rule is that in case the traversal is possible in the same IPLD object, *merkle-links* should not be followed. +Paths traversals are divided into two kinds : + +- **in-object traversals** traverse maps within the same object, and is denoted with `/` +- **cross-object traversals** traverse across objects, resolving through merkle-links, and is denoted with `/` **(TODO)**, `//` or with `/@link/`. + +The case for strict path traversals: -In order to specify a path that follows a *merkle-link* even in case the traversal can be done without fetching another IPLD object, there are two mechanisms: +> We divide the traversals strictly, to avoid ambiguity in accessing properties within a *merkle-link* map itself. This is not transparent resolution, and thus a path reveals the objects it traverses. For example, `a/b//c/d//e` traverses across 3 objects.` +> +> We use `//` or `/@link/` for cross-object traversals depending on the filesystem implementation. For example, in unix filesystems, double slashes (`//`) are meaningless and often cleaned into a single slash (`/`). In such a case, the use of `/@link/` is required to traverse links. -- Use `//` instead of `/` as a path separator when following the *merkle-link* is desired. This may not always be possible depending on the filesystem implementation. +The case for lenient path traversals: -- Use the special path component `/@link/` instead of a simple path separator. That also signifies the path needs to dereference the *merkle-link* +> We use `/` to transparently traverse inside a single IPLD object or traverse across multiple. A single slash (/) ALWAYS traverses in-object first, and cross-object otherwise. A double slash (//) or /@link/ ALWAYS traverses cross-object. +> +> To avoid potential ambiguity, we MUST use cross-object traversals (`//` or `/@link/`) wherever possible. For example, merkle-links can themselves carry properties and sub-maps. When `/` path traversals are ambiguous, they default to in-object (the local operation). In that case, we must use `//` or `/@link/` to traverse cross-object. -As a consequence, `@link` keys that are not *merkle-links* cannot be referenced in *merkle-paths*. +Note: filesystem implementation might not be able to support the separator `//` as this is generally folded into `/` on unix. In that case, usage of `/@link/` is preferred. + +As a consequence of using the `@link` path component to denote cross-object traversals, this becomes a reserved path component and makes it impossible to access arbitrary `@link` keys that are not otherwise *merkle-links*. Escaping can be used to render access to those keys possible if so desired. #### Examples -The IPLD object with the hash `QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k` contains: +Using the following dataset: + > ipfs cat --fmt=yaml QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k + --- a: b: @link: QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT @@ -74,23 +85,23 @@ The IPLD object with the hash `QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k` c foo: @link: QmQmkZPNPoRkPd7wj2xUJe5v5DsY6MX33MFaGhZKB2pRSE -And the object `QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT` contains: - + > ipfs cat --fmt=yaml QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT + --- c: "e" d: e: "f" foo: name: "second/foo" -And the object `QmQmkZPNPoRkPd7wj2xUJe5v5DsY6MX33MFaGhZKB2pRSE` contains: - + > ipfs cat --fmt=yaml QmQmkZPNPoRkPd7wj2xUJe5v5DsY6MX33MFaGhZKB2pRSE + --- name: "third" An example of the paths: - `/ipfs/QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT/a/b/c` will only traverse the first object and lead to string `d`. - `/ipfs/QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT/a/b//c` will traverse both objects and lead to the string `e` -- `/ipfs/QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT/a/b/@link/c` is equivalent +- `/ipfs/QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT/a/b/@link/c` is equivalent (will traverse both objects and lead to the string `e`) - `/ipfs/QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT/a/b/d/e` traverse both objects and leads to the string `f` - `/ipfs/QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT/a/b/foo/name` traverse the first and last object and lead to string `third` From ffa001e645d727cbd460bef400338266b1ea750e Mon Sep 17 00:00:00 2001 From: Mildred Ki'Lya Date: Fri, 12 Feb 2016 12:50:14 +0100 Subject: [PATCH 31/31] Remove named links section (but leave the possibility open) --- merkledag/ipld-compat-protobuf.md | 44 +++++++++++++++---------------- 1 file changed, 21 insertions(+), 23 deletions(-) diff --git a/merkledag/ipld-compat-protobuf.md b/merkledag/ipld-compat-protobuf.md index 59839c719..9a79658e6 100644 --- a/merkledag/ipld-compat-protobuf.md +++ b/merkledag/ipld-compat-protobuf.md @@ -37,43 +37,41 @@ This format is defined with the Protocol Buffers syntax as: ## Conversion to IPLD model -The conversion to the IPLD data model must have the following properties: - -- It MUST be convertible back to protocol buffers, resulting in an identical byte stream (so the hash corresponds). This implies that ordering and duplicate links must be preserved in some way. -- When using paths as defined in the IPLD specification, links should be accessible without further indirection. This requires the top node object to have keys corresponding to link names. -- Link names should be able to be any valid file name. As such, the encoding must ensure that link names do not conflict with other keys in the model. +The conversion to the IPLD data model MUST be convertible back to protocol buffers, resulting in an identical byte stream (so the hash corresponds). This implies that ordering and duplicate links must be preserved in some way. As such, they are stored in an array and not in a map indexed by their name. There is a canonical form which is described below: { "data": "", - "named-links": { - "": { - "@link": "", + "links": [ + { + "@link": "/ipfs/", "name": "", "size": }, - "": { - "@link": "", - "name": "", - "size": - }, - ... - } - "ordered-links": [ - "", { + "@link": "/ipfs/", "name": "", - "@link": "", "size": - } - "", + }, + { + "@link": "/ipfs/", + "name": "", + "size": + }, ... ] } -- Here we assume that the link #0 and #1 have the same name. As specified in [ipld.md](ipld.md) in paragraph **Duplicate property keys**, only the first link is present in the named link section. The other link is present in the `ordered-links` section for completeness and to allow recreating the original protocol buffer message. +The main object contains: + +- A `data` key containing the binary data string +- A `links` array containing links in the correct order + +Each link consists of: -- Links are not accessible on the top level object. Applications that are using protocol buffer objects such as unixfs will have to handle that and special case for legacy objects. +- A `@link` key containing the path to the destination document (Using the `/ipfs/` prefix) +- A `name` key containing the link name (a text string) +- A `size` unsigned integer containing the link size as stored in the Protocol Buffer object -- No escaping is needed and no conflict is possible +Implementations are free to add any other top level key they need. In particular it may be interesting to access the links indexed by their name. This is a purely optional feature and additional keys cannot possibly be encoded back to the protonal Protocol Buffer format.