Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relationship with Protocol Buffers legacy IPFS node format #59

Merged
merged 9 commits into from
Feb 12, 2016
205 changes: 205 additions & 0 deletions merkledag/ipld-compat-protobuf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
# IPLD conversion with Protocol Buffer legacy IPFS node format

IPLD has a known conversion with the legacy Protocol Buffers format in order for new IPLD objects to interact with older protocol buffer objects.

## Detecting if the legacy format is in use

The format is encapsulated after a multicodec header that tells which codec to use. In addition, older applications that do not yet use the multicodec header will transmit a protocol buffer stream. This can be detected by looking at the first byte:

- if the first byte is between 0 and 127, it is a multicodec header
- if the first byte if between 128 and 255, it is a protocol buffer stream
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I am wrong here. The multicodec header length is not limited to one byte, but can be encoded in multiple bytes if it is above 127. Setting the MSB to 1 is just the way varint works.

I also assumed that protocol buffers always started with a byte with MSB set to 1, but I don't know if that true. Probably not.

So, it's probably not possible to detect in a such easy way if we are transmitting a multicodec header or a protocol buffer message. I'm currently rewriting this part (as I am implementing it in go-ipld).


In case a multicodec header is in use, the actual IPLD object is encapsulated first with a multicodec header which identifier is `/mdagv1`, then by a second header which identifier corresponds to the actual encoding of the object:

- `/protobuf/msgio`: is the encapsulation for protocol buffer message
- `/json`: is the encapsulation for JSON encoding
- `/cbor`: is the encapsulation for CBOR encoding

For example, a protocol buffer object encapsulated in a multicodec header would start with "`\x08/mdagv1\n\x10/protobuf/msgio\n`" corresponding to the bytes :

08 2f 6d 64 61 67 76 31 0a
10 2f 70 72 6f 74 6f 62 75 66 2f 6d 73 67 69 6f 0a

A JSON encoded object would start with "`\x08/mdagv1\n\x06/json\n`" and a CBOR encoded object would start with "`\x08/mdagv1\n\x06/cbor\n`".


## Description of the legacy protocol buffers format

This format is defined with the Protocol Buffers syntax as:

message PBLink {
optional bytes Hash = 1;
optional string Name = 2;
optional uint64 Tsize = 3;
}

message PBNode {
repeated PBLink Links = 2;
optional bytes Data = 1;
}

## Conversion to IPLD model

The conversion to the IPLD data model must have the following properties:

- It should be convertible back to protocol buffers, resulting in an identical byte stream (so the hash corresponds). This implies that ordering and duplicate links must be preserved in some way.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • "- It MUST be ..."

- When using paths as defined earlier in this document, links should be accessible without further indirection. This requires the top node object to have keys corresponding to link names.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • "When using paths as defined in the IPLD spec..."

- Link names should be able to be any valid file name. As such, the encoding must ensure that link names do not conflict with other keys in the model.

There is a canonical form which is described below:

{
"data": "<Data>",
"named-links": {
"<Links[0].Name>": {
"link": "<Links[0].Hash.(base58)>",
"name": "<Links[0].Name>",
"size": <Links[0].Tsize>
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these could be offsets into the array, to decrease memory concerns. otherwise we blow up by ~2x

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(im ok with either, can maybe state implementations can just dedup?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't duplicate links here. A link is represented only once. The rule is that:

  • if the link can be represented in named-links, it is represented just there. The ordered-links section contain just the string key referencing the named link.
  • if the link cannot be represented in the named-links (name conflict, malformed protocol buffer document), it only appear, this time in full, in the ordered-links section.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. i think we should reverse to simplify, i.e.

  • always keep the full link object in ordered-links
  • always point (with an offset number) from the named-links section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. It depends on which information you want to have most accessible. The named links section was initially the only section, but the ordered links was necessary to be able to convert objects with duplicate links back to protocol buffers.

The idea is that all named links are available in the named-links section by their name, and the key is the name of the link. It's easy to make a mapping between the link and its name.

If the named links contain the index to the ordered links, making the mapping between each link and its name is a little more difficult if you are not parsing both sections at the same time. Also, the benefit of having access to the link with a path containing the link name is gone.

If we want to simplify, I'd be in favor of removing the named links section entirely instead and adding a name attribute inside the link object. There is no benefit of having a named links section being just a mapping between name and link index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main argument here is that link order was never meant to be significant in the protocol buffer serialization format. We keep the order only to allow conversion back to the original format and be guaranteed we have the same hash. If not for that (and for the possibility of link conflicts), we would have removed the ordered links section entirely

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im ok removing the named-links section, but it may help-- makes it easier to find stuff. In any case, we can remove it for now, and add it later if we need it, as the ordered-links section is the important part anyway.

Link conflicts require the ordering, so ordering is very significant. Most protobuf objects out there have lots of links with the same link name (""), this is current unixfs files' sub-files.

"<Links[2].Name>": {
"link": "<Links[1].Hash.(base58)>",
"name": "<Links[1].Name>",
"size": <Links[1].Tsize>
},
...
}
"ordered-links": [
"<Links[0].Name>",
{
"name": "<Link[1].Name>",
"link": "<Links[1].Hash.(base58)>",
"tsize": <Links[1].Tsize>
}
"<Links[2].Name>",
...
]
}

- Here we assume that the link #0 and #1 have the same name. As specified in [ipld.md](ipld.md) in paragraph **Duplicate property keys**, only the first link is present in the named link section. The other link is present in the `ordered-links` section for completeness and to allow recreating the original protocol buffer message.

- Links are not accessible on the top level object. Applications that are using protocol buffer objects such as unixfs will have to handle that and special case for legacy objects.

- No escaping is needed and no conflict is possible

-----------------

### Simple variation on that solution

{
"data": "<Data>",
"<Links[0].Name>": {
"link": "<Links[0].Hash.(base58)>",
"name": "<Links[0].Name>",
"size": <Links[0].Tsize>
},
"<Links[2].Name>": {
"link": "<Links[1].Hash.(base58)>",
"name": "<Links[1].Name>",
"size": <Links[1].Tsize>
},
...
"ordered-links": [
"<Links[0].Name>",
{
"name": "<Link[1].Name>",
"link": "<Links[1].Hash.(base58)>",
"tsize": <Links[1].Tsize>
}
"<Links[2].Name>",
...
]
}

- Here we assume that the link #0 and #1 have the same name. As specified in [ipld.md](ipld.md) in paragraph **Duplicate property keys**, only the first link is present in the named link section. The other link is present in the `ordered-links` section for completeness and to allow recreating the original protocol buffer message.

- Link whose name would conflict with other top level keys are not included in the top level object. They are only accessible in `ordered-links` section by iterating through the values.

- Links are not accessible on the top level object. Applications that are using protocol buffer objects such as unixfs will have to handle that and special case for legacy objects.

- No escaping is needed and no conflict is possible

### Other variation: escape encoding

A protocol buffer message would be converted the following way:

{
"data": "<Data>",
"named-links": {
"<Links[0].Name>": {
"mlink": "<Links[0].Hash.(base58)>",
"name": "<Links[0].Name>",
"size": <Links[0].Tsize>
},
"<Links[1].Name>": {
"mlink": "<Links[1].Hash.(base58)>",
"name": "<Links[1].Name>",
"size": <Links[1].Tsize>
},
...
}
"ordered-links": [
{
"mlink": "<Links[0].Hash.(base58)>",
"name": "<Links[0].Name>",
"size": <Links[0].Tsize>
},
{
"mlink": "<Links[1].Hash.(base58)>",
"name": "<Links[1].Name>",
"size": <Links[1].Tsize>
}
]
}

Notes :

- The `links` array in the `@attrs` section is there to preserve order and duplicate links to hash back to the exact same protocol buffer object.

- Link hashes are encoded in base58

- The link names are escaped to prevent clashing with the `@attr` key.

- The escaping mechanism transforms the `@` character into `\@`. This mechanism also implies a modification of the path algorithm. When a path component contains the `@` character, it must be escaped to look it up in the IPLD Node object.

For example, a _filesystem merkle-path_ `/root/first@component/second@component/third_component` would look for object `root["first\@component"]["second\@component"]["third_component"]` (following mlinks when necessary).

**FIXME: Using the `@` character is not mandatory. Any other character could fit. Don't hesitate to give your ideas.**

### Other variation that avoids escaping

We can imagine another transformation where the link names are not escaped. For example:

{
"<Links[0].Name>": {
"mlink": "<Links[0].Hash.(base58)>",
"tsize": <Links[0].Tsize>
},
"<Links[1].Name>": {
"mlink": "<Links[2].Hash.(base58)>",
"tsize": <Links[2].Tsize>
},
...
".": {
"data": "<Data>",
"links": [
"<Links[0].Name>",
{
"name": "<Link[1].Name>",
"mlink": "<Links[1].Hash.(base58)>",
"tsize": <Links[1].Tsize>
}
"<Links[2].Name>",
...
]
}
}

Notes:

- Very conveniently, we use the key `.` to represent data for the current node, and any other key can represent a link. This means that we forbid link to be named `.`. This is in any case a good thing to do as the `.` element in paths can always be removed (the same way `..` can be replaced by the parent directory)

- No escaping is needed, and no modification to the path algorithm is needed.

- Link order is kept by using the link names for links present in the top node. This avoid repeating identical data (even though it is probably generated on the fly).

Links that cannot be present in the top node (the case for the link named `.`, which is forbidden, or for links that are repeated with the same name) are present in full to allow reconstructing the exact protocol buffer object.
21 changes: 21 additions & 0 deletions merkledag/ipld.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,23 @@ O_5 = | "hello": "world" | whose hash value is QmR8Bzg59Y4FGWHeu9iTYhwhiP8PHCN

This entire _merkle-path_ traversal is a unix-style path traversal over a _merkle-dag_ which uses _merkle-links_ with names.

**[In case we use escaping in protobuf IPLD format]**

In order to not restrict individual path component by disallowing some file names and still allow storing arbitrary data in IPLD objects, path components must be escaped when they are looked up in IPLD objects.

To escape a path component in order to look it up in an IPLD object:

- every `\` character in the path component must be replaced with `\\`
- every `@` character in the path component must be replaced with `\@`

This makes any key containing a `@` character unescaped in an IPLD object not accessible through a _filesystem merkle-path_. This is a reserved key that can be used to store auxiliary data without making it a link and visible in regular filesystems. This data can be made available in filesystems through extended attributes or opening and reading file contents.

To unescape IPLD object keys that are not reserved and get the corresponding path component:

- every `\@` sequence in the key must be replaced by `@`
- every `\\` sequence in the key must be replaced by `\`


## What is the IPLD Data Model?

The IPLD Data Model defines a simple JSON-based _structure_ for all merkle-dags, and identifies a set of formats to encode the structure into.
Expand Down Expand Up @@ -319,6 +336,10 @@ In the same way, when the receiver is storing the object, it must make sure that
A simple way to store such objects with their format is to store them with their multicodec header.


### Other encodings

Speak up in comments if you have other encoding suggestions, or you have another implementation with another encoding.

## Datastructure Examples

It is important that IPLD be a simple, nimble, and flexible format that does not get in the way of users defining new or importing old datastractures. For this purpose, below I will show a few example data structures.
Expand Down