Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPLD CBOR tagging #61

Merged
merged 6 commits into from
Feb 12, 2016
Merged

IPLD CBOR tagging #61

merged 6 commits into from
Feb 12, 2016

Conversation

mildred
Copy link
Contributor

@mildred mildred commented Jan 8, 2016

See PR #37

TODO items:

  • register CBOR tags with IANA

@jbenet jbenet added the backlog label Jan 8, 2016
@mildred mildred mentioned this pull request Jan 8, 2016
5 tasks
@mildred
Copy link
Contributor Author

mildred commented Jan 8, 2016

@davidar, any opinions on this? @jbenet? anyone?


- `<tag-escaped-key>`: **[If key escaping is necessary]** The string that follows (major type 2 or 3) is interpreted as an escaped string (of the same major type). Every occurrences of `\` are considered to be `\\`, and every occurrences of `@` are considered to be `\@`.

- `<tag-base58>`: the byte string that follows (major type 2) is to be interpreted as a text string instead (major type 3). This text string is the base58 encoded version of the byte string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! wonder --then-- if we can store the hashes compressed. having the hashes encoded on base58 is a space waste, so if a small conversion can be done transparently, it may be worth it.

(https://github.com/jbenet/multiaddr/ does something like this. this may be well beyond scope here, and may not be worth the time.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This spec describes how to transform textual base58 encoded data to actual binary data in the wire format. It can be done with other textual formats that could be represented in binary, only if the function to convert between the two is bijective (there is only one canonical way to convert).

We can add this later, that's why the multicodec header contains a version number (/cbor/ipld-tagsv1)

@davidar
Copy link
Member

davidar commented Jan 9, 2016

@mildred @jbenet This LGTM for the most part. I had a slightly different tagging system in my head - I suspect it's more or less equivalent to what you're doing, but here it is for completeness:

Assuming no escaping, I think we could get by (perhaps?) with a single extra tag:

  • <tag-link-object>: Either
    • a pair (size-2 array, major type 4) must follow. The first element is a byte string representing a (binary) multihash (i.e. for an address "/ipfs/QmFooBar", strip the "/ipfs/" prefix and base58-decode the remainder). The second element is a map (major type 5) containing any additional metadata attributes.
    • or, if there is no additional metadata, the tag may be followed by a single (binary multihash) bytestring (instead of a pair of the bytestring and the empty map)

If we do have escaping, would it would make more sense to tag the _un_escaped keys rather than the escaped ones? In other words, should we tag the "special" @keys rather than the "regular" keys? For instance:

  • <tag-special-key>: The string that follows (major type 2 or 3) is interpreted as a special @directive key. For example, a JSON key "@mode" is translated into CBOR as <tag-special-key><string>mode. Untagged keys are assumed to be taken literally, and hence @ and \ characters should be escaped when encoded as JSON, e.g. CBOR <string>foo@bar\baz -> JSON "foo\@bar\\baz"

Please let me know if the assumptions I'm making about valid mlink addresses and/or escaping are incorrect.

@mildred
Copy link
Contributor Author

mildred commented Jan 9, 2016

@davidar I thought about the three tags after these comments from issue #37. The idea was to be able to express not only /ipfs/ links, but also other kind of links. The idea was also to allow compressing base58 data independently from the fact that it is a link. is that over-engineering?

Now that I think about it, the solution presented in this PR is not ideal since new tags must be registered with IANA when we need to express other kind of links (/ipld/... or others).

As for the escaping, I don't know if we should give it special treatment. I think we should not. Currently, escaping is only needed to convert legacy protocol buffer objects (see PR #59). On new IPLD objects, escaping will not be used at all (or will be defined by individual applications). Merkle-paths are no longer involved in key escaping if you look at PR #62.

@mildred
Copy link
Contributor Author

mildred commented Jan 9, 2016

I would still like to keep a separate base58 tag that is using the same idea as the existing base64 tag defined in the CBOR RFC section 2.4.4.2. So perhaps we should define at least two tags:

  • <tag-base58>: the byte string that follows (major type 2) is to be interpreted as a text string instead (major type 3). This text string is the base58 encoded version of the byte string.
  • <tag-link-object>: an array of three items must follow containing a link prefix (text string), a link hash (text string or base58 tag) and a map. The tag when interpreted is a map with a first element with key link and value link_prefix + link_hash, and the other elements as defined by the map (3rd element).

@davidar
Copy link
Member

davidar commented Jan 10, 2016

I would still like to keep a separate base58 tag that is using the same idea as the existing base64 tag

Sounds reasonable. We might need some way of specifying which variant of base58 we're using though.

when we need to express other kind of links (/ipld/... or others).

Hmm, that does make things more tricky. Embedding textual prefixes seems a little bit wasteful, but numeric identifiers might be too rigid.

On that topic, I'm having some difficulties resolving the inconsistency between multihash and path addresses, since we're mixing two different prefixing strategies. Something like /ipfs/QmFooBar would be more consistent as either

  • /ipfs/sha256/FooBar, or
  • IpQmFooBar (or whatever)

@jbenet could you clarify the situation here?

@mildred
Copy link
Contributor Author

mildred commented Jan 10, 2016

Sounds reasonable. We might need some way of specifying which variant of base58 we're using though.

I know there are multiple alphabets, but perhaps we shouldn't encourage using too much variations. To handle different base58 alphabets, I would introduce a new tag for each. What do you think?

Hmm, that does make things more tricky. Embedding textual prefixes seems a little bit wasteful, but numeric identifiers might be too rigid.

We can also introduce a numerical mapping, but defined in this spec (and further iteration of this spec). And fallback to a string if the numerical mapping is not enough. We avoid registering every new link with IANA and allow extensions if necessary.

As for the escaping, I don't know if we should give it special treatment. I think we should not.

If we do, adding tags to the non escaped keys is more difficult because the IPLD data model defines keys as escaped. I prefer to take the approach of adding tags to the values we modified compared to the IPLD data model (the escaped keys we are unescape) rather than adding tags to the unmodified values (the non escaped values).

What I would do to add tags to escaped keys only would be:

  • add <escaped-map> tag to CBOR maps that contain escaped keys
  • keys in this map that are escaped in IPLD data model (escape(unescape(key)) == key) are stored unescaped in that map
  • keys that do not contain all escape sequences (escape(unescape(key)) != key) are stored with <tag-unescaped> in front

@mildred
Copy link
Contributor Author

mildred commented Jan 10, 2016

Pushed a new version which has a simpler escaping mechanism (more flexible) and removed tagging of escaped keys (until it is more needed).

@mildred
Copy link
Contributor Author

mildred commented Jan 10, 2016

@davidar Is it still LGTM? it is for me at this point.

@davidar
Copy link
Member

davidar commented Jan 19, 2016

waiting for clarification from @jbenet first

@jbenet
Copy link
Member

jbenet commented Jan 22, 2016

On the path addresses

  • A multihash is explicitly designed not to have textual hash names, it is precisely done this way to include the function along with the digest value and ensure it is transferred always in-band correctly. many systems have no out of band way of specifying such values. see https://github.com/jbenet/multihash
  • to solve the annoying "what base is this" question and to allow other bases, a new version of multihash may gain a base encoding character prefix, see discussion here First byte used to represent the encoding or settle on base58bitcoin-multihash? multiformats/multihash#14 (we may be able to get away with this while being backwards compatible)
  • the /ipfs/<hash>/<path> address is a path, effectively a "Filesystem URI" -- a URI whose scheme identifier is unix filesystem friendly. the schemes may vary and have different URIs altogether. like /ipns/<name>/<path>, or /dns/<domain>/<path>.
  • @davidar while it may seem "more consistent" to do all text or all encoded, it's not-- the point is that the hash is a component in the path. note that the path naming has vast implications when dealing with all sorts of systems (filesystems, browsers, etc). and an explicit goal is creating trustable URIs (trustable paths).


IPLD objects can be represented using cbor using the tags described below when possible. Tags are defined in [RFC 7049 section 2.4](http://tools.ietf.org/html/rfc7049#section-2.4):

- `<tag-base58>`: the byte string that follows (major type 2) is to be interpreted as a text string instead (major type 3). This text string is the base58 encoded version of the byte string (using the IPFS alphabet).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the presence of multiformats/multihash#14 change the thinking here? (im not set that multiformats/multihash#14 will happen, but it would solve an annoying wrinkle, and speed up a bunch of applications, allowing us to chose the encoding to fit the use case better)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should endeavor to always store the compressed version of the multihash. perhaps what we really need is a format like multiaddr's binary and text (compressed/uncompressed) duality: https://github.com/jbenet/multiaddr#binary-format

We could actually get away with doing this with multiaddr directly, we would just need to add types and an int code for all schemes we want to support (ipfs, ipns, dns for now). OR we could have a thing like multiaddr, but with a different table, explicitly for URI things... (we likely should just use multiaddr)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the presence of multiformats/multihash#14 change the thinking here?

I commented there, but I don't like this idea when applied to lower level components. It creates a few problems, the first being that a single multihash could now be encoded in various different ways. We could always make sure the protocol would only have a canonical version (say binary encoded)... This could work.

And CBOR already have a tag to represent base64 encoded data, it only makes sense to have a similar tag for base58.

As for encoding the path prefix, I admit I don't understand how you imagine them. I had the idea that a unix path was relative only to a single computer, and depending on the computer you have, the path might represent different things. URLs add a prefix (followed by :) to the path to express how they should be interpreted. Since day one, I was puzzled by the use of unprefixed paths in IPFS. So I have little idea how they should be encoded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And CBOR already have a tag to represent base64 encoded data, it only makes sense to have a similar tag for base58.

nice

As for encoding the path prefix, I admit I don't understand how you imagine them. I had the idea that a unix path was relative only to a single computer, and depending on the computer you have, the path might represent different things. URLs add a prefix (followed by :) to the path to express how they should be interpreted. Since day one, I was puzzled by the use of unprefixed paths in IPFS. So I have little idea how they should be encoded.

IPFS paths are similar, im sorry for the confusion there-- it's an annoying uphill battle for unixy composability. (someone recorded a video of me explaining in detail (ranting?) about this-- wonder what became of it.) the path /ipfs/<hash> is "like" ipfs://<hash> it's just that we make the scheme just the first path component instead of a colon-separated structure, so that it fits in more places + is nicely composable/mappable.

@mildred
Copy link
Contributor Author

mildred commented Jan 22, 2016

The more I think about multiformats/multihash#14, the more I like it. Still I see so many use cases to the multihash without the encoding prefix I think it should be done as a separate layer.

There is one issue left with that vision of the multihash. We can't encode a binary multihash as a path component because the binary string might contain the byte / and that specifically disallowed for path components. When splitting a path into components, that would be a problem.

We can decide that this is becoming too complex and don't strive for binary encoding of paths. However, let's suppose we have solved this for the rest of this post.

If we apply this to our problem here, we could encode links as paths containing binary multihash, having a way to decode them to a textual representation:

  • <tag-link> followed by a binary string:

    We'll have to make sure we don't transform the path too much. In particular, if there are double // in the paths, even though they can be simplified to a single /, this simplification should not happen. We strive to get the exact binary compatible representation we had in input.

Now, for encoding:

  • Suppose we have a textual link we want to encode in CBOR:

    • examine each path component, if a path component can be interpreted as a multihash with a byte prefix encoding, encode it to binary
    • if there were multiple path components and they were encoded differently (one in base64, the other in base58), abort and keep the original string
    • if any were encoded in a way that cannot be completely reproduced (hexadecimal for example where uppercase/lowercase is cannot be reproduced from binary), abort

    If we didn't abort, we have one path component or multiple path components converted to binary, but all with the same original encoding. Take the character encoding and put it in front of the string.


This is becoming quite complex. A little too much for my liking. Why not just require paths to be base58 encoded and support no other encoding?

@jbenet
Copy link
Member

jbenet commented Jan 24, 2016

The more I think about multiformats/multihash#14, the more I like it. Still I see so many use cases to the multihash without the encoding prefix I think it should be done as a separate layer.

Agreed on separate layer.

There is one issue left with that vision of the multihash. We can't encode a binary multihash as a path component because the binary string might contain the byte / and that specifically disallowed for path components. When splitting a path into components, that would be a problem.

Yes, this is indeed a problem. a binary encoded multihash in a string path is dangerous and should not be supported. we were already bitten by something like this.

We can decide that this is becoming too complex and don't strive for binary encoding of paths. However, let's suppose we have solved this for the rest of this post.

If we apply this to our problem here, we could encode links as paths containing binary multihash, having a way to decode them to a textual representation:

  • <tag-link> followed by a binary string:

    We'll have to make sure we don't transform the path too much. In particular, if there are double // in the paths, even though they can be simplified to a single /, this simplification should not happen. We strive to get the exact binary compatible representation we had in input.

Now, for encoding:

  • Suppose we have a textual link we want to encode in CBOR:

    • examine each path component, if a path component can be interpreted as a multihash with a byte prefix encoding, encode it to binary
    • if there were multiple path components and they were encoded differently (one in base64, the other in base58), abort and keep the original string
    • if any were encoded in a way that cannot be completely reproduced (hexadecimal for example where uppercase/lowercase is cannot be reproduced from binary), abort

    If we didn't abort, we have one path component or multiple path components converted to binary, but all with the same original encoding. Take the character encoding and put it in front of the string.


This is becoming quite complex. A little too much for my liking. Why not just require paths to be base58 encoded and support no other encoding?

I think the above would work, yes. But agreed. this is getting very complex. I think paths should work like this:


On the compression-- it may be the case that gzipping or huffman coding everything (wire transports and disk representations) would get most of the benefits anyway.

@jbenet
Copy link
Member

jbenet commented Jan 24, 2016

@mildred have you taken a look at how multiaddr's binary encoding works?

its readme is not very good, but the gist is that something like this:

# get a multiaddr
> ipfs swarm addrs local --id | head -n1
/ip4/127.0.0.1/tcp/4001/ipfs/Qmbatr97vqpmrXg6tubdpUBwJrLzvYPHe9vS7xc9b7y3nk

# save it to var
> addr=/ip4/127.0.0.1/tcp/4001/ipfs/Qmbatr97vqpmrXg6tubdpUBwJrLzvYPHe9vS7xc9b7y3nk

# export the compressed binary representation
> multiaddr --format bytes $(addr) | wc -c
  46

So we can have a compressed representation, which also turns a scheme string like "/ipfs" into a single byte. so a path lik /ipfs/<base58-encoded-string> could be as small as <1-byte-for-scheme><1-byte-for-len><binary-string>.

The multiaddr route may not be what we want here, but it is something we already have, and solves this exact use case. And it already has support for base58 text/binary coding. (if it is a multihash (i.e. our use case) we have redundant len varints, but useful for base58 things that are not self-described). this facet of multiaddr could be extended with whatever multiformats/multihash#14 becomes.

@davidar
Copy link
Member

davidar commented Jan 24, 2016

What kind of addresses do we want links to support? Will they just be flat /ipfs/QmFooBar, or the full gamut of /ipfs/QmFooBar/foo/bar/baz.quux? Personally, the latter seems excessive as it doesn't add any expressive power over the former (unless we want to be linking to fragments of objects, which seems odd). The former would be much easier to implement (as opaque pointers, with a namespace):

<cbor-tag-pointer> <cbor-integer=namespace> <cbor-bytestring=address>
                   /ipfs/                   QmFooBar...

@jbenet
Copy link
Member

jbenet commented Jan 24, 2016

the latter seems excessive as it doesn't add any expressive power over the former (unless we want to be linking to fragments of objects, which seems odd).

We do want to link to fragments. that's one of the important parts of IPLD. being able to make larger objects where needed/reasonable, but be able to link seamlessly to parts of them without having to duplicate entities.

@davidar
Copy link
Member

davidar commented Jan 24, 2016

@jbenet hrm, that does complicate things a bit. We could still encode the /ipfs/QmFoo prefix as above, but I suspect encoding the rest of the path isn't going to buy us much over a plain text representation. Another option would be to introduce (yet another) CBOR tag to express "extract this attribute from this object", but that might be over-engineering the problem. Unless that would be a generally useful thing to have in its own right, but I can't think of any cases off the top of my head.

I'm tempted to say, make every fragment semantically a separate object, and only compose them into larger objects as a practical optimisation, but I suspect that would be a rather major change.

[ If we had computable objects already it would also be easier --- just link to the the program that extracts the relevant fragment from an object :p ]

@mildred
Copy link
Contributor Author

mildred commented Jan 25, 2016

I haven't got any opinion on that. I didn't know multiaddr had a binary encoding, and that may be a nice thing to use. We'll have to encode the leading path fragment of course.

We have to keep in mind that any binary encoding we have will have to ensure that any text we encode gets decoded in the exact same way. Possible issues would be double /, case insensitive characters, leading or trailing / that might be ignored / non significant...

For all link we must ensure that decode(encode(link)) == link.

@mildred
Copy link
Contributor Author

mildred commented Feb 9, 2016

I am now convinced that we should define the format of the address somewhere else than in IPLD. And it could be multiaddr or an extension of it if we want to express link fragments. Then, the encoding would be simple:

  • if the link property is a valid multiaddr and we can guarantee that decode-maddr(encode-to-maddr(link)) == link in every case, encode the link to a multiaddr format
  • else keep the link in plain text (it's an invalid link anyway)

Are you all ok with that ?

@mildred
Copy link
Contributor Author

mildred commented Feb 9, 2016

@davidar, I pushed a version of the encoding with a single tag as you suggested first. This is much more simple. The only difference is that the binary encoding of the address is not specified. it is left to the multiaddress to specify.

@mildred
Copy link
Contributor Author

mildred commented Feb 11, 2016

I implemented this spec in go-ipd/coding/cbor.

@davidar
Copy link
Member

davidar commented Feb 11, 2016

@mildred LGTM :)

@jbenet
Copy link
Member

jbenet commented Feb 11, 2016

@mildred

if the link property is a valid multiaddr and we can guarantee that decode-maddr(encode-to- maddr(link)) == link in every case, encode the link to a multiaddr format
else keep the link in plain text (it's an invalid link anyway)
Are you all ok with that ?

yes sgtm

IPLD supports a variety of serialized data formats through [multicodec](https://github.com/jbenet/multicodec). These can be used however is idiomatic to the format, for example in `CBOR`, we can use `CBOR` type tags to represent the merkle-link, and avoid writing out the full string key `mlink`. Users are encouraged to use the formats to their fullest, and to store and transmit IPLD data in whatever format makes the most sense. The only requirement **is that there MUST be a well-defined one-to-one mapping with the IPLD Canonical format.** This is so that data can be transformed from one format to another, and back, without changing its meaning nor its cryptographic hashes.
IPLD supports a variety of serialized data formats through [multicodec](https://github.com/jbenet/multicodec). These can be used however is idiomatic to the format, for example in `CBOR`, we can use `CBOR` type tags to represent the merkle-link, and avoid writing out the full string key `@link`. Users are encouraged to use the formats to their fullest, and to store and transmit IPLD data in whatever format makes the most sense. The only requirement **is that there MUST be a well-defined one-to-one mapping with the IPLD Canonical format.** This is so that data can be transformed from one format to another, and back, without changing its meaning nor its cryptographic hashes.

## Serialised CBOR with tags
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be ### level

@jbenet
Copy link
Member

jbenet commented Feb 11, 2016

@mildred minor edits, else im good to merge this.

@mildred
Copy link
Contributor Author

mildred commented Feb 11, 2016

Are multiaddresses the place to define merkle-links ? Or do we need something that works the same but is specifically designed for merkle links?

edit: sorry, I didn't see your last comment

@mildred
Copy link
Contributor Author

mildred commented Feb 11, 2016

I rebased this branch on top of ipld-spec

@jbenet
Copy link
Member

jbenet commented Feb 12, 2016

Thanks @mildred -- this LGTM!

jbenet added a commit that referenced this pull request Feb 12, 2016
@jbenet jbenet merged commit 6734c93 into ipfs:ipld-spec Feb 12, 2016
@jbenet jbenet removed the backlog label Feb 12, 2016
@mildred mildred mentioned this pull request Feb 12, 2016
@daviddias daviddias added the IPLD label Mar 14, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants