-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPLD CBOR tagging #61
Conversation
|
||
- `<tag-escaped-key>`: **[If key escaping is necessary]** The string that follows (major type 2 or 3) is interpreted as an escaped string (of the same major type). Every occurrences of `\` are considered to be `\\`, and every occurrences of `@` are considered to be `\@`. | ||
|
||
- `<tag-base58>`: the byte string that follows (major type 2) is to be interpreted as a text string instead (major type 3). This text string is the base58 encoded version of the byte string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! wonder --then-- if we can store the hashes compressed. having the hashes encoded on base58 is a space waste, so if a small conversion can be done transparently, it may be worth it.
(https://github.com/jbenet/multiaddr/ does something like this. this may be well beyond scope here, and may not be worth the time.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This spec describes how to transform textual base58 encoded data to actual binary data in the wire format. It can be done with other textual formats that could be represented in binary, only if the function to convert between the two is bijective (there is only one canonical way to convert).
We can add this later, that's why the multicodec header contains a version number (/cbor/ipld-tagsv1
)
@mildred @jbenet This LGTM for the most part. I had a slightly different tagging system in my head - I suspect it's more or less equivalent to what you're doing, but here it is for completeness: Assuming no escaping, I think we could get by (perhaps?) with a single extra tag:
If we do have escaping, would it would make more sense to tag the _un_escaped keys rather than the escaped ones? In other words, should we tag the "special"
Please let me know if the assumptions I'm making about valid mlink addresses and/or escaping are incorrect. |
@davidar I thought about the three tags after these comments from issue #37. The idea was to be able to express not only /ipfs/ links, but also other kind of links. The idea was also to allow compressing base58 data independently from the fact that it is a link. is that over-engineering? Now that I think about it, the solution presented in this PR is not ideal since new tags must be registered with IANA when we need to express other kind of links ( As for the escaping, I don't know if we should give it special treatment. I think we should not. Currently, escaping is only needed to convert legacy protocol buffer objects (see PR #59). On new IPLD objects, escaping will not be used at all (or will be defined by individual applications). Merkle-paths are no longer involved in key escaping if you look at PR #62. |
I would still like to keep a separate base58 tag that is using the same idea as the existing base64 tag defined in the CBOR RFC section 2.4.4.2. So perhaps we should define at least two tags:
|
Sounds reasonable. We might need some way of specifying which variant of base58 we're using though.
Hmm, that does make things more tricky. Embedding textual prefixes seems a little bit wasteful, but numeric identifiers might be too rigid. On that topic, I'm having some difficulties resolving the inconsistency between multihash and path addresses, since we're mixing two different prefixing strategies. Something like
@jbenet could you clarify the situation here? |
I know there are multiple alphabets, but perhaps we shouldn't encourage using too much variations. To handle different base58 alphabets, I would introduce a new tag for each. What do you think?
We can also introduce a numerical mapping, but defined in this spec (and further iteration of this spec). And fallback to a string if the numerical mapping is not enough. We avoid registering every new link with IANA and allow extensions if necessary.
If we do, adding tags to the non escaped keys is more difficult because the IPLD data model defines keys as escaped. I prefer to take the approach of adding tags to the values we modified compared to the IPLD data model (the escaped keys we are unescape) rather than adding tags to the unmodified values (the non escaped values). What I would do to add tags to escaped keys only would be:
|
Pushed a new version which has a simpler escaping mechanism (more flexible) and removed tagging of escaped keys (until it is more needed). |
@davidar Is it still LGTM? it is for me at this point. |
waiting for clarification from @jbenet first |
On the path addresses
|
|
||
IPLD objects can be represented using cbor using the tags described below when possible. Tags are defined in [RFC 7049 section 2.4](http://tools.ietf.org/html/rfc7049#section-2.4): | ||
|
||
- `<tag-base58>`: the byte string that follows (major type 2) is to be interpreted as a text string instead (major type 3). This text string is the base58 encoded version of the byte string (using the IPFS alphabet). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the presence of multiformats/multihash#14 change the thinking here? (im not set that multiformats/multihash#14 will happen, but it would solve an annoying wrinkle, and speed up a bunch of applications, allowing us to chose the encoding to fit the use case better)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should endeavor to always store the compressed version of the multihash. perhaps what we really need is a format like multiaddr's binary and text (compressed/uncompressed) duality: https://github.com/jbenet/multiaddr#binary-format
We could actually get away with doing this with multiaddr directly, we would just need to add types and an int code for all schemes we want to support (ipfs, ipns, dns
for now). OR we could have a thing like multiaddr, but with a different table, explicitly for URI things... (we likely should just use multiaddr)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the presence of multiformats/multihash#14 change the thinking here?
I commented there, but I don't like this idea when applied to lower level components. It creates a few problems, the first being that a single multihash could now be encoded in various different ways. We could always make sure the protocol would only have a canonical version (say binary encoded)... This could work.
And CBOR already have a tag to represent base64 encoded data, it only makes sense to have a similar tag for base58.
As for encoding the path prefix, I admit I don't understand how you imagine them. I had the idea that a unix path was relative only to a single computer, and depending on the computer you have, the path might represent different things. URLs add a prefix (followed by :
) to the path to express how they should be interpreted. Since day one, I was puzzled by the use of unprefixed paths in IPFS. So I have little idea how they should be encoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And CBOR already have a tag to represent base64 encoded data, it only makes sense to have a similar tag for base58.
nice
As for encoding the path prefix, I admit I don't understand how you imagine them. I had the idea that a unix path was relative only to a single computer, and depending on the computer you have, the path might represent different things. URLs add a prefix (followed by :) to the path to express how they should be interpreted. Since day one, I was puzzled by the use of unprefixed paths in IPFS. So I have little idea how they should be encoded.
IPFS paths are similar, im sorry for the confusion there-- it's an annoying uphill battle for unixy composability. (someone recorded a video of me explaining in detail (ranting?) about this-- wonder what became of it.) the path /ipfs/<hash>
is "like" ipfs://<hash>
it's just that we make the scheme just the first path component instead of a colon-separated structure, so that it fits in more places + is nicely composable/mappable.
The more I think about multiformats/multihash#14, the more I like it. Still I see so many use cases to the multihash without the encoding prefix I think it should be done as a separate layer. There is one issue left with that vision of the multihash. We can't encode a binary multihash as a path component because the binary string might contain the byte We can decide that this is becoming too complex and don't strive for binary encoding of paths. However, let's suppose we have solved this for the rest of this post. If we apply this to our problem here, we could encode links as paths containing binary multihash, having a way to decode them to a textual representation:
Now, for encoding:
This is becoming quite complex. A little too much for my liking. Why not just require paths to be base58 encoded and support no other encoding? |
Agreed on separate layer.
Yes, this is indeed a problem. a binary encoded multihash in a string path is dangerous and should not be supported. we were already bitten by something like this.
I think the above would work, yes. But agreed. this is getting very complex. I think paths should work like this:
On the compression-- it may be the case that gzipping or huffman coding everything (wire transports and disk representations) would get most of the benefits anyway. |
@mildred have you taken a look at how multiaddr's binary encoding works? its readme is not very good, but the gist is that something like this: # get a multiaddr
> ipfs swarm addrs local --id | head -n1
/ip4/127.0.0.1/tcp/4001/ipfs/Qmbatr97vqpmrXg6tubdpUBwJrLzvYPHe9vS7xc9b7y3nk
# save it to var
> addr=/ip4/127.0.0.1/tcp/4001/ipfs/Qmbatr97vqpmrXg6tubdpUBwJrLzvYPHe9vS7xc9b7y3nk
# export the compressed binary representation
> multiaddr --format bytes $(addr) | wc -c
46 So we can have a compressed representation, which also turns a scheme string like "/ipfs" into a single byte. so a path lik The multiaddr route may not be what we want here, but it is something we already have, and solves this exact use case. And it already has support for base58 text/binary coding. (if it is a multihash (i.e. our use case) we have redundant len varints, but useful for base58 things that are not self-described). this facet of multiaddr could be extended with whatever multiformats/multihash#14 becomes. |
What kind of addresses do we want links to support? Will they just be flat
|
We do want to link to fragments. that's one of the important parts of IPLD. being able to make larger objects where needed/reasonable, but be able to link seamlessly to parts of them without having to duplicate entities. |
@jbenet hrm, that does complicate things a bit. We could still encode the I'm tempted to say, make every fragment semantically a separate object, and only compose them into larger objects as a practical optimisation, but I suspect that would be a rather major change. [ If we had computable objects already it would also be easier --- just link to the the program that extracts the relevant fragment from an object :p ] |
I haven't got any opinion on that. I didn't know multiaddr had a binary encoding, and that may be a nice thing to use. We'll have to encode the leading path fragment of course. We have to keep in mind that any binary encoding we have will have to ensure that any text we encode gets decoded in the exact same way. Possible issues would be double For all |
I am now convinced that we should define the format of the address somewhere else than in IPLD. And it could be multiaddr or an extension of it if we want to express link fragments. Then, the encoding would be simple:
Are you all ok with that ? |
@davidar, I pushed a version of the encoding with a single tag as you suggested first. This is much more simple. The only difference is that the binary encoding of the address is not specified. it is left to the multiaddress to specify. |
I implemented this spec in go-ipd/coding/cbor. |
@mildred LGTM :) |
yes sgtm |
IPLD supports a variety of serialized data formats through [multicodec](https://github.com/jbenet/multicodec). These can be used however is idiomatic to the format, for example in `CBOR`, we can use `CBOR` type tags to represent the merkle-link, and avoid writing out the full string key `mlink`. Users are encouraged to use the formats to their fullest, and to store and transmit IPLD data in whatever format makes the most sense. The only requirement **is that there MUST be a well-defined one-to-one mapping with the IPLD Canonical format.** This is so that data can be transformed from one format to another, and back, without changing its meaning nor its cryptographic hashes. | ||
IPLD supports a variety of serialized data formats through [multicodec](https://github.com/jbenet/multicodec). These can be used however is idiomatic to the format, for example in `CBOR`, we can use `CBOR` type tags to represent the merkle-link, and avoid writing out the full string key `@link`. Users are encouraged to use the formats to their fullest, and to store and transmit IPLD data in whatever format makes the most sense. The only requirement **is that there MUST be a well-defined one-to-one mapping with the IPLD Canonical format.** This is so that data can be transformed from one format to another, and back, without changing its meaning nor its cryptographic hashes. | ||
|
||
## Serialised CBOR with tags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be ###
level
@mildred minor edits, else im good to merge this. |
Are multiaddresses the place to define merkle-links ? Or do we need something that works the same but is specifically designed for merkle links? edit: sorry, I didn't see your last comment |
63cf2ab
to
dfb8903
Compare
I rebased this branch on top of |
Thanks @mildred -- this LGTM! |
See PR #37
TODO items: