-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The unified multicodecs theory #16
Conversation
05dbfb1
to
39712a2
Compare
| | /ip4/ | | 0x04 | n/a | | ||
| | /ip6/ | | 0x29 | n/a | | ||
| | /tcp/ | | 0x06 | n/a | | ||
| | /udp/ | | 0x11 | n/a | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clash with sha1
9431794
to
ad444ff
Compare
UpdateAfter much discussion (🚲🏚), we've understood that we can improve the clarity of how multicodec, multicodec-packed and multistream are design and their purpose, if we adjust the names of these protocols. In simple terms this means that:
This way:
|
@@ -38,67 +38,43 @@ Moreover, this self-description allows straightforward layering of protocols wit | |||
`multicodec` is a _self-describing multiformat_, it wraps other formats with a tiny bit of self-description: | |||
|
|||
```sh | |||
<multicodec-header><encoded-data> | |||
# or | |||
<varint-len><code>\n<encoded-data> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put back the newlines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✔️
| **Multiformats** | | ||
| 0x182f6d756c7469636f6465632f | /multicodec/ | | 0x30 | n/a | | ||
| 0x162f6d756c7469686173682f | /multihash/ | | 0x31 | n/a | | ||
| 0x162f6d756c7469616464722f | /multiaddr/ | | 0x32 | n/a | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add multibase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✔️
48f7366
to
f2d877e
Compare
f2d877e
to
b9eba22
Compare
b9eba22
to
a0eab30
Compare
| **IPLD formats** | | ||
| dag-pb | MerkleDAG protobuf | | | ||
| dag-cbor | MerkleDAG cbor | | | ||
| eth-rlp | Ethereum Block RLP | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add:
- git
- bitcoin
- stellar
@jbenet I'm making js-cid complete (as in, being able to handle bs58(multihash), raw multihash, cid, cidStr and cid construction by parts (version, codec, multihash)) and I found myself in a position of 'luck', because neighter CID v0 or CID v1 clash with the multihash table. For now it is ok, because multihash only starts at 0x11, which gives us 14 more CID versions (if we ever need to change it). However, a thought crossed my mind that CID code should be governed as a multicodec, so that we can then update the CID table, add another version that doesn't represent an incremental update to the previous number, so that all the parsers implemented meanwhile do not break. |
IPLD formats | ||
dag-pb, MerkleDAG protobuf, 0x70 | ||
dag-cbor, MerkleDAG cbor, 0x71 | ||
eth-block, Ethereum Block (RLP), 0x90 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these should be above 0x80
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because that means they take up two bytes when encoded. Anything that is going to be commonly used we would really love to have take only a single byte.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed sctp
is also above 0x80 (0x84)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Transport multicodecs are 'more okay' because they get transferred less, in the other hand, format multicodecs get transferred every time a block is transferred, so a byte actually means a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@whyrusleeping is this still a concern? I believe it stopped being as soon you merged IPLD in go-ipfs into master, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and, there are simply more than 127 things we "really care about". so 2 bytes is in the long run unavoidable.
codec, description, code | ||
|
||
miscelaneous | ||
bin, raw binary, 0x00 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure i like zero being 'binary'. Zero is a default value in a lot of cases and this makes it difficult to tell between 'unknown/invalid' and 'binary'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binary could be 0x55 which is 0b10101010
and is also value of Ethernet frame preamble.
@diasdavid can we sit down today and finish this? Its blocking my work on ipld |
@whyrusleeping sure, why is it blocking though? The IPLD formats have codes. |
I'm taking |
why not reserve the top level table only for the different encoding types? so something like
0x00 would be specail. What would follow is would be a custom encoding table which would map strings to their binary representation. This would allow apps to use dynamically create new tables. |
@diasdavid is a proposed format for the dictionaries? EDIT: i opened an issue on it here #18 |
sha1 and udp clash solutionI've added an extra 0 byte to
UPDATE: @Kubuxu add a very valuable point that I missed entirely here -- #16 (comment) -- , moving udp to 0x0111 instead of 0x0011. |
Stamp remaining multicodecsI'm feeling resistant to stamp the remaining multicodecs only by following rules of thumb. Is there any reasoning we can use to pick the remaining unspecified codes? This is the last item in order to merge this PR |
update: SOLVED I think that specifying that I asked the question about this quite a time ago: multiformats/unsigned-varint#5 Summing up, I think we shouldn't abuse varints in this way, we should specify that the number is prefixed with infinitely many leading zeros and that the canonical form is when there is minimum number of bytes taken to express the number. Then before crafting IPLD objects we would canonize such miss-formatted CIDs which should be fast and easy operation. |
@Kubuxu can you follow up in multiformats/unsigned-varint#5, or another issue in that repo, with examples? i'm not understanding what you're getting at. Either way, this spec uses https://github.com/multiformats/unsigned-varint so whatever that does, this will do too. let's resolve it there. |
I don't think there's a nice function for assigning these numbers. unfortunately it's a nasty problem, because there's a bunch of "domain specific" functions that may make sense-- but don't make sense throughout. A function i can think of that works uniformly well (even if not very well) is "first come first assign". meaning, a queue, assign them as they're requested. this is certainly going to make people unhappy about having to use 2 bytes for a commonly used function, but it's unavoidable anyway given there's definitely more than 127 common things users will care about... So, no. i can't think of a nice way that is not "domain specific and non-universal" or "rules of thumb" :( --- but there may be one if people want to keep thinking :) |
@diasdavid what is the diff here? what protocols changed value in the table? (i.e. what will break, if anything? I see UDP, is that the only one) |
bitcoin-block, Bitcoin Block, 0xb0 | ||
bitcoin-tx, Bitcoin Tx, 0xb1 | ||
stellar-block, Stellar Block, 0xd0 | ||
stellar-tx, Stellar Tx, 0xd1 | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this table should move to a CSV, and the readme should point to it or embed it.
curious why "binary has been migrated from 0x00 to 0x55" -- what was the reasoning? (sgtm, jw) |
but there was discussion elsewhere on it too. |
Yes, that one and the binary which moved to 0x55 as @Kubuxu mentioned and got stamped in go-ipfs.
Sounds good to me, will add a note to that in the readme. |
All right, this looks ready to merge :)I'll give it a day and merge it Monday (tomorrow) |
3a6f28e
to
76a23f7
Compare
76a23f7
to
0c9d2df
Compare
All right, I gave it 3 days, going for the merge 🎉🎉 |
It was changed from 17 to 273 in multiformats/multicodec#16.
BREAKING CHANGE: The UDP code was changed in the multicodec table The UDP code is now `273` instead of `17`. For the full discussion of this change please see multiformats/multicodec#16. Fixes #17.
The unified multicodecs theory.
tl;dr; This PR updates the multicodecs table to incorporate all the multiformats binary packed tables (multihash and multiaddr) into one.
With the introduction of CIDv1, we needed a way to describe several types of data(multicodec like) in a way that a program could parse(self described), that didn't increase the data size by much, and so, multicodec-packed was born.
One of the requirements for multicodec-packed is that it needed to be as extensible as the normal multicodec, so that applications developers and protocol engineers could add their own multicodec tables (dictionaries) of their own data structures.
Once we added this in, we realized how multihash, multiaddr or even multibase are just users of multicodec-packed with their own custom tables already.
Given this, we found that it is extremely convinient to have all of these tables converge into one, so that we can expand it overtime and avoid code clashes in table extensions.
This PR fixes some spec errors and merges the multiaddr, multihash and multibase tables in a new table with a more clear format.
TODO:
Future work:
Bring in all the mimetypes and give them an packed codeSeee The unified multicodecs theory #16 (comment)