Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The unified multicodecs theory #16

Merged
merged 17 commits into from
Nov 16, 2016
Merged

The unified multicodecs theory #16

merged 17 commits into from
Nov 16, 2016

Conversation

daviddias
Copy link
Member

@daviddias daviddias commented Sep 25, 2016

The unified multicodecs theory.

The theory that unites all the self-described multiformats for a beautifully colored 🚲🏚

tl;dr; This PR updates the multicodecs table to incorporate all the multiformats binary packed tables (multihash and multiaddr) into one.

With the introduction of CIDv1, we needed a way to describe several types of data(multicodec like) in a way that a program could parse(self described), that didn't increase the data size by much, and so, multicodec-packed was born.

One of the requirements for multicodec-packed is that it needed to be as extensible as the normal multicodec, so that applications developers and protocol engineers could add their own multicodec tables (dictionaries) of their own data structures.

Once we added this in, we realized how multihash, multiaddr or even multibase are just users of multicodec-packed with their own custom tables already.

Given this, we found that it is extremely convinient to have all of these tables converge into one, so that we can expand it overtime and avoid code clashes in table extensions.

This PR fixes some spec errors and merges the multiaddr, multihash and multibase tables in a new table with a more clear format.

TODO:

  • Solve the clash issue (udp and sha1 share the same code)
  • Get the table reviewed by @jbenet

Future work:

| | /ip4/ | | 0x04 | n/a |
| | /ip6/ | | 0x29 | n/a |
| | /tcp/ | | 0x06 | n/a |
| | /udp/ | | 0x11 | n/a |
Copy link
Member Author

@daviddias daviddias Sep 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clash with sha1

@daviddias daviddias changed the title spec/update The unified multicodecs theory Sep 26, 2016
@daviddias
Copy link
Member Author

Update

After much discussion (🚲🏚), we've understood that we can improve the clarity of how multicodec, multicodec-packed and multistream are design and their purpose, if we adjust the names of these protocols. In simple terms this means that:

  • rename: multicodec-packed -> multicodec
  • rename: multicodec -> multistream
  • rename: multistream -> multistream-select

This way:

  • multicodec becomes the protocol that simply defines { codec: code } pairs (e.g: { ip4: 0x04 })
  • multistream becomes the protocol that gets used for describing streams of data in a human readable way (<varint>/<codec>\n)
  • multistream-select becomes the protocol for handshaking protocols between two endpoints, which uses multistream describe the protocol the endpoints want to speak.

@@ -38,67 +38,43 @@ Moreover, this self-description allows straightforward layering of protocols wit
`multicodec` is a _self-describing multiformat_, it wraps other formats with a tiny bit of self-description:

```sh
<multicodec-header><encoded-data>
# or
<varint-len><code>\n<encoded-data>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put back the newlines

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️

| **Multiformats** |
| 0x182f6d756c7469636f6465632f | /multicodec/ | | 0x30 | n/a |
| 0x162f6d756c7469686173682f | /multihash/ | | 0x31 | n/a |
| 0x162f6d756c7469616464722f | /multiaddr/ | | 0x32 | n/a |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add multibase

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️

| **IPLD formats** |
| dag-pb | MerkleDAG protobuf | |
| dag-cbor | MerkleDAG cbor | |
| eth-rlp | Ethereum Block RLP | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add:

  • git
  • bitcoin
  • stellar

@daviddias
Copy link
Member Author

@jbenet I'm making js-cid complete (as in, being able to handle bs58(multihash), raw multihash, cid, cidStr and cid construction by parts (version, codec, multihash)) and I found myself in a position of 'luck', because neighter CID v0 or CID v1 clash with the multihash table.

For now it is ok, because multihash only starts at 0x11, which gives us 14 more CID versions (if we ever need to change it). However, a thought crossed my mind that CID code should be governed as a multicodec, so that we can then update the CID table, add another version that doesn't represent an incremental update to the previous number, so that all the parsers implemented meanwhile do not break.

IPLD formats
dag-pb, MerkleDAG protobuf, 0x70
dag-cbor, MerkleDAG cbor, 0x71
eth-block, Ethereum Block (RLP), 0x90
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these should be above 0x80.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because that means they take up two bytes when encoded. Anything that is going to be commonly used we would really love to have take only a single byte.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😢

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed sctp is also above 0x80 (0x84)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transport multicodecs are 'more okay' because they get transferred less, in the other hand, format multicodecs get transferred every time a block is transferred, so a byte actually means a lot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whyrusleeping is this still a concern? I believe it stopped being as soon you merged IPLD in go-ipfs into master, correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and, there are simply more than 127 things we "really care about". so 2 bytes is in the long run unavoidable.

codec, description, code

miscelaneous
bin, raw binary, 0x00
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure i like zero being 'binary'. Zero is a default value in a lot of cases and this makes it difficult to tell between 'unknown/invalid' and 'binary'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binary could be 0x55 which is 0b10101010 and is also value of Ethernet frame preamble.

@whyrusleeping
Copy link
Member

@diasdavid can we sit down today and finish this? Its blocking my work on ipld

@daviddias
Copy link
Member Author

@whyrusleeping sure, why is it blocking though? The IPLD formats have codes.

@whyrusleeping
Copy link
Member

I'm taking 0x22 for murmur3

@wanderer
Copy link
Member

why not reserve the top level table only for the different encoding types? so something like

0x00 an encoding table
0x01 bases encodings
0x02 serialization formats
0x03 multicodec
0x04 multihash
0x05 multiaddr
0x06 multibase
0x07 multihashes
0x08 multiaddrs
0x09 archiving formats
0x10 IPLD formats
....
ect

0x00 would be specail. What would follow is would be a custom encoding table which would map strings to their binary representation. This would allow apps to use dynamically create new tables.

@wanderer
Copy link
Member

wanderer commented Oct 13, 2016

own multicodec tables (dictionaries) of their own data structures.

@diasdavid is a proposed format for the dictionaries?

EDIT: i opened an issue on it here #18

@daviddias
Copy link
Member Author

daviddias commented Nov 9, 2016

sha1 and udp clash solution

I've added an extra 0 byte to udp multicodec, so that it doesn't colid with sha1 anymore. It is more convenient to change udp than sha1 as changing sha1 would mean that all the multihashes that use it.

Adding an extra 0 byte is enough to differentiate it, but keeping the value 11 on the code.

UPDATE:

@Kubuxu add a very valuable point that I missed entirely here -- #16 (comment) -- , moving udp to 0x0111 instead of 0x0011.

@daviddias
Copy link
Member Author

Stamp remaining multicodecs

I'm feeling resistant to stamp the remaining multicodecs only by following rules of thumb. Is there any reasoning we can use to pick the remaining unspecified codes? This is the last item in order to merge this PR

@Kubuxu
Copy link
Member

Kubuxu commented Nov 9, 2016

update: SOLVED

I think that specifying that 0x0011 and 0x11 are something different is really bad idea. This means that we can't use normal varint libraries but we will need to craft our own. Our spec doesn't specify it and then it isn't really and integer, it is variable length bytestream where highest bit says "there is data still coming".

I asked the question about this quite a time ago: multiformats/unsigned-varint#5
but it was about wether we should canonize such miss-formatted varints before storing them to for example CBOR notation as if we don't do that same object has infinitely many binary CIDs.

Summing up, I think we shouldn't abuse varints in this way, we should specify that the number is prefixed with infinitely many leading zeros and that the canonical form is when there is minimum number of bytes taken to express the number. Then before crafting IPLD objects we would canonize such miss-formatted CIDs which should be fast and easy operation.

@jbenet
Copy link
Member

jbenet commented Nov 12, 2016

@Kubuxu can you follow up in multiformats/unsigned-varint#5, or another issue in that repo, with examples? i'm not understanding what you're getting at. Either way, this spec uses https://github.com/multiformats/unsigned-varint so whatever that does, this will do too. let's resolve it there.

@jbenet
Copy link
Member

jbenet commented Nov 12, 2016

@diasdavid

Is there any reasoning we can use to pick the remaining unspecified codes?

I don't think there's a nice function for assigning these numbers. unfortunately it's a nasty problem, because there's a bunch of "domain specific" functions that may make sense-- but don't make sense throughout.

A function i can think of that works uniformly well (even if not very well) is "first come first assign". meaning, a queue, assign them as they're requested. this is certainly going to make people unhappy about having to use 2 bytes for a commonly used function, but it's unavoidable anyway given there's definitely more than 127 common things users will care about...

So, no. i can't think of a nice way that is not "domain specific and non-universal" or "rules of thumb" :( --- but there may be one if people want to keep thinking :)

@jbenet
Copy link
Member

jbenet commented Nov 12, 2016

@diasdavid what is the diff here? what protocols changed value in the table? (i.e. what will break, if anything? I see UDP, is that the only one)

bitcoin-block, Bitcoin Block, 0xb0
bitcoin-tx, Bitcoin Tx, 0xb1
stellar-block, Stellar Block, 0xd0
stellar-tx, Stellar Tx, 0xd1
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this table should move to a CSV, and the readme should point to it or embed it.

@jbenet
Copy link
Member

jbenet commented Nov 12, 2016

curious why "binary has been migrated from 0x00 to 0x55" -- what was the reasoning? (sgtm, jw)

@Kubuxu
Copy link
Member

Kubuxu commented Nov 12, 2016

@jbenet #16 (comment)

but there was discussion elsewhere on it too.

@daviddias
Copy link
Member Author

@diasdavid what is the diff here? what protocols changed value in the table? (i.e. what will break, if anything? I see UDP, is that the only one)

Yes, that one and the binary which moved to 0x55 as @Kubuxu mentioned and got stamped in go-ipfs.

"first come first assign".

Sounds good to me, will add a note to that in the readme.

@daviddias
Copy link
Member Author

All right, this looks ready to merge :)

I'll give it a day and merge it Monday (tomorrow)

@daviddias daviddias force-pushed the spec/update branch 2 times, most recently from 3a6f28e to 76a23f7 Compare November 13, 2016 05:30
@daviddias
Copy link
Member Author

All right, I gave it 3 days, going for the merge 🎉🎉

@daviddias daviddias merged commit d6e0ec1 into master Nov 16, 2016
@ghost ghost mentioned this pull request Nov 17, 2016
Stebalien added a commit to multiformats/rust-multiaddr that referenced this pull request Nov 28, 2018
It was changed from 17 to 273 in
multiformats/multicodec#16.
Stebalien added a commit to multiformats/js-multiaddr that referenced this pull request Nov 28, 2018
BREAKING CHANGE: The UDP code was changed in the multicodec table

The UDP code is now `273` instead of `17`. For the full discussion of this change
please see multiformats/multicodec#16.

Fixes #17.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants