-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: assign codes for MIME types #159
base: master
Are you sure you want to change the base?
Conversation
2542896
to
669fea4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import script is not future-proof. If I understand it correctly, then the script parses the XML file and assigns numbers to the codecs based on the order in the XML file. They are sorted alphabetically in the XML file, so if a new mimetype is added and you re-run the script, you would end up with different codes.
Some <record>
s have a date
attribute, it looks like it got introduced in 2014. Hence I propose sorting the records by date
attribute first (the ones without a date first) and if they have the same date alphabetically based on <file>
. This we we hopefully end up with a future-proof reproducible assignment of codes.
I'd also like to see more information text about why we include MIME types and that Multicodecs are not MIME types. If I understand things correctly in an ideal Multicodec-only world all those +json
MIME types would all just be Multicodec JSON 0x0200
.
I think that we should factor out the consumption of the entire multiformat table in our multiformat clients before we drastically increase the size of the table. The parse time and table size are already noticable in our bundle size, this would make that much worse. My older http clients include a full mime database and it makes it unsuitable for browser bundles. Luckily, we already have a tentative plan to move to using integer references that won’t require the full table in code. |
Unless there's a bug, it first loads the already-assigned numbers. Then, for all new mime types, it assigns increasing (unique) codes.
Sounds like a good starting point. I'll keep the current "don't change things" logic but having a stable conversion would be nice. @mikeal Hm. Fair point. Even compressed, the table is going to grow from 3K to 22K. Or 26K to 260K uncompressed. I'm fine leaving this in a PR for now. I submitted it because we kept getting requests to do something like this. |
Oh I missed that when I read the code. Though I'm still in favour of having a stable conversion that can be run at any time and results in the same output. |
table.csv
Outdated
libp2p-peer-record, libp2p, 0x0301, libp2p peer record type | ||
x11, multihash, 0x1100, | ||
sm3-256, multihash, 0x00534d, | ||
blake2b-8, multihash, 0x00b201, Blake2b consists of 64 output lengths that give different hashes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hashes are prefixed with an additional 0x00
is that intentional or a bug (the MIME types as well)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I'm trying to make it clear how many bytes are in each code. Every two hex digits is a single byte.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't get it. What do you mean with "how many bytes". Do you mean how many bytes it takes when encoded as varint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't then values over 0x80 also have a 0x00 prefix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
669fea4
to
0e370e5
Compare
I've put on my todo list to make this stable. I don't know when I will find the time to do it. When this issue becomes urgent to be merged, please let me know and i'll prioritize it. |
A downside of a 4-byte range is it makes a base32 sha256 CID 64 characters which doesn't fit in a DNS segment. A 3-byte range would work and with a single range instead of sub ranges for each major type it wouldn't reserve too much of that space. |
See GH/multiformats#158. Changes are analogous to those proposed in GH/multiformats#159.
What's the status of this? I'm hoping to use CIDs to refer to image types soon. |
I have a suggestion, perhaps instead of one big |
Also I noticed that |
Another question, adding in mimetypes overlaps with existing codecs in the tables.csv. For example how do we compare |
Looks like this effort has been stalled for a while, mostly due to concerns around the drastic increase in table size? The readme of the project describes a first-come, first-serve policy when it comes to adding new codecs, and I wonder if we could maybe apply that here as well with mime types. I.e. maybe we can start with a small handful of the most commonly used MIME types on the internet today (say, this list), and then add more over time based on demand, instead of dumping in all known mime-types at once? Is there some particular need for all the mime types to be in a contiguous block that I'm not aware of? |
And add a script to automate this.
fixes #4
Questions:
mime/
as in feat: adding MIME types as codecs #84.