-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Future-proofed cryptographic hash values. #1
Comments
what are the scenarios this will get used in? There is a widely used hash function X. later, people will develop better hash functions, but there will still be plenty of people using X. Later, weaknesses are discovered in X, This is far far fewer hash functions than if you just always use the best hash function around. You probably shouldn't just update your app to the current flavor of the month... this would really be something you might do once in a decade or two. If you have hash trees, which I'm guessing you would... then you well have to rebuild all your data anyway. maybe all you need to pass the current hash function in the handshake or config file, and have tests where you inject a different hash function or allow building with a new hash function. |
Re. the UTF-8 question, I think it is so that "continuation" bytes are unambiguous vs. "starting" bytes. This means that e.g. if is a stream is chopped off in the middle of a multibyte glyph, the remote will know that it does not have a complete glyph at the start of the stream. |
Anyway, @jbenet, when I was thinking about the wire protocol for my "new OSI model", I came to roughly the same conclusion, that we need a prefix field to specify ciphersuite/protocol version. The way I figured it, the cost of a varint is not worth it, since if push comes to shove, a "protocol version" could be standardized that has an additional ciphersuite field with a bigger size. However, I decided on two bytes rather than one (same size as the TLS ciphersuite field, and with IANA's "port numbers" in mind). BTW, a bonus of ciphersuite prefix field is that an appropriate lookup table can be used to determine the length of the hash, so other content can follow it on the wire with no delimiters. I would like to put in a word for making it a true ciphersuite field, rather than a "hash function suite" field. A ciphersuite specifies a hash function, anyway; why invent an entirely new, incompatible numbering scheme for a smaller concept? If you want to use shiny new hashes like BLAKE2b that aren't part of a TLS ciphersuite yet, just use an existing unassigned block like {0xdc, *} and define your own set of ciphersuites there, copying some existing ciphersuite but modifying the hash function. (I recommend copying the ECDHE-ECDSA-ChaCha20-Poly1305 ciphersuite if it's all the same to you.) |
Yeah definitely.
Maybe, not always. For example, you might be switching from sha256 to blake2b for the speed improvement, but don't necessarily need to change all the stored data. (And even if you do, in a live p2p system, you'll be upgrading over time). So as I see it, hash functions will need to coexist in systems aiming to have data in use ~10 years. So git + bittorrent definitely qualify :)
Ahh, makes sense. But seems brittle. Reliability (and message/item atomicity) will still have to be the responsibility of the stream's client. So this probably doesn't need it.
Yeah, figured you ran into this too :)
Seems like UTF-16 where the varint cost is enormous. :/
Oh yeah! That's great.
Yeah, sgtm.
It's kind of silly that the combinations need to be enumerated. Saves space in the end, due to valid combinations sparsity? Or just good old standard IANA protocol? For many use cases though, one only needs a hash function (say hashes in git). Would be nice to just use standardized values for each of the components, and allocates a byte per component wanted in the application.
Hey look! It's a really slow package manager. Guess the size of the id namespace calls for a really slow publish process. :) Aaaaaand, here's what we're looking for:
Boom. Maybe just use those.
Btw, @davidad's new OSI (an awesome proposal) is here: http://davidad.github.io/blog/2014/04/24/an-osi-layer-model-for-the-21st-century/ |
@jbenet Re. UTF-8 vs UTF-16, saving a byte is one thing, but the CPU cost of processing or filtering packets is another, and I think the UTF-8 style varint imposes a heavy penalty on the latter that is not really justified, since there's only going to be one of these objects per packet. (UTF-8 makes sense because text files are entirely full of packed glyphs, among other reasons like ASCII compatibility.) And they really ought to be able to fit in two bytes for pretty much ever (but unlike Y2K, if that assumption turns out to be wrong, it is at least possible to back out of it). |
Yeah, that's fair.
I plan to use this per-hash. So, say, an ipfs tree object would have many.
Yep. Unless there's a proliferation of hash functions made possible by the emergence of a more open + interoperable way of specifying hash function. (i can imagine variants/wrappers of others) |
Very neat! |
|
@msparks said
You mean the way |
(Note: I'm not saying anything novel; just mentioning prior work.) |
@msparks yep. You can see https://github.com/jbenet/go-multihash/blob/master/multihash.go -- this makes it en/decodable between binary <--> string. (prepending |
You can see this technique being used in https://github.com/jbenet/node-ipfs |
so what has been bothering me about this idea is that the great thing about content addressing is that everything has only one name - so if you start using multiple hashes it gets weird because what happens when the same object is refered to by two hashes? but this morning I had an idea: what if you just tag the object being hashed with the type of hash that should be used - now, this idea is not without problems - but it would mean there is a canocal referent for any object, so you could easily mix sha256 with BLAKE2, and follow refs from one to the other. something like this: //automatically create a sha256 hash
var hash = autohash({
message: "hello", hash: 'sha256'
})
//automatically create a blake2 hash
var hash = autohash({
message: "hello", hash: 'blake2'
}) Now, instances could make a local choice about when they wanted to upgrade their hash, If you need to refer to an object with a hash that you now consider insecure, you can refer to it by it's old hash, and then give a second check hash with a secure algorithm. So, old data could be incrementally secured when it's referenced by new data! |
This isn't true in practice. It's possible to get everything to use the same name, but in practice you end up chunking large files anyway (can't sha256 the whole thing), so now you've got infinite numbers of ways you could hash the same file with the same hash function.
So putting the multihash function tag into the data too? Would have to also describe how to hash it (whole thing, how to get block boundaries, etc, which just turns into a dag anyway). I like this idea.
Yeah I like this. |
I'm curious who does this? As a counterexample, you can download the Geocities archive (640GB+) as one torrent, with one info hash (2DC18F47AFEE0307E138DAB3015EE7E5154766F6). |
And what does that infohash mean @feross ? What is it a hash of? (spoiler alert: chunks) |
I still think that @dominictarr's original point is valid though:
At least 2DC18F47AFEE0307E138DAB3015EE7E5154766F6 uniquely identifies the tree of folders and files that comprise this particular user's dump of Geocities.
This could be standardized. Remove silly fields like "title" from the info dictionary (so it's not part of the content that gets hashed). Standardize the piece length selection algorithm (webtorrent uses piece-length btw). And then you're pretty dang close to the promise of one file/folder = one hash. |
Precisely the point i made :). When you're hashing a massive file, you end up with a hash that uniquely identifies the last layer of the particular chunking/blocking algorithms you chose. Not the final, underlying file itself.
How is this different from saying, "in this application, we will hash with hash function X, and chunk files using deterministic function Y"? The IPFS merkledag, and the rabin-fingerprinting file chunking, are examples of such standards. @dominictarr's point is this:
My point is this: standards are the only way to address this problem, and the "hash functions go stale" problem. |
I don't think that this precludes some sort of deterministic tree hash. That said - there is quite a bit more design space in the field of replicating large files. |
So, if a hash points indirectly to some large object that must be replicated over some special protocol,
in the case of bittorrent the infofile is small, but defines what to get next. with a merkle or rabin tree that may not be the case... basically, treating this like a type of hash? of course, this is a hash with parameters, you could use different types of hashes within the tree, etc so maybe you need more than just one number anyway. |
or maybe you could always point to a header file which contained metadata such as what parameters the large file uses... of course this assumes that there is a threashold where anything smaller is just a straightforward hash. |
These all sound like special cases of the more general "just use a merkle dag". All of these can be implemented on top -- up to the users to decide how to index, but all the indices work everywhere. |
That would be correct, but can eliminate round trips by having more than one replication protocol. Maybe you have a structure that has some well defined properties, say maybe it's a linked list. If you really want to make this InterPlanetary then round trips becomes a huge problem. Light speed round trip to the moon is 2.6 seconds! To Mars, Venus, the Asteroids? I'll leave that as an exercise for the reader... There are even places on earth where the round trip time is a source of frustration when people misdesign protocols or applications to unnecessarily round-trip. You should come on a holiday to visit me in New Zealand and experience it first hand! |
oh hang on, did I just advocate cypherspace urls now? |
Yeah, absolutely. I'd use a routing system that's aware of this (i.e. make requests across high latency rarely). (technically coral is already latency-cluster aware, but planetary scale is a completely different beast)
This is what (the first path component is a protocol. one of my goals in life is to move us away from the "scheme identifier". It breaks links + mounting in these links.) |
Okay so how do you plan to "make requests rarely"? I thought you just had a want-list? Having obsessively studied data-replication for a few years now, I just don't believe there is a golden hammer that solves the problem efficiently in general. |
No i mean the routing system should be aware of this. One extreme solution is you have 2 planetary dhts -- if you're in mars you don't use Earth's -- and use something else latency aware to route requests + fetch things back across interplanetary links. A better solution bakes all of this into itself with latency measurement + ways to express variable blocks (i.e. as you've suggested, retrieving path globs 👍 ) |
so if you are gonna have path globs why not have multilpe protocols? |
the replication protocol in scuttlebutt is certainly "a way to express variable blocks", |
Yes, encoding the number of continuation bytes in unary (base-1) followed by a zero bit, followed by enough data bits to finish the byte, followed by the continuation bytes (in network (big endian) byte order) makes a lot of sense. (In other words, like UTF-8, but without making all of the continuation bytes 10xxxxxx.) Inside Google, these are known as prefix varints. This (1) keeps values in lexographical order (ignoring the zigzag encoding that Google uses to move the sign bit to the least significant bit) and (2) is faster to decode and encode than the base-128 varint scheme used by Protocol Buffers and LLVM bitcode. I think if Google had it to do over, they'd use prefix varints in Protocol Buffers. |
yeah +1
very much +1
wish the various authors published lessons learned / recommendations for the future. btw, jeff dean had a nice discussion on very large varints here: http://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/WSDM09-keynote.pdf |
Problem
As time passes, software that uses a particular hash function will often need to upgrade to a better, faster, stronger, ... one. This introduces large costs: systems may assume a particular hash size, or call
sha1
all over the place.It's already common to see hashes prefixed with a function id:
Is this the best way? Maybe it is. But there are some problems:
blake2b-
may matter. So we might want to use a much narrower prefix. Particularly given that "widely used and accepted secure cryptographic hash functions" tend to change very little over time (by 2014 there's less than 256 that you might seriously consider).Is there an RFC for this? I haven't found a "Hash Function Suite" like the "Cypher Suite" in TLS (RFC 5246/A.5).
Potential solutions
Use a short prefix mapping to some "crytographic hash function" suite. This already has to be done: the
sha1-
prefix is more human readable, but probably not a good idea to blindly dispatch a function based on the stringsha1
. Whitelisting specific strings (a blessed table) already happens.So what would this look like? For example, suppose sha1 is
0x01
Pros:
Cons:
0x01
is sha1,0x02
is sha256, etc.on varints
Ideally, for proper future proofing, we want a varint. Though it is to be noted that varints are annoying to parse + slower than fixed-width ints. There are so few "widely used...hash functions" that it may be okay to get away with one byte. Luckily, can wait until we reach 127 functions before we have to decide which one :)
May be able to repurpose utf-8 implementations for this.
** Random UTF-8 question: ** why are the subsequent bytes wasting two bits each?? '10' prefix below.
From http://en.wikipedia.org/wiki/UTF-8#Description
Is it to keep the code point ranges nice and rounded-ish?
The text was updated successfully, but these errors were encountered: