-
Notifications
You must be signed in to change notification settings - Fork 108
feat: require deterministic map serialization #235
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good, although the bolding is probably not necessary, the doc is sparse enough that everything in here is pretty important
I would specify the ordering the Data Model and having the codec store in which way it wants (I agree that it should be fixed per codec). When you interact with Selectors, which operate on a Data Model level, you want to have that one specific ordering, you don't care how it is actually stored. |
If a format already has deterministic ordering by other means it’s preferable to use that. If we specific it we could potentially exclude a format from being data model compliant if it doesn’t match what we specify. |
This means that traversals (e.g. with Selectors) depend on the codec. |
Yes, but that’s already true. If we allow for streaming parsers then the keys will come out in the encoded order, which means it relies on codec write-time determinism. Similarly, when you have an advanced selector over a HAMT or other multi-block data structure the keys MUST be yielded in encoded order because you can’t re-order without reading all the keys from all the blocks into memory. In other words, selector determinism for key ordering is “read them in the order they are encoded” and we are relying on consistent ordering from the data writers. |
Yes, that seems reasonable to me. Since the codec varies the hash and CID of an object then determinism at that level, as long as it's properly deterministic within that codec, should be enough. Then we just have to put some wording in selectors about leaning on this. 👍 good reasoning. |
To me a strength of the IPLD Data Model is, that as long as your codec implements the full Data Model, it doesn't matter which codec you use, the system will behave the same. You can just switch to another codec. You can build libraries on top of the IPLD Data Model (like Selectors) and things will work. With this change, that's no longer true. We break that assumption (that I would guess some would expect, besides the spec saying it's not the case) for potential future codecs (we don't know if ever will happen) that might want to encode things different. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would much prefer we specify the Data Model as having "stable order" and leave it at that.
This enables the best of all worlds. When in actual practice we evaluate a selector starting at some Node (or just as well, starting at some CID), and order is defined to be stable, the outcome is stable. (Mission accomplished.)
We will specify many (most?) of our codecs to have strict ordering. This is supports our convergence-without-coordination goals and visions at the high level.
And at the same time, this definition makes it possible for us to build a jq
-like tool that works with, say, a JSON codec -- n.b. distinct from dag-json (dag-json can rightly have strict key ordering, while a plain json codec should not) -- and would allow handling data and doing transformations without reordering. And I think we sign up for some very nasty usability hurdles if we don't have this; whereas to make it possible is quite easy... it just relies on wording this very carefully and making sure we keep the Data Model and the Codecs separate here.
I think we all agree we should be seriously allergic to getting sorting orders specified in the Data Model... I just think we should just go even farther in saying that. While this diff does say "the sorting is not defined by the data model", by saying "regardless of insertion order", it kind of leaves a rock and a hard place that I'm not sure how to interpret.
Thanks @warpfork your comment makes a lot of sense (sorry to others I only now got the point (hopfully). So is the idea that we specify a stable order in our Data Model and we define which stable order it is in our Codecs. We can have as many Dag* codecs as we like and there specify that it is one specific order. As long as you would use Dag* codecs only, you'll have the same order, no matter which one you use. In case what I'm saying is accurate I'm in full support for this. |
I really like the comment from the top of the issue:
Idea: we should have a page ( Then the page for
...etc |
Some of my thoughts on this are also shaped by the way I structured code in go-ipld-prime, so I should maybe talk about that a little. In go-ipld-prime, the The first implementation of There's an idea that we could build a |
I like this language and will update the PR to reflect it. I’d like to define stable ordering in this doc as, and this is draft language: “consistent key ordering of a map regardless of how that map was created (does not vary by insertion order)” |
Co-Authored-By: Volker Mische <volker.mische@gmail.com>
Just to focus in on one minor quibble @warpfork raised: are we prepared to rule out insertion order as the possible ordering heuristic a codec chooses and the current language seems to do that. Maybe a codec that's focused on encoding streaming data wants to retain insertion order and it has a way of ensuring that in a round-trip decode/encode too.
How about just: "(key ordering should be consistent and stable, using rules specified by the codec and mechanisms in implementations as appropriate for each language)" In most cases I think the codec will control the insertion order at the lowest layer. Like now, in (I think) all of our codecs in Go and JS, we hand off blobs for marshalling by the codec which picks it apart and feeds it through a coding mechanism. It's the "picking apart" step that we're trying to codify. (Aside: I think it's worth noting that "insertion order" is much more of a JavaScript programmer obsession because of our Maps than it is for Go programmers who get YOLO ordering). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve, but I'm also fine making the change @rvagg proposed in #235 (comment).
There’s a tradeoff here no matter which way we go. Streaming a single map encode, to me, is not a high priority since it’s a difficult to use micro-optimization that only works on a single block which we already want to keep small for other reasons. The problem with insertion order is that there’s no way to fully ensure determinism. Even if you do the work of preserving decode order for maps in languages that don’t preserve it, you still have the problem that two nodes who create the same new map would have to somehow agree on their insertion order in order to guarantee determinism. That’s a lot of heavy lifting, and is something we should try to ensure at the codec layer instead. It’s hard for me to see insertion order ever being the easier thing to implement once you take into account the work across languages and these coordination problems between nodes. I’m still in agreement that we shouldn’t define a specific ordering in the data model but I’m comfortable ruling out insertion order as “stable.” By definition it isn’t really stable between nodes without some other method of coordinating the insertion order. |
OK, I'm fine with that, just wanted to confirm that it had been properly considered. Will let @warpfork dissent if he still wants to. |
I still got a lotta beef. "regardless of insertion order" does not jive a single whit for me. I don't think we've fully identified even half the relevant user stories here. We have a very fundamental desync on the meaning of "determinism" -- I don't think it can be well defined without defining "input" and "output" better than discussion here has; and, I think most of the time when we say "determinism" here, we've meant "convergence", which is a subtle but importantly distinct thing -- which we should get our heads around ASAP because it's super central and gonna keep coming up in this and other discussions. And @mikeal , I think you're throwing out some user stories that I'm not.
So, I'm sorry to be so staunchly unyielding here, but truly: I think we must not reject the idea of simply maintaining a handed-in order in the spec, or things collapse. We can say "codecs should have sorted order" too at the same time and that's fine. But saying the data model must be sorted and may not simply persist a given order is just not going to end well for us. Let's use the words "stable". Stable is sufficient. Stable has all critical good properties, and permits all bonus good properties. Stable. |
I've been brewing on a longer exploration report-style writeup about this and the big-picture "determinism"-vs-"convergence" thing and I'll try to wrap that up and push it soon to give a little more zoomed out / principles perspective. But to spoil some snippets: Some User Storiesusing IPLD for (human-written) configreproducible builds indexesfilesystem ingestingweb page scraping and ingestingstream processing and 'jq-like' usageExtracted patterns from user storiesusing the data model and a strict sorting codecusing the data model and a non-sorting codecusing the data model with two differently-sorting codecsusing the data model aloneusing a schema (without codegen)using a schema (with codegen)using a schema plus user sourced map contentI identify at least seven different patterns of usage that have different constraints, some of which (as described in previous comment) staunchly reject any solution other than "maintain original order", and some of which provide interesting lenses for considering performance tradeoffs; and in the example user stories, there's a mix of "strictness is essential" and "sorting is irrelevant" and "strict sorting is murderously unusable". I think we should be sure we've factored all these. |
Another thing I should note here, in furtherance to my concern that "determinism" can't be well defined without defining "input" and "output" better: I don't consider golang's maps an "input", or at least certainly not a very good one that I'm going to define anything around. They're randomly ordered and there's nothing I can do about that. So, if those are my "input" then there's not much I can do, yeah. So this is why the Similar issues probably exist in other languages but can be solved the same way. When handling anything serialized, that kind of defacto ordering of a token stream always exists, so it's not a particularly unnatural thing to make a |
I’ll try to be clearer. As I see it, there’s really only one problem with block sizes and all the other problems descend from that problem. The problem is that a block is the smallest unit that is verifiably addressed. A transport’s max block size is not some arbitrary limit they place for reasons we can discard when we don’t use that protocol. Those max sizes are how the transport deals with the fundamental problem. Even when you aren’t using that transport, you will still run in to situations in which you cannot have an arbitrarily large value and cannot simply accept data for an indefinite period of time without being able to verify it. So yes, we should not design everything to the max block size of a particular transport but we can assume that all use cases will implement a max block size as they mature, we just can’t say with certainty what that size will be. Given that some max size will be imposed, I’m skeptical that enough use cases will have a sufficiently large max size that their performance bottleneck will be streaming the serialization of keys. In the rare instance that this is your performance bottleneck, I think it’s reasonable for us to say that the way you should handle this is to use a custom non-native map in your code that inserts in stable order at insertion time so that you can then safely stream the key serialization.
I think I clarified this in comments elsewhere but I’ll do it again here. While it is certainly possible for two actors to find some way of aligning their insertion order, it’s difficult and problematic. If we do not specify insertion order as being unstable we will be deferring this problem to IPLD consumers and they will need to find their own consensus mechanism for aligning insertion order. That is what I’m trying to avoid. If you don’t agree and think it’s reasonable to push this up the stack then that is where our disagreement is. I have not been compelled enough by the possibility of use cases for insertion order as a stable ordering mechanism to change my view here.
I think we may be lacking alignment on terminology. We’ve stated a few times now that If you have a bunch of CBOR data encoded with whatever and you want to link to it, you can do so with a I do not think that the codec should re-order keys it de-serializes. If possible, it should just fail if the ordering is not consistent with the spec. If it does not fail then the re-ordering would only happen during re-serialization. And before you say “that’s a mutation” let me say “so what?” If the ordering is indeterministic then nobody else can reproduce that block independently anyway!
One thing that every user story for IPLD will have in common is that the data is addressed by hash. Any user story that is addressed by hash will have issues if the serialization is indeterministic, which is why we’re trying to handle these requirements at the serialization layer of IPLD. There may be cases where you don’t care about consistency between languages or even applications written in the same language. But if you don’t care then doing this work won’t hurt unless the performance penalty is large enough to exclude your use case. The stable ordering that CBOR defines is certainly problematic in this regard but is it so bad that it would break or otherwise exclude a class of user stories? |
I just don’t find this argument compelling given the tradeoffs. Will someone, some day, want to write their own sorting algorithm: sure. Is that important enough for us to allow for application specific sorting of our generic codecs and fail to provide these assurances to everyone else: absolutely not. This falls far enough outside of the generic/default path that it’s perfectly acceptable to me that we say this needs to be an application specific codec. We’ve got a codec table with plenty of room, they are welcome to it ;) |
I have an application that does the following:
Consider three cases:
When I use an iterator to iterate over key-value pairs, which order do you expect? When I serialize, which order do you expect? Are the answers the same in all cases? Are all cases even defined, if "maintain order" is outlawed? If one of these situations is undefined, how is that not a blocking problem? |
Consistent ordering requires that you prepare the keys to be serialized, either as you insert them or before you serialize. The ordering for If you’re taking an arbitrary Map-like object and serializing this means preparing the keys right before serialization, and this may involve an extra memcopy. If the API for inserting the keys is custom, and it knows the codec you’ll be using, you could keep the internal ordering consistent with the codec’s ordering rules as you insert them. If you have a This would actually be easier if we defined consistent ordering rules for all the data model codecs and didn’t have any variations, but I don’t know if we are prepared to do that. |
I've been trying to grok the difference here for the past few hours and just can't quite grasp it. It's making me feel very small-brained. @warpfork perhaps you could propose an alternate paragraph in place of @mikeal's that expresses what you'd like to see re ordering?
Why must it be defined? Same as with your ipld-prime example above, why do we need to define in the data model how the ordering is presented to the user? This is one point where I think language differences are getting in our way - Go's yolo If we changed the paragraph to say "Data model compliant codecs must ..." would that change anything? We're really only talking about dag-cbor and dag-json here, what happens with json and cbor are a different matter and they are not data model compliant in themselves so wouldn't fall under this "must". But we could clarify that a little more. We also need to work more on this convergence vs determinism business and clarify core principles that are guiding decisions like this so that when we make them, we can reference those principles that we already agree on. This might be a case for a team week sometime soon. |
Here's a fresh proposal for text I'd be happy with:
You'll notice an important effort I'm making here in this larger text is to identify what libraries are actually capable of doing in various situations. (We could expand on this further with recommendations and remarks on efficiency and performance and how different implementation choices can relocate those costs to different parts of a pipeline of deserialization, operations, and serialization. Probably we'd want to do that in a larger document in another file, though, as the file we're currently in is supposed to be just a terse introduction to Kinds in the Data Model, not full advice for library implementors nor performance tips for users.) I also think we should introduce as much as possible of this topic in documentation in codecs, as previously commented. Yes, even if that means introducing new (and unfortunately fairly skeletal) files. Accordingly, there are "TODO" links in my proposed text. If anything, I think my example here doesn't go far enough in moving the codec topics somewhere else -- it could be improved by further separating those. We really need to get the distinction between Data Model concerns and Codec concerns right. To refresh ourselves on an important detail here, in case it's become glazed over or lost from focus, the path currently targeted by this diff is There's more that could be done here. There's minimal talk about streaming in the text I proposed, and there's minimal talk about how strictness should result in error handling, and there's minimal talk about support for deserializing-loosely-and-then-sorting... all of which absolutely also deserve discussion! But... all of which belong in text in the codec area. |
Interestingly, in a surprising bit of coincidental synchronicity, the Python community is also making a change to maps and ordering around the present. They're going full order-preservation. There's a plethora of comments on Hacker News about this (as well as elsewhere, I'm sure), but to cherry-pick two which resonate pretty strongly with me, as they contain language-independent user stories about the upsides of this:
Just food for thought. (I couldn't not link to it given the coincidental timing.) |
Closing for now, need to work on a better document about all of our determinism requirements. |
In order to be “data model compliant” we should require deterministic maps.
This means that all the
dag-{format}
codecs must add deterministic key sorting during serialization. We should add the details for each codec in those codec specs, but there should be a general requirement that some form of determinism is defined.