Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

dag-pb: more accurate schema and up-to-date impl description #297

Merged
merged 1 commit into from
Oct 1, 2020

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Sep 15, 2020

Round 1,490 of DAG-PB specification cleanup..

I'm working on getting this just right in https://github.com/ipld/js-dag-pb/ and am taking the opportunity to be very strict about Data Model representation. We have the challenge of wanting to ensure that we have previously accepted some flexibility in the way the components of a DAG-PB were constructed (friendly API) but now we want to be much more strict - you must give us an object of exactly the right shape as defined by the Schema such that if we round-trip it it'll come out the same way (while also offering utilities on the side to regain some of the friendly API functionality).

Thanks to Hannah's efforts in https://github.com/ipld/go-ipld-prime-proto we have a good chance at full hamonisation of the two implementations! go-ipld-prime-proto has a much more strict interpretation of the Data Model form than the previous Go implementation (which offers the shortcut named pathing form). So I want to try and get that impl and the new JS impl aligned and the Schema here reflecting it correctly. I don't think we quite got it right the last time around, which was hampered by having to conform to the varying legacy implementations and hand-waving about Data Model form. I think we can be much clearer now that we have a clearer concept of the Data Model in the new implementations.

I'm looking mainly at https://github.com/ipld/go-ipld-prime-proto/blob/master/coding.go and its use of https://github.com/ipfs/go-merkledag/tree/master/pb to pull in properties to inform some of the Schema changes, I'd appreciate double-checking on that please. I plan to conform JS to this.

Some additional notes inline.

block-layer/codecs/dag-pb.md Outdated Show resolved Hide resolved
block-layer/codecs/dag-pb.md Outdated Show resolved Hide resolved
block-layer/codecs/dag-pb.md Outdated Show resolved Hide resolved
@rvagg
Copy link
Member Author

rvagg commented Sep 15, 2020

Also pushed in an update that adds the sorting order to Links which we've neglected to document before. I don't know if go-ipld-prime-proto does this when encoding, maybe it defers to go-merkledag for it but if it does there might be a non-roundtrippyness problem there where you construct a DAG-PB object in memory, traverse through it assuming it's "correct" but it comes out in a different form when it does a round-trip (same problem in general with strictness in object shapes I'm trying to correct for in JS).

@rvagg
Copy link
Member Author

rvagg commented Sep 17, 2020

OK, did some investigation with Go to see what it's doing at the lowest level and that resulted in ipfs/go-merkledag#58 and some winding back of most of the proposals in here. The only real one that changes the schema now is:

Hash optional Link

Plus I've added some notes below with further constraints that can't be expressed with the schema. Particularly that Links: [] is invalid and an empty Links array must be null. There's no difference at the byte level. But it's possible to have zero-length byte arrays in both Data and Name, as well as null for both. I've added additional constraints around nulls and zero-length byte arrays and the validity of the Hash -> CID conversion.

I was going to use go-ipld-prime-proto as a reference but noticed this in the README:

It does not include robust facilities to support actual creation of DAG protobuf nodes

It really is minimal and defers mostly to go-merkledag and has some very rough rules about what it's doing with the data. So what we define here should be the rules used when go-ipld-prime-proto is properly fleshed out so it's worth getting this right.

I've also added a section about a zero-length block being valid. Both current Go and JS implementations will decode this into { Data: null, Links: null }, so it's important for implementers to be aware this is a case they may need to test for.

@rvagg
Copy link
Member Author

rvagg commented Sep 21, 2020

ok! I've had my head in this all week and I'm much more comfortable with where this is at now. I've reversed a bit from the previous comment, the schema uses optional for these fields and not a kinded union that includes Null so { Data: null, Links: null } is not valid according to the schema unless we go with a kinded union. In Go there's nil for these fields, but go-ipld-prime has Null as a separate thing, and IsAbsent() can tell you whether it's there or not, it just doesn't have to be part of a graph. In JavaScript we just don't define it (or undefined). So:

  • The only major change to the schema as making the Hash optional. I don't see a way around this, an empty PBLink is possible in the binary format, who knows how many of those are in the wild. { Links: [{}] } is valid.
  • I've added a bunch of constraints clarifying things:
    • one notable one is that the Hash bytes must be a valid CID. I think that's reasonable, I believe all implementations are treating them as such so even if there are blocks in the wild with bad Hash values, they should be rejected already.
    • empty Links array is indistinguishable from an omitted one in the binary format, that's an important clarification (and this is different to an empty Data byte array which is distinguishable from an omitted one).

Implementation stuff:

  • Go "forms" tests is here: chore: add tests to verify allowable data layouts ipfs/go-merkledag#58 - it allows a few things that we should disallow in go-ipld-prime-proto - the zero-length Links case (indistinguishable from no Links) and the non-CID Hash.
  • New JS implementation is now fully compatible with this spec, it's strict on the way in, and mostly strict on the way out (won't check sorting of Links array is the only bit of non-strictness). It also has a prepare() method that lets you re-shape a sloppy input object into a precise Data Model form, and encode() is super-strict so you won't end up having an in-memory representation that's different from its round-trip version. I also have a mirror of the "forms" tests to validate these edge cases and their binary representations. https://github.com/ipld/js-dag-pb

@rvagg
Copy link
Member Author

rvagg commented Sep 22, 2020

OK, after sleeping on it and then talking it through this morning in the meeting and with Mikeal I've decided to even wind back the optional for Hash. So the only real change to the schema text is that it must be a Link, that is, it must be interpreted as a CID and anything else is invalid. I've added this to my notes, that a block should be rejected. Now a [{}] is not a valid Links array.

I've also added that a block with "unrecognized" data should also be rejected. This is a should because most PB decoders will silently accept extraneous data. In go-merkledag there's a XXX_unrecognized property on both nodes to capture it, but any data in there should signal a bad block.

I'm about to go enforce this on the new JS DAG-PB library. Maybe we'll find some source of data out there that has these bad blocks, but until we see them in the wild we can afford to be strict in our newer implementations.

Name optional String
Tsize optional Int
}

type PBNode struct {
Links optional [PBLink]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood correctly that the protobuf wire format doesn't disambiguate between a lack of a list here vs a zero length list... maybe it would be best to describe it in our schemas as Links [PBLink] without the optional modifer on the field?

If we have the optional modifier on the field, that means it adds Absent to the cardinality (it +1's the cardinality), but we really have no way of taking away the empty list as a valid value (if we could, that'd be -1 on the cardinality, and balance things -- but we don't have such an ability).

Just having the codec handle the translation from {empty list}<->{absense of encoded bytes} seems correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so I've been wanting to lean this way but the thing that's stopped me is that it creates a lot of situations where you're forcing the use (and allocation) of an empty list where you don't want anything. For decoding a dag-pb block with just data in it you're going to need to throw in an extra array just to satisfy this case. When creating an object manually in our APIs you're going to have to allocate an array. In JavaScript this means that I have to force users to give me { Data: [...], Links: [] } when they're just wanting to ingest terminal data blocks. This isn't a huge inconvenience as I've created this prepare() method to help get the forms right before instantiating, but it adds a level of hostility that I'm trying to avoid. I think in ipld-prime you'd have data structures that have an empty array for Links when you create a new PBNode anyway so I suppose you get around this hostility fairly easily.

But.. I take your point on cardinality. And the reason why I've been tempted in this direction is that there's something to be said on the other end of this API where you always get an array in that position and don't need to check for its existence. If I can write a traversal that safely assumes Links is always there then it's much nicer than the incessant checking.

I'm going to ponder this today and maybe try out some code.

@rvagg
Copy link
Member Author

rvagg commented Sep 26, 2020

OK, I'm going with @warpfork's argument about Links cardinality. An omitted one in the binary form will be interpreted as an empty array. In the Data Model the array must always be present.

I've done another pass of the doc and cleaned up the clarifications to the schema. I've also added notes in the Protobuf schema about our recommendations for strict decoding.

https://github.com/ipld/js-dag-pb now matches this. I'm going to work on a Go form too I think now that I have a good test suite but it'll just be a simple replacement to go-merkledag/pb, I don't think I can manage the ipld-prime piece of it and the question of whether to integrate it in ipld-prime-proto is still up for discussion. So I'll just go with the low-level pieces for now.

@rvagg rvagg merged commit 9ebaa17 into master Oct 1, 2020
@rvagg rvagg deleted the rvagg/dag-pb-schema branch October 1, 2020 03:50
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants