dag-json: cleanup and fully specify reserved space #356

rvagg · 2021-01-25T10:31:52Z

I've been working ok figuring out how to codify the rules that we discussed sometime last meeting .. during some meeting. I can't find the particular meeting by looking at the minutes in https://github.com/ipld/team-mgmt so I'm doing this from memory. But the main points were:

Strictly only allow certain forms in the {"/":...} namespace so that tokenizing parsers didn't have to buffer and unroll too much data but can quickly reject form without getting too deep.
~~Remove the multibase prefix from the encoded bytes.~~

After working through some of this, I've flipped on my opinion of the second point and would rather it stay as proper multibase, even if it's still strictly base64. It adds clarity and there's scope for using this for further validation if we want to. It's a single byte and I think it increases the clarity and reduces the number of edge cases. So I haven't included that change here, but discussion is welcome if you feel strongly!

The bit to pay most attention to below is "The Reserved Namespace" where I've tried to enumerate the rules. It turns out to be quite hard to explain it with crystal clarity, but that could be me so suggestions welcome!

Along with this, as exploration, I've been working on a version of JS DAG-JSON that does tokenized decoding and uses the same base encoder/decoder as the new JS DAG-CBOR codec, just by swapping out the backend. There's some consistency benefits here in having the same code paths and same affordances for applying data model strictness. Still uncertain whether I'll actually propose that we merge this, although I did uncover some bugs in the current implementation in the process, but the code is here: https://github.com/ipld/js-dag-json/compare/rvagg/cborg (see the innards of DagJsonTokenizer for the token parsing rules that arise from the reserved space rules).

block-layer/codecs/dag-json.md

vmx · 2021-01-25T11:16:33Z

block-layer/codecs/dag-json.md

+ * Maps with more than one key, where the first key is `"/"` and its value is a string. e.g. `{"/":"foo","bar":"baz"}`.
+   - Where a key exists that sorts before `"/"`, the map is valid, e.g. `{"0bar":"baz","/":"foo"}`.
+   - Where the value of the `"/"` entry is not a string, the map is valid, e.g. `{"/":true,"bar":"baz"}`


It took me awhile to grasp that the main item is the rule to reject something and that the sub-items are rules that would not be rejected.

Perhaps it could be expended and we give the rules even names. Example:

Multiple keys

Rejection criteria: Maps with more than one key, where the first key is "/" and its value is a string.

Examples for invald DAG-JSON:

{"/":"foo","bar":"baz"}.

{"/":"foo","bar":true}.

Examples for valid DAG-JSON:

{"0bar":"baz","/":"foo"}: There is a key that sorts before "/". The value following the "/" will not be interpreted as link.

{"/":true,"bar":"baz"}: There value after the "/" is not a string.

{"/":"foo"}: There is only a the single map key "/"`. The value is expected to be a valid base encoded CID.

I think I've addressed this, but worth a review

block-layer/codecs/dag-json.md

warpfork · 2021-01-25T17:51:38Z

I went digging for our discussion notes (and merged some stale PRs) and found them here:

https://github.com/ipld/team-mgmt/blob/master/meeting-notes/2020/2020-11-05--ipld-specs.md#dag-json-bytes-and-links-discuss-parsing-rules-edge-cases

block-layer/codecs/dag-json.md

rvagg · 2021-05-08T08:21:13Z

I've addressed the reviews so far and I think this is ready for landing. The reserved namespace rules are hard to be clear about but I've done the shuffle recommended by @vmx and separated things out a little more to hopefully make it easier to digest.

I went with the .0 rule for floats with no fractional component - although this is no help to JavaScript but Go can implement it.

I also have a working JavaScript version that I'm pretty much ready to release. It uses the same backend as the DAG-CBOR codec but with JSON semantics. We've lost some speed in the process but we get all the reserved namespace rules and all the strictness.

@warpfork care to review the changes?

vmx

The rewrite of the "Parse rejection modes" reads great now!

vmx · 2021-05-10T12:32:20Z

block-layer/codecs/dag-json.md

-JavaScript itself that has trouble with them. This means JS implementations
-of DAG-JSON can't use the native JSON parser and serializer if integers bigger
-than `2^53 - 1` should be supported.
+Data Model floats that do not have a fractional component should be encoded **with** a decimal point, and will therefore be distinguishable from an integer during round-trip. (Note that since JavaScript still cannot distinguish a float from an integer where the number has no fractional component, this rule will not impact JavaScript encoding or decoding).


I'm not sure if the note about JavaScript is needed as it does exactly that when encoding/decoding JSON:

> JSON.stringify(2.1) '2.1' > JSON.stringify(2.0) '2' > JSON.parse('2.1') 2.1 > JSON.parse('2.0') 2

yeah, you're making my point @vmx. The spec is saying that 2.0 should be encoded as 2.0 but JavaScript can only do 2 since we can't distinguish. Then in the reverse the spec says that a 2.0 should be decoded as a "float" but JavaScript still can't distinguish so a 2 and a 2.0 would come out as the same thing, regardless. Maybe one day we'll retain the information that "this was a float in the encoded form" but there's no infrastructure to attach that information today.

So the note basically says "JavaScript can't conform to this part of the spec". The spec says "... floats that do not have a fractional component should be encoded with a decimal point".

rvagg · 2021-05-11T02:14:37Z

As per a brief discussion today:

go-ipld-prime just had "bytes" implemented @ Allow emitting & parsing of bytes per dagjson codec spec go-ipld-prime#166 but this was done without the multibase prefix as per the current version of the spec
We have previously discussed removal of the multibase prefix but this spec change didn't include that -- see the original post for a discussion on that and why it was removed from scope.
The current Rust implementation has a multibase prefix and Forest has some test fixtures using dag-json, and the current JavaScript dag-json has a multibase prefix too.

So either we should fix go-ipld-prime to match spec or we just bite the bullet and remove it and get it done everywhere—I can do this in JavaScript pretty quickly after merging this.

@warpfork @vmx @willscott (and anyone else) 👍 for just remove multibase prefix and 👎 for leave the spec alone darn it!.

vmx · 2021-05-11T09:14:18Z

My reasons for having the multibase prefix:

consistency: the CIDs are also Multibases. The rule would be: all base encoded things are Multibases (how could would it be if that would be true across all our specs?)
we do self-describing things: to me having a multibase isn't necessarily a signal that any arbitrary Multibase is allowed, it's for being self-described.
not performance: if you care about performance, don't use DAG-JSON. An even if you care about allocations, for base encodings you need to over-allocate slightly anyway, so just allocate one byte more for the prefix
why change now: generally we are hesitate to change things in specs that are already out there, unless we really have to. We don't know who else has implemented that spec already. So why changing the status quo if it doesn't really make a practical difference (except for one of our own libraries we control that didn't follow the spec).

All that said, I don't have a strong opinion and can live with either way.

Ref: ipld/specs#356 Ref: ipld/go-ipld-prime#168

rvagg · 2021-05-11T11:58:58Z

Add multibase prefix to DAG-JSON bytes go-ipld-prime#168 has a multibase prefix
chore: add go-ipld-prime compat test case for bytes js-dag-json#36 has the same test case ipld-prime has

at first I couldn't get them making the same bytes and thought there must be a multibase incompatibility (the horror!) but it's []byte('deadbeef') being deceptive cause that's actually a string, not a hex value in spite of the nice hex characters..

also, sorting, ipld-prime don't sort and that engages hostilities in @warpfork and is not fun for compatibility fixtures 🤦 and I kind of don't want to have that argument over and over so will skip it for now.

But basically, it's not a big deal to do it in go-ipld-prime. It's also not a big deal to strip it in @ipld/dag-json, but I'm also a soft -1 to changing the spec and leaving the prefix in there for reasons stated in this thread and restated by @vmx. But, it's not a big enough deal if someone else has particularly strong feelings and justification.

warpfork · 2021-05-11T13:13:14Z

Arguments against inserting multibase in this bytes field, just to have them written:

it will become "another one of those bytes" that anyone wanting compactness will question and question and question again -- just like the extra "zero" byte in CIDs in DAG-CBOR which confuses, surprises, and annoys every single person who ever has cause to notice it.
- I've stopped saving references to every time this has happened about that zero in DAG-CBOR, but it's... many times. People (me included) repeatedly implement DAG-CBOR without it, and then go "what?!" when a fixture points out that it's needed; or people think it's a bug, and try to remove it again later on in development cycles! And this regularly comes from people working "down the hall" from us, not just distant strangers. Even if we nevermind the excess bytes on the wire, the number of human hours this has wasted over time does not sit well with me. I'd like to not do this again.
having a multibase indicator would suggest it's acceptable to use other multibase indicators here... which it is not; in turn, it is not desirable to all multiple multibases to be used here because it would provide negative value if DAG-JSON were to be more expressive than the IPLD Data Model in this position.
holistically: I think multibases are to be used when you've got a string and it's fairly user-facing, and there are no other disambiguators or indicators in the area that could help you know what that value is. This is not the scenario we have in the middle of dag-json: we already have 1) the multicodec indicator, from the outside, 2) all the information/confidence provided by the success of parsing as dag-json throughout the block, 3) the information/confidence provided by a single-key map with a "/" key, and 4) the information/confidence provided by a single-key map with a "bytes" key, and then 5) the confidence provided by the success of parsing the value therein as base64. This is a huge amount of confidence and a ton of non-accidental structure, and it seems to me heaping on another byte just to "be sure" it's clear what base is expected here would be wild overkill.
- The (admittedly rather fine) distinction I'm trying to make here is that the "natural form" of this data is bytes -- it's not a string of some multibase, it's bytes -- and dag-json is in the business of specifying how those bytes are encoded in the serial format. I think that we have a goal of using indicators like multibase in contexts where there are not yet indicators; but in the the middle of a codec's serial forms is not such a context.
  - Can you imagine if we put a "zero" prefix on every bytes scalar in DAG-CBOR, too, just to make it really clear that they're bytes?
We argued about this before in https://github.com/ipld/team-mgmt/blob/master/meeting-notes/2020/2020-11-05--ipld-specs.md#votes.
- I distinctly remember trying to hold a vote and do an actual formal count -- and now do regret not forcing that! -- but as the notes say, and we reviewed and merged, in video, we had entirely assenting voices for removing multibase. There were more than a half dozen people there, fully engaged, and having just collaboratively reviewed the entire subject.
- That meeting had more people and more of their attention dedicated to it than this re-review today has got, so I'd prefer to stick with it.

rvagg · 2021-05-12T11:25:30Z

fine, you appear to have a stronger opinion than either of us (although I'm not sure I'm convinced it's all that strong)

have pushed a new version that removes it and will do likewise in the JS impl

rvagg · 2021-05-12T11:25:50Z

(PTAL & merge if 👍)

Ref: ipld/specs#356

rvagg commented Jan 25, 2021

View reviewed changes

block-layer/codecs/dag-json.md Outdated Show resolved Hide resolved

rvagg commented Jan 25, 2021

View reviewed changes

block-layer/codecs/dag-json.md Outdated Show resolved Hide resolved

rvagg requested review from vmx, mikeal, mvdan, chafey, ribasushi and warpfork January 25, 2021 10:37

rvagg commented Jan 25, 2021

View reviewed changes

block-layer/codecs/dag-json.md Show resolved Hide resolved

vmx reviewed Jan 25, 2021

View reviewed changes

warpfork reviewed Jan 25, 2021

View reviewed changes

block-layer/codecs/dag-json.md Show resolved Hide resolved

warpfork mentioned this pull request Mar 27, 2021

dag-json encoding should proactively reject maps that would be confusable with links during decode ipld/go-ipld-prime#155

Open

rvagg mentioned this pull request Apr 27, 2021

Allow emitting & parsing of bytes per dagjson codec spec ipld/go-ipld-prime#166

Merged

dag-json: cleanup and fully specify reserved space

ed0d689

rvagg force-pushed the rvagg/dag-json-cleanup branch 2 times, most recently from 881cad3 to 0911186 Compare May 8, 2021 08:12

dag-json: further spec clarifications and cleanup

9de1861

rvagg force-pushed the rvagg/dag-json-cleanup branch from 0911186 to 9de1861 Compare May 8, 2021 08:13

rvagg marked this pull request as ready for review May 8, 2021 08:13

rvagg mentioned this pull request May 8, 2021

Use cborg backend, implement new strictness hotness ipld/js-dag-json#35

Merged

vmx approved these changes May 10, 2021

View reviewed changes

rvagg mentioned this pull request May 11, 2021

Add multibase prefix to DAG-JSON bytes ipld/go-ipld-prime#168

Closed

rvagg added a commit to ipld/js-dag-json that referenced this pull request May 11, 2021

chore: add go-ipld-prime compat test case for bytes

b7bfc8a

Ref: ipld/specs#356 Ref: ipld/go-ipld-prime#168

rvagg mentioned this pull request May 11, 2021

chore: add go-ipld-prime compat test case for bytes ipld/js-dag-json#36

Closed

dag-json: remove multibase prefix from bytes

c32eecf

rvagg added a commit to ipld/js-dag-json that referenced this pull request May 12, 2021

fix!: remove multibase prefix from bytes

35515b4

Ref: ipld/specs#356

dag-json: more go-ipld-prime spec difference clarification

4d0fd8b

rvagg added a commit to ipld/js-dag-json that referenced this pull request May 13, 2021

fix!: remove multibase prefix from bytes

864dbda

Ref: ipld/specs#356

rvagg merged commit 9bab36f into master May 19, 2021

rvagg deleted the rvagg/dag-json-cleanup branch May 19, 2021 05:30

rvagg added a commit to ipld/js-dag-json that referenced this pull request May 19, 2021

fix!: remove multibase prefix from bytes

88c4f43

Ref: ipld/specs#356

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dag-json: cleanup and fully specify reserved space #356

dag-json: cleanup and fully specify reserved space #356

rvagg commented Jan 25, 2021 •

edited

Loading

vmx Jan 25, 2021

rvagg May 8, 2021

warpfork commented Jan 25, 2021

rvagg commented May 8, 2021

vmx left a comment

vmx May 10, 2021

rvagg May 11, 2021

rvagg commented May 11, 2021

vmx commented May 11, 2021

rvagg commented May 11, 2021

warpfork commented May 11, 2021 •

edited

Loading

rvagg commented May 12, 2021

rvagg commented May 12, 2021

dag-json: cleanup and fully specify reserved space #356

dag-json: cleanup and fully specify reserved space #356

Conversation

rvagg commented Jan 25, 2021 • edited Loading

vmx Jan 25, 2021

Choose a reason for hiding this comment

Multiple keys

rvagg May 8, 2021

Choose a reason for hiding this comment

warpfork commented Jan 25, 2021

rvagg commented May 8, 2021

vmx left a comment

Choose a reason for hiding this comment

vmx May 10, 2021

Choose a reason for hiding this comment

rvagg May 11, 2021

Choose a reason for hiding this comment

rvagg commented May 11, 2021

vmx commented May 11, 2021

rvagg commented May 11, 2021

warpfork commented May 11, 2021 • edited Loading

rvagg commented May 12, 2021

rvagg commented May 12, 2021

rvagg commented Jan 25, 2021 •

edited

Loading

warpfork commented May 11, 2021 •

edited

Loading