-
Notifications
You must be signed in to change notification settings - Fork 108
dag-json: cleanup and fully specify reserved space #356
Conversation
block-layer/codecs/dag-json.md
Outdated
* Maps with more than one key, where the first key is `"/"` and its value is a string. e.g. `{"/":"foo","bar":"baz"}`. | ||
- Where a key exists that sorts before `"/"`, the map is valid, e.g. `{"0bar":"baz","/":"foo"}`. | ||
- Where the value of the `"/"` entry is not a string, the map is valid, e.g. `{"/":true,"bar":"baz"}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me awhile to grasp that the main item is the rule to reject something and that the sub-items are rules that would not be rejected.
Perhaps it could be expended and we give the rules even names. Example:
Multiple keys
Rejection criteria: Maps with more than one key, where the first key is "/"
and its value is a string.
Examples for invald DAG-JSON:
{"/":"foo","bar":"baz"}
.{"/":"foo","bar":true}
.
Examples for valid DAG-JSON:
{"0bar":"baz","/":"foo"}
: There is a key that sorts before"/"
. The value following the"/"
will not be interpreted as link.{"/":true,"bar":"baz"}
: There value after the"/"
is not a string.{"/":"foo"}: There is only a the single map key
"/"`. The value is expected to be a valid base encoded CID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've addressed this, but worth a review
I went digging for our discussion notes (and merged some stale PRs) and found them here: |
881cad3
to
0911186
Compare
0911186
to
9de1861
Compare
I've addressed the reviews so far and I think this is ready for landing. The reserved namespace rules are hard to be clear about but I've done the shuffle recommended by @vmx and separated things out a little more to hopefully make it easier to digest. I went with the I also have a working JavaScript version that I'm pretty much ready to release. It uses the same backend as the DAG-CBOR codec but with JSON semantics. We've lost some speed in the process but we get all the reserved namespace rules and all the strictness. @warpfork care to review the changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rewrite of the "Parse rejection modes" reads great now!
JavaScript itself that has trouble with them. This means JS implementations | ||
of DAG-JSON can't use the native JSON parser and serializer if integers bigger | ||
than `2^53 - 1` should be supported. | ||
Data Model floats that do not have a fractional component should be encoded **with** a decimal point, and will therefore be distinguishable from an integer during round-trip. (Note that since JavaScript still cannot distinguish a float from an integer where the number has no fractional component, this rule will not impact JavaScript encoding or decoding). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if the note about JavaScript is needed as it does exactly that when encoding/decoding JSON:
> JSON.stringify(2.1)
'2.1'
> JSON.stringify(2.0)
'2'
> JSON.parse('2.1')
2.1
> JSON.parse('2.0')
2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, you're making my point @vmx. The spec is saying that 2.0
should be encoded as 2.0
but JavaScript can only do 2
since we can't distinguish. Then in the reverse the spec says that a 2.0
should be decoded as a "float" but JavaScript still can't distinguish so a 2
and a 2.0
would come out as the same thing, regardless. Maybe one day we'll retain the information that "this was a float in the encoded form" but there's no infrastructure to attach that information today.
So the note basically says "JavaScript can't conform to this part of the spec". The spec says "... floats that do not have a fractional component should be encoded with a decimal point".
As per a brief discussion today:
So either we should fix go-ipld-prime to match spec or we just bite the bullet and remove it and get it done everywhere—I can do this in JavaScript pretty quickly after merging this. @warpfork @vmx @willscott (and anyone else) 👍 for just remove multibase prefix and 👎 for leave the spec alone darn it!. |
My reasons for having the multibase prefix:
All that said, I don't have a strong opinion and can live with either way. |
at first I couldn't get them making the same bytes and thought there must be a multibase incompatibility (the horror!) but it's also, sorting, ipld-prime don't sort and that engages hostilities in @warpfork and is not fun for compatibility fixtures 🤦 and I kind of don't want to have that argument over and over so will skip it for now. But basically, it's not a big deal to do it in go-ipld-prime. It's also not a big deal to strip it in |
Arguments against inserting multibase in this bytes field, just to have them written:
|
fine, you appear to have a stronger opinion than either of us (although I'm not sure I'm convinced it's all that strong) have pushed a new version that removes it and will do likewise in the JS impl |
(PTAL & merge if 👍) |
I've been working ok figuring out how to codify the rules that we discussed sometime last meeting .. during some meeting. I can't find the particular meeting by looking at the minutes in https://github.com/ipld/team-mgmt so I'm doing this from memory. But the main points were:
{"/":...}
namespace so that tokenizing parsers didn't have to buffer and unroll too much data but can quickly reject form without getting too deep.Remove the multibase prefix from the encoded bytes.After working through some of this, I've flipped on my opinion of the second point and would rather it stay as proper multibase, even if it's still strictly base64. It adds clarity and there's scope for using this for further validation if we want to. It's a single byte and I think it increases the clarity and reduces the number of edge cases. So I haven't included that change here, but discussion is welcome if you feel strongly!
The bit to pay most attention to below is "The Reserved Namespace" where I've tried to enumerate the rules. It turns out to be quite hard to explain it with crystal clarity, but that could be me so suggestions welcome!
Along with this, as exploration, I've been working on a version of JS DAG-JSON that does tokenized decoding and uses the same base encoder/decoder as the new JS DAG-CBOR codec, just by swapping out the backend. There's some consistency benefits here in having the same code paths and same affordances for applying data model strictness. Still uncertain whether I'll actually propose that we merge this, although I did uncover some bugs in the current implementation in the process, but the code is here: https://github.com/ipld/js-dag-json/compare/rvagg/cborg (see the innards of
DagJsonTokenizer
for the token parsing rules that arise from the reserved space rules).