Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of undefined for old blocks #44

Closed
ukstv opened this issue Dec 1, 2021 · 7 comments
Closed

Handling of undefined for old blocks #44

ukstv opened this issue Dec 1, 2021 · 7 comments

Comments

@ukstv
Copy link

ukstv commented Dec 1, 2021

We at Ceramic used the legacy codec to store data, when it was the only option. Apparently, some of the data contain undefined. New js-ipfs release can retrieve a block, but can not decode it, as undefined is not allowed anymore in code. IMO, it is a bug to so drastically change the code and not make it visible for a general audience. You know, SemVer was created for a reason.

As ipfs is released with this dag-cbor codec which is more strict than the legacy dag-cbor codec, I am wondering, what is the plan to accomodate for old blocks created by the legacy codec? Could you lift the restrictions for undefined, maybe, so that old IPFS records could be properly decoded still?

@rvagg
Copy link
Member

rvagg commented Dec 2, 2021

Sorry, but undefined should never have been allowed and was only possible because it used https://github.com/dignifiedquire/borc which would even let you encode Dates, and its own custom BigNumber objects as well. The IPLD data model has a strict set of kinds that work throughout systems that support IPLD and undefined will break in many places. Your existing data with undefined is going to break if you pass it through Go IPLD / IPFS code for instance. https://ipld.io/docs/data-model/kinds/

You know, SemVer was created for a reason.

Snark isn't particularly helpful. And while I agree and would prefer that js-ipfs be >1 by now so we can do proper signalling, that's a decision by others and they are currently using <0 semver semantics which do allow for breaking changes in minor release. So technically it is following semver.

Could you lift the restrictions for undefined, maybe, so that old IPFS records could be properly decoded still?

You're welcome to do this now by using the BLOCK API to fetch the raw bytes and use the older DAG-CBOR codec or just use a CBOR decoder directly (using borc) and deal with the CID tags yourself. If you really wanted to you could even fork this repo and replace the CBOR decoder with borc to make it behave how you want and then force js-ipfs to use it instead of this codec whenever it encounters a DAG-CBOR block to decode (I believe you can override existing codecs when you supply new ones that conflict). In fact, it's probably easier than that since you may be able to just undo https://github.com/ipld/js-dag-cbor/blob/master/index.js#L88 and the undefineds will flow freely.

It's really important that you ensure that new data doesn't get encoded with undefined, it'll be treated as a bad block by the newer JS stack and the Go stack is the same, and since Go runs most of the backend infra around the world for IPFS, that's bad news: https://github.com/polydawn/refmt/blob/30ac6d18308e584ca6a2e74ba81475559db94c5f/cbor/cborDecoder.go#L213-L218

Sorry for the annoyance on this, clarifying the boundaries of our codecs and ensuring fully compatibility across the stack has been a long journey. Hopefully the current level of clarity helps, though.

@rvagg rvagg closed this as completed Dec 2, 2021
@ukstv
Copy link
Author

ukstv commented Dec 2, 2021

@rvagg Reference to go-ipfs is misleading in this case. Despite the linked change is 3 years old, go-ipfs 0.9 is able to return an old cbor too, unlike go-ipfs 0.10. Maybe this repository is not the right venue to discuss this, but anyway with the recent changes both go-ipfs and js-ipfs can not load old blocks.

@ukstv
Copy link
Author

ukstv commented Dec 2, 2021

I could blame own negligence on not following IPLD spec, but changes to the spec making undefined prohibited are also recent. On 2020-10-14 it became prohibited, on 2021-01-28 the old codec became "legacy", and at that time there were no proper codec available.

@oed
Copy link

oed commented Dec 2, 2021

Hey @rvagg this is a big problem for us as it basically breaks all existing data on Ceramic.

It's really important that you ensure that new data doesn't get encoded with undefined

This makes sense and we will definitely do it going forward, but we need a solution for loading existing data. Does it really hurt to keep decode functions that allow for undefined to allow for loading of data that has already been created with older versions?

We also found that this strictness is also enforced on the go-ipfs side with the new go-ipld-prime, while it works in go-ipfs < 0.10.

@rvagg
Copy link
Member

rvagg commented Dec 8, 2021

Hey, so a few of us have been furiously discussing this issue and the merits and technical issues involved. One outstanding question has been the Go stack and whether it's ever been able to properly recognise undefined. As I pointed out above, refmt isn't happy about undefined by default so my assumption has been that trying to get go-ipfs to decode one of these blocks is going to end up with pain, either with current versions or historical versions.

But I decided to dig deeper because you've insisted that this isn't correct and we've been doing some head scratching. So here's what I've found, and it looks like we're in a dejavu situation from almost exactly 4 years ago:

  1. go-ipld-cbor originally used whyrusleeping/cbor/go for its CBOR handling. My read of its major-7 handling is that it just ignored undefined and any of the other possible strange values that can come in via that route, effectively treating them as unset values (likely nil or whatever the unset value of the field would be, although I haven't spent much time figuring this bit out): https://github.com/whyrusleeping/cbor/blob/63513f603b11583741970c5045ea567130ddb492/go/cbor.go#L408-L438
  2. In September 2018, go-ipld-cbor has a big refactor and switches to the fairly new refmt: Refactor to refmt ipfs/go-ipld-cbor#30, at some point soon after this got released into go-ipfs.
  3. Someone files a bug report to go-ipfs complaining that their JS encoded data (using our old codec, which uses borc, which will encode undefined among other inappropriate terminals) no longer works with go-ipfs, because undefined is now rejected as an invalid token.
  4. The fix is to introduce coercion into refmt (polydawn/refmt@ede3ae2), make it into an opt-in (polydawn/refmt@bbf0067), and then opt in go-ipld-cbor (ipfs/go-ipld-cbor@c306514) and ship that in go-ipfs. That change is live today in go-ipld-cbor: https://github.com/ipfs/go-ipld-cbor/blame/master/encoding/unmarshaller.go#L29
  5. go-ipld-prime gets its own dag-cbor encoder, still using refmt, but not enabling that option.
  6. Over the course of this year, go-ipld-prime was integrated into go-ipfs, primarily for the DAG API, but this mostly (or completely?) replaces go-ipld-cbor as of v0.10. IPLD Prime In IPFS: Target Merge Branch ipfs/kubo#7976 is the main PR for this but there are others for this work.
  7. Last year, I rewrote all the primary JS codecs, including dag-cbor, and it now uses a new encoder, cborg, which is intentionally strict and limited in comparison to borc as it's focused on our content-addressed use-case.
  8. During the course of this year, Alex upgraded js-ipfs to use our new js-multiformats stack and all our new codecs—our old set of codecs have been retired, we no longer use them anywhere.
  9. This issue.
  10. dagcbor: coerce undef to null. go-ipld-prime#308
  11. ... I guess a change on this repo is due too now.

So it looks like we're repeating 2018, but now with strictness on both JS and Go sides.

We still need to retain strictness on encode for this to all work nicely, though, so hopefully Ceramic, and anyone else who happens to be relying on undefined can find a work-around for existing code that may rely on it.

@oed
Copy link

oed commented Dec 8, 2021

Thanks for digging into this and finding a path forward @rvagg extremely appreciated! Let us know if we can help in any way on the js side, since the go side has a fix (yet to be confirmed to work?).

We still need to retain strictness on encode for this to all work nicely, though, so hopefully Ceramic, and anyone else who happens to be relying on undefined can find a work-around for existing code that may rely on it.

Definitely, no problem for us to not use undefined on encode. We just need a way to be able to retrieve (and pin) old dag-cbor objects.

@rvagg
Copy link
Member

rvagg commented Dec 8, 2021

#45 WIP, it's slightly more complicated than on the Go side which had the "coerce" flag ready to go.

@rvagg rvagg reopened this Dec 8, 2021
rvagg added a commit that referenced this issue Dec 8, 2021
rvagg added a commit that referenced this issue Dec 13, 2021
For legacy data since this has been allowed by dag-cbor for a long time.
New data should _not_ take advantage of this, `undefined` should not appear
in IPLD data structures as it is not part of the IPLD data model.

Fixes: #44
Ref: ipld/go-ipld-prime#308
@rvagg rvagg closed this as completed in 8145728 Dec 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants