-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migration 11-12: migrate CIDsv1 to raw multihashes #95
Conversation
823c99f
to
4658cb0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments and open questions
4658cb0
to
b79320a
Compare
(you checked in a binary) |
95cbf1b
to
d330dcf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(removed binary file that was checked in)
ipfs-8-to-9/migration/go.mod
Outdated
github.com/ipfs/go-cid v0.0.5 | ||
github.com/ipfs/go-datastore v0.3.1 | ||
github.com/ipfs/go-filestore v0.0.4-0.20200203163551-8cc45055f624 // indirect | ||
github.com/ipfs/go-ipfs v0.4.22-0.20200130064341-6750ee973e2a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: link to newer after ipfs/kubo#6816 has the right dep-set and is green.
So, I have some bad news, reverts going to get more complicated. If we already have a block as CIDv1 and CIDv0, the revert will delete the CIDv0. We can easily fix this by either not deleting on revert or by checking to see if we already have the block before marking it as something we've swapped. Unfortunately, the second issue is much worse: If we upgrade, add some files with CIDv1, then downgrade, we'll GC these files. It'll still be possible to read the files given that we check both V0 and V1 CIDs on read. The only solution I can think of is to revert by walking the pinset/mfs. If we do that, with https://github.com/ipfs/go-ipfs/blob/master/thirdparty/cidv0v1/blockstore.go, we'll avoid GCing anything we want to keep. Thoughts? I'd rather not say "thou shalt not revert" given all the other features and potential issues with this release, but this is getting to be a very complicated process. |
ok, doable somehow. We can tag certain migrations as "keep" and don't delete the blocks during revert.
This is getting hell-ish. The GC code (I assume) has the necessary code to list all the CIDs under pinset and mfs, so maybe it is not overly complicated to add the revert of those blocks as an extra step for the revert process? |
SGTM.
I agree. I'm starting to think we should put this in its own release instead of squeezing it into 0.5.0. On the other hand, I know the browsers team has been looking forward to this.
Yeah, we'd need to use the GC coloring code to get all "live" blocks. |
@Stebalien Any progress on this? I came to look up progress on this change from ipfs/kubo#6815 |
js-ipfs moved forward with their migration: ipfs/js-ipfs#3124. The go team doesn't have anyone to work on this at the moment. |
b5db12d
to
704c741
Compare
I've brought the branch up to date and will continue where I left off in the good old days. |
704c741
to
89c5b50
Compare
In this case, we need to keep them around during the revert. We do this by pre-pending + to the list. When doing reverts, we do not delete raw multihashes that already existed. This happens only if the user had both the CidV1 and the CidV0 for something.
This is not addressed yet, but I don't remember if we wanted to address it because the migration was meant to happen on a big go-ipfs release (with high chances for regressions and downgrades). This happens when the user downgrades after having added CIDv1-addressed content, in which case, those "new" blocks would stay referenced by multihash in the datastore (as CIDv0). They could still be read, because the blockstore is wrapped insomething that tries reading CIDv1 but fails over to CIDv0, but if a GC runs at that point, they could be GC'ed. @Stebalien @aschmahmann is this something we still want to address? The proposed solution is to:
I'm pretty sure there are corner cases still: i.e. if the user, post migration, adds something with CidV0 to MFS, then adds the same with CidV1 and pins. It is likely that we will be removing the CIDv0 block unless we have an additional check to see if the equivalent to a CIDv1 is in the pin/mfs trees as CIDv0. This will add more memory pressure or we have to opt for a bloom filter check. |
Ok, what we can do is to revert everything from the backup file and all the CIDv1 pins from the pinset and mfs and leave the raw-multihash blocks in place (do not delete anything on revert). That would cover the corner cases of the migration deleting something is should not have deleted on revert, it is simple and the user can choose to GC after revert. The downside is the repo size will potentially grow during revert. Note that we are always talking about CIDv1 stuff (not the default). |
This commit ensure that we traverse all the CIDs in the pinset and MFS trees and revert them to cid-based addressing in the datastore when the migration is reversed. This addresses the problem when a user has added new CIDv1 pins or modified MFS with CIDv1 content before reverting the migration. In that case, those new multihash-addressed blocks would stay as such and while they could potentially still be read, they would be swiped out by GC in the older version of the code. Additionally, there are corner cases when content is added with both CIDv1 and CIDv0 (but the same multihash). These were partially addressed already by keeping track of CIDv0 blocks that existed already for CIDv1 equivalents, and by some logic during reverts, to handle blocks referenced from several CIDv1s. However, things like adding CIDv1 and CIDv0 equivalent blocks to the pinset after migration would not be handled properly. To address all these issues, and simplify, we have opted to not delete any raw-multihash-addressed block during revert. We convert as much as we can to CIDv1 (using the backup file and the pinset/mfs trees), but otherwise leave all the blocks. This simplifies the code and makes reverts more memory-friendly at the expense of extra disk space. As a final reminder, CIDv1s are not the default. We can expect that in the general case, people have content that is CIDv0 addressed and requires no action whatsoever in both migrating forward and backwards.
The last commit walks the MFS and pinset trees and ensures all those blocks are reverted to CIDv1 addressing. Also, we do not delete any blocks on revert anymore. There are many corner cases about deleting blocks that were indeed expected to be there after migration because they had both CIDv1 and CIDv0. The last commit mostly reverts the previous one as it would not be necessary to track which blocks had both CIDv1 and CIDv0 addressing if we do not delete them on revert. |
@hsanjuan does this now mean that reversions are safe (even in the CIDv1 cases), you'll just end up with more data you'll have to GC? |
I have added tests. @gammazero it seems this wants me to publish the new migration to dist.ipfs.io before it can successfully check that |
It may be impossible to review when the vendor repository is added, which breaks tests when not there. |
One thing to double-check is whether it is ok to use ColoredSet() from go-ipfs v0.8.0 for the reverts (probably yes because it does not convert CIDs). |
Afaik tests fail because there were broken in master after #130 |
The tests appear to be passing now. Maybe network issue in CI? |
hm thanks for checking @gammazero ! |
3c2d9b1
to
2ea97fd
Compare
Dependency vendoring is in #131 |
Blocked until the release (ipfs/kubo#8058) |
I have been testing the migration using a 2TB badger repository on a AWS EC2 machine, with CPU/16GB RAM and 15TB EBS storage (IO optimized). This repository is additionally full of CIDv1s (2M+ blocks need migration). This takes about a day or two.
|
Looks like CI failure due to needing to re-vendor. |
2021-07-01 conversation notes:
|
…-vendored Migration 11-12: vendor dependencies
This adds the migration to convert CIDsV1 in the datastore "/blocks" namespace to raw multihashes.
Testing is missing, but initial testing on my local repo (~5000 of such CIDs) seems to work on both flatfs (slow) and badger (really fast).To be rebased on #94. More comments inlined.