-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Migration of Blockstore to use multihash instead of CID as key #2415
Comments
I’m a little concerned about how this could paint us into a corner in the long term. This change makes sense for the current implementation of our storage engine but there’s a lot of upgrades we could make in the future that wouldn’t be possible if we aren’t keying on the CID. For instance, we could dramatically speed up GC if we were creating indexes at write time of the blocks that are linked to by each block. This is only possible if we know how to interpret the block, which means we need the CID. If we migrate the internals to lose this information we won’t be able to make improvements like this in the future. |
You are definitely right, that this should be thought out in long-term scope! Yet, not sure if I will be able to give much input on that part. What I feel though, is that with the introduction of CIDv1, the CID stopped being good identifier as CIDv0 which I believe was designed to be such a thing and hence was used in implementations as such. I am not sure if this was not already discussed somewhere earlier, but don't you think that the position of CID has bit shifted with introduction of CIDv1? Example of that could be the problem that @Stebalien mentioned. If I understand it correctly if you have now stored some data under base58 CID and then try to fetch it using base32 CID, IPFS won't find this content because it relies upon CIDs and not hashes. And for this reason, I think the shift to some better identifier is needed and I guess multihashes seem a pretty good option. Not sure if there is some other option. As I mentioned in my |
I’m not familiar with this specific problem, but that sounds like a base encoding issue and not a CID vs Multihash issue. You’d have the same issue with multihash if you were storing it by the base encoded hash and accepting base encoded strings as identifiers. On the IPLD side, we’ve taken to using instances of
That’s certainly a possibility, but it would impact performance as it would mean an additional write whenever a block is added or removed. However, if we were to do something like the optimization I suggested above, we’d already be doing this kind of thing anyway. Moving to a more complex storage system with indexing should be on the roadmap somewhere. IMO, that’s a pre-requisite to moving to storing the underlying block data keyed by multihash (which I think is a good idea once we have a more sophisticated storage layer). In other words, my concerns are only regarding what we consider the key at a high level, which I think should be a |
I believe this is pretty much what we are aiming to do. Keep CID pretty much throughout whole internals, only at few lowest components (storage and maybe Bitswap(?)) use multihash instead. PS.: Thanks for fixing the typo in title |
As I am looking into this more and more (playing around with migration for this), I think that it is necessary to have a way to get the original CID. This is mainly important for query what blocks are available. I am inclining to approach of extending the block's data with some metadata that will enable rebuild the CID(s) for the block. The rebuild process will be needed only for querying purpose. My idea is to have the following data schema:
This should not add big space overhead as the biggest part of CID is the multihash itself, which will be available already as the key of the block. Also, this approach will not require additional write/read as the "map" idea would have. The whole concept would be hidden in the implementation of What are your thoughts about this? |
It’s not entirely clear to me how you’re storing that. One thing I’d point you at that might help is bytewise: https://github.com/deanlandolt/bytewise This would allow you to encode much more complex sorting rules and additional data into the key rather than the value. So, if you want this metadata you could just encode In fact, if what you need is a list of all the cids for a given multihash, and you know you need to support multiple cids for the same multihash, I’d store this on every write: const bulk = [
{ key: bytewise([multihash, true]), value: blockData },
{ key: bytewise([multihash, cidVersion, cidCodec]), value: null }
] Now it’s super easy to get a block by multihash, just get This locks us into k/v stores that support range queries, but I think we’re heading in that direction anyway. |
It has been a while since I wrote custom databases with level + bytewise. Apparently there’s an alternative that just uses strings which is probably better for our needs https://github.com/dominictarr/charwise . |
I'm not sure about js, but the base encoding isn't the problem in go (we always use
This is mostly a DHT issue.
Bitswap should still use CIDs (it gives us more information). |
We'd store the CIDs in the index. Then we'd strip the codec part when looking the block up in the datastore. This would prevent us from walking the datastore to find the structure but we always start GC from some pin root. That pin root would be a CID. |
The plan was to assume the raw ipld codec in those cases. For example, |
Thanks @Stebalien it is more clear to me know!
You are right, in JS it is the same. I missed this part.
Well, in that case, this assumption simplifies everything. Yet don't you think there might be use-case of people doing something like |
@mikeal thanks for the tips! I was not aware of these tools, it is pretty smart! But I am not sure how much applicable it is for our use case. If I understand it correctly this targets very specifically leveldb, right? The problem I see is that we use Datastore abstraction. In a browser that translates to leveldb implementation, but in Node it translates to using the filesystem as the goal is to have the same IPFS Repo for JS and Go implementation. So I guess this would have to be worked directly into the interface of Datastore and its implementations, that I think is out of scope for this initiative. |
From the docs we appear to use level for the datastore in both Node.js and the brower. Where they differ is the blockstore, if the docs are up to date. The defaults for Node.js and JS differ slightly but you can still configure Node.js to use a Let’s be a little careful about terminology around “leveldb.” For historical reasons, a whole collection of interfaces in JS are called “level*” but aren’t actually built on the leveldb project maintained by Google, it’s just that the community started with bindings to leveldb.
Interesting. I didn’t realize this was a requirement, but it makes changing the blockstore implementation more difficult than I had imagined. |
IMO, it's quite internal. We could store all known codecs for every block but I'd like to see some motivation for that first.
Note: The goal was to use the same repo layout within the datastore, not necessarily use the same datastore backends. We also don't absolutely have to do this. |
@Stebalien: I see. Well, I don't have any concrete use-case and I am afraid I am not really able to asses the reach of this change so I will leave that up to you. My only concern is about the migration perspective as the migration will lose data and will be semi-irreversible as we won't be able to reconstruct the original CID (but we can construct some CID using the IPLD raw codec). Should the migration be irreversible or semi-reversible using CIDv1 and IPLD raw codec? |
I see, well this is something that the current Datastore's interface does not support...
Thanks for the clarification! |
BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key instead of CID. If you want to have CID returned call it as query({}, true), which will constructs CIDv1 using IPLD's RAW codec. This means that this constructed CID might not equal to the one that the block was originally saved. Related to ipfs/js-ipfs#2415
BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key instead of CID. If you want to have CID returned call it as query({}, true), which will constructs CIDv1 using IPLD's RAW codec. This means that this constructed CID might not equal to the one that the block was originally saved. Related to ipfs/js-ipfs#2415
BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key instead of CID. If you want to have CID returned call it as query({}, true), which will constructs CIDv1 using IPLD's RAW codec. This means that this constructed CID might not equal to the one that the block was originally saved. Related to ipfs/js-ipfs#2415
BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key instead of CID. If you want to have CID returned call it as query({}, true), which will constructs CIDv1 using IPLD's RAW codec. This means that this constructed CID might not equal to the one that the block was originally saved. Related to ipfs/js-ipfs#2415
BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key instead of CID. If you want to have CID returned call it as query({}, true), which will constructs CIDv1 using IPLD's RAW codec. This means that this constructed CID might not equal to the one that the block was originally saved. Related to ipfs/js-ipfs#2415
BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key instead of CID. If you want to have CID returned call it as query({}, true), which will constructs CIDv1 using IPLD's RAW codec. This means that this constructed CID might not equal to the one that the block was originally saved. Related to ipfs/js-ipfs#2415
@Stebalien is 💯 at #2415 (comment)
Exactly the same in JS. JS is repo compatible with Go
bullseye. This is the main thing being solved by using the multihash as the key inside the ipfs repo. @mikeal this is part of the large endeavour of moving to base32Cids by default overall for IPLD (and hence, files) that was reviewed, re-reviewed, decided on the dev meetings of 2018 https://github.com/ipfs/ipfs/issues/337 |
Actually I realized that if reversion will move the blocks to CIDv0, then we should be able to achieve full reversibility. So unless somebody has some objections, I will move forward this way. |
BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key instead of CID. If you want to have CID returned call it as query({}, true), which will constructs CIDv1 using IPLD's RAW codec. This means that this constructed CID might not equal to the one that the block was originally saved. Related to ipfs/js-ipfs#2415
Blocking discussion: ipfs/specs#242 |
Integration of js-ipfs-repo-migrations brings automatic repo migrations to ipfs-repo (both in-browser and fs). It is possible to control the automatic migration using either config's setting 'repoDisableAutoMigration' or IPFSRepo's option 'disableAutoMigration'. BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key instead of CID. If you want to have CID returned call it as query({}, true), which will constructs CIDv1 using IPLD's RAW codec. This means that this constructed CID might not equal to the one that the block was originally saved. Related to ipfs/js-ipfs#2415
Integration of js-ipfs-repo-migrations brings automatic repo migrations to ipfs-repo (both in-browser and fs). It is possible to control the automatic migration using either config's setting 'repoDisableAutoMigration' or IPFSRepo's option 'disableAutoMigration'. BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key instead of CID. If you want to have CID returned call it as query({}, true), which will constructs CIDv1 using IPLD's RAW codec. This means that this constructed CID might not equal to the one that the block was originally saved. Related to ipfs/js-ipfs#2415
Integration of js-ipfs-repo-migrations brings automatic repo migrations to ipfs-repo (both in-browser and fs). It is possible to control the automatic migration using either config's setting 'repoDisableAutoMigration' or IPFSRepo's option 'disableAutoMigration'. BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key instead of CID. If you want to have CID returned call it as query({}, true), which will constructs CIDv1 using IPLD's RAW codec. This means that this constructed CID might not equal to the one that the block was originally saved. Related to ipfs/js-ipfs#2415
Integration of js-ipfs-repo-migrations brings automatic repo migrations to ipfs-repo (both in-browser and fs). It is possible to control the automatic migration using either config's setting 'repoDisableAutoMigration' or IPFSRepo's option 'disableAutoMigration'. BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key instead of CID. If you want to have CID returned call it as query({}, true), which will constructs CIDv1 using IPLD's RAW codec. This means that this constructed CID might not equal to the one that the block was originally saved. Related to ipfs/js-ipfs#2415 Co-authored-by: achingbrain <alex@achingbrain.net>
Integration of js-ipfs-repo-migrations brings automatic repo migrations to ipfs-repo (both in-browser and fs). It is possible to control the automatic migration using either config's setting 'repoDisableAutoMigration' or IPFSRepo's option 'disableAutoMigration'. BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key instead of CID. If you want to have CID returned call it as query({}, true), which will constructs CIDv1 using IPLD's RAW codec. This means that this constructed CID might not equal to the one that the block was originally saved. Related to ipfs/js-ipfs#2415 Co-authored-by: achingbrain <alex@achingbrain.net>
Integration of js-ipfs-repo-migrations brings automatic repo migrations to ipfs-repo (both in-browser and fs). It is possible to control the automatic migration using either config's setting 'repoDisableAutoMigration' or IPFSRepo's option 'disableAutoMigration'. BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key instead of CID. If you want to have CID returned call it as query({}, true), which will constructs CIDv1 using IPLD's RAW codec. This means that this constructed CID might not equal to the one that the block was originally saved. Related to ipfs/js-ipfs#2415 Co-authored-by: achingbrain <alex@achingbrain.net>
Integration of js-ipfs-repo-migrations brings automatic repo migrations to ipfs-repo (both in-browser and fs). It is possible to control the automatic migration using either config's setting 'repoDisableAutoMigration' or IPFSRepo's option 'disableAutoMigration'. BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key instead of CID. If you want to have CID returned call it as query({}, true), which will constructs CIDv1 using IPLD's RAW codec. This means that this constructed CID might not equal to the one that the block was originally saved. Related to ipfs/js-ipfs#2415 Co-authored-by: achingbrain <alex@achingbrain.net>
This is related to ipfs/js-ipfs#2415 Breaking changes: - Repo version incremented to `8`, requires a migration - Blocks are now stored using the multihash, not the full CID - `repo.blocks.query({})` now returns an async iterator that yields blocks - `repo.blocks.query({ keysOnly: true })` now returns an async iterator that yields CIDs - Those CIDs are v1 with the raw codec Co-authored-by: achingbrain <alex@achingbrain.net>
This is part of #1440 endeavor.
Motivation
Currently, Block-store (key-value store) uses CID as a key for the block's data. As CIDv1 can have different bases used for encoding, it can happen that the same data will be duplicated several times because of CID with different base encodings. The main motivation to tackle this problem is the shift from CIDv0 (eq. base58) to CIDv1 (eq. default base32, yet as mentioned any other encoding is also possible).
Solution
Use CID's multihash as a key in Block-store.
Parts affected
This change will ripple through several commands/packages. Here is a list of things I have discovered in analysis. The main parts affected will be related to parts of code that uses
query
on the repo's blockstore.Problems and possible solutions
ipfs refs local
- returns list of CIDs of locally stored objects{ cid: key, data: buff }
and store that in datastore. Or have different Map stored aside to track this, yet there will be a possibility of "conflicts" (eq. several CIDs having the same multihash), how should that be handled?Class
Block
(injs-ipfs-block
) hascid
property, should it be changed tomultihash
? This is used heavily in many packages though, so I guess not, but then it won't be always possible to create a Block with CID (eq. see the previous problem).Questions
As discussed in weekly call, @Stebalien mentioned that "provider records need to use raw multihashes". Is this related to Bitswap and
ipfs dht findprovs
? @Stebalien? If so, then does it mean that Bitswap should be changed to use/negotiate/exchange around multihashes instead of CIDs? How far should this ripple? Content routing? I am not so familiar with this part of the codebase, so I will need some guidance on this. Also I am not sure if this needs to happen right away? I feel like this is related but not required for what we are doing here right now.@alanshaw please also provide your input.
The text was updated successfully, but these errors were encountered: