Migration of Blockstore to use multihash instead of CID as key #2415

AuHau · 2019-09-02T14:57:50Z

This is part of #1440 endeavor.

Motivation

Currently, Block-store (key-value store) uses CID as a key for the block's data. As CIDv1 can have different bases used for encoding, it can happen that the same data will be duplicated several times because of CID with different base encodings. The main motivation to tackle this problem is the shift from CIDv0 (eq. base58) to CIDv1 (eq. default base32, yet as mentioned any other encoding is also possible).

Solution

Use CID's multihash as a key in Block-store.

Parts affected

This change will ripple through several commands/packages. Here is a list of things I have discovered in analysis. The main parts affected will be related to parts of code that uses query on the repo's blockstore.

Garbage Collection - lists all stored blocks CIDs, compare it with pinned CIDs and remove those not pinned.
- This will have to be changed to not depend on comparing CIDs but multihashes. Eq. take only pinset, extract multihashes out of those and do GC based on those.

Problems and possible solutions

ipfs refs local - returns list of CIDs of locally stored objects
- Constructing new CIDs - We could return new CIDs with base32 wrapped around the stored multihashes. But if somebody would store a block under different encoding, then they won't find it in this listing.
- Retain original CID - We could wrap the data in object to keep the original CID as metadata something like: { cid: key, data: buff } and store that in datastore. Or have different Map stored aside to track this, yet there will be a possibility of "conflicts" (eq. several CIDs having the same multihash), how should that be handled?
Class Block (in js-ipfs-block) has cid property, should it be changed to multihash? This is used heavily in many packages though, so I guess not, but then it won't be always possible to create a Block with CID (eq. see the previous problem).

Questions

As discussed in weekly call, @Stebalien mentioned that "provider records need to use raw multihashes". Is this related to Bitswap and ipfs dht findprovs? @Stebalien? If so, then does it mean that Bitswap should be changed to use/negotiate/exchange around multihashes instead of CIDs? How far should this ripple? Content routing? I am not so familiar with this part of the codebase, so I will need some guidance on this. Also I am not sure if this needs to happen right away? I feel like this is related but not required for what we are doing here right now.

@alanshaw please also provide your input.

The text was updated successfully, but these errors were encountered:

mikeal · 2019-09-03T17:28:00Z

I’m a little concerned about how this could paint us into a corner in the long term.

This change makes sense for the current implementation of our storage engine but there’s a lot of upgrades we could make in the future that wouldn’t be possible if we aren’t keying on the CID.

For instance, we could dramatically speed up GC if we were creating indexes at write time of the blocks that are linked to by each block. This is only possible if we know how to interpret the block, which means we need the CID. If we migrate the internals to lose this information we won’t be able to make improvements like this in the future.

AuHau · 2019-09-04T08:27:56Z

You are definitely right, that this should be thought out in long-term scope! Yet, not sure if I will be able to give much input on that part.

What I feel though, is that with the introduction of CIDv1, the CID stopped being good identifier as CIDv0 which I believe was designed to be such a thing and hence was used in implementations as such. I am not sure if this was not already discussed somewhere earlier, but don't you think that the position of CID has bit shifted with introduction of CIDv1?

Example of that could be the problem that @Stebalien mentioned. If I understand it correctly if you have now stored some data under base58 CID and then try to fetch it using base32 CID, IPFS won't find this content because it relies upon CIDs and not hashes.

And for this reason, I think the shift to some better identifier is needed and I guess multihashes seem a pretty good option. Not sure if there is some other option.

As I mentioned in my ipfs refs local point, we don't have to throw away the CID information. We could store it aside for exactly this sort of purposes.

mikeal · 2019-09-04T17:55:16Z

If I understand it correctly if you have now stored some data under base58 CID and then try to fetch it using base32 CID, IPFS won't find this content because it relies upon CIDs and not hashes.

I’m not familiar with this specific problem, but that sounds like a base encoding issue and not a CID vs Multihash issue. You’d have the same issue with multihash if you were storing it by the base encoded hash and accepting base encoded strings as identifiers.

On the IPLD side, we’ve taken to using instances of CID or instances of Block everywhere we can and never using strings as keys above the lowest storage layers. If a storage layer needs a string key, that base encoding happens close enough to the storage driver that it can remain consistent regardless of what base encoding the CID may have had in the application. There’s been a bunch of caching work in the CID class in order to avoid unnecessary base encodings to keep this performant.

As I mentioned in my ipfs refs local point, we don't have to throw away the CID information. We could store it aside for exactly this sort of purposes.

That’s certainly a possibility, but it would impact performance as it would mean an additional write whenever a block is added or removed. However, if we were to do something like the optimization I suggested above, we’d already be doing this kind of thing anyway.

Moving to a more complex storage system with indexing should be on the roadmap somewhere. IMO, that’s a pre-requisite to moving to storing the underlying block data keyed by multihash (which I think is a good idea once we have a more sophisticated storage layer).

In other words, my concerns are only regarding what we consider the key at a high level, which I think should be a CID instance. If the underlying storage layer preserves that data but de-duplicates blocks by storing the raw block data by multihash, great!

AuHau · 2019-09-05T07:47:45Z

In other words, my concerns are only regarding what we consider the key at a high level, which I think should be a CID instance. If the underlying storage layer preserves that data but de-duplicates blocks by storing the raw block data by multihash, great!

I believe this is pretty much what we are aiming to do. Keep CID pretty much throughout whole internals, only at few lowest components (storage and maybe Bitswap(?)) use multihash instead.

PS.: Thanks for fixing the typo in title

AuHau · 2019-09-06T15:10:02Z

As I am looking into this more and more (playing around with migration for this), I think that it is necessary to have a way to get the original CID. This is mainly important for query what blocks are available.

I am inclining to approach of extending the block's data with some metadata that will enable rebuild the CID(s) for the block. The rebuild process will be needed only for querying purpose.

My idea is to have the following data schema:

<countOfCids:vint>(<cidVersion:vint><multibase-prefix><multicodec-content-type>)*countOfCids<blocksData>

This should not add big space overhead as the biggest part of CID is the multihash itself, which will be available already as the key of the block. Also, this approach will not require additional write/read as the "map" idea would have. The whole concept would be hidden in the implementation of blockstore of ipfs-repo and hence hidden to other components.

What are your thoughts about this?

mikeal · 2019-09-06T18:59:00Z

It’s not entirely clear to me how you’re storing that.

One thing I’d point you at that might help is bytewise: https://github.com/deanlandolt/bytewise

This would allow you to encode much more complex sorting rules and additional data into the key rather than the value. So, if you want this metadata you could just encode [mutlihash, ...moreData] as the key and when you query for a block you would just do a range query for {start: bytewise([multihash, null]), end: bytewise([multihash, {}])}. This could keep the value cleaner (just the raw block data).

In fact, if what you need is a list of all the cids for a given multihash, and you know you need to support multiple cids for the same multihash, I’d store this on every write:

const bulk = [
  { key: bytewise([multihash, true]), value: blockData },
  { key: bytewise([multihash, cidVersion, cidCodec]), value: null }
]

Now it’s super easy to get a block by multihash, just get bytewise([multihash, true]) and if you need a list of CIDS you just do {start: bytewise([mutlihash, 0]), end: bytewise([multihash, {}]) }.

This locks us into k/v stores that support range queries, but I think we’re heading in that direction anyway.

mikeal · 2019-09-06T19:01:45Z

It has been a while since I wrote custom databases with level + bytewise. Apparently there’s an alternative that just uses strings which is probably better for our needs https://github.com/dominictarr/charwise .

Stebalien · 2019-09-07T03:01:41Z

Currently, Block-store (key-value store) uses CID as a key for the block's data. As CIDv1 can have different bases used for encoding, it can happen that the same data will be duplicated several times because of CID with different base encodings. The main motivation to tackle this problem is the shift from CIDv0 (eq. base58) to CIDv1 (eq. default base32, yet as mentioned any other encoding is also possible).

I'm not sure about js, but the base encoding isn't the problem in go (we always use base32(binary cid)). The problem is the version. We shouldn't have to store a block twice, once for CID version 1 and once for CID version 0.

As discussed in weekly call, @Stebalien mentioned that "provider records need to use raw multihashes". Is this related to Bitswap and ipfs dht findprovs

This is mostly a DHT issue.

ipfs dht findprovs can continue to take CIDs (and extract the multihash part).
The DHT will need to be changed to take multihashes instead of CIDs. FindProviders could take either but Provide will have to take multihashes as the provider subsystem won't actually be able to generate CIDs (it walks the datastore and the datastore will be using raw multihashes).

Bitswap should still use CIDs (it gives us more information).

Stebalien · 2019-09-07T03:04:51Z

@mikeal

For instance, we could dramatically speed up GC if we were creating indexes at write time of the blocks that are linked to by each block. This is only possible if we know how to interpret the block, which means we need the CID. If we migrate the internals to lose this information we won’t be able to make improvements like this in the future.

We'd store the CIDs in the index. Then we'd strip the codec part when looking the block up in the datastore. This would prevent us from walking the datastore to find the structure but we always start GC from some pin root. That pin root would be a CID.

Stebalien · 2019-09-07T03:06:38Z

This is mainly important for query what blocks are available.

The plan was to assume the raw ipld codec in those cases. For example, ipfs refs local would return a list of raw blocks. In all other cases (pins, get, cat, etc.), we have the CID of the root.

AuHau · 2019-09-12T07:55:21Z

Thanks @Stebalien it is more clear to me know!

I'm not sure about js, but the base encoding isn't the problem in go (we always use base32(binary cid)). The problem is the version. We shouldn't have to store a block twice, once for CID version 1 and once for CID version 0.

You are right, in JS it is the same. I missed this part.

The plan was to assume the raw ipld codec in those cases. For example, ipfs refs local would return a list of raw blocks. In all other cases (pins, get, cat, etc.), we have the CID of the root.

Well, in that case, this assumption simplifies everything. Yet don't you think there might be use-case of people doing something like ipfs pin add <cid> and then at some point ipfs refs local | grep <cid>? I am not sure how ipfs refs local is to be looked upon and how much "internal" this command is.

AuHau · 2019-09-12T08:11:07Z

@mikeal thanks for the tips! I was not aware of these tools, it is pretty smart!

But I am not sure how much applicable it is for our use case. If I understand it correctly this targets very specifically leveldb, right? The problem I see is that we use Datastore abstraction. In a browser that translates to leveldb implementation, but in Node it translates to using the filesystem as the goal is to have the same IPFS Repo for JS and Go implementation. So I guess this would have to be worked directly into the interface of Datastore and its implementations, that I think is out of scope for this initiative.

mikeal · 2019-09-12T16:26:57Z

charwise is useful for any key/value store that supports sorted range queries. It’s just a way to encode keys with some nice sorting semantics and when you use arrays you get column-like sorting. Wouldn’t be great for a pure filesystem approach though.

The problem I see is that we use Datastore abstraction. In a browser that translates to leveldb implementation, but in Node it translates to using the filesystem

From the docs we appear to use level for the datastore in both Node.js and the brower. Where they differ is the blockstore, if the docs are up to date.

The defaults for Node.js and JS differ slightly but you can still configure Node.js to use a levelup compatible store for both — which is just an interface that has storage layer options for filesystem, s3, actual leveldb, IndexedDatabase, etc.

Let’s be a little careful about terminology around “leveldb.” For historical reasons, a whole collection of interfaces in JS are called “level*” but aren’t actually built on the leveldb project maintained by Google, it’s just that the community started with bindings to leveldb.

as the goal is to have the same IPFS Repo for JS and Go implementation

Interesting. I didn’t realize this was a requirement, but it makes changing the blockstore implementation more difficult than I had imagined.

Stebalien · 2019-09-13T00:26:36Z

@AuHau

I am not sure how ipfs refs local is to be looked upon and how much "internal" this command is.

IMO, it's quite internal. We could store all known codecs for every block but I'd like to see some motivation for that first.

I didn’t realize this was a requirement, but it makes changing the blockstore implementation more difficult than I had imagined.

Note: The goal was to use the same repo layout within the datastore, not necessarily use the same datastore backends. We also don't absolutely have to do this.

AuHau · 2019-09-13T07:01:22Z

@Stebalien: I see. Well, I don't have any concrete use-case and I am afraid I am not really able to asses the reach of this change so I will leave that up to you.

My only concern is about the migration perspective as the migration will lose data and will be semi-irreversible as we won't be able to reconstruct the original CID (but we can construct some CID using the IPLD raw codec). Should the migration be irreversible or semi-reversible using CIDv1 and IPLD raw codec?

AuHau · 2019-09-13T07:22:23Z

@mikeal

charwise is useful for any key/value store that supports sorted range queries. It’s just a way to encode keys with some nice sorting semantics and when you use arrays you get column-like sorting. Wouldn’t be great for a pure filesystem approach though.

I see, well this is something that the current Datastore's interface does not support...

Let’s be a little careful about terminology around “leveldb.” For historical reasons, a whole collection of interfaces in JS are called “level*” but aren’t actually built on the leveldb project maintained by Google, it’s just that the community started with bindings to leveldb.

Thanks for the clarification!

BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key instead of CID. If you want to have CID returned call it as query({}, true), which will constructs CIDv1 using IPLD's RAW codec. This means that this constructed CID might not equal to the one that the block was originally saved. Related to ipfs/js-ipfs#2415

Related to ipfs/js-ipfs#2415

Related to #2415

Related to ipfs/js-ipfs#2415

BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key instead of CID. If you want to have CID returned call it as query({}, true), which will constructs CIDv1 using IPLD's RAW codec. This means that this constructed CID might not equal to the one that the block was originally saved. Related to ipfs/js-ipfs#2415

daviddias · 2019-10-10T09:31:03Z

@Stebalien is 💯 at #2415 (comment)

I'm not sure about js, but the base encoding isn't the problem in go (we always use base32(binary cid)).

Exactly the same in JS. JS is repo compatible with Go

The problem is the version. We shouldn't have to store a block twice, once for CID version 1 and once for CID version 0.

bullseye. This is the main thing being solved by using the multihash as the key inside the ipfs repo.

@mikeal this is part of the large endeavour of moving to base32Cids by default overall for IPLD (and hence, files) that was reviewed, re-reviewed, decided on the dev meetings of 2018 https://github.com/ipfs/ipfs/issues/337

Related to #2415

Related to ipfs/js-ipfs#2415

Related to #2415

Related to ipfs/js-ipfs#2415

Related to #2415

AuHau · 2019-10-20T14:28:34Z

Should the migration be irreversible or semi-reversible using CIDv1 and IPLD raw codec?

Actually I realized that if reversion will move the blocks to CIDv0, then we should be able to achieve full reversibility. So unless somebody has some objections, I will move forward this way.

BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key instead of CID. If you want to have CID returned call it as query({}, true), which will constructs CIDv1 using IPLD's RAW codec. This means that this constructed CID might not equal to the one that the block was originally saved. Related to ipfs/js-ipfs#2415

Stebalien · 2020-03-12T21:15:19Z

Blocking discussion: ipfs/specs#242

Integration of js-ipfs-repo-migrations brings automatic repo migrations to ipfs-repo (both in-browser and fs). It is possible to control the automatic migration using either config's setting 'repoDisableAutoMigration' or IPFSRepo's option 'disableAutoMigration'. BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key instead of CID. If you want to have CID returned call it as query({}, true), which will constructs CIDv1 using IPLD's RAW codec. This means that this constructed CID might not equal to the one that the block was originally saved. Related to ipfs/js-ipfs#2415

Integration of js-ipfs-repo-migrations brings automatic repo migrations to ipfs-repo (both in-browser and fs). It is possible to control the automatic migration using either config's setting 'repoDisableAutoMigration' or IPFSRepo's option 'disableAutoMigration'. BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key instead of CID. If you want to have CID returned call it as query({}, true), which will constructs CIDv1 using IPLD's RAW codec. This means that this constructed CID might not equal to the one that the block was originally saved. Related to ipfs/js-ipfs#2415 Co-authored-by: achingbrain <alex@achingbrain.net>

This is related to ipfs/js-ipfs#2415 Breaking changes: - Repo version incremented to `8`, requires a migration - Blocks are now stored using the multihash, not the full CID - `repo.blocks.query({})` now returns an async iterator that yields blocks - `repo.blocks.query({ keysOnly: true })` now returns an async iterator that yields CIDs - Those CIDs are v1 with the raw codec Co-authored-by: achingbrain <alex@achingbrain.net>

AuHau self-assigned this Sep 2, 2019

mikeal changed the title ~~Migration of Blockstore to use multihas instead of CID as key~~ Migration of Blockstore to use multihash instead of CID as key Sep 4, 2019

AuHau mentioned this issue Oct 2, 2019

Create an utility for IPFS Repo migrations #1115

Closed

AuHau mentioned this issue Oct 3, 2019

feat: store blocks under multihash key ipfs/js-ipfs-repo#211

Merged

AuHau added a commit to ipfs-inactive/interface-js-ipfs-core that referenced this issue Oct 8, 2019

feat: support for reconstructed CIDs of refs-local

920edcf

Related to ipfs/js-ipfs#2415

AuHau mentioned this issue Oct 8, 2019

feat: support for reconstructed CIDs of refs-local ipfs-inactive/interface-js-ipfs-core#540

Open

AuHau added a commit to ipfs-inactive/interface-js-ipfs-core that referenced this issue Oct 8, 2019

feat: support for reconstructed CIDs of refs-local

0abb4a5

Related to ipfs/js-ipfs#2415

AuHau added a commit that referenced this issue Oct 8, 2019

feat: refs-local reconstructs CIDs

6d6e8c6

Related to #2415

AuHau mentioned this issue Oct 8, 2019

Migration of Blockstore to use multihash instead of CID as key #2522

Closed

9 tasks

AuHau added a commit to ipfs-inactive/interface-js-ipfs-core that referenced this issue Oct 9, 2019

feat: documentation of GC using multihashes

9e1b754

Related to ipfs/js-ipfs#2415

AuHau added a commit that referenced this issue Oct 15, 2019

feat: refs-local reconstructs CIDs

83b606f

Related to #2415

AuHau added a commit to ipfs-inactive/js-ipfs-http-client that referenced this issue Oct 15, 2019

feat: store blocks under multihash key

c153482

Related to ipfs/js-ipfs#2415

AuHau mentioned this issue Oct 15, 2019

feat: store blocks under multihash key ipfs-inactive/js-ipfs-http-client#1128

Open

AuHau added a commit that referenced this issue Oct 15, 2019

feat: refs-local reconstructs CIDs

9df5dc1

Related to #2415

AuHau added a commit to ipfs-inactive/interface-js-ipfs-core that referenced this issue Oct 16, 2019

feat: support for reconstructed CIDs of refs-local

69c5a29

Related to ipfs/js-ipfs#2415

AuHau added a commit to ipfs-inactive/interface-js-ipfs-core that referenced this issue Oct 16, 2019

feat: documentation of GC using multihashes

b148e21

Related to ipfs/js-ipfs#2415

AuHau added a commit that referenced this issue Oct 16, 2019

feat: refs-local reconstructs CIDs

6fecca8

Related to #2415

alanshaw mentioned this issue Nov 11, 2019

Make js-ipfs capable of working with cidv1b32 #1440

Closed

vmx mentioned this issue Jan 22, 2020

CAR format spec doc ipld/specs#230

Closed

koivunej mentioned this issue Mar 10, 2020

BlockStore considerations rs-ipfs/rust-ipfs#84

Closed

Stebalien mentioned this issue Mar 12, 2020

Naming blocks in the datastore ipfs/specs#242

Open

achingbrain assigned achingbrain and unassigned AuHau Jun 25, 2020

achingbrain added the status/in-progress In progress label Jun 25, 2020

achingbrain linked a pull request Jun 25, 2020 that will close this issue

feat: store blocks by multihash instead of CID #3124

Merged

achingbrain closed this as completed in #3124 Jul 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migration of Blockstore to use multihash instead of CID as key #2415

Migration of Blockstore to use multihash instead of CID as key #2415

AuHau commented Sep 2, 2019 •

edited

Loading

mikeal commented Sep 3, 2019

AuHau commented Sep 4, 2019

mikeal commented Sep 4, 2019

AuHau commented Sep 5, 2019 •

edited

Loading

AuHau commented Sep 6, 2019 •

edited

Loading

mikeal commented Sep 6, 2019

mikeal commented Sep 6, 2019

Stebalien commented Sep 7, 2019

Stebalien commented Sep 7, 2019

Stebalien commented Sep 7, 2019

AuHau commented Sep 12, 2019

AuHau commented Sep 12, 2019

mikeal commented Sep 12, 2019

Stebalien commented Sep 13, 2019

AuHau commented Sep 13, 2019

AuHau commented Sep 13, 2019

daviddias commented Oct 10, 2019

AuHau commented Oct 20, 2019

Stebalien commented Mar 12, 2020

Migration of Blockstore to use multihash instead of CID as key #2415

Migration of Blockstore to use multihash instead of CID as key #2415

Comments

AuHau commented Sep 2, 2019 • edited Loading

Motivation

Solution

Parts affected

Problems and possible solutions

Questions

mikeal commented Sep 3, 2019

AuHau commented Sep 4, 2019

mikeal commented Sep 4, 2019

AuHau commented Sep 5, 2019 • edited Loading

AuHau commented Sep 6, 2019 • edited Loading

mikeal commented Sep 6, 2019

mikeal commented Sep 6, 2019

Stebalien commented Sep 7, 2019

Stebalien commented Sep 7, 2019

Stebalien commented Sep 7, 2019

AuHau commented Sep 12, 2019

AuHau commented Sep 12, 2019

mikeal commented Sep 12, 2019

Stebalien commented Sep 13, 2019

AuHau commented Sep 13, 2019

AuHau commented Sep 13, 2019

daviddias commented Oct 10, 2019

AuHau commented Oct 20, 2019

Stebalien commented Mar 12, 2020

AuHau commented Sep 2, 2019 •

edited

Loading

AuHau commented Sep 5, 2019 •

edited

Loading

AuHau commented Sep 6, 2019 •

edited

Loading