Skip to content
This repository has been archived by the owner on Feb 12, 2024. It is now read-only.

Migration of Blockstore to use multihash instead of CID as key #2415

Closed
AuHau opened this issue Sep 2, 2019 · 19 comments · Fixed by #3124
Closed

Migration of Blockstore to use multihash instead of CID as key #2415

AuHau opened this issue Sep 2, 2019 · 19 comments · Fixed by #3124
Assignees
Labels

Comments

@AuHau
Copy link
Member

AuHau commented Sep 2, 2019

This is part of #1440 endeavor.

Motivation

Currently, Block-store (key-value store) uses CID as a key for the block's data. As CIDv1 can have different bases used for encoding, it can happen that the same data will be duplicated several times because of CID with different base encodings. The main motivation to tackle this problem is the shift from CIDv0 (eq. base58) to CIDv1 (eq. default base32, yet as mentioned any other encoding is also possible).

Solution

Use CID's multihash as a key in Block-store.

Parts affected

This change will ripple through several commands/packages. Here is a list of things I have discovered in analysis. The main parts affected will be related to parts of code that uses query on the repo's blockstore.

  1. Garbage Collection - lists all stored blocks CIDs, compare it with pinned CIDs and remove those not pinned.
    • This will have to be changed to not depend on comparing CIDs but multihashes. Eq. take only pinset, extract multihashes out of those and do GC based on those.

Problems and possible solutions

  1. ipfs refs local - returns list of CIDs of locally stored objects

    • Constructing new CIDs - We could return new CIDs with base32 wrapped around the stored multihashes. But if somebody would store a block under different encoding, then they won't find it in this listing.
    • Retain original CID - We could wrap the data in object to keep the original CID as metadata something like: { cid: key, data: buff } and store that in datastore. Or have different Map stored aside to track this, yet there will be a possibility of "conflicts" (eq. several CIDs having the same multihash), how should that be handled?
  2. Class Block (in js-ipfs-block) has cid property, should it be changed to multihash? This is used heavily in many packages though, so I guess not, but then it won't be always possible to create a Block with CID (eq. see the previous problem).

Questions

As discussed in weekly call, @Stebalien mentioned that "provider records need to use raw multihashes". Is this related to Bitswap and ipfs dht findprovs? @Stebalien? If so, then does it mean that Bitswap should be changed to use/negotiate/exchange around multihashes instead of CIDs? How far should this ripple? Content routing? I am not so familiar with this part of the codebase, so I will need some guidance on this. Also I am not sure if this needs to happen right away? I feel like this is related but not required for what we are doing here right now.

@alanshaw please also provide your input.

@AuHau AuHau self-assigned this Sep 2, 2019
@mikeal
Copy link
Contributor

mikeal commented Sep 3, 2019

I’m a little concerned about how this could paint us into a corner in the long term.

This change makes sense for the current implementation of our storage engine but there’s a lot of upgrades we could make in the future that wouldn’t be possible if we aren’t keying on the CID.

For instance, we could dramatically speed up GC if we were creating indexes at write time of the blocks that are linked to by each block. This is only possible if we know how to interpret the block, which means we need the CID. If we migrate the internals to lose this information we won’t be able to make improvements like this in the future.

@AuHau
Copy link
Member Author

AuHau commented Sep 4, 2019

You are definitely right, that this should be thought out in long-term scope! Yet, not sure if I will be able to give much input on that part.

What I feel though, is that with the introduction of CIDv1, the CID stopped being good identifier as CIDv0 which I believe was designed to be such a thing and hence was used in implementations as such. I am not sure if this was not already discussed somewhere earlier, but don't you think that the position of CID has bit shifted with introduction of CIDv1?

Example of that could be the problem that @Stebalien mentioned. If I understand it correctly if you have now stored some data under base58 CID and then try to fetch it using base32 CID, IPFS won't find this content because it relies upon CIDs and not hashes.

And for this reason, I think the shift to some better identifier is needed and I guess multihashes seem a pretty good option. Not sure if there is some other option.

As I mentioned in my ipfs refs local point, we don't have to throw away the CID information. We could store it aside for exactly this sort of purposes.

@mikeal
Copy link
Contributor

mikeal commented Sep 4, 2019

If I understand it correctly if you have now stored some data under base58 CID and then try to fetch it using base32 CID, IPFS won't find this content because it relies upon CIDs and not hashes.

I’m not familiar with this specific problem, but that sounds like a base encoding issue and not a CID vs Multihash issue. You’d have the same issue with multihash if you were storing it by the base encoded hash and accepting base encoded strings as identifiers.

On the IPLD side, we’ve taken to using instances of CID or instances of Block everywhere we can and never using strings as keys above the lowest storage layers. If a storage layer needs a string key, that base encoding happens close enough to the storage driver that it can remain consistent regardless of what base encoding the CID may have had in the application. There’s been a bunch of caching work in the CID class in order to avoid unnecessary base encodings to keep this performant.

As I mentioned in my ipfs refs local point, we don't have to throw away the CID information. We could store it aside for exactly this sort of purposes.

That’s certainly a possibility, but it would impact performance as it would mean an additional write whenever a block is added or removed. However, if we were to do something like the optimization I suggested above, we’d already be doing this kind of thing anyway.

Moving to a more complex storage system with indexing should be on the roadmap somewhere. IMO, that’s a pre-requisite to moving to storing the underlying block data keyed by multihash (which I think is a good idea once we have a more sophisticated storage layer).

In other words, my concerns are only regarding what we consider the key at a high level, which I think should be a CID instance. If the underlying storage layer preserves that data but de-duplicates blocks by storing the raw block data by multihash, great!

@mikeal mikeal changed the title Migration of Blockstore to use multihas instead of CID as key Migration of Blockstore to use multihash instead of CID as key Sep 4, 2019
@AuHau
Copy link
Member Author

AuHau commented Sep 5, 2019

In other words, my concerns are only regarding what we consider the key at a high level, which I think should be a CID instance. If the underlying storage layer preserves that data but de-duplicates blocks by storing the raw block data by multihash, great!

I believe this is pretty much what we are aiming to do. Keep CID pretty much throughout whole internals, only at few lowest components (storage and maybe Bitswap(?)) use multihash instead.

PS.: Thanks for fixing the typo in title

@AuHau
Copy link
Member Author

AuHau commented Sep 6, 2019

As I am looking into this more and more (playing around with migration for this), I think that it is necessary to have a way to get the original CID. This is mainly important for query what blocks are available.

I am inclining to approach of extending the block's data with some metadata that will enable rebuild the CID(s) for the block. The rebuild process will be needed only for querying purpose.

My idea is to have the following data schema:

<countOfCids:vint>(<cidVersion:vint><multibase-prefix><multicodec-content-type>)*countOfCids<blocksData>

This should not add big space overhead as the biggest part of CID is the multihash itself, which will be available already as the key of the block. Also, this approach will not require additional write/read as the "map" idea would have. The whole concept would be hidden in the implementation of blockstore of ipfs-repo and hence hidden to other components.

What are your thoughts about this?

@mikeal
Copy link
Contributor

mikeal commented Sep 6, 2019

It’s not entirely clear to me how you’re storing that.

One thing I’d point you at that might help is bytewise: https://github.com/deanlandolt/bytewise

This would allow you to encode much more complex sorting rules and additional data into the key rather than the value. So, if you want this metadata you could just encode [mutlihash, ...moreData] as the key and when you query for a block you would just do a range query for {start: bytewise([multihash, null]), end: bytewise([multihash, {}])}. This could keep the value cleaner (just the raw block data).

In fact, if what you need is a list of all the cids for a given multihash, and you know you need to support multiple cids for the same multihash, I’d store this on every write:

const bulk = [
  { key: bytewise([multihash, true]), value: blockData },
  { key: bytewise([multihash, cidVersion, cidCodec]), value: null }
]

Now it’s super easy to get a block by multihash, just get bytewise([multihash, true]) and if you need a list of CIDS you just do {start: bytewise([mutlihash, 0]), end: bytewise([multihash, {}]) }.

This locks us into k/v stores that support range queries, but I think we’re heading in that direction anyway.

@mikeal
Copy link
Contributor

mikeal commented Sep 6, 2019

It has been a while since I wrote custom databases with level + bytewise. Apparently there’s an alternative that just uses strings which is probably better for our needs https://github.com/dominictarr/charwise .

@Stebalien
Copy link
Member

Currently, Block-store (key-value store) uses CID as a key for the block's data. As CIDv1 can have different bases used for encoding, it can happen that the same data will be duplicated several times because of CID with different base encodings. The main motivation to tackle this problem is the shift from CIDv0 (eq. base58) to CIDv1 (eq. default base32, yet as mentioned any other encoding is also possible).

I'm not sure about js, but the base encoding isn't the problem in go (we always use base32(binary cid)). The problem is the version. We shouldn't have to store a block twice, once for CID version 1 and once for CID version 0.

As discussed in weekly call, @Stebalien mentioned that "provider records need to use raw multihashes". Is this related to Bitswap and ipfs dht findprovs

This is mostly a DHT issue.

  • ipfs dht findprovs can continue to take CIDs (and extract the multihash part).
  • The DHT will need to be changed to take multihashes instead of CIDs. FindProviders could take either but Provide will have to take multihashes as the provider subsystem won't actually be able to generate CIDs (it walks the datastore and the datastore will be using raw multihashes).

Bitswap should still use CIDs (it gives us more information).

@Stebalien
Copy link
Member

@mikeal

For instance, we could dramatically speed up GC if we were creating indexes at write time of the blocks that are linked to by each block. This is only possible if we know how to interpret the block, which means we need the CID. If we migrate the internals to lose this information we won’t be able to make improvements like this in the future.

We'd store the CIDs in the index. Then we'd strip the codec part when looking the block up in the datastore. This would prevent us from walking the datastore to find the structure but we always start GC from some pin root. That pin root would be a CID.

@Stebalien
Copy link
Member

This is mainly important for query what blocks are available.

The plan was to assume the raw ipld codec in those cases. For example, ipfs refs local would return a list of raw blocks. In all other cases (pins, get, cat, etc.), we have the CID of the root.

@AuHau
Copy link
Member Author

AuHau commented Sep 12, 2019

Thanks @Stebalien it is more clear to me know!

I'm not sure about js, but the base encoding isn't the problem in go (we always use base32(binary cid)). The problem is the version. We shouldn't have to store a block twice, once for CID version 1 and once for CID version 0.

You are right, in JS it is the same. I missed this part.

The plan was to assume the raw ipld codec in those cases. For example, ipfs refs local would return a list of raw blocks. In all other cases (pins, get, cat, etc.), we have the CID of the root.

Well, in that case, this assumption simplifies everything. Yet don't you think there might be use-case of people doing something like ipfs pin add <cid> and then at some point ipfs refs local | grep <cid>? I am not sure how ipfs refs local is to be looked upon and how much "internal" this command is.

@AuHau
Copy link
Member Author

AuHau commented Sep 12, 2019

@mikeal thanks for the tips! I was not aware of these tools, it is pretty smart!

But I am not sure how much applicable it is for our use case. If I understand it correctly this targets very specifically leveldb, right? The problem I see is that we use Datastore abstraction. In a browser that translates to leveldb implementation, but in Node it translates to using the filesystem as the goal is to have the same IPFS Repo for JS and Go implementation. So I guess this would have to be worked directly into the interface of Datastore and its implementations, that I think is out of scope for this initiative.

@mikeal
Copy link
Contributor

mikeal commented Sep 12, 2019

charwise is useful for any key/value store that supports sorted range queries. It’s just a way to encode keys with some nice sorting semantics and when you use arrays you get column-like sorting. Wouldn’t be great for a pure filesystem approach though.

The problem I see is that we use Datastore abstraction. In a browser that translates to leveldb implementation, but in Node it translates to using the filesystem

From the docs we appear to use level for the datastore in both Node.js and the brower. Where they differ is the blockstore, if the docs are up to date.

The defaults for Node.js and JS differ slightly but you can still configure Node.js to use a levelup compatible store for both — which is just an interface that has storage layer options for filesystem, s3, actual leveldb, IndexedDatabase, etc.

Let’s be a little careful about terminology around “leveldb.” For historical reasons, a whole collection of interfaces in JS are called “level*” but aren’t actually built on the leveldb project maintained by Google, it’s just that the community started with bindings to leveldb.

as the goal is to have the same IPFS Repo for JS and Go implementation

Interesting. I didn’t realize this was a requirement, but it makes changing the blockstore implementation more difficult than I had imagined.

@Stebalien
Copy link
Member

@AuHau

I am not sure how ipfs refs local is to be looked upon and how much "internal" this command is.

IMO, it's quite internal. We could store all known codecs for every block but I'd like to see some motivation for that first.

I didn’t realize this was a requirement, but it makes changing the blockstore implementation more difficult than I had imagined.

Note: The goal was to use the same repo layout within the datastore, not necessarily use the same datastore backends. We also don't absolutely have to do this.

@AuHau
Copy link
Member Author

AuHau commented Sep 13, 2019

@Stebalien: I see. Well, I don't have any concrete use-case and I am afraid I am not really able to asses the reach of this change so I will leave that up to you.

My only concern is about the migration perspective as the migration will lose data and will be semi-irreversible as we won't be able to reconstruct the original CID (but we can construct some CID using the IPLD raw codec). Should the migration be irreversible or semi-reversible using CIDv1 and IPLD raw codec?

@AuHau
Copy link
Member Author

AuHau commented Sep 13, 2019

@mikeal

charwise is useful for any key/value store that supports sorted range queries. It’s just a way to encode keys with some nice sorting semantics and when you use arrays you get column-like sorting. Wouldn’t be great for a pure filesystem approach though.

I see, well this is something that the current Datastore's interface does not support...

Let’s be a little careful about terminology around “leveldb.” For historical reasons, a whole collection of interfaces in JS are called “level*” but aren’t actually built on the leveldb project maintained by Google, it’s just that the community started with bindings to leveldb.

Thanks for the clarification!

AuHau added a commit to ipfs/js-ipfs-repo that referenced this issue Oct 3, 2019
BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key
instead of CID. If you want to have CID returned call it as query({},
true), which will constructs CIDv1 using IPLD's RAW codec. This means
that this constructed CID might not equal to the one that the block was originally
saved. Related to ipfs/js-ipfs#2415
AuHau added a commit to ipfs/js-ipfs-repo that referenced this issue Oct 3, 2019
BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key
instead of CID. If you want to have CID returned call it as query({},
true), which will constructs CIDv1 using IPLD's RAW codec. This means
that this constructed CID might not equal to the one that the block was originally
saved. Related to ipfs/js-ipfs#2415
AuHau added a commit to ipfs/js-ipfs-repo that referenced this issue Oct 4, 2019
BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key
instead of CID. If you want to have CID returned call it as query({},
true), which will constructs CIDv1 using IPLD's RAW codec. This means
that this constructed CID might not equal to the one that the block was originally
saved. Related to ipfs/js-ipfs#2415
AuHau added a commit to ipfs/js-ipfs-repo that referenced this issue Oct 7, 2019
BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key
instead of CID. If you want to have CID returned call it as query({},
true), which will constructs CIDv1 using IPLD's RAW codec. This means
that this constructed CID might not equal to the one that the block was originally
saved. Related to ipfs/js-ipfs#2415
AuHau added a commit to ipfs/js-ipfs-repo that referenced this issue Oct 7, 2019
BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key
instead of CID. If you want to have CID returned call it as query({},
true), which will constructs CIDv1 using IPLD's RAW codec. This means
that this constructed CID might not equal to the one that the block was originally
saved. Related to ipfs/js-ipfs#2415
AuHau added a commit to ipfs-inactive/interface-js-ipfs-core that referenced this issue Oct 8, 2019
AuHau added a commit to ipfs-inactive/interface-js-ipfs-core that referenced this issue Oct 8, 2019
AuHau added a commit that referenced this issue Oct 8, 2019
AuHau added a commit to ipfs-inactive/interface-js-ipfs-core that referenced this issue Oct 9, 2019
AuHau added a commit to ipfs/js-ipfs-repo that referenced this issue Oct 9, 2019
BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key
instead of CID. If you want to have CID returned call it as query({},
true), which will constructs CIDv1 using IPLD's RAW codec. This means
that this constructed CID might not equal to the one that the block was originally
saved. Related to ipfs/js-ipfs#2415
@daviddias
Copy link
Member

@Stebalien is 💯 at #2415 (comment)

I'm not sure about js, but the base encoding isn't the problem in go (we always use base32(binary cid)).

Exactly the same in JS. JS is repo compatible with Go

The problem is the version. We shouldn't have to store a block twice, once for CID version 1 and once for CID version 0.

bullseye. This is the main thing being solved by using the multihash as the key inside the ipfs repo.

@mikeal this is part of the large endeavour of moving to base32Cids by default overall for IPLD (and hence, files) that was reviewed, re-reviewed, decided on the dev meetings of 2018 https://github.com/ipfs/ipfs/issues/337

AuHau added a commit that referenced this issue Oct 15, 2019
AuHau added a commit to ipfs-inactive/js-ipfs-http-client that referenced this issue Oct 15, 2019
AuHau added a commit that referenced this issue Oct 15, 2019
AuHau added a commit to ipfs-inactive/interface-js-ipfs-core that referenced this issue Oct 16, 2019
AuHau added a commit to ipfs-inactive/interface-js-ipfs-core that referenced this issue Oct 16, 2019
AuHau added a commit that referenced this issue Oct 16, 2019
@AuHau
Copy link
Member Author

AuHau commented Oct 20, 2019

Should the migration be irreversible or semi-reversible using CIDv1 and IPLD raw codec?

Actually I realized that if reversion will move the blocks to CIDv0, then we should be able to achieve full reversibility. So unless somebody has some objections, I will move forward this way.

AuHau added a commit to ipfs/js-ipfs-repo that referenced this issue Nov 6, 2019
BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key
instead of CID. If you want to have CID returned call it as query({},
true), which will constructs CIDv1 using IPLD's RAW codec. This means
that this constructed CID might not equal to the one that the block was originally
saved. Related to ipfs/js-ipfs#2415
@Stebalien
Copy link
Member

Blocking discussion: ipfs/specs#242

achingbrain pushed a commit to ipfs/js-ipfs-repo that referenced this issue Jun 18, 2020
Integration of js-ipfs-repo-migrations brings automatic repo migrations
to ipfs-repo (both in-browser and fs). It is possible to control the
automatic migration using either config's setting
'repoDisableAutoMigration' or IPFSRepo's option 'disableAutoMigration'.

BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key
instead of CID. If you want to have CID returned call it as query({},
true), which will constructs CIDv1 using IPLD's RAW codec. This means
that this constructed CID might not equal to the one that the block was originally
saved. Related to ipfs/js-ipfs#2415
achingbrain pushed a commit to ipfs/js-ipfs-repo that referenced this issue Jun 18, 2020
Integration of js-ipfs-repo-migrations brings automatic repo migrations
to ipfs-repo (both in-browser and fs). It is possible to control the
automatic migration using either config's setting
'repoDisableAutoMigration' or IPFSRepo's option 'disableAutoMigration'.

BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key
instead of CID. If you want to have CID returned call it as query({},
true), which will constructs CIDv1 using IPLD's RAW codec. This means
that this constructed CID might not equal to the one that the block was originally
saved. Related to ipfs/js-ipfs#2415
achingbrain pushed a commit to ipfs/js-ipfs-repo that referenced this issue Jun 19, 2020
Integration of js-ipfs-repo-migrations brings automatic repo migrations
to ipfs-repo (both in-browser and fs). It is possible to control the
automatic migration using either config's setting
'repoDisableAutoMigration' or IPFSRepo's option 'disableAutoMigration'.

BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key
instead of CID. If you want to have CID returned call it as query({},
true), which will constructs CIDv1 using IPLD's RAW codec. This means
that this constructed CID might not equal to the one that the block was originally
saved. Related to ipfs/js-ipfs#2415
achingbrain added a commit to ipfs/js-ipfs-repo that referenced this issue Jun 19, 2020
Integration of js-ipfs-repo-migrations brings automatic repo migrations
to ipfs-repo (both in-browser and fs). It is possible to control the
automatic migration using either config's setting
'repoDisableAutoMigration' or IPFSRepo's option 'disableAutoMigration'.

BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key
instead of CID. If you want to have CID returned call it as query({},
true), which will constructs CIDv1 using IPLD's RAW codec. This means
that this constructed CID might not equal to the one that the block was originally
saved. Related to ipfs/js-ipfs#2415

Co-authored-by: achingbrain <alex@achingbrain.net>
achingbrain added a commit to ipfs/js-ipfs-repo that referenced this issue Jun 19, 2020
Integration of js-ipfs-repo-migrations brings automatic repo migrations
to ipfs-repo (both in-browser and fs). It is possible to control the
automatic migration using either config's setting
'repoDisableAutoMigration' or IPFSRepo's option 'disableAutoMigration'.

BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key
instead of CID. If you want to have CID returned call it as query({},
true), which will constructs CIDv1 using IPLD's RAW codec. This means
that this constructed CID might not equal to the one that the block was originally
saved. Related to ipfs/js-ipfs#2415

Co-authored-by: achingbrain <alex@achingbrain.net>
achingbrain added a commit to ipfs/js-ipfs-repo that referenced this issue Jun 19, 2020
Integration of js-ipfs-repo-migrations brings automatic repo migrations
to ipfs-repo (both in-browser and fs). It is possible to control the
automatic migration using either config's setting
'repoDisableAutoMigration' or IPFSRepo's option 'disableAutoMigration'.

BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key
instead of CID. If you want to have CID returned call it as query({},
true), which will constructs CIDv1 using IPLD's RAW codec. This means
that this constructed CID might not equal to the one that the block was originally
saved. Related to ipfs/js-ipfs#2415

Co-authored-by: achingbrain <alex@achingbrain.net>
achingbrain added a commit to ipfs/js-ipfs-repo that referenced this issue Jun 19, 2020
Integration of js-ipfs-repo-migrations brings automatic repo migrations
to ipfs-repo (both in-browser and fs). It is possible to control the
automatic migration using either config's setting
'repoDisableAutoMigration' or IPFSRepo's option 'disableAutoMigration'.

BREAKING CHANGE: repo.blocks.query() now returns multihashes as a key
instead of CID. If you want to have CID returned call it as query({},
true), which will constructs CIDv1 using IPLD's RAW codec. This means
that this constructed CID might not equal to the one that the block was originally
saved. Related to ipfs/js-ipfs#2415

Co-authored-by: achingbrain <alex@achingbrain.net>
achingbrain added a commit to ipfs/js-ipfs-repo that referenced this issue Jun 25, 2020
This is related to ipfs/js-ipfs#2415

Breaking changes:

- Repo version incremented to `8`, requires a migration
- Blocks are now stored using the multihash, not the full CID
- `repo.blocks.query({})` now returns an async iterator that yields blocks
- `repo.blocks.query({ keysOnly: true })` now returns an async iterator that yields CIDs
  - Those CIDs are v1 with the raw codec

Co-authored-by: achingbrain <alex@achingbrain.net>
@achingbrain achingbrain assigned achingbrain and unassigned AuHau Jun 25, 2020
@achingbrain achingbrain added the status/in-progress In progress label Jun 25, 2020
@achingbrain achingbrain linked a pull request Jun 25, 2020 that will close this issue
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants