Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: w3 index spec #121

Merged
merged 5 commits into from
Apr 22, 2024
Merged

feat: w3 index spec #121

merged 5 commits into from
Apr 22, 2024

Conversation

Gozala
Copy link
Collaborator

@Gozala Gozala commented Apr 17, 2024

📽️ Preview

This is a draft of proposed content index publishing spec spawn off of https://github.com/web3-storage/RFC/blob/main/rfc/ipni-w3c.md RFC from @gammazero

@Gozala
Copy link
Collaborator Author

Gozala commented Apr 17, 2024

@gammazero I'm putting down what we talked about face to face today to get feedback from @alanshaw and to discuss have place to discuss tradeoffs

@Gozala
Copy link
Collaborator Author

Gozala commented Apr 17, 2024

Things requiring consensus

  • Should location commitment be part of the proposed bundle format ?

Questions discussions with @gammazero has inspired

  • Can we support partial queries (e.g. I have content CID and want to to query IPNI to get links that are shards as opposed to getting getting links to the index archive)
  • Can we support joins in queries (e.g. I have content CID and I want to query IPNI to get shard named links from the published index and query IPNI by that also.

@Gozala
Copy link
Collaborator Author

Gozala commented Apr 17, 2024

[ ] Should location commitment be part of the proposed bundle format ?

I have been noodling on this for a while and after discussion with @gammazero I came to conclusion that we should not bundle location commitments for following reasons.

  1. Index format described in this PR represents immutable claim about the DAG and once and verified published should not require any maintenance neither from us or from the user publishing it.
  • With enough consideration format can be made deterministic also, meaning two different users would produce same index for the same data, which I think should be our goal.
  1. Location commitment on the other hand is temporary, that is because content location is mutable
  • By bundling immutable facts with mutable ones we impose maintenance burden, whenever commitment expires someone needs to either drop index bundle or create altered version with a different commitment
  1. Location commitment can be revoked requiring index invalidation
  2. Same blob could be located in several places, which would imply creating a content index per location. With shared content that would be cartesian product
  3. Location commitment is to a specific space (owner) enabling them to authorize who can perform a read from that location. In other words some locations MAY NOT be publicly readable, which would impose a question if index should be published without making underlying blobs readable by everyone. By decoupling location from index we make it possible to find locations that authorize me separate from the content index itself

The rational for bundling was to cut down roundtrips needed for replicating DAG locally, which in most cases would be

  1. IPNI Query
  2. Fetch linked Index
  3. Concurrently fetch all blobs containing DAG blocks

If we do not include location commitment in a published index, number of roundtrips increases by extra roundtrip, if we can embed location claim or by 2 if we have to fetch location commitment instead.

  1. IPNI Query
  2. Fetch linked Index
  3. IPNI Query for blob location (concurrent per each blob)
  4. Fetch blobs containing blocks (concurrent per each blob)

💔 Extra roundtrips would be very unfortunate. However perhaps it is worth considering how can we avoid those without introducing all the duplication and maintenance burden on locations changes. Here are few ideas without thinking too much about it

  1. GraphQL is tool designed exactly for cutting down roundtrips. I suspect it would be lot simpler to put it in front of IPNI to cut down roundtrips as opposed doing all the index rebuilds each time location change.
  2. Utilize datalog in to do exactly the same as ☝️
    • Also note that this and the GraphQL variant would cache really well and could utilize commitment expirations to derive TTL
  3. Instead of making index re-packing (swapping content location commitments) user concern we could instead derive bundle containing (location commitments and corresponding content index) when user publishes location content commitment.
    • This would also give provide TTL for us to prune on commitment expiry

@Gozala
Copy link
Collaborator Author

Gozala commented Apr 17, 2024

So as far as I understand (after @gammazero explained this to me) when user publishes the index (in some format e.g. one describe in the spec here) we will create an IPNI advertisement with a reverse index, specifically given this example

// "bag...index"
{
  "index/sharded/dag@0.1": {
    "content": { "/": "bafy..dag" },
    "shards": [
      link([
        // blob multihash
        { "/": { "bytes": "blb...left" } },
        // sliced within the blob
        [
          [{ "/": { "bytes": "block..1"} }, 0, 128],
          [{ "/": { "bytes": "block..2"} }, 129, 256],
          [{ "/": { "bytes": "block..3"} }, 257, 384],
          [{ "/": { "bytes": "block..4"} }, 385, 512]
        ]
      ]),
      link([
        // blob multihash
        { "/": { "bytes": "blb...right" } },
        // sliced within the blob
        [
          [{ "/": { "bytes": "block..5"} }, 0, 128],
          [{ "/": { "bytes": "block..6"} }, 129, 256],
          [{ "/": { "bytes": "block..7"} }, 257, 384],
          [{ "/": { "bytes": "block..8"} }, 385, 512]
        ]
      ])
    ]
  }
}

I expect following k:v mapping

bafy..dag → bag...index
blb...left → bag...index
block..1 → bag...index
...
block..4 → bag...index
blb...right → bag...index
block..5 → bag...index
...
block..8 → bag...index

Which makes me wonder if we can do something bit more clever to capture not just relation to the container but all relations along with a relation (name). Specifically what I mean is when I query block..8 I would like to get something more like this

block..8 → ["blb...right", "slice",  block..8]
block..8 → ["bafy..dag", "content/slice",  block..8]
block..8 → ["bag...index", "index/sharded/dag@0.1/content/slice",  block..8]

Meaning capture all the relations upward not just the top. That would allow enquirer to get a lot more value out of it. We could also make it possible to query by a relation as opposed to having to get all of it if we get bit more creative with the keys e.g.

multihash([?, 'slice', block...8])  blb...right
multihash([?, ?, block...8])  inlineLink([blb...right, 'slice'])
multihash([?, 'content/slice', block..8])  bafy..dag
multihash([?, ?, block..8])  inlineLink([bafy..dag, 'content/slice'])
multihash([?, 'index/sharded/dag@0.1/content/slice', block...8])  bag...index
multihash([?, ?, block...8])  inlineLInk([bag...index, 'index/sharded/dag@0.1/content/slice'])

P.S. ☝️ That is roughly how datomic and other datalog stores work and allow you to sync set of facts relevant to you so you can query locally

@hannahhoward
Copy link
Member

Which makes me wonder if we can do something bit more clever to capture not just relation to the container but all relations along with a relation (name). Specifically what I mean is when I query block..8 I would like to get something more like this

block..8 → ["blb...right", "slice",  block..8]
block..8 → ["bafy..dag", "content/slice",  block..8]
block..8 → ["bag...index", "index/sharded/dag@0.1/content/slice",  block..8]

@Gozala @gammazero will need to confirm, but I believe the ability of IPNI to update frequently and at scale is dependent on not altering the records on a per block basis

i.e. if you imagine the original goal was to index blocks in Filecoin CARs in deals, we expected that lots of block level CIDs would have the same reference -- and we'd want to change en masse.

so I think this would destroy the optimization path for IPNI.

Copy link
Member

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple suggestions:

  1. Why are we keying the content claims index on the dag cid rather than the blob CID?

The reason I ask is two fold:

  1. If we groups the claims index on the blob cid, even if we take the lcoation commitment out, then now it's still just a single query to IPNI for everything about that Blob -- i.e. you query the blob CID and then that returns a link for the claims index AND a location commitment (potentially multiple). All one query.

  2. @Gozala you've suggested, and I strongly agree, that we really need to look at removing the concept of sharding in favor of multipart uploads and on demand sharding for Filecoin pieces. This feels like it further commits us to the concept of a permaneant sharded format.

Controversial view: if IPNI provides a block level index to blob records, there's no need to even put the concept of the shards in it. Why? Cause the root of the DAG is still a block. So if I get the root cid of sharded content, I can pull the first CAR, traverse till I hit a block that isn't in the first CAR, then go looking for that.

Is there something critical I'm missing? Otherwise I feel pretty strongly we should just do everything base on the blob index.

If we need to do shards seperately, we can do that, and publish a seperate record that's just an partition claim for the shards, and nothing else.

@hannahhoward
Copy link
Member

  • With enough consideration format can be made deterministic also, meaning two different users would produce same index for the same data, which I think should be our goal.

Is this feasible? I thought we discussed specifically introducing user choice into indexing granularity.

Also, I'm really not that interested in trying to share these indexes, space saving though it might be. Seems sharing always leads to further confusion

@alanshaw
Copy link
Member

alanshaw commented Apr 17, 2024

👍 I like that this is an index that encapsulates multiple shards and has a version! In general I agree we probably shouldn't include the location claim and I am LGTM on this proposal.

Which makes me wonder if we can do something bit more clever to capture not just relation to the container but all relations along with a relation (name). Specifically what I mean is when I query block..8 I would like to get something more like this

block..8 → ["blb...right", "slice",  block..8]
block..8 → ["bafy..dag", "content/slice",  block..8]
block..8 → ["bag...index", "index/sharded/dag@0.1/content/slice",  block..8]

@Gozala I don't understand anything from here onwards.

IPNI metadata is on an advertisement level so provided all the blocks you mention exist in entries then I believe you'll receive bag...index in the query response independant of which block you query - as you want.

The question for me is, if we're not including a location claim with the IPNI metadata...what is the proposal?

Can client invoke an assert/location and read from existing content claims API to bound this work and until we figure out a good decentralized way to publish and read these claims?

Why are we keying the content claims index on the dag cid rather than the blob CID?

@hannahhoward how do we then map DAG CID to blob CID when resolving?


What does it look like to export a DAG from bafy...dag?

  1. Ask IPNI for bafy...dag and receive link to bag...index
  2. Get location claim for bag...index from claims.web3.storage/claims/bag...index
  3. Fetch CAR from returned location
  4. Fetch location claims for blb...left and blb...right from claims.web3.storage/claims/blb...left and claims.web3.storage/claims/blb...right
  5. Use HTTP range queries to export data!

How do we map blb...left and blb...right to piece CIDs?

??? not covered here - stick with assert/equals for now?

How do we map a piece CIDs to blobs?

??? not covered here - stick with assert/equals for now?

@alanshaw
Copy link
Member

FWIW this is building on top of a lot of ground trodden already:

The index in this proposal is a better version of: #63

The relation claim I proposed in the current content claims feels very similar to this (although designed for a different purpose): https://github.com/web3-storage/content-claims?tab=readme-ov-file#relation-claim-

Oli's proposed block index looks very similar to this (although not sharded): storacha/RFC#9

@Gozala
Copy link
Collaborator Author

Gozala commented Apr 17, 2024

@Gozala @gammazero will need to confirm, but I believe the ability of IPNI to update frequently and at scale is dependent on not altering the records on a per block basis

i.e. if you imagine the original goal was to index blocks in Filecoin CARs in deals, we expected that lots of block level CIDs would have the same reference -- and we'd want to change en masse.

so I think this would destroy the optimization path for IPNI.

You might be correct and I would love @gammazero input on this. As far as I did understand from our conversation yesterday however is that every block cid in the content claims bundle is still a key you'd be able to lookup by except it would resolve to the bundle address. If so I'm effectively suggesting to map it to not just one multihash but three one for bundle, one for blob and one for DAG root.

While there might be some critical details I'm overlooking I'd be very worried if scaling it 3x would be prohibitively costly as given our throughput we can assume needing to scale far more than that.

@Gozala
Copy link
Collaborator Author

Gozala commented Apr 17, 2024

A couple suggestions:

  1. Why are we keying the content claims index on the dag cid rather than the blob CID?

Because clients (today) start from something like ipfs://bafy...dnr4fa and they would not know which blob(s) would that DAG correspond to. We could decouple this through equivalence claims as @alanshaw is pointing out but that would mean increasing roundtrips and I don't know what would be the benefit to compensate for this tradeoff.

The reason I ask is two fold:

  1. If we groups the claims index on the blob cid, even if we take the location commitment out, then now it's still just a single query to IPNI for everything about that Blob -- i.e. you query the blob CID and then that returns a link for the claims index AND a location commitment (potentially multiple). All one query.

As far as I understand if you lookup by digest of the blob you'd still get same claims bundle cid back with a current design, it just also lets you lookup by the unixfs root CID because that is what most clients I assume will do lookups by.

  1. @Gozala you've suggested, and I strongly agree, that we really need to look at removing the concept of sharding in favor of multipart uploads and on demand sharding for Filecoin pieces. This feels like it further commits us to the concept of a permaneant sharded format.

I mean I would prefer if we just had a single shard and in fact in many cases it probably will be so, especially in the future if we adopt multipart uploads. However that is not the case right now and I assume we do want to support current workflows. That said if there is one shard it will work the same and if we really want to we can lean into the embedded versioning and deprecate sharded format in favor of new one that will simply map DAG root to single blob.

Thinking more on this we could probably alter format to something like this instead

{
   "index/dag@0.1": {
      // dag root
      "dag":  { "/": "bafy..dag" },
     // blob digest that is concatenation of shards 
      "blob": { "/": { "bytes": "concat..blob" } }
      "slices": [
        // shard that contains subslices
        [{ "/": { "bytes": "shard..left"} }, 0, 512],
        // sliced within the above shard
        [{ "/": { "bytes": "block..1"} }, 0, 128],
        [{ "/": { "bytes": "block..2"} }, 129, 256],
        [{ "/": { "bytes": "block..3"} }, 257, 384],
        [{ "/": { "bytes": "block..4"} }, 385, 512]
        // right shard
        [{ "/": { "bytes": "blb...right" } }, 512, 1024]
        // slices within the right shard
        [{ "/": { "bytes": "block..5"} }, 512, 128],
        [{ "/": { "bytes": "block..6"} }, 640, 768],
        [{ "/": { "bytes": "block..7"} }, 768, 896],
        [{ "/": { "bytes": "block..8"} }, 896, 1,024]
      ]
    }
}

Tradeoff here is that you will not know which blobs to look location commitments

Controversial view: if IPNI provides a block level index to blob records, there's no need to even put the concept of the shards in it. Why? Cause the root of the DAG is still a block. So if I get the root cid of sharded content, I can pull the first CAR, traverse till I hit a block that isn't in the first CAR, then go looking for that.

I'm not sure I follow this, what in this version DAG root lookup resolves to ? You mean it would still resolve to the CAR ? If so that is not too different from what I was proposing with triples.

Is there something critical I'm missing? Otherwise I feel pretty strongly we should just do everything base on the blob index.

Lookups start with DAG root and we would like to cut down roundtrips so putting blob index and DAG root together would allow us to cut those down.

If we need to do shards seperately, we can do that, and publish a seperate record that's just an partition claim for the shards, and nothing else.

We could we'll just need more roundtrips

@Gozala
Copy link
Collaborator Author

Gozala commented Apr 17, 2024

Can client invoke an assert/location and read from existing content claims API to bound this work and until we figure out a good decentralized way to publish and read these claims?

I think we’ll extend Index variant with another branch to cover location claims and client will look those up

@gammazero
Copy link

gammazero commented Apr 17, 2024

Which makes me wonder if we can do something bit more clever to capture not just relation to the container but all relations along with a relation (name). Specifically what I mean is when I query block..8 I would like to get something more like this ...

No, the whole design of IPNI is to map large numbers of individual multihashes to the same provider record. That provider record can be easily swapped out, but IPNI efficiency is predicated on not making changes at the individual block level. Here is a description of how data is stored within the indexer (double-hashing and encrypted provider keys add an additional layer, but it does not alter the storage concepts here): https://github.com/ipni/go-indexer-core/blob/main/doc/data-storage.md

Exactly as @hannahhoward said above.

@gammazero
Copy link

  • With enough consideration format can be made deterministic also, meaning two different users would produce same index for the same data, which I think should be our goal.

What we want to index is not the data itself, but a user's claims + w3up location commitments. Even if the data is the same, and stored in the same location (which I am not sure is realistic), the user's statements and claims about that data are unique to that user, and that is what is being indexed.

Consider the case where 2 users store identical data, and store it in the same place. However, one user requests an additional replica (maybe part of a higher priced storage plan).

@Gozala
Copy link
Collaborator Author

Gozala commented Apr 17, 2024

I think there is some misunderstanding here, I never suggested making changes at the block level. Here is the my attempt to illustrate what I mean

index.add({ // blob...idx
   content: { '/': { "bytes": "left...shard" } },
   slices: [
    [{ "/": { "bytes": "block..1"} }, 0, 128],
    [{ "/": { "bytes": "block..2"} }, 129, 256],
    [{ "/": { "bytes": "block..3"} }, 257, 384],
    [{ "/": { "bytes": "block..4"} }, 385, 512]
  ]
})

I expect this to provide reverse lookups like

block...1 → blob...idx
...
block..4 → blob..idx

Similarly I could do something like

index.add({ // dag...idx
   content: { '/': { "bytes": "dag...idx" } },
   'dag/blocks': [
    [{ "/": { "bytes": "block..1"} }, 0, 128],
    [{ "/": { "bytes": "block..2"} }, 129, 256],
    [{ "/": { "bytes": "block..3"} }, 257, 384],
    [{ "/": { "bytes": "block..4"} }, 385, 512]
    ...
    [{ "/": { "bytes": "block..8"} }, 896, 1,024]
  ]
})

I expect this to provide reverse lookups like

block...1 → dag...idx
...
block..8 → dag...idx

Assuming

  1. Above works (which I don't see why it would not)
  2. There is a way to define resolution value (as opposed to make it hash what we pass into index.add, I assume that is what contextID does no ?)

We can effectively do what I was describing it's just instead of using hash of the payload to index.add we would use multiformat(utf8('slices').concat(payload.content)) in first case and multiformat(utf8('dag/blocks').concat(payload.content)) in the second.

@gammazero
Copy link

I think we’ll extend Index variant with another branch to cover location claims and client will look those up

I am not sure I really understand this. Does this mean mapping all the block multihashes to a second piece of information coming from a different advertisement chain?

@Gozala
Copy link
Collaborator Author

Gozala commented Apr 17, 2024

Small update, I had a call with @gammazero who confirmed that my "trickery" of capturing name of 1:n relations (blob:blocks or dag:blocks) are doable and do not impose block level changes. Although it does abuse IPNI to a degree (that I think is acceptable).

That said I realize that listing those ideas created an impression that we need to figure this out to move forward. I do not think that is the case, I think format in the spec is not affected by that line of thought. I think we should map that data to IPNI advertisement as per https://github.com/web3-storage/RFC/blob/main/rfc/ipni-w3c.md

I'll create a separate RFC to explain my "trickery" there and provide some rational on why I think it is a good idea. Also not that if we decide it adopt it format described here would not need to change we'd just derived different IPNI advertisements from it.

@Gozala
Copy link
Collaborator Author

Gozala commented Apr 17, 2024

🚦 I think we need to get a consensus on the following questions (@gammazero suggested we have a call with @alanshaw and @hannahhoward to do it)

  1. Do we require that location commitment be included in the bundle that user creates or do we leave it out ?
  2. Do we support shards in the format or do we have a separate advertisement to map dag → shard relations ?
  3. Do we support multiple shards or just go with one ?

On 1st I think we should omit location commitments and here is my rational for this #121 (comment). I believe @alanshaw agrees per comment here #121 (comment) and I'm not sure what is @hannahhoward's position.

In regards to 2nd I responded with some rational and as @alanshaw pointed out lot of it was grounded in our prior experience and I'd say is just a revision of https://github.com/web3-storage/content-claims?tab=readme-ov-file#relation-claim- and I assume he's on board. @hannahhoward perhaps you can comment on whether provided rational resonates or if you think we should go about this differently.

In regards to 3rd I provided some rational on why and also offered an alternative approach here #121 (comment) main tradeoff of proposed version is it no longer explicit which slices one should seek location commitments for.

Copy link
Member

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gammazero @Gozala

I have an idea about about seperating location commitment without paying a major price on round trips.

Currently, I imagine a dag traversal from root cid to be:

  • look up dag cid in ipni to get claim index
  • read claim index
  • lookup blob cids in IPNI to get blob locations
  • read out all the blocks using blob locations + claim index

The downside obviously is 3 sequential roundtrips (query IPNI for DAG CID, fetch claim index, query IPNI for blob CID) before you can read data

As I understand from @gammazero one of IPNI’s selling points is you can publish a single piece of data on many CIDs for trivial cost. So:

Rather than publish the location clam ONLY on the blob cid, is there any reason not to publish it on every block within the blob (or every indexed CID in the blob)?

That would mean, in the case of a sharded DAG, at least one blob would contain the root block, whose location claim would be returned on a query for the dag root CID. So now a traversal would look like:

  • look up dag cid in ipni to get claim index, and first blob location
  • read claim index
  • start downloading immediately from first blob
  • lookup additional blob cids in IPNI to get blob locations
  • finish downloading from other blobs

I would think this isn’t a huge lift for IPNI but it would potentially cut out a round trip in practice.

I’m not sure our block ordering algorithm but I wouldn’t be surprised if the blob with the root CID contains many of the blocks you need to start incrementally verifying and deserializing right away.

This would have an additional advantage for bitswap: every block served would be a single IPNI queru, since every block lookup would return both a claim index AND a location claim for the specific blob in a sharded DAG that block is part of.

The primary challenge is it requires the person publishing the location commitment to IPNI to have access to the claim index to know all the CIDs they should publish on. I think it’s worth it though, and would enable us still to handle these as completely seperate concerns.

So if a location claim was revoked, it could easily be removed for just the CIDs in the blob affected -- just tell IPNI to delete the entire advertisement for the location claim contextID (@gammazero you can correct me if I'm wrong on this) without affecting the claim index.

@gammazero
Copy link

gammazero commented Apr 18, 2024

I am not sure about step one:

look up dag cid in ipni to get claim index, and first blob location

I thought the purpose of getting the claims bundle was to get the blob index so that the client could do range requests to read the parts of whichever blob(s) contains the data they want. If they only need to get data from the second and third blobs, then we should not care about the first blob location. Is this a statistical optimization because we suspect that most of the time a client will want all of the data in the first blob?

Please confirm or clarify how we are using IPNI for step three:

lookup additional blob cids in IPNI to get blob locations

I assume this is for the case where a multihash represents a piece of data that spans multiple blobs, correct? So, there would be a claim (inclusion claim?) in the claims bundle that says that there are some additional blob CIDs (multihashes) associated with this blob. Then the client asks IPNI for the location of those blobs, and downloads the "continuation" blobs.

I am considering whether IPNI is the best way to lookup the associated "continuation" blobs, since that is only a 1:1 mapping of multihash to blob location info. It does work, but it means that advertisements are published to advertise a single multihash to location info mapping. IPNI maps huge numbers of multihashes to a single record, with the ability to swap out just that record, so this is somewhat abusing IPNI, and maintaining and publishing the advertisements is not a zero cost thing.

@hannahhoward
Copy link
Member

hannahhoward commented Apr 18, 2024

@gammazero what I understand is:

  1. Claim index contains
// "bag...index"
{
  "index/sharded/dag@0.1": {
    "content": { "/": "bafy..dag" },
    "shards": [
      link([
        // blob multihash
        { "/": { "bytes": "blb...left" } },
        // sliced within the blob
        [
          [{ "/": { "bytes": "block..1"} }, 0, 128],
          [{ "/": { "bytes": "block..2"} }, 129, 256],
          [{ "/": { "bytes": "block..3"} }, 257, 384],
          [{ "/": { "bytes": "block..4"} }, 385, 512]
        ]
      ]),
      link([
        // blob multihash
        { "/": { "bytes": "blb...right" } },
        // sliced within the blob
        [
          [{ "/": { "bytes": "block..5"} }, 0, 128],
          [{ "/": { "bytes": "block..6"} }, 129, 256],
          [{ "/": { "bytes": "block..7"} }, 257, 384],
          [{ "/": { "bytes": "block..8"} }, 385, 512]
        ]
      ])
    ]
  }
}

This gives me everything I need to read from the dag representing bafy..dag EXCEPT the location commitment for blb...left and blb...right

Currently, to actually read from bafy..dag I need to now go lookup a location commitment for blb...left and blb...right

What I am saying is:

  1. Publish the claims index location in IPNI for every cid in bafy..dag (so context id = bafy..dag, peer/metadata = location of claims index, cids advertised = every cid in bafy..dag)
  2. Seperately, publish the location commitment for blb...left in IPNI for all the block cids in blb...left (so context id = blb..left, peer/metadata = location commitment for blb...left, cids advertised = all the cids in blb...left + blb...left itself)
  3. Seperately, publish the location commitment for blb...right in IPNI for all the block cids in blb...right (so context id = blb..right, peer/metadata = location commitment for blb...right, cids advertised = all the cids in blb...right + blb...right itself)

Then the lookup for the dag cid returns both claims index location and the location commitment for blb...left. This means once I get the claims index I can immediately start doing range reads from blb...left and if I also want to do range reads from blb...right I can kick off that IPNI query while I'm working on reading from blb...left.

@alanshaw
Copy link
Member

alanshaw commented Apr 18, 2024

@hannahhoward yes that might work... and removes the dependency on querying the existing content claims API for resolution. My concern is the number of adverts we'd be generating - each upload is at minimum 2 adverts and I'm worried IPNI might not be able to keep up with the head.

...we currently pack as many hashes as possible into each advert and we'd be switching to a system where we get an advert per upload. I was thinking for the benefits this gives us might be worth it, but now that we're leaning on it for content claims also we should consider this:

⁉️ Relying on IPNI to serve content claims directly ties time to availability in bitswap/gateways to the time it takes IPNI to index our chain. ...and we do not know at all when it any given upload/CID becomes available.

Conversely, right now if users create content claims and publish them to us (i.e. location, inclusion, partition, equivalency) they can read them ALL in a single request like https://claims.web3.storage/claims/bafy...dag?walk=parts,includes,equals and they are available immediately after they are published (aside: we also have a graphql interface). It's a shame to be moving away from these behaviours.

@Gozala
Copy link
Collaborator Author

Gozala commented Apr 18, 2024

@hannahhoward I think it might work and it would indeed reduce roundtrips, however it is worth calling out that it moves lookup from query time to location publishing time and it may end up 1:n if there are more then one partition index for the same blob (which can happen if we don't make partitioning deterministic).

Put it differently cost of doing extra lookup is there we just move it around to optimize reads. What I have been advocating for and still feel is better option is to put something like GraphQL in front of the IPNI that way we are able to gain same benefits by doing extra roundtrip on first query and then caching it. Cache invalidation can also be tied directly to location commitment TTL. Additionally this would also give all the blob location in same roundtrip not just the first one.

It does not need to be GraphQL necessarily, something that simply performs recursive lookups would probably work also. Inlining message from slack describing this

One other thought occured to me is that if IPNI could do a recursive query we could cut down roundtrips significantly. What I mean is resolve multihashes by key, then resolve multihashes from first resolve, etc… If you could specify depth that would allow us to lookup by root with depth 2 to resolve blobs and resolve those to locations and indexes and get all of it in a single roundtrip

@Gozala
Copy link
Collaborator Author

Gozala commented Apr 18, 2024

⁉️ Relying on IPNI to serve content claims directly ties time to availability in bitswap/gateways to the time it takes IPNI to index our chain. ...and we do not know at all when it any given upload/CID becomes available.

Conversely, right now if users create content claims and publish them to us (i.e. location, inclusion, partition, equivalency) they can read them ALL in a single request like https://claims.web3.storage/claims/bafy...dag?walk=parts,includes,equals and they are available immediately after they are published (aside: we also have a graphql interface). It's a shame to be moving away from these behaviours.

I would imagine that we would have our own replica of IPNI records that we could query first and fallback to the IPNI mainline. In other words local provides a fast access, and global provides resilience.

I also do think that we need tool like GraphQL or similar designed for cutting roundtrips as opposed to coming up with creative ways to bundle things in a way that would avoid roundtrips. That is because later compromises on flexibility and limits our ability to execute fast:

  1. Every new piece of information needs to be incorporated into a bundle somehow (do we put piece CIDs also ?)
  2. Every time piece of information is introduced we need to rebuild all of the index bundles or build some fallback.

It is possible but hard to scale, which is why I think more holistic solution in form of graphql is better

@gammazero
Copy link

I do not think that IPNI will be able to do recursive lookups of any kind with how the double-hashed reader protection works. The purpose of double-hashing is to prevent IPNI from knowing what multihash a client is trying to lookup.

Double-hashing works as follows:
The client queries IPNI using a hash of the original multihash, so that IPNI does not know the original multihash the client is looking for. IPNI returns a metadata key that is encrypted using the original multihash. Then the client decrypts that metadata key using the original multihash. The client then uses the decrypted metadata key to request the provider metadata from IPNI.

@gammazero
Copy link

@hannahhoward I am continuing forward with implementing your idea to separately index the location commitments, mapping all the multihashes in a blob to the location of the blob. We talked about publishing the advertisements for locations and for index on separate chains, but it seems kind of strange to have a separate locations provider and indexes provider when really these are just two different sets of data from the same source (w3up). If index advertisements are ever published by something other than w3s then maybe it makes sense to have a different provider/chain at that point. WDYT?

@Gozala Gozala marked this pull request as ready for review April 22, 2024 16:07
w3-index.md Outdated Show resolved Hide resolved
w3-index.md Show resolved Hide resolved
@Gozala Gozala merged commit a09ceac into main Apr 22, 2024
2 checks passed
@Gozala Gozala deleted the feat/w3-index branch April 22, 2024 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants