feat: write up ipni+datalog sketch #22

Gozala · 2024-04-18T06:29:24Z

This moves some of the ideas from this thread storacha/specs#121 into own RFC and hopefully does better job explaining it.

ipni-datalog.md

gammazero · 2024-05-15T02:34:51Z

ipni-datalog.md

+
+1. Client is required to fetch `bag...idx` and parse it before it can start looking for locations for the blobs.
+
+2. Anyone could publish an IPNI advertisement that will associate `block..1` to some `who..knows` multihash which may not be a [DAG Index], yet client will only discover that after fetching and trying to parse as such.


This can be controlled if we use our own indexer and only allow known publishers to publish ads. We can even modify it to only allow certain types of index data.

The client can filter out results to remove unknown publishers and unwanted types of metadata. This can be done without having to fully read the results.

Yeah that is a good call, client could leverage existing trust where possible. I have been mostly interested in an open ended scenario in which publisher of the advertisment can be different from the author and bares no accountability for the accuracy. But if publisher is accountable for advertisements trust in publisher could be leveraged to address this concern.

I think IPNI should be able to support using UNAN to allow the content provider to authorize a publisher to publish on its behalf. I am going to propose that as a change to the IPNI spec. What is the appropriate type of UCAN to do this?

ipni-datalog.md

gammazero · 2024-05-15T02:44:40Z

ipni-datalog.md

+
+3. Client can not query specific relations, meaning fact that `block..1` is block of the DAG and fact that `blb...left` is a blob containing DAG blocks is not captured in any way.
+
+4. We also have not captured relation between `blob..left` and `block..1` in any way other than they both relate to `bag...idx`.


What relationships should be captured and how would they be used?

I go into more details below but put it simply if you look at the example index you can see that:

block..1 is slice of blb...left blob.

block..1 is block of bafy..dag dag.

When I lookup block..1 which in datalog could be something like ["?e", "?relation", "block...1"] I would expect to get back something like this:

[ ["blb..dag", "dag@0.1/block", "block..1"], ["blb..left", "blob@0.1/slice", "block...1"], ]

But I also would like to be able to query specific leration like ["?dag", "blob@0.1/block", "block..1"] to get something like:

[ ["blb..left", "blob@0.1/slice", "block...1"], ]

gammazero · 2024-05-15T02:57:17Z

ipni-datalog.md

+
+If system drops `blob..left` will drop all the records associated with that entity as it would be a single IPNI advertisement.
+
+Because we get shards from single query we do not need to fetch [DAG Index] to start resolving location commitments. We could start resolving them right away and we gate IPNI with something like GraphQL we could even resolve those in a single roundtrip and cache for subsequent queries.


Because we get shards from single query we do not need to fetch [DAG Index] to start resolving location commitments.

A single query to IPNI or to a result cache?

I am not sure I understand. Let's be specific about what key is used to look up what data. What I thought was that If IPNI is queried by a block multihash, the location of the DAG index is returned. The DAG index can be read to get location commitments. This can be cached in a separate result cache. A subsequent query for that same block multihash can first query the cache and then get the index and location commitments back in a single response.

I imagine something like this https://www.tldraw.com/r/C41cjsP2iRxuMGyrIByXQ?v=181,118,2832,2120&p=page

gammazero · 2024-05-15T03:01:38Z

ipni-datalog.md

+
+## Relation to Datalog
+
+In datalog facts are triples of `[entity, attribute, value]` and you e.g. run a query like  `[?blob, 'dag@0.1/shard', 'bafy..dag']` to find all `?blob`s that have `shard` relation to `bafy..dag` which (from our example) will produce


Would datalog be used as a result cache for IPNI query results? So, datalog queries would only available after getting IPNI results and then getting additional related data such as blob index and location commitments. Is that right?

I'm not sure I follow what you're asking here. I think datalog query as a composite of multiple queries. Service that receives them can decompose them in order to query combination of local cache + IPNI, aggregate / filter results and send back to the querying client.

This creates an opportunity to reduce number of queries to IPNI by caching at several layers:

Query can be hashed (after some normalization to making it name agnostic) which can be used as key for a query result cache ? If we have a result we can return it without any work.

If cache is missed query executor decomposes query into atomic queries (where some can run concurrently and others can only run after first set received). Each atomic query is again can be looked up in local cache by a hash. In case of cache miss query can be send to IPNI

After query executor performs aggregation / filtering of results it can cache it by the query key and return back to client

So in worst case scenario it would be:

One IPNI lookup for DAG root

Set of concurrent slice & commitment lookups per shard

But once engine finds responses it can cache it all the future queries would require no IPNI queries. We could also tie the cache TTL to the TTLs of all the responses that way we'll get invalidation and recompilation of joined results without having to manually deal with resource management (like invalidate some indexes when commitment expries)

feat: write up ipni+datalog sketch

dd99859

Gozala requested a review from gammazero April 18, 2024 06:29

alanshaw reviewed Apr 18, 2024

View reviewed changes

ipni-datalog.md Outdated Show resolved Hide resolved

chore: commit rest of the changes

8b6bdc3

gammazero reviewed May 15, 2024

View reviewed changes

hannahhoward self-requested a review May 30, 2024 16:26

hannahhoward mentioned this pull request May 30, 2024

Indexing design for the future W3S Network storacha/project-tracking#71

Closed

Add query example

e249e41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: write up ipni+datalog sketch #22

feat: write up ipni+datalog sketch #22

Gozala commented Apr 18, 2024

gammazero May 15, 2024

Gozala May 15, 2024

gammazero May 30, 2024

gammazero May 15, 2024

Gozala May 15, 2024

gammazero May 15, 2024

Gozala May 15, 2024

gammazero May 15, 2024

Gozala May 15, 2024


		1. Client is required to fetch `bag...idx` and parse it before it can start looking for locations for the blobs.

		2. Anyone could publish an IPNI advertisement that will associate `block..1` to some `who..knows` multihash which may not be a [DAG Index], yet client will only discover that after fetching and trying to parse as such.


		3. Client can not query specific relations, meaning fact that `block..1` is block of the DAG and fact that `blb...left` is a blob containing DAG blocks is not captured in any way.

		4. We also have not captured relation between `blob..left` and `block..1` in any way other than they both relate to `bag...idx`.


		If system drops `blob..left` will drop all the records associated with that entity as it would be a single IPNI advertisement.

		Because we get shards from single query we do not need to fetch [DAG Index] to start resolving location commitments. We could start resolving them right away and we gate IPNI with something like GraphQL we could even resolve those in a single roundtrip and cache for subsequent queries.


		## Relation to Datalog

		In datalog facts are triples of `[entity, attribute, value]` and you e.g. run a query like `[?blob, 'dag@0.1/shard', 'bafy..dag']` to find all `?blob`s that have `shard` relation to `bafy..dag` which (from our example) will produce

feat: write up ipni+datalog sketch #22

Are you sure you want to change the base?

feat: write up ipni+datalog sketch #22

Conversation

Gozala commented Apr 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment