Added experimental index provider feature to go-ipfs #8771

bkyarger · 2022-03-08T03:56:29Z

Two modes available with different flag, one to import all existing blocks and a second one for ongoing advertising of new blocks added.

No pinning is currently implemented, so no protection from GC at this time. Lots of code borrowed from indexer-provider[1].

There are some hardcoded constants for controlling size of advertisements and chunks, these should probably be moved to configuration if this is to be supported long term.

[1] https://github.com/filecoin-project/index-provider

welcome · 2022-03-08T03:56:30Z

Thank you for submitting this PR!
A maintainer will be here shortly to review it.
We are super grateful, but we are also overloaded! Help us by making sure that:

The context for this PR is clear, with relevant discussion, decisions
and stakeholders linked/mentioned.
Your contribution itself is clear (code comments, self-review for the
rest) and in its best form. Follow the code contribution
guidelines
if they apply.

Getting other community members to do a review would be great help too on complex PRs (you can ask in the chats/forums). If you are unsure about something, just leave us a comment.
Next steps:

A maintainer will triage and assign priority to this PR, commenting on
any missing things and potentially assigning a reviewer for high
priority items.
The PR gets reviews, discussed and approvals as needed.
The PR is merged by maintainers when it has been approved and comments addressed.

We currently aim to provide initial feedback/triaging within two business days. Please keep an eye on any labelling actions, as these will indicate priorities and status of your contribution.
We are very grateful for your contribution!

bkyarger · 2022-03-08T04:18:28Z

Added fixes for issues from make test_go_lint

bkyarger · 2022-03-09T05:51:27Z

Rebased on top of #8756 and fixed go-ipfs-config refs

aschmahmann

@bkyarger I'm glad you were able to make some progress through here and put together a test demonstrating that things mostly work.

I added a bunch of code related comments as well as couple about design. The big design one above is mostly to give a basic info dump of what you might run into and some potential ways around it.

There's still work to do before this can be usable by upstream go-ipfs (mostly mentioned in the big design comment), however getting your fork up and running for your use case sooner seems pretty doable.

Great to see this coming along and happy to help where I can here, although you might need some folks more familiar with the indexer codebase to give review on the indexer specific logic and setup

For the time being I have converted the PR to a draft to indicate it's not ready for merge into master at the moment and make it easy to identify on our GitHub boards.

aschmahmann · 2022-03-18T16:36:32Z

cmd/ipfs/daemon.go

+	// if idxProviderFull is set, we will provide all existing blocks we have.  This
+	// is meant to be a one-time operation, and not meant to be used at the same time
+	// as the idxProviderSet - which advertises newly added blocks on a timer
+	idxProviderNode := &ipfsIdxProviderNode{node}
+	if idxProviderFullSet {


New components, whether configurable or not, being instantiated in daemon.go is inconsistent with how we normally initialize things which may lead to issues as with the instantiation of the node.BaseBlocks below.

Instead this logic should live in core/node with the other components. If you want to have this be configurable with a flag rather than just a config file option you can use the IPNS over PubSub (search for ipnsps) example for illustration since it also has a CLI flag. You can also add an indexer provider field to IpfsNode in the event you want to write some extra commands that allow you to introspect on it or just want to work with it once it's running.

Happy to give more guidance here if needed.

aschmahmann · 2022-03-18T17:10:24Z

cmd/ipfs/idxprovider.go

+		dataStore, err = leveldb.NewDatastore(dataStorePath, nil)
+		if err != nil {
+			providerlog.Errorf("Error creating new leveldb datastore: %v", err)
+			return err
+		}
+		// TODO why can't I use this datastore instead of creating my own?  When I try I get an panic in migrations code
+		// with an empty array being indexed
+		// dataStore := node.DataStore()


Yep, this should be able to use the standard datastore. Probably moving things to the standard initialization pattern will help you out here.

aschmahmann · 2022-03-18T17:11:50Z

cmd/ipfs/idxprovider.go

+		// TODO currently can't use this version since it has it's own linksystem - could swap that out
+		// based on idx provider being enabled, but saving that for a later time
+		// gs := node.GraphExchange()


This also means that if a user had enabled graphsync and the index provider that they'd start clobbering each other since they're both handlers of the same protocol.

aschmahmann · 2022-03-18T17:16:41Z

cmd/ipfs/idxprovider.go

+		// based on idx provider being enabled, but saving that for a later time
+		// gs := node.GraphExchange()
+
+		tp := gstransport.NewTransport(node.PeerHost().ID(), gs)


Not familiar with the go-datatransfer business here, review on this will have to come from @willscott @hannahhoward or someone else familiar with this.

aschmahmann · 2022-03-18T17:23:06Z

cmd/ipfs/idxprovider.go

+		providerlog.Debugf("creating link system")
+		lsys = mkLinkSystem()
+
+		gsNet := gsnetwork.NewFromLibp2pHost(node.PeerHost())


All the places you use node. will go away/change if you move this function into the main dependency injection area of the codebase instead of running after the IpfsNode creation.

aschmahmann · 2022-03-18T18:16:33Z

cmd/ipfs/providerblockstore.go

+	if bs.keys[k] {
+		delete(bs.keys, k)
+	}
+	return bs.backend.DeleteBlock(ctx, k)


out of order + mutex

aschmahmann · 2022-03-18T18:19:10Z

go.mod

 	github.com/jbenet/go-temp-err-catcher v0.1.0
 	github.com/jbenet/goprocess v0.1.4
 	github.com/libp2p/go-doh-resolver v0.4.0
-	github.com/libp2p/go-libp2p v0.16.0
+	github.com/libp2p/go-libp2p v0.18.0-rc3


Even for your testing you should use either the released go-libp2p v0.18.0 (not yet tested with go-ipfs pending #8680) or a previous version.

aschmahmann · 2022-03-18T19:09:34Z

cmd/ipfs/daemon.go

+		cmds.StringOption(enableIndexerProviderKwd, "Enable experimental indexer provider feature and provide new Cids that are added, with indexer address"),
+		cmds.StringOption(indexerProviderFullKwd, "Enable experimental indexer provider feature and provide all locally stored Cids, with indexer address"),


Going to use this area to talk about some potential design issues because I think a thread will be useful here and GitHub doesn't provide them unless you anchor to some code.

My understanding of the design here is that you're planning to run this fork by:

Running ipfs daemon --index-provider-full-import-experiment once, waiting for everything to finish by watching some logs

Kill the node and from then on run ipfs daemon --index-provider-experiment which will periodically notice any new blocks you have added locally and advertise them

Some things that this misses, some of which you might be ok with and some not. (Note: most if not all of these are likely blockers to merging the code upstream, but we can go a step at a time here)

If the node crashes before I've advertised some new data I will never advertise it. Unless I run an expensive full import again which will be expensive to the client and server due to how the data is batched up (i.e. there is unlikely to be any overlap with the previous advertisements)

If I delete data I don't "unadvertise" it. This is an easy enough fix, although the issue above remains

If I change my identifier (whether just multiaddrs because of dynamic DNS, or peerIDs due to rotation) then I lose all my advertisements and I have to rely on the indexer figuring out to drop them

If I use the node without running the daemon (e.g. ipfs add while the daemon isn't running) the advertisement will get dropped. Potentially fixable, but at the cost of adding tiny advertisements which IIUC perform poorly with indexer syncing

I can only provide everything, not some subset (e.g. pins, roots, etc.)

IIUC a pattern which might be more useful for your use case (a pinning service with lots of data it wants to advertise to the indexers) might look like this:

Only advertised pinned data

On startup grab the list of pins and diff them against the list of pins you've already advertised

Advertise any new ones, de-advertise any missing ones, update the list of pins that have been advertised as you go

Note: advertise new ones means walking the DAG, collecting all the CIDs and advertising them

Hook into the pinner/put a wrapper around it and whenever a pin is successfully added or removed modify the advertisement list

Some advantages:

It maps nicely to what the indexer is trying to do which is track groups of objects that want to be advertised/de-advertised together

It doesn't leave your state inconsistent on-crash or require you to put together something like a journaling recovery system to deal with mismatched state on a per-block level (e.g. new block added/removed but not advertised/de-advertised)

It's a less costly diff then diffing your whole blockstore

Some disadvantages:

Might stress out the indexer system since you may have millions of pins that are relatively small and IIUC the indexer ingestion throughput is limited by needing to walk the linear advertisement list. These millions of small pins is a different set of workload then say a Filecoin storage deal for 32 GBs of data

Could potentially be worked around by batching up the pins into groups for the indexer, but it means tracking more locally and needing to update these larger advertisements whenever a pin is added/removed from the batch. The updates might not mean transferring so much data though depending on how the advertisement DAG is structured

You still sort of end up with a journaling system on the pin level

Requires storing the list of advertised pins, which is itself a bit like a journaling system

Startup costs involve doing a diff on the pinsets which could potentially be expensive, this could happen in the background if you're ok waiting a little while for the indexer subsystem to become functional

Doesn't handle advertising unpinned blocks on your node (including MFS, although that seems handlable separately if needed)

The above are just some suggestions though, some of the error cases unhandled by this design that can show up might not be a problem at the moment to start with.

aschmahmann · 2022-03-18T19:14:12Z

cmd/ipfs/idxprovider_e2e_test.go

@@ -0,0 +1,420 @@
+package main_test


Not sure if this is just a temporary thing for testing, but integration tests should live in test/integration

aschmahmann · 2022-03-18T20:19:18Z

cmd/ipfs/idxprovider.go

+
+	// Store latest advertisement published from the chain
+	providerlog.Infow("Storing advertisement locally", "cid", c.String())
+	err = dataStore.Put(ctx, datastore.NewKey(latestAdvKey), c.Bytes())


Note: If you wanted to keep your advertisements safe from GC it seems pretty straightforward to do. You just have to pass the root CID (as found from the latestAdvKey) into the list of roots that GC should traverse and mark as safe from GC because IIUC the advertisements are all linked together in a single DAG.

Two modes available with different flag, one to import all existing blocks and a second one for ongoing advertising of new blocks added. Can be used together, or one at a time. No pinning is currently implemented, so no protection from GC at this time. Lots of code borrowed from indexer-provider[1]. There are some hardcoded constants for controlling size of advertisements and chunks, these should probably be moved to configuration if this is to be supported long term. [1] https://github.com/filecoin-project/index-provider

BigLep · 2022-05-06T15:57:08Z

@bkyarger : what are the next steps here? Are you going to incorporate the feedback or can this be close in light of https://github.com/web3-storage/ipfs-elastic-provider

BigLep · 2022-09-27T17:47:35Z

Closing this since didn't am not aware of anyone driving this forward and there hasn't been activity for months.

willscott · 2022-09-27T18:10:19Z

Yep, instead we're doing this via reframe - a working reframe publisher to indexers for kubo is at ipni/index-provider#271

Jorropo · 2022-09-27T18:19:45Z

Yep, instead we're doing this via reframe - a working reframe publisher to indexers for kubo

IMO this is bad because it require doing diffing on the 24h republish that Kubo do.

We (Kubo maintainers) want to rework the advertising logic anyway because it's doing bad stuff (we have two different codepaths for providing and reprodividing, the first provide code buffer is crashy) and I think if we have a proper state tracker it would be easier.
Then if we already track the state of what is pinned it will be easy to publish to indexers without doing diffing.

willscott · 2022-09-27T18:22:02Z

@Jorropo It would be great to propose an update to the reframe provide path with better semantics you can think would be supportable by kubo: https://github.com/ipfs/specs/blob/main/reframe/REFRAME_KNOWN_METHODS.md#provide

Jorropo · 2022-09-27T18:26:07Z

@willscott I don't want to start working on new reframe changes while the advertise refactor is not even planned yet. I have too much things to do for now.

I have a personal project to make a simple plugin that read the pinned states and run an indexer server after we refactor the annoucement logic.
It would be more decentralized to publish on a pubsub channel than an HTTP endpoint. (even tho yes you could just run your own http endpoint but it's more work)

willscott · 2022-09-27T18:28:52Z

we're getting a working version of this implementation with the expectation of publishers running their own endpoint they delegate their kubo reframe publishing to.
that setup allows for minimal risk and isn't hard to set up for large publishers (e.g. big cluster instances, big publishers)

dchoi27 requested a review from aschmahmann March 8, 2022 03:57

bkyarger force-pushed the feat/provide-indexer branch from fc74936 to 8f5323d Compare March 9, 2022 05:49

bkyarger changed the title ~~Added experimental index provider feature~~ Added experimental index provider feature to go-ipfs Mar 9, 2022

bkyarger force-pushed the feat/provide-indexer branch 16 times, most recently from ee94964 to 9d3f2ce Compare March 16, 2022 04:30

BigLep added this to the Best Effort Track milestone Mar 18, 2022

aschmahmann requested changes Mar 18, 2022

View reviewed changes

aschmahmann marked this pull request as draft March 18, 2022 20:27

bkyarger force-pushed the feat/provide-indexer branch from 9d3f2ce to c9f9d5a Compare March 26, 2022 17:26

BigLep assigned bkyarger May 6, 2022

BigLep added the need/author-input Needs input from the original author label May 6, 2022

honghaoq mentioned this pull request May 19, 2022

go-ipfs integration with Indexer ipni/storetheindex#502

Closed

BigLep closed this Sep 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added experimental index provider feature to go-ipfs #8771

Added experimental index provider feature to go-ipfs #8771

bkyarger commented Mar 8, 2022 •

edited

Loading

welcome bot commented Mar 8, 2022

bkyarger commented Mar 8, 2022

bkyarger commented Mar 9, 2022

aschmahmann left a comment •

edited

Loading

aschmahmann Mar 18, 2022

aschmahmann Mar 18, 2022

aschmahmann Mar 18, 2022

aschmahmann Mar 18, 2022

aschmahmann Mar 18, 2022

aschmahmann Mar 18, 2022

aschmahmann Mar 18, 2022

aschmahmann Mar 18, 2022

aschmahmann Mar 18, 2022

aschmahmann Mar 18, 2022

BigLep commented May 6, 2022

BigLep commented Sep 27, 2022

willscott commented Sep 27, 2022

Jorropo commented Sep 27, 2022 •

edited

Loading

willscott commented Sep 27, 2022

Jorropo commented Sep 27, 2022 •

edited

Loading

willscott commented Sep 27, 2022

		cmds.StringOption(enableIndexerProviderKwd, "Enable experimental indexer provider feature and provide new Cids that are added, with indexer address"),
		cmds.StringOption(indexerProviderFullKwd, "Enable experimental indexer provider feature and provide all locally stored Cids, with indexer address"),

Added experimental index provider feature to go-ipfs #8771

Added experimental index provider feature to go-ipfs #8771

Conversation

bkyarger commented Mar 8, 2022 • edited Loading

welcome bot commented Mar 8, 2022

bkyarger commented Mar 8, 2022

bkyarger commented Mar 9, 2022

aschmahmann left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BigLep commented May 6, 2022

BigLep commented Sep 27, 2022

willscott commented Sep 27, 2022

Jorropo commented Sep 27, 2022 • edited Loading

willscott commented Sep 27, 2022

Jorropo commented Sep 27, 2022 • edited Loading

willscott commented Sep 27, 2022

bkyarger commented Mar 8, 2022 •

edited

Loading

aschmahmann left a comment •

edited

Loading

Jorropo commented Sep 27, 2022 •

edited

Loading

Jorropo commented Sep 27, 2022 •

edited

Loading