Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added experimental index provider feature to go-ipfs #8771

Closed
wants to merge 1 commit into from

Conversation

bkyarger
Copy link

@bkyarger bkyarger commented Mar 8, 2022

Two modes available with different flag, one to import all existing blocks and a second one for ongoing advertising of new blocks added.

No pinning is currently implemented, so no protection from GC at this time. Lots of code borrowed from indexer-provider[1].

There are some hardcoded constants for controlling size of advertisements and chunks, these should probably be moved to configuration if this is to be supported long term.

[1] https://github.com/filecoin-project/index-provider

  • research pinning style design suggestions
  • move component init into core/node
  • add indexProvider field to IpfsNode
  • move to standard datastore
  • move to built-in graphsync, deal with linksystem
  • research dataTransfer context usage / behavior during shutdown
  • research registration with indexer, usage of private keys, whether it is necessary
  • use blockstore, not datastore
  • fix log level on errCh not open
  • use standard linksystem, put on IpfsNode if necessary
  • add mutex to providerBlockStore for concurrency assurances
  • deal with errors in PutMany of providerBlockStore
  • fix order of operations and mutex on Delete of providerBlockStore

@welcome
Copy link

welcome bot commented Mar 8, 2022

Thank you for submitting this PR!
A maintainer will be here shortly to review it.
We are super grateful, but we are also overloaded! Help us by making sure that:

  • The context for this PR is clear, with relevant discussion, decisions
    and stakeholders linked/mentioned.

  • Your contribution itself is clear (code comments, self-review for the
    rest) and in its best form. Follow the code contribution
    guidelines

    if they apply.

Getting other community members to do a review would be great help too on complex PRs (you can ask in the chats/forums). If you are unsure about something, just leave us a comment.
Next steps:

  • A maintainer will triage and assign priority to this PR, commenting on
    any missing things and potentially assigning a reviewer for high
    priority items.

  • The PR gets reviews, discussed and approvals as needed.

  • The PR is merged by maintainers when it has been approved and comments addressed.

We currently aim to provide initial feedback/triaging within two business days. Please keep an eye on any labelling actions, as these will indicate priorities and status of your contribution.
We are very grateful for your contribution!

@dchoi27 dchoi27 requested a review from aschmahmann March 8, 2022 03:57
@bkyarger
Copy link
Author

bkyarger commented Mar 8, 2022

Added fixes for issues from make test_go_lint

@bkyarger
Copy link
Author

bkyarger commented Mar 9, 2022

Rebased on top of #8756 and fixed go-ipfs-config refs

@bkyarger bkyarger changed the title Added experimental index provider feature Added experimental index provider feature to go-ipfs Mar 9, 2022
@bkyarger bkyarger force-pushed the feat/provide-indexer branch 16 times, most recently from ee94964 to 9d3f2ce Compare March 16, 2022 04:30
@BigLep BigLep added this to the Best Effort Track milestone Mar 18, 2022
Copy link
Contributor

@aschmahmann aschmahmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkyarger I'm glad you were able to make some progress through here and put together a test demonstrating that things mostly work.

I added a bunch of code related comments as well as couple about design. The big design one above is mostly to give a basic info dump of what you might run into and some potential ways around it.

There's still work to do before this can be usable by upstream go-ipfs (mostly mentioned in the big design comment), however getting your fork up and running for your use case sooner seems pretty doable.

Great to see this coming along and happy to help where I can here, although you might need some folks more familiar with the indexer codebase to give review on the indexer specific logic and setup


For the time being I have converted the PR to a draft to indicate it's not ready for merge into master at the moment and make it easy to identify on our GitHub boards.

Comment on lines +553 to +556
// if idxProviderFull is set, we will provide all existing blocks we have. This
// is meant to be a one-time operation, and not meant to be used at the same time
// as the idxProviderSet - which advertises newly added blocks on a timer
idxProviderNode := &ipfsIdxProviderNode{node}
if idxProviderFullSet {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New components, whether configurable or not, being instantiated in daemon.go is inconsistent with how we normally initialize things which may lead to issues as with the instantiation of the node.BaseBlocks below.

Instead this logic should live in core/node with the other components. If you want to have this be configurable with a flag rather than just a config file option you can use the IPNS over PubSub (search for ipnsps) example for illustration since it also has a CLI flag. You can also add an indexer provider field to IpfsNode in the event you want to write some extra commands that allow you to introspect on it or just want to work with it once it's running.

Happy to give more guidance here if needed.

Comment on lines +111 to +118
dataStore, err = leveldb.NewDatastore(dataStorePath, nil)
if err != nil {
providerlog.Errorf("Error creating new leveldb datastore: %v", err)
return err
}
// TODO why can't I use this datastore instead of creating my own? When I try I get an panic in migrations code
// with an empty array being indexed
// dataStore := node.DataStore()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this should be able to use the standard datastore. Probably moving things to the standard initialization pattern will help you out here.

Comment on lines +135 to +137
// TODO currently can't use this version since it has it's own linksystem - could swap that out
// based on idx provider being enabled, but saving that for a later time
// gs := node.GraphExchange()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also means that if a user had enabled graphsync and the index provider that they'd start clobbering each other since they're both handlers of the same protocol.

// based on idx provider being enabled, but saving that for a later time
// gs := node.GraphExchange()

tp := gstransport.NewTransport(node.PeerHost().ID(), gs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not familiar with the go-datatransfer business here, review on this will have to come from @willscott @hannahhoward or someone else familiar with this.

providerlog.Debugf("creating link system")
lsys = mkLinkSystem()

gsNet := gsnetwork.NewFromLibp2pHost(node.PeerHost())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the places you use node. will go away/change if you move this function into the main dependency injection area of the codebase instead of running after the IpfsNode creation.

Comment on lines +134 to +137
if bs.keys[k] {
delete(bs.keys, k)
}
return bs.backend.DeleteBlock(ctx, k)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of order + mutex

github.com/jbenet/go-temp-err-catcher v0.1.0
github.com/jbenet/goprocess v0.1.4
github.com/libp2p/go-doh-resolver v0.4.0
github.com/libp2p/go-libp2p v0.16.0
github.com/libp2p/go-libp2p v0.18.0-rc3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even for your testing you should use either the released go-libp2p v0.18.0 (not yet tested with go-ipfs pending #8680) or a previous version.

Comment on lines +187 to +188
cmds.StringOption(enableIndexerProviderKwd, "Enable experimental indexer provider feature and provide new Cids that are added, with indexer address"),
cmds.StringOption(indexerProviderFullKwd, "Enable experimental indexer provider feature and provide all locally stored Cids, with indexer address"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to use this area to talk about some potential design issues because I think a thread will be useful here and GitHub doesn't provide them unless you anchor to some code.

My understanding of the design here is that you're planning to run this fork by:

  1. Running ipfs daemon --index-provider-full-import-experiment once, waiting for everything to finish by watching some logs
  2. Kill the node and from then on run ipfs daemon --index-provider-experiment which will periodically notice any new blocks you have added locally and advertise them

Some things that this misses, some of which you might be ok with and some not. (Note: most if not all of these are likely blockers to merging the code upstream, but we can go a step at a time here)

  • If the node crashes before I've advertised some new data I will never advertise it. Unless I run an expensive full import again which will be expensive to the client and server due to how the data is batched up (i.e. there is unlikely to be any overlap with the previous advertisements)
  • If I delete data I don't "unadvertise" it. This is an easy enough fix, although the issue above remains
  • If I change my identifier (whether just multiaddrs because of dynamic DNS, or peerIDs due to rotation) then I lose all my advertisements and I have to rely on the indexer figuring out to drop them
  • If I use the node without running the daemon (e.g. ipfs add while the daemon isn't running) the advertisement will get dropped. Potentially fixable, but at the cost of adding tiny advertisements which IIUC perform poorly with indexer syncing
  • I can only provide everything, not some subset (e.g. pins, roots, etc.)

IIUC a pattern which might be more useful for your use case (a pinning service with lots of data it wants to advertise to the indexers) might look like this:

  • Only advertised pinned data
  • On startup grab the list of pins and diff them against the list of pins you've already advertised
  • Advertise any new ones, de-advertise any missing ones, update the list of pins that have been advertised as you go
    • Note: advertise new ones means walking the DAG, collecting all the CIDs and advertising them
  • Hook into the pinner/put a wrapper around it and whenever a pin is successfully added or removed modify the advertisement list

Some advantages:

  • It maps nicely to what the indexer is trying to do which is track groups of objects that want to be advertised/de-advertised together
  • It doesn't leave your state inconsistent on-crash or require you to put together something like a journaling recovery system to deal with mismatched state on a per-block level (e.g. new block added/removed but not advertised/de-advertised)
  • It's a less costly diff then diffing your whole blockstore

Some disadvantages:

  • Might stress out the indexer system since you may have millions of pins that are relatively small and IIUC the indexer ingestion throughput is limited by needing to walk the linear advertisement list. These millions of small pins is a different set of workload then say a Filecoin storage deal for 32 GBs of data
    • Could potentially be worked around by batching up the pins into groups for the indexer, but it means tracking more locally and needing to update these larger advertisements whenever a pin is added/removed from the batch. The updates might not mean transferring so much data though depending on how the advertisement DAG is structured
  • You still sort of end up with a journaling system on the pin level
    • Requires storing the list of advertised pins, which is itself a bit like a journaling system
    • Startup costs involve doing a diff on the pinsets which could potentially be expensive, this could happen in the background if you're ok waiting a little while for the indexer subsystem to become functional
  • Doesn't handle advertising unpinned blocks on your node (including MFS, although that seems handlable separately if needed)

The above are just some suggestions though, some of the error cases unhandled by this design that can show up might not be a problem at the moment to start with.

@@ -0,0 +1,420 @@
package main_test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is just a temporary thing for testing, but integration tests should live in test/integration


// Store latest advertisement published from the chain
providerlog.Infow("Storing advertisement locally", "cid", c.String())
err = dataStore.Put(ctx, datastore.NewKey(latestAdvKey), c.Bytes())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: If you wanted to keep your advertisements safe from GC it seems pretty straightforward to do. You just have to pass the root CID (as found from the latestAdvKey) into the list of roots that GC should traverse and mark as safe from GC because IIUC the advertisements are all linked together in a single DAG.

@aschmahmann aschmahmann marked this pull request as draft March 18, 2022 20:27
Two modes available with different flag, one to import all existing blocks and a second one for ongoing advertising of new blocks added.  Can be used together, or one at a time.

No pinning is currently implemented, so no protection from GC at this time.  Lots of code borrowed from indexer-provider[1].

There are some hardcoded constants for controlling size of advertisements and chunks, these should probably be moved to configuration if this is to be supported long term.

[1] https://github.com/filecoin-project/index-provider
@BigLep
Copy link
Contributor

BigLep commented May 6, 2022

@bkyarger : what are the next steps here? Are you going to incorporate the feedback or can this be close in light of https://github.com/web3-storage/ipfs-elastic-provider

@BigLep BigLep added the need/author-input Needs input from the original author label May 6, 2022
@BigLep
Copy link
Contributor

BigLep commented Sep 27, 2022

Closing this since didn't am not aware of anyone driving this forward and there hasn't been activity for months.

@BigLep BigLep closed this Sep 27, 2022
@willscott
Copy link
Contributor

Yep, instead we're doing this via reframe - a working reframe publisher to indexers for kubo is at ipni/index-provider#271

@Jorropo
Copy link
Contributor

Jorropo commented Sep 27, 2022

Yep, instead we're doing this via reframe - a working reframe publisher to indexers for kubo

IMO this is bad because it require doing diffing on the 24h republish that Kubo do.

We (Kubo maintainers) want to rework the advertising logic anyway because it's doing bad stuff (we have two different codepaths for providing and reprodividing, the first provide code buffer is crashy) and I think if we have a proper state tracker it would be easier.
Then if we already track the state of what is pinned it will be easy to publish to indexers without doing diffing.

@willscott
Copy link
Contributor

@Jorropo It would be great to propose an update to the reframe provide path with better semantics you can think would be supportable by kubo: https://github.com/ipfs/specs/blob/main/reframe/REFRAME_KNOWN_METHODS.md#provide

@Jorropo
Copy link
Contributor

Jorropo commented Sep 27, 2022

@willscott I don't want to start working on new reframe changes while the advertise refactor is not even planned yet. I have too much things to do for now.

I have a personal project to make a simple plugin that read the pinned states and run an indexer server after we refactor the annoucement logic.
It would be more decentralized to publish on a pubsub channel than an HTTP endpoint. (even tho yes you could just run your own http endpoint but it's more work)

@willscott
Copy link
Contributor

we're getting a working version of this implementation with the expectation of publishers running their own endpoint they delegate their kubo reframe publishing to.
that setup allows for minimal risk and isn't hard to set up for large publishers (e.g. big cluster instances, big publishers)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/author-input Needs input from the original author
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants