Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

context/tracing in the blockstore/datastore pipeline #6803

Closed
Tracked by #8343
MichaelMure opened this issue Dec 17, 2019 · 27 comments
Closed
Tracked by #8343

context/tracing in the blockstore/datastore pipeline #6803

MichaelMure opened this issue Dec 17, 2019 · 27 comments
Assignees
Labels
kind/enhancement A net-new feature or improvement to an existing feature
Milestone

Comments

@MichaelMure
Copy link
Contributor

At Infura, we have to deal with the occasional performance issue. Even though we now have added tracing in the HTTP handler of go-ipfs-cmds (not upstreamed yet, it's opentracing and you are aiming for opencensus if I'm not mistaken?), the tracing essentially stop there.

The majority of the requests (and the perf issues) involve the data pipeline but it is essentially a black box. To resolve that problem, a proper tracing instrumentation there would be very helpful, but that imply adding a go context to most if not all blockstore and datastore functions.

Is that something you would be interested to pursue ?

cc @dirkmc maybe ?

@MichaelMure MichaelMure added the kind/enhancement A net-new feature or improvement to an existing feature label Dec 17, 2019
@Stebalien
Copy link
Member

We should be passing contexts through to the datastores but the refactor is going to be quite large and invasive and will affect multiple projects (ipfs, libp2p, filecoin, and probably quite a few more).

We should look into this next year but I don't have time right now to handle all the fallout of this refactor.

This was referenced Jul 22, 2020
@MichaelMure
Copy link
Contributor Author

MichaelMure commented Jul 22, 2020

master issue:

Datastore:

Blockstore:

IPFS:

libp2p:

@MichaelMure
Copy link
Contributor Author

Overall it went fairly well with just go-graphsync to be worked around with `context.TODO().

Obviously this will need to be merged progressively, tagged and updated on the upper layers, but it does compile for me and seems to be working from a quick test.

@BigLep
Copy link
Contributor

BigLep commented May 3, 2021

@MichaelMure : We're sorry this has been open for so long. Do you think you'd be available to work with a PL engineer later in May get this updated and pushed? We'll figure out the exact schedule if you're available, but wanted to see if that is possible. Thanks!

@MichaelMure
Copy link
Contributor Author

MichaelMure commented May 3, 2021

@BigLep I might be but I'll have to see how things evolve on my side.

However I'm not sure I would be so useful: the problem here is not technical. Ignoring go-graphsync for a minute, coding wise it's actually fairly easy. It took me a few hours to code and open all those PRs. The few packages I didn't touch are very likely to be as easy.

The problem here is the coordination of all the project and their maintainers to land those changes without too much complications and someone to orchestrate that effort. It's not something I can do as an outsider.

@MichaelMure
Copy link
Contributor Author

@BigLep that said, let me reiterate that this would be a big win for us, but not only. We are often facing performance issue or simply something not working right, and go-ipfs being a black box make it hard to understand and fix. Tracing has proved to be invaluable in other part of our stack to provide a quality service and react quickly on incidents. This is likely the feature that would help us the most.

In addition, this would allow us to give you a better feedback from our deployed infrastructure, like pin-pointing precisely a performance issue. It's always easier to design or fix things with hard numbers.

@BigLep
Copy link
Contributor

BigLep commented May 7, 2021

@MichaelMure : thanks! The need makes sense and agreed that a key dependency is the coordination from a maintainer to land these. I'm hoping we secure a time later in the month with someone like @aschmahmann who will do the merging but can also have you on hand if there rebasing/updates that need to be done along the way. Does that make sense (and please let me know if that's not an effective strategy)?

@BigLep
Copy link
Contributor

BigLep commented Sep 28, 2021

@MichaelMure : quick update here that I'm trying to see about prioritizing this issue sooner as PL is focusing on its own efforts to improve the Gateways it's operating. @guseggert is a newer team member who may assign this to. Could you two connect and swap notes about the intended ways this will be helpful? I'll use that info for making the prioritization case.

@MichaelMure
Copy link
Contributor Author

Sure, here is a few things where that could be helpful:

tracing/observability

This would be a massive help for running IPFS in production. Really, I can't overstate that. At the moment, go-ipfs is basically a black box: requests comes in, response comes out. What happen in the middle is hard to observe. There is prometheus metrics, the diag command or pprof but that's really surface level. This leads to difficulties to address emergency situations, understand and plan for the future.

The most critical subsystem in go-ipfs for performance and day to day operation is the data pipeline. Having a go context in there would allow to gradually and possibly independently add tracing instrumentation. The benefits would be:

  • full visibility into operational activities: timing, request count, load, delay, inter-dependencies ...
  • ability to slice and dice data, isolate specific requests or class of requests, figure out patterns ...
  • track errors and problems, isolate their origin and resolve issue in a way simpler and efficient manner

Note also that this tracing would not necessarily be limited to go-ipfs: distributed tracing allow to propagate this tracing over the boundaries of connected systems (proxy, backend ...) which gives another dimension of observability.

handling cancellation / reliability

At the moment, there is no cancellation in the data pipeline. This means that once a request is started, nothing will stop it, even if the original request is gone. Fixing that would allow to trim that unnecessary fat, reduce the load and improve reliability. It might also prevent a form of attack where heavy handling is triggered with minimal effort.

request tagging / custom handling

Another way this could be helpful is that the go context allow to carry metadata about the request, independently of all the layers it's going through. This means that one could for example carry over a domain specific logger or tag a request with an origin or some customer information. When reaching a lower level, those information could be used to prioritize requests, do caching differently, route to a specialized backend ...

Importantly, this mechanism allows to extend the system independently without being bound by the Protocol Labs road map. This would lower your burdens and allow to explore more areas.

engineering feedback / optimisation

In my experience, each time observability improve, new issues are unearthed. Proper instrumentation would give node operator and PL the tooling to discover those long standing performance/reliability issues. 95% of the work in software engineering is to figure out where the problem comes from. Once that's done and understood, fixing stuff becomes easy.

node operator autonomy

Observability would allow node operator to more easily figure out what is happening and in turn, rely less on PL to diagnose those issues and reduce the burden on the development team.

Also discussed a bit at ipfs/roadmap#74

@BigLep
Copy link
Contributor

BigLep commented Sep 29, 2021

Well stated @MichaelMure . PL will respond back by EOD 2021-10-01.

@BigLep BigLep modified the milestones: go-ipfs 0.13, go-ipfs 0.11 Oct 5, 2021
@BigLep
Copy link
Contributor

BigLep commented Oct 5, 2021

I missed not circling back on this last week since we did discuss it internally. @guseggert and team are going to pick this up. We're aiming to get this in the next go-ipfs release 0.11. The first step is to make the plan of how we can merge this in a progressive way.

@BigLep
Copy link
Contributor

BigLep commented Oct 12, 2021

Will get more of the plan public, but internal scratchpad for thoughts on rolling this out is happening here: https://www.notion.so/protocollabs/Context-Plumbing-2b9fccf60db34ecb980b3068cabb9d50

@guseggert
Copy link
Contributor

guseggert commented Oct 13, 2021

Update: I'm plumbing these changes through as pseudoversions on branches, to make sure it all works before publishing any new versions.

I will probably add some contexts in additional places, e.g. there are some interfaces in go-datastore that should have contexts too like CheckedDatastore, ScrubbedDatastore, GCDatastore, etc.

Here's the order to update the modules:

  • github.com/ipfs/go-datastore
  • github.com/ipfs/go-ds-badger
  • github.com/ipfs/go-ds-leveldb
  • github.com/libp2p/go-libp2p-peerstore
  • github.com/libp2p/go-libp2p-swarm
  • github.com/libp2p/go-libp2p-autonat
  • github.com/libp2p/go-libp2p-circuit
  • github.com/libp2p/go-libp2p-discovery
  • github.com/libp2p/go-libp2p
  • github.com/libp2p/go-libp2p-noise
  • github.com/ipfs/go-ipfs-ds-help
  • github.com/ipfs/go-ipfs-blockstore
  • github.com/ipfs/go-ipfs-exchange-interface
  • github.com/ipfs/go-ipfs-routing
  • github.com/ipfs/go-bitswap
  • github.com/ipfs/go-ipfs-exchange-offline
  • github.com/ipfs/go-blockservice
  • github.com/ipfs/go-merkledag
  • github.com/ipfs/go-unixfs
  • github.com/ipfs/go-fetcher
  • github.com/ipfs/go-unixfsnode
  • github.com/libp2p/go-libp2p-kbucket
  • github.com/ipfs/go-path
  • github.com/ipfs/go-ipns
  • github.com/libp2p/go-libp2p-xor
  • github.com/ipfs/interface-go-ipfs-core
  • github.com/libp2p/go-libp2p-kad-dht
  • github.com/libp2p/go-libp2p-gostream
  • github.com/libp2p/go-libp2p-pubsub
  • github.com/ipfs/go-ds-flatfs
  • github.com/ipfs/go-ds-measure
  • github.com/ipfs/go-filestore
  • github.com/ipfs/go-graphsync
  • github.com/ipfs/go-ipfs-config
  • github.com/ipfs/go-ipfs-pinner
  • github.com/ipfs/go-ipfs-provider
  • github.com/ipfs/go-mfs
  • github.com/ipfs/go-namesys
  • github.com/ipld/go-car
  • github.com/libp2p/go-libp2p-http
  • github.com/libp2p/go-libp2p-pubsub-router
  • github.com/ipfs/go-ipfs

(script to generate the order: https://gist.githubusercontent.com/guseggert/fe079f793cbea3158538bdaa9f50878b/raw/d87c0ef9f1593dd7ce9acb0b38e003e9f455ba88/gistfile1.txt)

@guseggert
Copy link
Contributor

Update: currently working through libp2p/go-libp2p-swarm as it depends on an older version of go-libp2p-peerstore, and the newer version has non-trivial backwards-incompatible changes.

(This module was not originally in the list, it was added after I fixed the script to generate the list.)

@BigLep
Copy link
Contributor

BigLep commented Oct 26, 2021

2021-10-26 note:

  1. @guseggert will update this
  2. We'll link the new PRs
  3. We'll close out the old ones

@guseggert
Copy link
Contributor

I've plumbed the changes through using pseudoversions on feat/context branches through all the repos, and fixed the issues that came up. I've added contexts to quite a bit more interfaces, so I re-did the plumbing work. Now I'm beginning to cut releases and plumb those through. Libp2p has some hole punching changes that are in-flight, and there are also sharding changes in-flight that could cause issues with the rollout here--if I run into any problems, I'm going to plumb through a pseudoversion instead of a release version, and that can be cleaned up separately after the issues are resolved.

@BigLep
Copy link
Contributor

BigLep commented Oct 26, 2021

Thanks for the update @guseggert ! A couple of things I think would be useful for visibility when you can get to them:

  1. A list of all the repos we're going to touch (and in what order) with checkboxes to show current status?
  2. The PRs next to the list as they get created

@guseggert
Copy link
Contributor

guseggert commented Oct 27, 2021

I am also adding the versioning workflows to all of these repos, which is taking some time to roll out (rerunning flaky tests, approving PRs, etc.).

Here are the repos (ordered):

@guseggert
Copy link
Contributor

I've finished the majority of the plumbing, the rest are blocked on two things:

Once those are resolved, I can complete the plumbing and we will be ready to release go-ipfs v0.11.0-RC1

@guseggert
Copy link
Contributor

I discovered yesterday that go-ipfs-blockstore is using the wrong version of go-ipfs-ds-help, it was inadvertently upgraded to v1, so I need to go back and downgrade go-ipfs-blockstore@v0 to use go-ipfs-ds-help@v0 and then re-plumb.

@aschmahmann
Copy link
Contributor

closed by #8563

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement A net-new feature or improvement to an existing feature
Projects
No open projects
Archived in project
Development

No branches or pull requests

5 participants