-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
context/tracing in the blockstore/datastore pipeline #6803
Comments
We should be passing contexts through to the datastores but the refactor is going to be quite large and invasive and will affect multiple projects (ipfs, libp2p, filecoin, and probably quite a few more). We should look into this next year but I don't have time right now to handle all the fallout of this refactor. |
Overall it went fairly well with just go-graphsync to be worked around with `context.TODO(). Obviously this will need to be merged progressively, tagged and updated on the upper layers, but it does compile for me and seems to be working from a quick test. |
@MichaelMure : We're sorry this has been open for so long. Do you think you'd be available to work with a PL engineer later in May get this updated and pushed? We'll figure out the exact schedule if you're available, but wanted to see if that is possible. Thanks! |
@BigLep I might be but I'll have to see how things evolve on my side. However I'm not sure I would be so useful: the problem here is not technical. Ignoring go-graphsync for a minute, coding wise it's actually fairly easy. It took me a few hours to code and open all those PRs. The few packages I didn't touch are very likely to be as easy. The problem here is the coordination of all the project and their maintainers to land those changes without too much complications and someone to orchestrate that effort. It's not something I can do as an outsider. |
@BigLep that said, let me reiterate that this would be a big win for us, but not only. We are often facing performance issue or simply something not working right, and go-ipfs being a black box make it hard to understand and fix. Tracing has proved to be invaluable in other part of our stack to provide a quality service and react quickly on incidents. This is likely the feature that would help us the most. In addition, this would allow us to give you a better feedback from our deployed infrastructure, like pin-pointing precisely a performance issue. It's always easier to design or fix things with hard numbers. |
@MichaelMure : thanks! The need makes sense and agreed that a key dependency is the coordination from a maintainer to land these. I'm hoping we secure a time later in the month with someone like @aschmahmann who will do the merging but can also have you on hand if there rebasing/updates that need to be done along the way. Does that make sense (and please let me know if that's not an effective strategy)? |
@MichaelMure : quick update here that I'm trying to see about prioritizing this issue sooner as PL is focusing on its own efforts to improve the Gateways it's operating. @guseggert is a newer team member who may assign this to. Could you two connect and swap notes about the intended ways this will be helpful? I'll use that info for making the prioritization case. |
Sure, here is a few things where that could be helpful: tracing/observabilityThis would be a massive help for running IPFS in production. Really, I can't overstate that. At the moment, go-ipfs is basically a black box: requests comes in, response comes out. What happen in the middle is hard to observe. There is prometheus metrics, the The most critical subsystem in go-ipfs for performance and day to day operation is the data pipeline. Having a go context in there would allow to gradually and possibly independently add tracing instrumentation. The benefits would be:
Note also that this tracing would not necessarily be limited to go-ipfs: distributed tracing allow to propagate this tracing over the boundaries of connected systems (proxy, backend ...) which gives another dimension of observability. handling cancellation / reliabilityAt the moment, there is no cancellation in the data pipeline. This means that once a request is started, nothing will stop it, even if the original request is gone. Fixing that would allow to trim that unnecessary fat, reduce the load and improve reliability. It might also prevent a form of attack where heavy handling is triggered with minimal effort. request tagging / custom handlingAnother way this could be helpful is that the go context allow to carry metadata about the request, independently of all the layers it's going through. This means that one could for example carry over a domain specific logger or tag a request with an origin or some customer information. When reaching a lower level, those information could be used to prioritize requests, do caching differently, route to a specialized backend ... Importantly, this mechanism allows to extend the system independently without being bound by the Protocol Labs road map. This would lower your burdens and allow to explore more areas. engineering feedback / optimisationIn my experience, each time observability improve, new issues are unearthed. Proper instrumentation would give node operator and PL the tooling to discover those long standing performance/reliability issues. 95% of the work in software engineering is to figure out where the problem comes from. Once that's done and understood, fixing stuff becomes easy. node operator autonomyObservability would allow node operator to more easily figure out what is happening and in turn, rely less on PL to diagnose those issues and reduce the burden on the development team. Also discussed a bit at ipfs/roadmap#74 |
Well stated @MichaelMure . PL will respond back by EOD 2021-10-01. |
I missed not circling back on this last week since we did discuss it internally. @guseggert and team are going to pick this up. We're aiming to get this in the next go-ipfs release 0.11. The first step is to make the plan of how we can merge this in a progressive way. |
Will get more of the plan public, but internal scratchpad for thoughts on rolling this out is happening here: https://www.notion.so/protocollabs/Context-Plumbing-2b9fccf60db34ecb980b3068cabb9d50 |
Update: I'm plumbing these changes through as pseudoversions on branches, to make sure it all works before publishing any new versions. I will probably add some contexts in additional places, e.g. there are some interfaces in go-datastore that should have contexts too like CheckedDatastore, ScrubbedDatastore, GCDatastore, etc. Here's the order to update the modules:
(script to generate the order: https://gist.githubusercontent.com/guseggert/fe079f793cbea3158538bdaa9f50878b/raw/d87c0ef9f1593dd7ce9acb0b38e003e9f455ba88/gistfile1.txt) |
Update: currently working through (This module was not originally in the list, it was added after I fixed the script to generate the list.) |
2021-10-26 note:
|
I've plumbed the changes through using pseudoversions on |
Thanks for the update @guseggert ! A couple of things I think would be useful for visibility when you can get to them:
|
I am also adding the versioning workflows to all of these repos, which is taking some time to roll out (rerunning flaky tests, approving PRs, etc.). Here are the repos (ordered):
|
I've finished the majority of the plumbing, the rest are blocked on two things:
Once those are resolved, I can complete the plumbing and we will be ready to release go-ipfs v0.11.0-RC1 |
I discovered yesterday that |
closed by #8563 |
At Infura, we have to deal with the occasional performance issue. Even though we now have added tracing in the HTTP handler of
go-ipfs-cmds
(not upstreamed yet, it's opentracing and you are aiming for opencensus if I'm not mistaken?), the tracing essentially stop there.The majority of the requests (and the perf issues) involve the data pipeline but it is essentially a black box. To resolve that problem, a proper tracing instrumentation there would be very helpful, but that imply adding a go context to most if not all blockstore and datastore functions.
Is that something you would be interested to pursue ?
cc @dirkmc maybe ?
The text was updated successfully, but these errors were encountered: