Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add comprehensive tracing #8578

Open
3 of 9 tasks
guseggert opened this issue Dec 2, 2021 · 22 comments · Fixed by #8595
Open
3 of 9 tasks

Add comprehensive tracing #8578

guseggert opened this issue Dec 2, 2021 · 22 comments · Fixed by #8595
Assignees

Comments

@guseggert
Copy link
Contributor

guseggert commented Dec 2, 2021

Master issue for adding comprehensive tracing throughout the go-ipfs codebase.

Some requirements:

  • End users and gateway operators should be able to trace costly work units throughout the various parts of the go-ipfs internals
  • Should fit into a larger distributed tracing story if go-ipfs is part of a larger system
  • Traces all the types of subsystems
    • Synchronous work done as part of requests such as gateway requests
    • Async workers that service sync requests
    • Async workers that aren't involved in synchronous requests
      • DHT background work
      • Work done at daemon startup
    • libp2p
  • Pluggable into different tracing backends
  • Easy to instrument existing un-instrumented code
  • Easy to instrument new code such as plugins, new interface implementations, etc.
  • Random sampling knob to tune perf/observability tradeoff

I plan to implement this against OpenTelemetry. Tactically I want to start at a high level (e.g. gateway reqs) and iteratively work down into the guts, driven by trying to answer questions about real requests so that we are instrumenting the right things.

Some related issues:

Todos:

  • Add tracer provider in go-ipfs w/ exporter configs, and add gateway and CoreAPI tracing
  • Deploy to public gateways
  • Add tracing in subsystems (incomplete list)
    • go-bitswap
    • go-namesys
    • go-datastore
    • go-libp2p-kad-dht
    • libp2p subsystems
  • The Provider API needs a context plumbed through to go-ipfs-provider
@guseggert guseggert self-assigned this Dec 2, 2021
@BigLep
Copy link
Contributor

BigLep commented Dec 6, 2021

@guseggert: a couple of quick questions on this:

  1. Do we have a sense of how much effort from you this will take in terms of ideal dev days/weeks? (I'm asking so can figure out when this may be of benefit for Bifrost)
  2. After this work is done, what additional work will Bifrost need to do before they're able to see tracing data of a specific request?

@guseggert
Copy link
Contributor Author

guseggert commented Dec 6, 2021

  1. We can start getting local barebones traces back with 3 days of dev effort, though they won't go deep into the stack (just the glue in go-ipfs). Also not including the initial collector setup below, which is required for tracing on deployed infra. Adding additional spans deeper in the stack is trivial, with some minor overhead for plumbing that back into go-ipfs...in aggregate it'd likely take a couple weeks to add basic instrumentation to major components like Bitswap, DHT, datastores, other libp2p pieces, etc., and get them all plumbed up into go-ipfs.

  2. I met with @coryschwartz and @frrist last week, we discussed 2) (amongst other things). I'd estimate the effort to integrate this with our infra at 1 week. The consensus was to use Grafana Tempo, which is a hosted OpenTelemetry collector, which Grafana has a UI on top of (see https://grafana.com/docs/tempo/latest/getting-started/tempo-in-grafana/). We'd follow the onboarding instructions here: https://grafana.com/orgs/protocollabs/hosted-traces/76.

@BigLep
Copy link
Contributor

BigLep commented Dec 6, 2021

@guseggert : thanks! With local barebones traces, what kind of metrics will we get?

@guseggert
Copy link
Contributor Author

guseggert commented Dec 6, 2021

We can get timing info and some basic request metadata for BlockStore, BlockService, DAGService, PeerStore, Pinner, and IPNS, possibly others.

Simple example, here's a req I recorded locally showing what's happening as go-ipfs builds an index page for dist.ipfs.io, from a few spans in the gateway request handler and core

2021-12-06T14:50:52

@proto-sid
Copy link

cc: @gmasgras for visibility

@guseggert guseggert linked a pull request Dec 8, 2021 that will close this issue
@guseggert guseggert removed a link to a pull request Dec 8, 2021
@guseggert
Copy link
Contributor Author

Draft PR for gateway tracing: #8595

@guseggert
Copy link
Contributor Author

Netops has a summit this week they are busy with, so things will be slow this week on infra side. @thattommyhall is working on code changes to install a collector on the public gateway hosts.

@BigLep BigLep linked a pull request Dec 16, 2021 that will close this issue
@proto-sid
Copy link

@thattommyhall Could you share the current state of this implementation?

@BigLep BigLep moved this to 🥞 Todo in IPFS Shipyard Team Mar 3, 2022
@BigLep BigLep added this to the go-ipfs 0.13 milestone Mar 3, 2022
@lidel
Copy link
Member

lidel commented Mar 10, 2022

FYI i gathered some feedback around HTTP-request metrics in #8441

@guseggert guseggert moved this from 🥞 Todo to 🏃‍♀️ In Progress in IPFS Shipyard Team Mar 21, 2022
@BigLep
Copy link
Contributor

BigLep commented Mar 24, 2022

2022-03-24 conversation: @guseggert is going to define the done criteria for the current attack at this. He'll also list potential followups.

ipfs/go-datastore#188 is a related PR.

@guseggert guseggert moved this from 🏃‍♀️ In Progress to 🔎 In Review in IPFS Shipyard Team Mar 26, 2022
@guseggert guseggert moved this from 🔎 In Review to 🏃‍♀️ In Progress in IPFS Shipyard Team Mar 26, 2022
@iand
Copy link
Contributor

iand commented Mar 31, 2022

I would suggest decoupling the technical implementation from the infrastructure deployment since having tracing capability earlier is very valuable. I reviewed #8595 and left a few minor comments, but based on my testing I consider that PR essentially done. It's enough to start gathering traces on a small scale and threading tracing code throughout lower level libraries is going to be extremely valuable.

However, getting useful traces at scale on our infrastructure needs more work. I don't think it's feasible or desirable to gather traces from all requests (sample rate 1) due to the volume and the utility of sampling at lower levels is limited if it doesn't include the most interesting samples. I'd like to see some more sampling options added so we can restrict by top level API operation (e.g. Gateway.Request) and by trace time (e.g. only record traces for requests with duration greater than 5s)

I used the go-ipfs branch and added tracing to go-namesys, go-path and go-blockservice to experiment a bit. The results are really useful:

sample trace

bcaeefd4fd6df40ca4f93d8ffaf83a5a

  • Gateway.Request (568.562044ms) http.target=/ipns/en.wikipedia-on-ipfs.org/wiki/San_Francisco/
    • CoreAPI.ResolvePath (955.569µs) path=/ipns/en.wikipedia-on-ipfs.org/wiki/San_Francisco/
      • go-namesys/resolve.ResolveIPNS (324.512µs) path=/ipns/en.wikipedia-on-ipfs.org/wiki/San_Francisco/
        • go-namesys.mpns.Resolve (313.499µs) name=/ipns/en.wikipedia-on-ipfs.org
          • go-namesys.resolve (307.04µs) name=/ipns/en.wikipedia-on-ipfs.org
            • go-namesys.resolveAsync (25.813µs) name=/ipns/en.wikipedia-on-ipfs.org
              • go-namesys.mpns.resolveOnceAsync (15.89µs) name=/ipns/en.wikipedia-on-ipfs.org
                • go-namesys.DNSResolver.resolveOnceAsync (6.621µs) name=en.wikipedia-on-ipfs.org
                  • go-namesys.DNSResolver.workDomain (224.793µs) name=_dnslink.en.wikipedia-on-ipfs.org.
                  • go-namesys.DNSResolver.workDomain (887.368µs) name=en.wikipedia-on-ipfs.org.
      • go-path.basicResolver.ResolveToLastNode (616.415µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco
        • go-path.basicResolver.resolveNodes (272.635µs) cid=bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze
          • go-blockservice/Session.GetBlock (36.319µs) cid=bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze
          • go-blockservice/Session.GetBlock (32.484µs) cid=bafybeihn2f7lhumh4grizksi2fl233cyszqadkn424ptjajfenykpsaiw4
          • go-blockservice/Session.GetBlock (25.078µs) cid=bafybeiaa552ayufsoxvsl7z5mey2hqi42ihjoerltffptqp4gnnbevvvlq
          • go-blockservice/Session.GetBlock (23.415µs) cid=bafybeidodlox437tinm6yemiai4hjfzft6qisgxpl7hvme4qawvjmp7loe
          • go-blockservice/Session.GetBlock (21.946µs) cid=bafybeiaerbb6wucpodhmomzk234n6f3clsmuwcw2ewduyfay4frsa5wwaa
    • CoreAPI.BlockAPI.Get (28.283µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco
      • CoreAPI.ResolvePath (1.758µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco
      • go-blockservice/blockService.GetBlock (20.343µs) cid=bafybeichc2vf6yk24rbph7tx5sjzybsz4ztiuzmmvrz4fnikyxtsulzkg4
    • CoreAPI.ResolvePath (321.899µs) path=/ipns/en.wikipedia-on-ipfs.org
      • go-namesys/resolve.ResolveIPNS (305.296µs) path=/ipns/en.wikipedia-on-ipfs.org
        • go-namesys.mpns.Resolve (298.594µs) name=/ipns/en.wikipedia-on-ipfs.org
          • go-namesys.resolve (294.529µs) name=/ipns/en.wikipedia-on-ipfs.org
            • go-namesys.resolveAsync (16.506µs) name=/ipns/en.wikipedia-on-ipfs.org
              • go-namesys.mpns.resolveOnceAsync (12.215µs) name=/ipns/en.wikipedia-on-ipfs.org
                • go-namesys.DNSResolver.resolveOnceAsync (5.279µs) name=en.wikipedia-on-ipfs.org
                  • go-namesys.DNSResolver.workDomain (209.855µs) name=_dnslink.en.wikipedia-on-ipfs.org.
                  • go-namesys.DNSResolver.workDomain (989.374µs) name=en.wikipedia-on-ipfs.org.
      • go-path.basicResolver.ResolveToLastNode (3.526µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze
    • CoreAPI.ResolvePath (358.244µs) path=/ipns/en.wikipedia-on-ipfs.org/wiki
      • go-namesys/resolve.ResolveIPNS (265.91µs) path=/ipns/en.wikipedia-on-ipfs.org/wiki
        • go-namesys.mpns.Resolve (256.527µs) name=/ipns/en.wikipedia-on-ipfs.org
          • go-namesys.resolve (250.682µs) name=/ipns/en.wikipedia-on-ipfs.org
            • go-namesys.resolveAsync (24.89µs) name=/ipns/en.wikipedia-on-ipfs.org
              • go-namesys.mpns.resolveOnceAsync (18.147µs) name=/ipns/en.wikipedia-on-ipfs.org
                • go-namesys.DNSResolver.resolveOnceAsync (6.997µs) name=en.wikipedia-on-ipfs.org
                  • go-namesys.DNSResolver.workDomain (162.87µs) name=_dnslink.en.wikipedia-on-ipfs.org.
                  • go-namesys.DNSResolver.workDomain (651.886µs) name=en.wikipedia-on-ipfs.org.
      • go-path.basicResolver.ResolveToLastNode (80.249µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki
        • go-path.basicResolver.resolveNodes (65.656µs) cid=bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze
          • go-blockservice/Session.GetBlock (32.781µs) cid=bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze
    • CoreAPI.ResolvePath (942.509µs) path=/ipns/en.wikipedia-on-ipfs.org/wiki/San_Francisco
      • go-namesys/resolve.ResolveIPNS (245.561µs) path=/ipns/en.wikipedia-on-ipfs.org/wiki/San_Francisco
        • go-namesys.mpns.Resolve (238.159µs) name=/ipns/en.wikipedia-on-ipfs.org
          • go-namesys.resolve (234.365µs) name=/ipns/en.wikipedia-on-ipfs.org
            • go-namesys.resolveAsync (14.864µs) name=/ipns/en.wikipedia-on-ipfs.org
              • go-namesys.mpns.resolveOnceAsync (11.16µs) name=/ipns/en.wikipedia-on-ipfs.org
                • go-namesys.DNSResolver.resolveOnceAsync (4.391µs) name=en.wikipedia-on-ipfs.org
                  • go-namesys.DNSResolver.workDomain (178.332µs) name=_dnslink.en.wikipedia-on-ipfs.org.
                  • go-namesys.DNSResolver.workDomain (302.165µs) name=en.wikipedia-on-ipfs.org.
      • go-path.basicResolver.ResolveToLastNode (682.538µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco
        • go-path.basicResolver.resolveNodes (288.607µs) cid=bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze
          • go-blockservice/Session.GetBlock (28.631µs) cid=bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze
          • go-blockservice/Session.GetBlock (50.985µs) cid=bafybeihn2f7lhumh4grizksi2fl233cyszqadkn424ptjajfenykpsaiw4
          • go-blockservice/Session.GetBlock (32.146µs) cid=bafybeiaa552ayufsoxvsl7z5mey2hqi42ihjoerltffptqp4gnnbevvvlq
          • go-blockservice/Session.GetBlock (29.617µs) cid=bafybeidodlox437tinm6yemiai4hjfzft6qisgxpl7hvme4qawvjmp7loe
          • go-blockservice/Session.GetBlock (25.647µs) cid=bafybeiaerbb6wucpodhmomzk234n6f3clsmuwcw2ewduyfay4frsa5wwaa
    • CoreAPI.UnixfsAPI.Get (60.648µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco
      • CoreAPI.ResolveNode (43.205µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco
        • CoreAPI.ResolvePath (2.154µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco
        • go-blockservice/Session.GetBlock (24.439µs) cid=bafybeichc2vf6yk24rbph7tx5sjzybsz4ztiuzmmvrz4fnikyxtsulzkg4
    • CoreAPI.UnixfsAPI.Get (778.132µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco/index.html
      • CoreAPI.ResolveNode (766.786µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco/index.html
        • CoreAPI.ResolvePath (737.653µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco/index.html
          • go-namesys/resolve.ResolveIPNS (2.155µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco/index.html
          • go-path.basicResolver.ResolveToLastNode (721.233µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco/index.html
            • go-path.basicResolver.resolveNodes (694.256µs) cid=bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze
              • go-blockservice/Session.GetBlock (18.787µs) cid=bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze
              • go-blockservice/Session.GetBlock (32.486µs) cid=bafybeihn2f7lhumh4grizksi2fl233cyszqadkn424ptjajfenykpsaiw4
              • go-blockservice/Session.GetBlock (51.608µs) cid=bafybeiaa552ayufsoxvsl7z5mey2hqi42ihjoerltffptqp4gnnbevvvlq
              • go-blockservice/Session.GetBlock (43.861µs) cid=bafybeidodlox437tinm6yemiai4hjfzft6qisgxpl7hvme4qawvjmp7loe
              • go-blockservice/Session.GetBlock (25.929µs) cid=bafybeiaerbb6wucpodhmomzk234n6f3clsmuwcw2ewduyfay4frsa5wwaa
              • go-blockservice/Session.GetBlock (14.212µs) cid=bafybeichc2vf6yk24rbph7tx5sjzybsz4ztiuzmmvrz4fnikyxtsulzkg4
        • go-blockservice/Session.GetBlock (15.004µs) cid=bafybeiejqhbybociss56wpnc7ncgtzysowtqjli6wmmcm4xa3t7b6un5jq
      • go-blockservice/Session.GetBlocks (5.148µs) bafkreic6wwrloxu73uhynnw32o3ivma77rbjave4lcjx74p4qoplixo4wm,bafkreidjt75xhwkxocrga3m6frxwvis4y57ngzoe6chzdfvs4brucbdkzm,bafkreifynzzfajzkwl6igdm5nqd5flvzctvviccbgnwd6dzo4o7e5qsiay

@iand
Copy link
Contributor

iand commented Mar 31, 2022

I forgot to mention that I can contribute time to getting this issue moved forward

Repository owner moved this from 🏃‍♀️ In Progress to 🎉 Done in IPFS Shipyard Team Apr 4, 2022
@BigLep
Copy link
Contributor

BigLep commented Apr 4, 2022

Reopening this one since it is a larger tracking issue. #8595 was the initial push.

@BigLep BigLep reopened this Apr 4, 2022
@iand
Copy link
Contributor

iand commented Apr 28, 2022

I'm working on adding tracing to go-blockservice and go-path (noting to avoid duplication of effort)

@guseggert
Copy link
Contributor Author

High-level go-bitswap spans: ipfs/go-bitswap#562

@guseggert
Copy link
Contributor Author

DHT: libp2p/go-libp2p-kad-dht#769

@iand
Copy link
Contributor

iand commented May 4, 2022

go-blockservice: ipfs/go-blockservice#91

@iand
Copy link
Contributor

iand commented May 4, 2022

go-path: ipfs/go-path#59

@ianzhang1988
Copy link

Maybe add peer id in those tags. If tracing multiple node, this way we can tell which node is having problem.

@iand
Copy link
Contributor

iand commented Sep 14, 2022

Maybe add peer id in those tags. If tracing multiple node, this way we can tell which node is having problem.

This might cause cardinality problems with systems that are ingesting the traces. There are a lot of possible values for peer id.

@thattommyhall
Copy link
Member

thattommyhall commented Sep 14, 2022 via email

@ianzhang1988
Copy link

This might cause cardinality problems with systems that are ingesting the traces. There are a lot of possible values for peer id.

That said, I think being able to tell the origin of the trace is quite usefull for finding node having problem. Any alternative?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: 🎉 Done
Development

Successfully merging a pull request may close this issue.

7 participants