Add comprehensive tracing #8578

guseggert · 2021-12-02T16:46:49Z

Master issue for adding comprehensive tracing throughout the go-ipfs codebase.

Some requirements:

End users and gateway operators should be able to trace costly work units throughout the various parts of the go-ipfs internals
Should fit into a larger distributed tracing story if go-ipfs is part of a larger system
Traces all the types of subsystems
- Synchronous work done as part of requests such as gateway requests
- Async workers that service sync requests
- Async workers that aren't involved in synchronous requests
  - DHT background work
  - Work done at daemon startup
- libp2p
Pluggable into different tracing backends
Easy to instrument existing un-instrumented code
Easy to instrument new code such as plugins, new interface implementations, etc.
Random sampling knob to tune perf/observability tradeoff

I plan to implement this against OpenTelemetry. Tactically I want to start at a high level (e.g. gateway reqs) and iteratively work down into the guts, driven by trying to answer questions about real requests so that we are instrumenting the right things.

Some related issues:

Todos:

Add tracer provider in go-ipfs w/ exporter configs, and add gateway and CoreAPI tracing
- feat: add basic gateway tracing #8595)
Deploy to public gateways
Add tracing in subsystems (incomplete list)
- go-bitswap
- go-namesys
- go-datastore
- go-libp2p-kad-dht
- libp2p subsystems
The Provider API needs a context plumbed through to go-ipfs-provider

BigLep · 2021-12-06T17:32:40Z

@guseggert: a couple of quick questions on this:

Do we have a sense of how much effort from you this will take in terms of ideal dev days/weeks? (I'm asking so can figure out when this may be of benefit for Bifrost)
After this work is done, what additional work will Bifrost need to do before they're able to see tracing data of a specific request?

guseggert · 2021-12-06T17:57:55Z

We can start getting local barebones traces back with 3 days of dev effort, though they won't go deep into the stack (just the glue in go-ipfs). Also not including the initial collector setup below, which is required for tracing on deployed infra. Adding additional spans deeper in the stack is trivial, with some minor overhead for plumbing that back into go-ipfs...in aggregate it'd likely take a couple weeks to add basic instrumentation to major components like Bitswap, DHT, datastores, other libp2p pieces, etc., and get them all plumbed up into go-ipfs.
I met with @coryschwartz and @frrist last week, we discussed 2) (amongst other things). I'd estimate the effort to integrate this with our infra at 1 week. The consensus was to use Grafana Tempo, which is a hosted OpenTelemetry collector, which Grafana has a UI on top of (see https://grafana.com/docs/tempo/latest/getting-started/tempo-in-grafana/). We'd follow the onboarding instructions here: https://grafana.com/orgs/protocollabs/hosted-traces/76.

BigLep · 2021-12-06T19:33:29Z

@guseggert : thanks! With local barebones traces, what kind of metrics will we get?

guseggert · 2021-12-06T19:48:08Z

We can get timing info and some basic request metadata for BlockStore, BlockService, DAGService, PeerStore, Pinner, and IPNS, possibly others.

Simple example, here's a req I recorded locally showing what's happening as go-ipfs builds an index page for dist.ipfs.io, from a few spans in the gateway request handler and core

proto-sid · 2021-12-06T22:00:55Z

cc: @gmasgras for visibility

guseggert · 2021-12-08T17:56:29Z

Draft PR for gateway tracing: #8595

guseggert · 2021-12-08T17:57:52Z

Netops has a summit this week they are busy with, so things will be slow this week on infra side. @thattommyhall is working on code changes to install a collector on the public gateway hosts.

proto-sid · 2022-01-13T23:39:45Z

@thattommyhall Could you share the current state of this implementation?

lidel · 2022-03-10T12:58:59Z

FYI i gathered some feedback around HTTP-request metrics in #8441

BigLep · 2022-03-24T16:09:07Z

2022-03-24 conversation: @guseggert is going to define the done criteria for the current attack at this. He'll also list potential followups.

ipfs/go-datastore#188 is a related PR.

iand · 2022-03-31T11:34:32Z

I would suggest decoupling the technical implementation from the infrastructure deployment since having tracing capability earlier is very valuable. I reviewed #8595 and left a few minor comments, but based on my testing I consider that PR essentially done. It's enough to start gathering traces on a small scale and threading tracing code throughout lower level libraries is going to be extremely valuable.

However, getting useful traces at scale on our infrastructure needs more work. I don't think it's feasible or desirable to gather traces from all requests (sample rate 1) due to the volume and the utility of sampling at lower levels is limited if it doesn't include the most interesting samples. I'd like to see some more sampling options added so we can restrict by top level API operation (e.g. Gateway.Request) and by trace time (e.g. only record traces for requests with duration greater than 5s)

I used the go-ipfs branch and added tracing to go-namesys, go-path and go-blockservice to experiment a bit. The results are really useful:

sample trace

bcaeefd4fd6df40ca4f93d8ffaf83a5a

Gateway.Request (568.562044ms) http.target=/ipns/en.wikipedia-on-ipfs.org/wiki/San_Francisco/
- CoreAPI.ResolvePath (955.569µs) path=/ipns/en.wikipedia-on-ipfs.org/wiki/San_Francisco/
  - go-namesys/resolve.ResolveIPNS (324.512µs) path=/ipns/en.wikipedia-on-ipfs.org/wiki/San_Francisco/
    - go-namesys.mpns.Resolve (313.499µs) name=/ipns/en.wikipedia-on-ipfs.org
      - go-namesys.resolve (307.04µs) name=/ipns/en.wikipedia-on-ipfs.org
        
        go-namesys.resolveAsync (25.813µs) name=/ipns/en.wikipedia-on-ipfs.org
        
        go-namesys.mpns.resolveOnceAsync (15.89µs) name=/ipns/en.wikipedia-on-ipfs.org
        
        go-namesys.DNSResolver.resolveOnceAsync (6.621µs) name=en.wikipedia-on-ipfs.org
        
        go-namesys.DNSResolver.workDomain (224.793µs) name=_dnslink.en.wikipedia-on-ipfs.org.
        
        go-namesys.DNSResolver.workDomain (887.368µs) name=en.wikipedia-on-ipfs.org.
  - go-path.basicResolver.ResolveToLastNode (616.415µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco
    - go-path.basicResolver.resolveNodes (272.635µs) cid=bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze
      - go-blockservice/Session.GetBlock (36.319µs) cid=bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze
      - go-blockservice/Session.GetBlock (32.484µs) cid=bafybeihn2f7lhumh4grizksi2fl233cyszqadkn424ptjajfenykpsaiw4
      - go-blockservice/Session.GetBlock (25.078µs) cid=bafybeiaa552ayufsoxvsl7z5mey2hqi42ihjoerltffptqp4gnnbevvvlq
      - go-blockservice/Session.GetBlock (23.415µs) cid=bafybeidodlox437tinm6yemiai4hjfzft6qisgxpl7hvme4qawvjmp7loe
      - go-blockservice/Session.GetBlock (21.946µs) cid=bafybeiaerbb6wucpodhmomzk234n6f3clsmuwcw2ewduyfay4frsa5wwaa
- CoreAPI.BlockAPI.Get (28.283µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco
  - CoreAPI.ResolvePath (1.758µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco
  - go-blockservice/blockService.GetBlock (20.343µs) cid=bafybeichc2vf6yk24rbph7tx5sjzybsz4ztiuzmmvrz4fnikyxtsulzkg4
- CoreAPI.ResolvePath (321.899µs) path=/ipns/en.wikipedia-on-ipfs.org
  - go-namesys/resolve.ResolveIPNS (305.296µs) path=/ipns/en.wikipedia-on-ipfs.org
    - go-namesys.mpns.Resolve (298.594µs) name=/ipns/en.wikipedia-on-ipfs.org
      - go-namesys.resolve (294.529µs) name=/ipns/en.wikipedia-on-ipfs.org
        
        go-namesys.resolveAsync (16.506µs) name=/ipns/en.wikipedia-on-ipfs.org
        
        go-namesys.mpns.resolveOnceAsync (12.215µs) name=/ipns/en.wikipedia-on-ipfs.org
        
        go-namesys.DNSResolver.resolveOnceAsync (5.279µs) name=en.wikipedia-on-ipfs.org
        
        go-namesys.DNSResolver.workDomain (209.855µs) name=_dnslink.en.wikipedia-on-ipfs.org.
        
        go-namesys.DNSResolver.workDomain (989.374µs) name=en.wikipedia-on-ipfs.org.
  - go-path.basicResolver.ResolveToLastNode (3.526µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze
- CoreAPI.ResolvePath (358.244µs) path=/ipns/en.wikipedia-on-ipfs.org/wiki
  - go-namesys/resolve.ResolveIPNS (265.91µs) path=/ipns/en.wikipedia-on-ipfs.org/wiki
    - go-namesys.mpns.Resolve (256.527µs) name=/ipns/en.wikipedia-on-ipfs.org
      - go-namesys.resolve (250.682µs) name=/ipns/en.wikipedia-on-ipfs.org
        
        go-namesys.resolveAsync (24.89µs) name=/ipns/en.wikipedia-on-ipfs.org
        
        go-namesys.mpns.resolveOnceAsync (18.147µs) name=/ipns/en.wikipedia-on-ipfs.org
        
        go-namesys.DNSResolver.resolveOnceAsync (6.997µs) name=en.wikipedia-on-ipfs.org
        
        go-namesys.DNSResolver.workDomain (162.87µs) name=_dnslink.en.wikipedia-on-ipfs.org.
        
        go-namesys.DNSResolver.workDomain (651.886µs) name=en.wikipedia-on-ipfs.org.
  - go-path.basicResolver.ResolveToLastNode (80.249µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki
    - go-path.basicResolver.resolveNodes (65.656µs) cid=bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze
      - go-blockservice/Session.GetBlock (32.781µs) cid=bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze
- CoreAPI.ResolvePath (942.509µs) path=/ipns/en.wikipedia-on-ipfs.org/wiki/San_Francisco
  - go-namesys/resolve.ResolveIPNS (245.561µs) path=/ipns/en.wikipedia-on-ipfs.org/wiki/San_Francisco
    - go-namesys.mpns.Resolve (238.159µs) name=/ipns/en.wikipedia-on-ipfs.org
      - go-namesys.resolve (234.365µs) name=/ipns/en.wikipedia-on-ipfs.org
        
        go-namesys.resolveAsync (14.864µs) name=/ipns/en.wikipedia-on-ipfs.org
        
        go-namesys.mpns.resolveOnceAsync (11.16µs) name=/ipns/en.wikipedia-on-ipfs.org
        
        go-namesys.DNSResolver.resolveOnceAsync (4.391µs) name=en.wikipedia-on-ipfs.org
        
        go-namesys.DNSResolver.workDomain (178.332µs) name=_dnslink.en.wikipedia-on-ipfs.org.
        
        go-namesys.DNSResolver.workDomain (302.165µs) name=en.wikipedia-on-ipfs.org.
  - go-path.basicResolver.ResolveToLastNode (682.538µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco
    - go-path.basicResolver.resolveNodes (288.607µs) cid=bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze
      - go-blockservice/Session.GetBlock (28.631µs) cid=bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze
      - go-blockservice/Session.GetBlock (50.985µs) cid=bafybeihn2f7lhumh4grizksi2fl233cyszqadkn424ptjajfenykpsaiw4
      - go-blockservice/Session.GetBlock (32.146µs) cid=bafybeiaa552ayufsoxvsl7z5mey2hqi42ihjoerltffptqp4gnnbevvvlq
      - go-blockservice/Session.GetBlock (29.617µs) cid=bafybeidodlox437tinm6yemiai4hjfzft6qisgxpl7hvme4qawvjmp7loe
      - go-blockservice/Session.GetBlock (25.647µs) cid=bafybeiaerbb6wucpodhmomzk234n6f3clsmuwcw2ewduyfay4frsa5wwaa
- CoreAPI.UnixfsAPI.Get (60.648µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco
  - CoreAPI.ResolveNode (43.205µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco
    - CoreAPI.ResolvePath (2.154µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco
    - go-blockservice/Session.GetBlock (24.439µs) cid=bafybeichc2vf6yk24rbph7tx5sjzybsz4ztiuzmmvrz4fnikyxtsulzkg4
- CoreAPI.UnixfsAPI.Get (778.132µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco/index.html
  - CoreAPI.ResolveNode (766.786µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco/index.html
    - CoreAPI.ResolvePath (737.653µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco/index.html
      - go-namesys/resolve.ResolveIPNS (2.155µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco/index.html
      - go-path.basicResolver.ResolveToLastNode (721.233µs) path=/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/San_Francisco/index.html
        
        go-path.basicResolver.resolveNodes (694.256µs) cid=bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze
        
        go-blockservice/Session.GetBlock (18.787µs) cid=bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze
        
        go-blockservice/Session.GetBlock (32.486µs) cid=bafybeihn2f7lhumh4grizksi2fl233cyszqadkn424ptjajfenykpsaiw4
        
        go-blockservice/Session.GetBlock (51.608µs) cid=bafybeiaa552ayufsoxvsl7z5mey2hqi42ihjoerltffptqp4gnnbevvvlq
        
        go-blockservice/Session.GetBlock (43.861µs) cid=bafybeidodlox437tinm6yemiai4hjfzft6qisgxpl7hvme4qawvjmp7loe
        
        go-blockservice/Session.GetBlock (25.929µs) cid=bafybeiaerbb6wucpodhmomzk234n6f3clsmuwcw2ewduyfay4frsa5wwaa
        
        go-blockservice/Session.GetBlock (14.212µs) cid=bafybeichc2vf6yk24rbph7tx5sjzybsz4ztiuzmmvrz4fnikyxtsulzkg4
    - go-blockservice/Session.GetBlock (15.004µs) cid=bafybeiejqhbybociss56wpnc7ncgtzysowtqjli6wmmcm4xa3t7b6un5jq
  - go-blockservice/Session.GetBlocks (5.148µs) bafkreic6wwrloxu73uhynnw32o3ivma77rbjave4lcjx74p4qoplixo4wm,bafkreidjt75xhwkxocrga3m6frxwvis4y57ngzoe6chzdfvs4brucbdkzm,bafkreifynzzfajzkwl6igdm5nqd5flvzctvviccbgnwd6dzo4o7e5qsiay

iand · 2022-03-31T11:35:11Z

I forgot to mention that I can contribute time to getting this issue moved forward

BigLep · 2022-04-04T17:36:58Z

Reopening this one since it is a larger tracking issue. #8595 was the initial push.

iand · 2022-04-28T11:23:35Z

I'm working on adding tracing to go-blockservice and go-path (noting to avoid duplication of effort)

guseggert · 2022-04-28T14:18:32Z

High-level go-bitswap spans: ipfs/go-bitswap#562

guseggert · 2022-04-29T09:43:52Z

DHT: libp2p/go-libp2p-kad-dht#769

iand · 2022-05-04T10:40:29Z

go-blockservice: ipfs/go-blockservice#91

iand · 2022-05-04T10:48:25Z

go-path: ipfs/go-path#59

ianzhang1988 · 2022-09-14T10:19:59Z

Maybe add peer id in those tags. If tracing multiple node, this way we can tell which node is having problem.

iand · 2022-09-14T10:54:21Z

Maybe add peer id in those tags. If tracing multiple node, this way we can tell which node is having problem.

This might cause cardinality problems with systems that are ingesting the traces. There are a lot of possible values for peer id.

thattommyhall · 2022-09-14T11:01:57Z

Absolutely, there are already some warnings in the Grafana cardinality dashboard re: the hydras labeling peerid (they are at 2k), care needed

…

On Wed, 14 Sept 2022, 12:54 Ian Davis, ***@***.***> wrote: Maybe add peer id in those tags. If tracing multiple node, this way we can tell which node is having problem. This might cause cardinality problems with systems that are ingesting the traces. There are a lot of possible values for peer id. — Reply to this email directly, view it on GitHub <#8578 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA7PFF4GGXSIXZ2TY24VATV6GVGRANCNFSM5JHTAWIQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ianzhang1988 · 2022-09-14T11:26:30Z

This might cause cardinality problems with systems that are ingesting the traces. There are a lot of possible values for peer id.

That said, I think being able to tell the origin of the trace is quite usefull for finding node having problem. Any alternative?

guseggert self-assigned this Dec 2, 2021

guseggert linked a pull request Dec 8, 2021 that will close this issue

feat: add basic gateway tracing #8595

Merged

guseggert removed a link to a pull request Dec 8, 2021

feat: add basic gateway tracing #8595

Merged

BigLep linked a pull request Dec 16, 2021 that will close this issue

feat: add basic gateway tracing #8595

Merged

galargh added this to IPFS Shipyard Team Mar 2, 2022

BigLep moved this to 🥞 Todo in IPFS Shipyard Team Mar 3, 2022

BigLep added this to the go-ipfs 0.13 milestone Mar 3, 2022

guseggert moved this from 🥞 Todo to 🏃‍♀️ In Progress in IPFS Shipyard Team Mar 21, 2022

guseggert moved this from 🏃‍♀️ In Progress to 🔎 In Review in IPFS Shipyard Team Mar 26, 2022

guseggert moved this from 🔎 In Review to 🏃‍♀️ In Progress in IPFS Shipyard Team Mar 26, 2022

guseggert modified the milestones: go-ipfs 0.13, Best Effort Track Mar 26, 2022

lidel closed this as completed in #8595 Apr 4, 2022

Repository owner moved this from 🏃‍♀️ In Progress to 🎉 Done in IPFS Shipyard Team Apr 4, 2022

BigLep reopened this Apr 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add comprehensive tracing #8578

Add comprehensive tracing #8578

guseggert commented Dec 2, 2021 •

edited

Loading

BigLep commented Dec 6, 2021

guseggert commented Dec 6, 2021 •

edited

Loading

BigLep commented Dec 6, 2021

guseggert commented Dec 6, 2021 •

edited

Loading

proto-sid commented Dec 6, 2021

guseggert commented Dec 8, 2021

guseggert commented Dec 8, 2021

proto-sid commented Jan 13, 2022

lidel commented Mar 10, 2022

BigLep commented Mar 24, 2022

iand commented Mar 31, 2022

iand commented Mar 31, 2022

BigLep commented Apr 4, 2022

iand commented Apr 28, 2022

guseggert commented Apr 28, 2022

guseggert commented Apr 29, 2022

iand commented May 4, 2022

iand commented May 4, 2022

ianzhang1988 commented Sep 14, 2022

iand commented Sep 14, 2022

thattommyhall commented Sep 14, 2022 via email

ianzhang1988 commented Sep 14, 2022

Add comprehensive tracing #8578

Add comprehensive tracing #8578

Comments

guseggert commented Dec 2, 2021 • edited Loading

BigLep commented Dec 6, 2021

guseggert commented Dec 6, 2021 • edited Loading

BigLep commented Dec 6, 2021

guseggert commented Dec 6, 2021 • edited Loading

proto-sid commented Dec 6, 2021

guseggert commented Dec 8, 2021

guseggert commented Dec 8, 2021

proto-sid commented Jan 13, 2022

lidel commented Mar 10, 2022

BigLep commented Mar 24, 2022

iand commented Mar 31, 2022

iand commented Mar 31, 2022

BigLep commented Apr 4, 2022

iand commented Apr 28, 2022

guseggert commented Apr 28, 2022

guseggert commented Apr 29, 2022

iand commented May 4, 2022

iand commented May 4, 2022

ianzhang1988 commented Sep 14, 2022

iand commented Sep 14, 2022

thattommyhall commented Sep 14, 2022 via email

ianzhang1988 commented Sep 14, 2022

guseggert commented Dec 2, 2021 •

edited

Loading

guseggert commented Dec 6, 2021 •

edited

Loading

guseggert commented Dec 6, 2021 •

edited

Loading