-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add metrics in go-libp2p blogpost #77
Changes from 4 commits
722ba74
e36f33a
6d56447
0fa3696
62fc4db
84a0b83
09afc52
bbbc626
a2dc25b
becc906
5a97a3e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,162 @@ | ||||||
--- | ||||||
tags: | ||||||
- metrics | ||||||
- prometheus | ||||||
title: Metrics in go-libp2p | ||||||
description: | ||||||
date: 2023-06-15 | ||||||
permalink: "/2023-06-15-metrics-in-go-libp2p/" | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
note the permalink needs to change as well. can update date and permalink the day of merge |
||||||
author: Sukun Tarachandani | ||||||
--- | ||||||
|
||||||
# Metrics in go-libp2p | ||||||
|
||||||
## Introduction | ||||||
|
||||||
libp2p is the core networking component for many projects such as IPFS, Filecoin, the Ethereum Beacon Chain, and more. | ||||||
p-shahi marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
We as maintainers of go-libp2p, want to be able to observe the state of libp2p components and also enable our users to do the same in their production systems. | ||||||
To that effect, we've been added instrumentation to collect metrics from various components over the last few months. | ||||||
In fact, they've already helped us debug some nuanced go-libp2p issues and helped with the development of features (discussed in detail below). | ||||||
Today, we'd like to share some of the choices we made, our learnings, and point you to resources that will help you monitor your deployments of go-libp2p. | ||||||
|
||||||
p-shahi marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
## Why Prometheus? | ||||||
|
||||||
We were first faced with the question of choosing a metrics collection and monitoring system. Among our choices were Prometheus, OpenCensus, and OpenTelemetry. The details of the discussion can be found [here](https://github.com/libp2p/go-libp2p/issues/1356). | ||||||
|
||||||
To summarise the discussion, we'd observed [performance problems with OpenCensus](https://github.com/libp2p/go-libp2p/issues/1955) due to large amounts of garbage generated and OpenTelemetry's metrics api is still unstable as of writing this blog. In contrast, Prometheus was performant and ubiquitious. This allowed us to add metrics without worrying too much about performance. We also ensured that tracking metrics wasn't putting too much pressure on the garbage collector by [testing allocations](https://github.com/libp2p/go-libp2p/issues/2060) for all metrics that we introduced. In addition, we knew a lot of our users would prefer using Grafana as their visualisation tool and Grafana has excellent support for visualising prometheus metrics. | ||||||
sukunrt marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
## How users can enable metrics | ||||||
sukunrt marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
Metrics have been enabled by default from go-libp2p [v0.26.0](https://github.com/libp2p/go-libp2p/releases/tag/v0.26.0). All you need to do is setup a Prometheus exporter for the collected metrics. | ||||||
sukunrt marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
```go | ||||||
|
||||||
func main() { | ||||||
http.Handle("/metrics", promhttp.Handler()) | ||||||
go func() { | ||||||
http.ListenAndServe(":2112", nil) // Any port is fine | ||||||
}() | ||||||
|
||||||
host, err := libp2p.New() | ||||||
// err handling | ||||||
... | ||||||
} | ||||||
``` | ||||||
Now just point your prometheus instance to scrape from `:2122/metrics` | ||||||
|
||||||
By default, metrics are sent to the default prometheus Registerer. To use a different Registerer from the default prometheus registerer, use the option `libp2p.PrometheusRegisterer`. | ||||||
|
||||||
```go | ||||||
|
||||||
func main() { | ||||||
reg := prometheus.NewRegistry() | ||||||
http.Handle("/metrics", promhttp.HandlerFor(reg, promhttp.HandlerOpts{})) | ||||||
go func() { | ||||||
http.ListenAndServe(":2112", nil) // Any port is fine | ||||||
}() | ||||||
|
||||||
host, err := libp2p.New( | ||||||
libp2p.PrometheusRegisterer(reg), | ||||||
) | ||||||
// err handling | ||||||
... | ||||||
} | ||||||
``` | ||||||
|
||||||
<!-- TODO: incorporate this PR: https://github.com/libp2p/go-libp2p/pull/2232 --> | ||||||
sukunrt marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
### Discovering what metrics are available | ||||||
sukunrt marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
go-libp2p provides metrics and grafana dashboards for all its major subsystems out of the box. You can check https://github.com/libp2p/go-libp2p/tree/master/dashboards for the grafana dashboards available. Another great way to discover available metrics is to open prometheus ui and type `libp2p_(libp2p-package-name)_` and find available metrics from autocomplete. For Ex: `libp2p_autonat_` gives you the list of all metrics exported from [AutoNAT](https://github.com/libp2p/specs/tree/master/autonat). | ||||||
|
||||||
<div class="container" style="display:flex; column-gap:10px; justify-content: center; align-items: center;"> | ||||||
<figure> | ||||||
<img src="../assets/metrics-in-go-libp2p-prometheus-ui.png" width="750"> | ||||||
<figcaption style="font-size:x-small;"> | ||||||
EvtLocalAddressesUpdated | ||||||
</figcaption> | ||||||
</figure> | ||||||
</div> | ||||||
|
||||||
|
||||||
sukunrt marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
## How are metrics useful? | ||||||
sukunrt marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
I'll share two cases where having metrics were extremely helpful for us in go-libp2p. One case deals with being able to debug a memory leak and one where adding two new metrics helped us with development of a new feature. | ||||||
|
||||||
### Debugging with metrics | ||||||
sukunrt marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
We were excited about adding metrics because it gave us the opportunity to observe exactly what was happening within the system. One of the first system we added metrics to was the Event Bus. When we added event bus metrics, we were immediately able to see discrepancy between two of our metrics, `EvtLocalReachabilityChanged` and `EvtLocalAddressesUpdated`. You can see the details on the [github issue](https://github.com/libp2p/go-libp2p/issues/2046) | ||||||
sukunrt marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
<div class="container" style="display:flex; column-gap:10px; justify-content: center; align-items: center;"> | ||||||
<figure> | ||||||
<img src="../assets/metrics-in-go-libp2p-evtlocalreachabilitychanged.png" width="750"> | ||||||
<figcaption style="font-size:x-small;"> | ||||||
EvtLocalReachabilityChanged | ||||||
</figcaption> | ||||||
</figure> | ||||||
</div> | ||||||
|
||||||
<div class="container" style="display:flex; column-gap:10px; justify-content: center; align-items: center;"> | ||||||
<figure> | ||||||
<img src="../assets/metrics-in-go-libp2p-evtlocaladdressesupdated.png" width="750"> | ||||||
<figcaption style="font-size:x-small;"> | ||||||
EvtLocalAddressesUpdated | ||||||
</figcaption> | ||||||
</figure> | ||||||
</div> | ||||||
|
||||||
Ideally when a node's reachability changes, its addresses should also change as it tries to obtain a [relay reservation](https://github.com/libp2p/specs/blob/master/relay/circuit-v2.md). This pointed us to an issue with [AutoNAT](https://github.com/libp2p/specs/tree/master/autonat). Upon debugging we realised that the we were emitting reachability changed events when the reachability had not changed and only the address to which the autonat dial succeeded had changed. | ||||||
|
||||||
The graph for event `EvtLocalProtocolsUpdated` pointed us to another problem. | ||||||
|
||||||
<div class="container" style="display:flex; column-gap:10px; justify-content: center; align-items: center;"> | ||||||
<figure> | ||||||
<img src="../assets/metrics-in-go-libp2p-evtprotocolsupdated.png" width="750"> | ||||||
<figcaption style="font-size:x-small;"> | ||||||
EvtLocalProtocolsUpdated | ||||||
</figcaption> | ||||||
</figure> | ||||||
</div> | ||||||
|
||||||
A node's supported protocols shouldn't change if its reachability has not changed. Once we became aware of the issue, finding the root cause was simple enough. There was a problem with cleaning up the relay service used in relay manager. The details of the issue and the subsequent solution can be found [here](https://github.com/libp2p/go-libp2p/issues/2091) | ||||||
|
||||||
### Development using metrics | ||||||
sukunrt marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
In go-libp2p [v0.28.0](https://github.com/libp2p/go-libp2p/releases/tag/v0.28.0) we introduced smart dialing. When connecting with a peer instead of dialing all the addresses of the peer in parallel, we now prioritise QUIC dials. This significantly reduces dial cancellations and reduces unnecessary load on the network. Check the smart dialing [PR](https://github.com/libp2p/go-libp2p/pull/2260) for more information on the algorithm used and the impact of smart dialing. | ||||||
|
||||||
Not dialing all addresses in parallel increases latency for establishing a connection if the first dial doesn't succeed. We wanted to ensure that most of the connections succeeded with no additional latency. To help us better gauge the impact we added two metrics | ||||||
1. Dial ranking delay. This metric tracks the latency in connection establishment introduced by the dial prioritisation logic. | ||||||
2. Dials per connection. This metric counts the number of addresses dialed before a connection was established with the peer. | ||||||
|
||||||
Dials per connection measured the benefit of introducing smart dialing mechanism, and dial ranking delay provided us with the assurance that the vast majority of dials had no adverse impact on latency. | ||||||
|
||||||
<div class="container" style="display:flex; column-gap:10px; justify-content: center; align-items: center;"> | ||||||
<figure> | ||||||
<img src="../assets/metrics-in-go-libp2p-smart-dialing.png" width="750"> | ||||||
<figcaption style="font-size:x-small;"> | ||||||
Smart dialing metrics | ||||||
</figcaption> | ||||||
</figure> | ||||||
</div> | ||||||
|
||||||
|
||||||
## Resources | ||||||
|
||||||
Check out our grafana dashboards: [https://github.com/libp2p/go-libp2p/tree/master/dashboards](https://github.com/libp2p/go-libp2p/tree/master/dashboards) | ||||||
sukunrt marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
To create custom dashboards, the [prometheus](https://prometheus.io/docs/prometheus/latest/querying/basics/) and [grafana docs](https://grafana.com/docs/grafana/latest/panels-visualizations/) are great resources. | ||||||
|
||||||
sukunrt marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
## Get Involved | ||||||
|
||||||
- If you’d like to get involved and contribute to libp2p, you can reach out to us using these means: [https://libp2p.io/#community](https://libp2p.io/#community) | ||||||
- If you’re a self-starter and want to start pushing code immediately, feel free to ping the maintainers in any of these help wanted/good first issues: [go-libp2p](https://github.com/libp2p/go-libp2p/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22), [js-libp2p](https://github.com/libp2p/js-libp2p/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22), and [rust-libp2p](https://github.com/libp2p/rust-libp2p/issues?q=is%3Aopen+is%3Aissue+label%3Agetting-started). | ||||||
- If you want to work in and around libp2p full-time, there are various teams hiring including the implementation teams. See [https://jobs.protocol.ai/jobs?q=libp2p](https://jobs.protocol.ai/jobs?q=libp2p) for opportunities across the [Protocol Labs Network](https://plnetwork.io/). | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is empty? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changed to
|
||||||
|
||||||
To learn more about libp2p generally, checkout: | ||||||
|
||||||
- The [libp2p documentation portal](https://docs.libp2p.io/) | ||||||
- The [libp2p connectivity website](https://connectivity.libp2p.io/) | ||||||
- The [libp2p curriculum put together by the Protocol Labs Launchpad program](https://curriculum.pl-launchpad.io/curriculum/libp2p/introduction/) | ||||||
|
||||||
You can reach out to us and stay tuned for our next event announcement by joining our [various communication channels](https://libp2p.io/#community), joining the [discussion forum](https://discuss.libp2p.io/), following us on [Twitter](https://twitter.com/libp2p), or saying hi in the #libp2p-implementers channel in the [Filecoin public Slack](http://filecoin.io/slack). | ||||||
sukunrt marked this conversation as resolved.
Show resolved
Hide resolved
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to update this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can update it once we are ready to release. #77 (review)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done