Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi: add htlc stream subscription and routing dashboard #59

Merged
merged 5 commits into from
Nov 20, 2020

Conversation

carlaKC
Copy link
Contributor

@carlaKC carlaKC commented Nov 11, 2020

This PR adds a subscription to HTLC events and adds a routing dashboard. HTLCs are separated with a dropdown variable to display the dashboard for sends/receives/forwards that the node processes.

There's a lot more than can be done with this subscription, so we stick to some basic questions:

How many {sends/receives/forwards} is my node processing?

Screenshot 2020-11-11 at 16 53 54

How successful are my {sends/receives/forwards}?

Screenshot 2020-11-17 at 13 33 59

Which channels are most used for successful {sends/receives/forwards}?

Screenshot 2020-11-17 at 13 25 54

Screenshot 2020-11-17 at 13 26 00

Which channels are most used for failed {sends/receives/forwards}?

Screenshot 2020-11-17 at 13 26 05
Screenshot 2020-11-17 at 13 26 09

How long is my liquidity locked up by {sends/forwards}?

Screenshot 2020-11-11 at 16 54 03

Screenshot 2020-11-11 at 16 54 07

Why are my {sends/receives/forwards} failing?

Screenshot 2020-11-11 at 16 54 11

Full dashboard layout:

Screenshot 2020-11-17 at 13 26 44
Screenshot 2020-11-17 at 13 26 37
Screenshot 2020-11-17 at 13 26 30

Open question here is whether we want to duplicate the "pending htlcs" that we already have on the node dash. I can see it being useful to have a graph with only forwards to monitor your traffic, and detect any unusual spikes/spamming.

Depends on #58, don't review the first 2 commits.

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work! IMO this is one of the last low hanging fruits, that also has a lot of leverage w.r.t giving node operators more insight w.r.t what's happening with their node in real-time. It can also eventually be used later on to trigger alerts if it appears that griefing attacks may be being launched across the network.

Haven't tried this out yet, but so far just a few comments re redundant counters, and richer use of labels.

collectors/htlcs_collector.go Outdated Show resolved Hide resolved
collectors/htlcs_collector.go Outdated Show resolved Hide resolved
@@ -59,6 +64,24 @@ func newHtlcMonitor(router lndclient.RouterClient,
Name: "failed_forwards",
Help: "count fo failed forwards",
}),
resolutionTimeHistogram: prometheus.NewHistogram(
prometheus.HistogramOpts{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment here re a vector to add in additional labels. One other useful labels for this one in particular (also applies to the others), would be adding the payment hash itself. This would let us track MPP usage, possibly probing usage, hash re-use, see how long MPP payments take to resolve on avg, etc, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some labels here. We don't have payment hash rn, because it was a bit involved to surface that info in the original lnd PR. Also wondering whether labels would be the best way to track something as variable as payment hash?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm thinking the payment hash would only be used to group items in aggregation operators. So you could do something like count(resolution_ms) > 1 by (payment_hash) (assuming it's a gauge) which would let you track MPP usage over time as they're identified by repeated payment hashes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could end up being rather unscalable though since one may end up with so many labels over time.

collectors/htlcs_collector.go Outdated Show resolved Hide resolved
Switch lndmon to use lndclient, providing the readonly macaroon as our
only macaroon, so that no changes to lndmon setup are required. This
switch also provides us with version checks, which we set to lnd v0.11,
since that is the minimum supported version after recent CVEs.
Update all of our collectors to shutdown on failure rather than
sliently log. This paired with restarting the lndmon container on exit
allows easier detection of persistenet issues, and simple restart when
lnd is unavailable temporarily.
@carlaKC carlaKC force-pushed the 57-lndlcientswitchover branch from 709dee0 to 939973e Compare November 17, 2020 11:41
@carlaKC carlaKC requested a review from Roasbeef November 17, 2020 12:55
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🗾

@Roasbeef
Copy link
Member

Also thinking would be useful to create quantile graphs for the resolution latency as well, but this is a great start!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants