-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi: add htlc stream subscription and routing dashboard #59
multi: add htlc stream subscription and routing dashboard #59
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work! IMO this is one of the last low hanging fruits, that also has a lot of leverage w.r.t giving node operators more insight w.r.t what's happening with their node in real-time. It can also eventually be used later on to trigger alerts if it appears that griefing attacks may be being launched across the network.
Haven't tried this out yet, but so far just a few comments re redundant counters, and richer use of labels.
@@ -59,6 +64,24 @@ func newHtlcMonitor(router lndclient.RouterClient, | |||
Name: "failed_forwards", | |||
Help: "count fo failed forwards", | |||
}), | |||
resolutionTimeHistogram: prometheus.NewHistogram( | |||
prometheus.HistogramOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment here re a vector to add in additional labels. One other useful labels for this one in particular (also applies to the others), would be adding the payment hash itself. This would let us track MPP usage, possibly probing usage, hash re-use, see how long MPP payments take to resolve on avg, etc, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some labels here. We don't have payment hash rn, because it was a bit involved to surface that info in the original lnd PR. Also wondering whether labels would be the best way to track something as variable as payment hash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'm thinking the payment hash would only be used to group items in aggregation operators. So you could do something like count(resolution_ms) > 1 by (payment_hash)
(assuming it's a gauge) which would let you track MPP usage over time as they're identified by repeated payment hashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could end up being rather unscalable though since one may end up with so many labels over time.
Switch lndmon to use lndclient, providing the readonly macaroon as our only macaroon, so that no changes to lndmon setup are required. This switch also provides us with version checks, which we set to lnd v0.11, since that is the minimum supported version after recent CVEs.
Update all of our collectors to shutdown on failure rather than sliently log. This paired with restarting the lndmon container on exit allows easier detection of persistenet issues, and simple restart when lnd is unavailable temporarily.
709dee0
to
939973e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🗾
Also thinking would be useful to create quantile graphs for the resolution latency as well, but this is a great start! |
This PR adds a subscription to HTLC events and adds a routing dashboard. HTLCs are separated with a dropdown variable to display the dashboard for sends/receives/forwards that the node processes.
There's a lot more than can be done with this subscription, so we stick to some basic questions:
How many {sends/receives/forwards} is my node processing?
How successful are my {sends/receives/forwards}?
Which channels are most used for successful {sends/receives/forwards}?
Which channels are most used for failed {sends/receives/forwards}?
How long is my liquidity locked up by {sends/forwards}?
Why are my {sends/receives/forwards} failing?
Full dashboard layout:
Open question here is whether we want to duplicate the "pending htlcs" that we already have on the node dash. I can see it being useful to have a graph with only forwards to monitor your traffic, and detect any unusual spikes/spamming.
Depends on #58, don't review the first 2 commits.