Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLI metric that tracks availability of nodes and possibly ranges at node liveness level #57071

Closed
joshimhoff opened this issue Nov 24, 2020 · 4 comments
Labels
A-kv-observability C-investigation Further steps needed to qualify. C-label will change. O-sre For issues SRE opened or otherwise cares about tracking.

Comments

@joshimhoff
Copy link
Collaborator

joshimhoff commented Nov 24, 2020

There is a gauge already that tracks how many nodes are live at any one time. Since prom scrapes on a fixed interval, & since the metric is a lagging indicator (this is what I understand though I don't understand why it is lagging very concretely), if node liveness is having issues, it doesn't appear clearly in the graphs. There are metrics that track heartbeart success & failure but what this means for availability is not at all clear. See these graphs for an example:

image

image

Node liveness records are in the DB! The DB uses MVCC technology! Can we compute the availability of each node at a node liveness level from the above two facts? Concretely, can we compute uptime for a node over the last 5m let's say & log it to the DB? I'm not sure the answer to this Q. If the answer is no, I shall learn something from the why. If the answer is yes, the next Q is how to express this in prometheus-speak. I'm puzzling thru that a bit right now.

We may be able to do something similar for ranges...

If we can compute these two metrics & export them as prometheus metrics, I view them as low-cost-to-implement SLI metrics. The problem with them is that they only care about node liveness; ranges may be unavailable for other reasons. We can solve the latter more general SLI problem eventually but it'll cost more to implement. For now, on the CC side at least, we do do some blackbox probing (send regular test queries to a test DB from an external service that measures uptime) also.

@joshimhoff joshimhoff added O-sre For issues SRE opened or otherwise cares about tracking. A-kv-observability labels Nov 24, 2020
@blathers-crl
Copy link

blathers-crl bot commented Nov 24, 2020

Hi @joshimhoff, please add a C-ategory label to your issue. Check out the label system docs.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@joshimhoff
Copy link
Collaborator Author

joshimhoff commented Nov 30, 2020

I have been thinking about this more and I have two additional thoughts:

  1. I am not sure of the value of the proposed metric. See the graphs in the first comment. Would such a metric lead to greater clarity about the pictured incident? There are a bunch of restarts & node liveness heartbeat failures; how much impact on the customer? There is the impact from dropping connections at restart time; that certainly won't be captured by this metric (but will by the CC SQL prober). There is the impact of leaseholders of some ranges going down; that again won't be captured by this metric. Were there also periods of range level unavailability due to node liveness failures? And would the proposed metric help us answer this Q? I am not sure of the answers but I think the Qs are the right ones to ask. I think it's definitely the case that we need better SLI metrics; we need much greater clarity about impact on customer from an outage than we have today.

  2. I have some thoughts about how to export the metric in prom-speak but I'll hold off on that until we agree on the value of such an SLI metric.

@knz
Copy link
Contributor

knz commented Dec 1, 2020

I think there are two different angles to talk about here:

  1. focus on things that have impact on the cluster. A blip on liveness may have zero impact on cluster activity, so it would be silly to trigger an alert because of it. What would be interesting instead is to track every time an operation internally failed because some node could not be reached, RPC timed out, etc. HAve counters for these types of failures and monitor those counters directly.

  2. push edge events instead of polling level values. Every time "something happens" inside cockroachdb this is an edge event: there's an error object, there's a Go conditional stepped in etc. When these things happen, we can push an event to logging or some other external sink where the CC monitoring can pick it up. Then the information is available in real time.

FWIW there's some logging effort e.g. in #57171 and I could see these events delivered actively on the OPS channel, and then you can have them "in real time" in your fluent collector thanks to #57170

@knz knz added the C-investigation Further steps needed to qualify. C-label will change. label Dec 1, 2020
@joshimhoff
Copy link
Collaborator Author

joshimhoff commented Dec 1, 2020

focus on things that have impact on the cluster.

I agree with this principle with some caveats below.

A blip on liveness may have zero impact on cluster activity, so it would be silly to trigger an alert because of it.

Monitoring & alerting are not the same. I think understanding the impact of node liveness heartbeat failures in terms of the percentage availability of some node at a node liveness level is useful monitoring data but I wouldn't necessarily want to alert SRE based on this data. I agree that SRE should be alerted on impact, not possible causes of impact. This makes for a much higher signal to noise set of alerting rules.

What would be interesting instead is to track every time an operation internally failed because some node could not be reached, RPC timed out, etc.

Ya agreed. Same caveats apply re: useful monitoring data not necessarily being a good thing to base alerts on. Also probe CRDB from the outside so as to measure impact even in case of crashing nodes.

Ya the more I think about this the less I think the proposed metric is worth building. Learned some things thinking about it though! Closing.

Want to follow up with more conversations about KV level observability & SLIs. I've been spelunking thru existing CRDB metrics and want to develop a much better mental model around what they mean. And talk more about SLIs that don't exist yet!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability C-investigation Further steps needed to qualify. C-label will change. O-sre For issues SRE opened or otherwise cares about tracking.
Projects
None yet
Development

No branches or pull requests

2 participants