-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLI metric that tracks availability of nodes and possibly ranges at node liveness level #57071
Comments
Hi @joshimhoff, please add a C-ategory label to your issue. Check out the label system docs. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
I have been thinking about this more and I have two additional thoughts:
|
I think there are two different angles to talk about here:
FWIW there's some logging effort e.g. in #57171 and I could see these events delivered actively on the OPS channel, and then you can have them "in real time" in your fluent collector thanks to #57170 |
I agree with this principle with some caveats below.
Monitoring & alerting are not the same. I think understanding the impact of node liveness heartbeat failures in terms of the percentage availability of some node at a node liveness level is useful monitoring data but I wouldn't necessarily want to alert SRE based on this data. I agree that SRE should be alerted on impact, not possible causes of impact. This makes for a much higher signal to noise set of alerting rules.
Ya agreed. Same caveats apply re: useful monitoring data not necessarily being a good thing to base alerts on. Also probe CRDB from the outside so as to measure impact even in case of crashing nodes. Ya the more I think about this the less I think the proposed metric is worth building. Learned some things thinking about it though! Closing. Want to follow up with more conversations about KV level observability & SLIs. I've been spelunking thru existing CRDB metrics and want to develop a much better mental model around what they mean. And talk more about SLIs that don't exist yet! |
There is a gauge already that tracks how many nodes are live at any one time. Since prom scrapes on a fixed interval, & since the metric is a lagging indicator (this is what I understand though I don't understand why it is lagging very concretely), if node liveness is having issues, it doesn't appear clearly in the graphs. There are metrics that track heartbeart success & failure but what this means for availability is not at all clear. See these graphs for an example:
Node liveness records are in the DB! The DB uses MVCC technology! Can we compute the availability of each node at a node liveness level from the above two facts? Concretely, can we compute uptime for a node over the last 5m let's say & log it to the DB? I'm not sure the answer to this Q. If the answer is no, I shall learn something from the why. If the answer is yes, the next Q is how to express this in prometheus-speak. I'm puzzling thru that a bit right now.
We may be able to do something similar for ranges...
If we can compute these two metrics & export them as prometheus metrics, I view them as low-cost-to-implement SLI metrics. The problem with them is that they only care about node liveness; ranges may be unavailable for other reasons. We can solve the latter more general SLI problem eventually but it'll cost more to implement. For now, on the CC side at least, we do do some blackbox probing (send regular test queries to a test DB from an external service that measures uptime) also.
The text was updated successfully, but these errors were encountered: