Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new "health check" method #2809

Closed
mDuo13 opened this issue Dec 14, 2018 · 5 comments
Closed

Create new "health check" method #2809

mDuo13 opened this issue Dec 14, 2018 · 5 comments
Assignees
Labels
API Change Feature Request Used to indicate requests to add new features Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required.

Comments

@mDuo13
Copy link
Collaborator

mDuo13 commented Dec 14, 2018

The server_info command is very cluttered and in many cases it's not easy to diagnose whether the server is healthy or not from that. A simple "health check" method could make it easier to monitor rippled with industry-standard tooling and also easier to diagnose manually as well.

Ideally:

  • It would be an HTTP GET call, to maximize compatibility with common monitoring software. (Maybe using the same port as "crawl"?)
  • It would report a health score ("Healthy", "Warning", or "Critical") depending on several factors:
    • Age of last validated ledger is < 7s (healthy), <=20s (warning) or >20s (critical)
    • Not amendment blocked (critical) and amendments unknown to the server don't currently have a majority in the network (warning)
    • Number of peers is >7 (healthy) <5 (warning unless configured w/ peer_private), or 0 (critical)
    • server_state is full/validating/proposing (healthy), syncing/tracking/connected (warning), or disconnected (critical)
    • Maybe a load_factor based warning, too? Not sure what thresholds to use there.
  • If the health score is not "healthy", it reports which factor(s) are not healthy. If it's healthy, the word "HEALTHY" is the only thing in the response.
@mDuo13 mDuo13 added API Change Feature Request Used to indicate requests to add new features labels Dec 14, 2018
@ximinez
Copy link
Collaborator

ximinez commented Dec 14, 2018

Speaking as an engineer, not an end-user, I think that reporting the values of the individual factors along with the health score could be valuable.
Reasons:

  1. Some people just like to watch the blinkenlites, and having these top factors only may be interesting.
  2. Advanced users may want to monitor how healthy their node is. For example, if I'm running a hub, I would expect hundreds of peers, not just 7. It would be handy to get this information in one place so I can confirm the node thinks itself is healthy, but I can keep an extra watch on the number of peers, and have some local alert if it drops below my own threshold. Yes, an advanced user can get this info from server_info, but including it here, when we also return it when there are problems, saves an RPC call.

@MarkusTeufelberger
Copy link
Collaborator

It might be useful to split this up into "liveness" and "readiness" (see https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/ for example) and I'm not convinced that a server that has constant 6 seconds lag is "healthy".

@garay-daniel
Copy link

Agree with Ed's point. I think this also pushes operators to understand what metrics are important to monitor.

If we make this compatible with crawl, then it would be easy for anyone to scrape this data to build dashboards and monitoring of the network, which would be great.

@MarkusTeufelberger
Copy link
Collaborator

If you want people to build dashboards, implement https://openmetrics.io/ instead of the current statsd-only metrics framework.

I currently collect metrics via server_status and a custom prometheus exporter, so most/all of the data mentioned is anyways already available. It is just clunky to extract.

@carlhua
Copy link
Contributor

carlhua commented Mar 23, 2020

Lets discuss this as part of 1.6. I added a tag to this issue. @mayurbhandary

HowardHinnant added a commit to HowardHinnant/rippled that referenced this issue Apr 16, 2020
* Gives a summary of the health of the node:
  Healthy, Warning, or Critical

* Last validated ledger age:
  <7s is Healthy,
  7s to 20s is Warning
  > 20s is Critcal

* If amendment blocked, Critical

* Number of peers:
  > 7 is Healthy
  1 to 7 is Warning
  0 is Critical

* server state:
  One of full, validating or proposing is Healthy
  One of syncing, tracking or connected is Warning
  All other states are Critical

* load factor:
  <= 100 is Healthy
  101 to 999 is Warning
  >= 1000 is Critical

* If not Healthy, info field contains data that is considered not
  Healthy.

Fixes: XRPLF#2809
HowardHinnant added a commit to HowardHinnant/rippled that referenced this issue Apr 16, 2020
* Gives a summary of the health of the node:
  Healthy, Warning, or Critical

* Last validated ledger age:
  <7s is Healthy,
  7s to 20s is Warning
  > 20s is Critcal

* If amendment blocked, Critical

* Number of peers:
  > 7 is Healthy
  1 to 7 is Warning
  0 is Critical

* server state:
  One of full, validating or proposing is Healthy
  One of syncing, tracking or connected is Warning
  All other states are Critical

* load factor:
  <= 100 is Healthy
  101 to 999 is Warning
  >= 1000 is Critical

* If not Healthy, info field contains data that is considered not
  Healthy.

Fixes: XRPLF#2809
HowardHinnant added a commit to HowardHinnant/rippled that referenced this issue May 11, 2020
* Gives a summary of the health of the node:
  Healthy, Warning, or Critical

* Last validated ledger age:
  <7s is Healthy,
  7s to 20s is Warning
  > 20s is Critcal

* If amendment blocked, Critical

* Number of peers:
  > 7 is Healthy
  1 to 7 is Warning
  0 is Critical

* server state:
  One of full, validating or proposing is Healthy
  One of syncing, tracking or connected is Warning
  All other states are Critical

* load factor:
  <= 100 is Healthy
  101 to 999 is Warning
  >= 1000 is Critical

* If not Healthy, info field contains data that is considered not
  Healthy.

Fixes: XRPLF#2809
@HowardHinnant HowardHinnant added the Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. label May 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Change Feature Request Used to indicate requests to add new features Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants