-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Extensible Health check framework #4244
Comments
@elfisher @muralikpbhat @reta + others, please provide your inputs. @Bukhtawar Please help me tagging the relevant folks. |
@Gaganjuneja thanks for the summary, I think it would greatly improve the visibility of the cluster as a whole. Using cicrcuit breakers as a decision maker for fault detection / node eviction (FollowerChecker / LeaderChecker) seems to be doubtful (not because they are useless but the way circuit breakers are designed). We may look at The Phi Accrual Failure Detector algroithm [1], as far as I know it is used by Akka Cluster and Cassandra for failure detection and is proven to work well. I personally lean towards 1st approach (afaik this is more or less how it works now but with predefined checks), but I have difficulties to understand what you mean by:
Are you reffering to the case when follower cannot ping leader (but leader can ping follower)? Wouldn't such nodes fail the followers check since they won't be able to send the response to leader? [1] https://pdfs.semanticscholar.org/11ae/4c0c0d0c36dc177c1fff5eb84fa49aa3e1a8.pdf |
Thanks @reta for your review.
Yes true, But the idea here is to identify the leader's failure so that followers can trigger the voting if the leader is unreachable. |
@Gaganjuneja Thanks for the proposal. This would help in plugging in more health checks easily and surface any system issues sooner before things start breaking in the cluster. In both the proposals, cluster manager (leader) is maintaining / caching the health status of each node and then making a decision. The main difference that I see is that cluster manager is piggy backing this health status collection over FollowerChecker or make another transport call. As long the health status is cached and payload is not too big, then FollowerChecker can simply return it. The concern that circuit breaker can't be executed for LeaderChecker, why is that needed? While generating the overall view Leader would have health status of local node as well. Also, i don't see we are doing anything special for Leader node in Proposal 2 anyway. Another ask is - There should be a health check API exposed for the user as well so that the user can run the health status on demand to check the health of all the nodes or run it on specific node. |
@shwetathareja thanks for reviewing it. Please find below the response to your queries.
There are a couple of more points -
You are right, let's take a scenario - every node runs the leader checker to know if the leader is active and healthy, if not they initiate the leader replacement and voting activities. If there is some bug in the plugin and the leader starts responding unhealthy all the time then quorum will be lost and the cluster goes to the red state. Yes, It's not being resolved in proposal-2 and actually we have skipped it for now and focussing on data nodes only.
Definitely a good call out, I need to check how it's being done as of now so that we can easily plug in there as well. But today only single action is only supported and that is node - eviction. So if unhealthy nodes are going to be evicted then health status would be the same as node available/unavailable status. |
Waiting for this since whole clusters of mine died related to Stuck I/O detection xD |
@Gaganjuneja @Bukhtawar I've changed the release version on this to |
@Gaganjuneja @Bukhtawar Reaching out since this is marked as a part of |
@Gaganjuneja This issue will be marked for next-release |
Tagging it for next release: |
@DarshitChanpura - I assume this is not going out in 2.8? Do we have a target release? |
I'm not sure. @Gaganjuneja Do you have an expected ETA for this? |
Is there any target release for this issue? I see that its tagged as v2.8.0, but don't see any associated PRs. |
@DarshitChanpura @cwperks I am not able to prioritise this for 2.8 release. |
Thank you @Gaganjuneja. I am removing the version label on this issue. Please add back a version label when this is scheduled. |
Is your feature request related to a problem? Please describe.
Feature Request #4052.
Problem Statement
OpenSearch core framework performs a very basic Ping based health check. Leader and followers ping each other for the health check and rely on the fact that if the ping request is responded within a time that means the node is healthy. Along with ping, it also checks the disk health by writing a temp file locally on the pinged host. Ping based health check works well but it tells the boolean response whether the node is up or not. It doesn’t give any signal about the deteriorating health of the host. E.g if the disk is performing slow for the last X minutes which might be contributing high latency, low throughput, etc and affecting operations like replication, search queries and ingestion. But the health check framework would only get to know about it once the node stops responding to the ping request. This faulty node may stay in the cluster for a very long time if managed to respond to the ping request but overall affecting the system performance and availability.
Scenarios -
Current State
At present, we are doing follower ping, read-only filesystem and stuckIO checks.
Leader/Follower checks – Leader pings all the followers on a periodic basis and validates the availability and checks their health. Followers also check their leader's health by ping requests.
Read only filesystem detection – System checks while bootstrapping a new node, if the filesystem is healthy.
Stuck I/O detection – It tries to write a temp file to see if the filesystem is writable. This is being done at multiple places/times.
Once removed the node is prevented from participating in any usual cluster operations by additionally handling below scenarios
1. Joining the new/earlier removed node - While adding the new node it checks for StuckIO to avoid unhealthy node from joining back the cluster.
2. Pre-vote request - While handling pre-vote requests this is being checked to remove unhealthy nodes from participating in the voting process.
Describe the solution you'd like
Proposal
The idea is to introduce an extensible health check framework and instead of just being reactive on node failure, it should be able to proactively identify the bad/sluggish nodes well in advance and take necessary action. Health check framework should be generic enough and open doors for users to implement their specific health checks as per their needs. There would be different checks for different cloud providers and underlined infrastructures. OpenSearch core should be resilient to all these health check implementations so that any wrong implementation shouldn’t bring the entire cluster down.
Approach
The approach is to come up with an extension point for Plugins to implement health checks. These health check plugins will be called by a new HealthCheckService from the above-mentioned places/times as applicable. I can think of the following primary features of this framework.
Features
Plugin Responsibilities
Proposed Solution
Solution – 1 (HealthCheck status with PING call)
We should leverage the existing way of health check and not make a new health check call. Now the request handler should be able to call a NodeHealthService implementation to get the node health. NodeHealthServiceIml should be able to asynchronously call all the health check plugins and keep the node health cached. FollowersChecker would cache the state of all the nodes and use circuit breaker to decide node action. It’s hard to implement a leader checker circuit breaker as it runs on all the follower nodes and it will be difficult to fetch the cluster level state.
Pros
Cons
PS - We can also think of using this extensible health check only for data/follower nodes and skip the leader nodes. We can discuss the potential health check scenarios for Leaders and decide.
Solution – 2 (Separate service which does health check and maintain the node health status at cluster level)
We can define a new service ClusterHealthCheckService, which will call all the health check plugins for all the nodes and cache their health status. This service will have a cluster level view of the health of all nodes and help in deciding the further action. We can easily detect if 1) faulty nodes belong to the same AZ, 2) Cluster state evaluation if we evict the node, 3) Health check failures are due to some BUG, 4) entire AZ is down, 5) node eviction lead to shard imbalance, etc. This service will give signals to Cluster manager for further node actions.
Pros
Cons
Use case
Grey Disk detection - Initially, we can start with the grey disk detection as a first use case. OpenSearch has few metrics already baked as a part of Performance Analyser plugin (https://github.com/opensearch-project/performance-analyzer) and these metrics can be further leveraged by Root cause analyser (https://github.com/opensearch-project/performance-analyzer-rca)to detect a faulty node having grey I/O failures. We can write a plugin which calls the RCA framework to get the disk related metrics for a sufficient period of time and decide on the node disk health.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: