-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance the detection mechanism for the unhealthy etcd node #7730
Labels
component/election
Election related logic.
report/customer
Customers have encountered this bug.
type/enhancement
The issue or PR belongs to an enhancement.
Comments
JmPotato
added
type/enhancement
The issue or PR belongs to an enhancement.
component/election
Election related logic.
labels
Jan 18, 2024
This was referenced Jan 19, 2024
ti-chi-bot bot
pushed a commit
that referenced
this issue
Jan 22, 2024
ref #7730 Move the health checker into a separate file. Signed-off-by: JmPotato <ghzpotato@gmail.com>
ti-chi-bot bot
pushed a commit
that referenced
this issue
Jan 30, 2024
…#7737) ref #7730 Consider the latency while patrolling the healthy endpoints to reduce the effect of slow nodes. Now, there are the following strategies to select and remove unhealthy endpoints: - Choose only the healthy endpoint within the lowest acceptable latency range. - The evicted endpoint can only rejoin if it is selected again for three consecutive times. Signed-off-by: JmPotato <ghzpotato@gmail.com>
Close with #7737. |
/found customer |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
component/election
Election related logic.
report/customer
Customers have encountered this bug.
type/enhancement
The issue or PR belongs to an enhancement.
Part of #7499.
In various tests, we always observe continuous unavailability when injecting IO latency and other chaos into the PD leader node., like #6291. Upon further investigation of the logs, we discovered that the detection and eviction of unhealthy nodes are not always accurate. As a result, problematic etcd nodes can persistently impact our requests due to the round-robin balancer used by the etcd client. Especially during the leader switch, this problem could cause the PD leader unable to be stabilized and prolong the election, which affects the availability a lot.
For now, we just use 10 seconds as the timeout for the health check, it's kind of too loose in some cases. We require a more precise detection mechanism to promptly remove an unhealthy etcd node from the available endpoints and prevent it from rejoining before it truly recovers.
The text was updated successfully, but these errors were encountered: