-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect leader health and automatically do failover #6403
Labels
affects-6.5
affects-7.1
severity/major
type/bug
The issue is confirmed as a bug.
type/enhancement
The issue or PR belongs to an enhancement.
Comments
nolouch
added
the
type/feature-request
Categorizes issue or PR as related to a new feature.
label
May 4, 2023
nolouch
added
type/bug
The issue is confirmed as a bug.
type/enhancement
The issue or PR belongs to an enhancement.
and removed
type/feature-request
Categorizes issue or PR as related to a new feature.
labels
May 5, 2023
ti-chi-bot bot
pushed a commit
that referenced
this issue
May 12, 2023
ref #6403 Signed-off-by: Ryan Leung <rleungx@gmail.com>
ti-chi-bot
pushed a commit
to ti-chi-bot/pd
that referenced
this issue
May 12, 2023
ref tikv#6403 Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot
pushed a commit
to ti-chi-bot/pd
that referenced
this issue
May 12, 2023
ref tikv#6403 Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot
pushed a commit
to ti-chi-bot/pd
that referenced
this issue
May 12, 2023
close tikv#6403 Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot
pushed a commit
to ti-chi-bot/pd
that referenced
this issue
May 12, 2023
close tikv#6403 Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot bot
added a commit
that referenced
this issue
May 15, 2023
…d leader intact (#6447) (#6461) close #6403, ref #6447 server: fix the leader cannot election after pd leader lost while etcd leader intact Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io> Signed-off-by: nolouch <nolouch@gmail.com> Co-authored-by: ShuNing <nolouch@gmail.com> Co-authored-by: nolouch <nolouch@gmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot bot
added a commit
that referenced
this issue
May 24, 2023
…d leader intact (#6447) (#6460) close #6403, ref #6447 server: fix the leader cannot election after pd leader lost while etcd leader intact Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io> Signed-off-by: nolouch <nolouch@gmail.com> Co-authored-by: ShuNing <nolouch@gmail.com> Co-authored-by: nolouch <nolouch@gmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
affects-6.5
affects-7.1
severity/major
type/bug
The issue is confirmed as a bug.
type/enhancement
The issue or PR belongs to an enhancement.
Feature Request
Describe your feature request related problem
PD will bind the PD leader to the etcd leader to reduce the burden of understanding for users. based on this behavior, here meet a problem that the PD leader lease lost but etcd leader doesn't lose. and then the previous PD cannot elect as leader again because of some problems with the leader election. we can simulate the problem like dropping all pockets coming from this connection.
While etcd raft heartbeat to keep etcd leadership goes to other nodes, the PD leader lease keepalive goes directly to the local peer advertise address, so completely different connections. In this case, the PD leader lost but other followers cannot elect a new leader due to the etcd leader still being in the old one, then PD cannot serve, and the cluster is unavailable for a long time until etcd leader be changed.
Describe the feature you'd like
Reduce the unavailable time of the cluster.
Describe alternatives solutions you've considered
PD Leader health detect
Because all followers watch the pd leader's key, so actually all members know who is the leader. we can store the leader-member id and the update time in the memory of all members. once the leader key lease is lost, the leader key will be deleted because the lease expired, then all members should know it by watching the key, then clear the leader id and record the upated time and reset it until the new leader is elected.
the leader record struct like:
and members can watch the leader key and do relatively handle for it here:
pd/server/server.go
Lines 1428 to 1430 in 46fdd96
Resign etcd leader if no pd leader for a long time
After knowing the pd leader and updated time, pd members can decide to do a failover with let etcd to do a new election base on the lost time. we can do this logic on this:
pd/server/server.go
Lines 1567 to 1582 in 46fdd96
once we detect there has etcd leader but no pd leader for a long time(such as 10 * etcdElectionTimeout), we can let the first follower member, with sorted by member id, to do an etcd re-election, the interface can use
pd/pkg/member/member.go
Line 282 in e15b211
ETA
a week fix on master
The text was updated successfully, but these errors were encountered: