-
Notifications
You must be signed in to change notification settings - Fork 726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The number of peers to remove may exceed expectation when timeout happends #1652
Labels
type/bug
The issue is confirmed as a bug.
Comments
shafreeck
changed the title
Send the region when the operator is added
The number of peers to remove may exceed expectation when timeout happends
Jul 31, 2019
shafreeck
added a commit
to shafreeck/pd
that referenced
this issue
Aug 29, 2019
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> There is a bug introduced by tikv#1652, in some case, like adding peers or adding learners, the step is left unfinished if the peer is in pending state, although the conf version has changed, in these cases, the operator will be removed because the controller thought someone has changed the conf version(in fact, it self did). We fix that by checking if the conf version has actually changed by current step, if it is, the operator is not regarded as stale.
shafreeck
added a commit
to shafreeck/pd
that referenced
this issue
Aug 29, 2019
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> There is a bug introduced by tikv#1652, in some case, like adding peers or adding learners, the step is left unfinished if the peer is in pending state, although the conf version has changed, in these cases, the operator will be removed because the controller thought someone has changed the conf version(in fact, it self did). We fix that by checking if the conf version has actually changed by current step, if it is, the operator is not regarded as stale. Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
shafreeck
added a commit
to shafreeck/pd
that referenced
this issue
Aug 29, 2019
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com> There is a bug introduced by tikv#1652, in some case, like adding peers or adding learners, the step is left unfinished if the peer is in pending state, although the conf version has changed, in these cases, the operator will be removed because the controller thought someone has changed the conf version(in fact, it self did). We fix that by checking if the conf version has actually changed by current step, if it is, the operator is not regarded as stale. Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
There is a bug when dispatching operators of a region with timeout occurs.
Imagine that, when a region requires 3 replicas and it just has 4 in fact. PD found this and create an operator A to remove a replica. Unfortunately, the operator A failed because of timeout. Then PD created another operator B to remove a replica. Rather than time out, the worse thing is that B executed successfully and A was successful in fact too. Indeed, A was timed out, but we don't know if it was executed. We removed 2 replicas finally which is not as expected.
The raft to remove a member is safe relying on the conf version. If two operators try to remove a member with the same conf version, only one can succeed.
The problem comes from PD. When there is a region heartbeat from TiKV, PD dispatches the operator of that region, it sends the operator with the region from the heartbeat which may have a newer epoch than when the operator is added, so the old operator will be executed on TiKV because it carries the newest region epoch.
The text was updated successfully, but these errors were encountered: