Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The number of peers to remove may exceed expectation when timeout happends #1652

Closed
shafreeck opened this issue Jul 25, 2019 · 0 comments · Fixed by #1659
Closed

The number of peers to remove may exceed expectation when timeout happends #1652

shafreeck opened this issue Jul 25, 2019 · 0 comments · Fixed by #1659
Labels
type/bug The issue is confirmed as a bug.

Comments

@shafreeck
Copy link
Contributor

shafreeck commented Jul 25, 2019

There is a bug when dispatching operators of a region with timeout occurs.

Imagine that, when a region requires 3 replicas and it just has 4 in fact. PD found this and create an operator A to remove a replica. Unfortunately, the operator A failed because of timeout. Then PD created another operator B to remove a replica. Rather than time out, the worse thing is that B executed successfully and A was successful in fact too. Indeed, A was timed out, but we don't know if it was executed. We removed 2 replicas finally which is not as expected.

The raft to remove a member is safe relying on the conf version. If two operators try to remove a member with the same conf version, only one can succeed.

The problem comes from PD. When there is a region heartbeat from TiKV, PD dispatches the operator of that region, it sends the operator with the region from the heartbeat which may have a newer epoch than when the operator is added, so the old operator will be executed on TiKV because it carries the newest region epoch.

@shafreeck shafreeck added the type/bug The issue is confirmed as a bug. label Jul 25, 2019
@shafreeck shafreeck changed the title Send the region when the operator is added The number of peers to remove may exceed expectation when timeout happends Jul 31, 2019
shafreeck added a commit to shafreeck/pd that referenced this issue Aug 29, 2019
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

There is a bug introduced by tikv#1652, in some case, like adding peers or
adding learners, the step is left unfinished if the peer is in pending
state, although the conf version has changed, in these cases, the
operator will be removed because the controller thought someone has
changed the conf version(in fact, it self did). We fix that by checking
if the conf version has actually changed by current step, if it is,
the operator is not regarded as stale.
shafreeck added a commit to shafreeck/pd that referenced this issue Aug 29, 2019
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

There is a bug introduced by tikv#1652, in some case, like adding peers or
adding learners, the step is left unfinished if the peer is in pending
state, although the conf version has changed, in these cases, the
operator will be removed because the controller thought someone has
changed the conf version(in fact, it self did). We fix that by checking
if the conf version has actually changed by current step, if it is,
the operator is not regarded as stale.

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
shafreeck added a commit to shafreeck/pd that referenced this issue Aug 29, 2019
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

There is a bug introduced by tikv#1652, in some case, like adding peers or
adding learners, the step is left unfinished if the peer is in pending
state, although the conf version has changed, in these cases, the
operator will be removed because the controller thought someone has
changed the conf version(in fact, it self did). We fix that by checking
if the conf version has actually changed by current step, if it is,
the operator is not regarded as stale.

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant