Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operator: fast-fail if the leader changed to the RemovePeer #2530

Merged
merged 8 commits into from
Jun 18, 2020

Conversation

nolouch
Copy link
Contributor

@nolouch nolouch commented Jun 11, 2020

Signed-off-by: nolouch nolouch@gmail.com

What problem does this PR solve?

Fix #2493
If the leader changed cause by TiKV(maybe too busy or network problem), the operator cannot step to the final state, such as

PD:
try to remove peer 24
TiKV:
reject remove peer 24 because it is a leader.

Then the operator needs 10mins to timeout, which may cause the rule sync learner became slow.

What is changed and how it works?

Check List

Tests

  • Unit test

Release note

  • Let the operator fast fail if the leader changed to the RemovePeer

Signed-off-by: nolouch <nolouch@gmail.com>
@nolouch nolouch added component/schedule Scheduling logic. needs-cherry-pick-release-4.0 The PR needs to cherry pick to release-4.0 branch. needs-cherry-pick-release-3.0 The PR needs to cherry pick to release-3.0 branch. needs-cherry-pick-release-3.1 The PR needs to cherry pick to release-3.1 branch. labels Jun 11, 2020
@nolouch nolouch requested review from rleungx and HunDunDM June 11, 2020 06:26
Signed-off-by: nolouch <nolouch@gmail.com>
Signed-off-by: nolouch <nolouch@gmail.com>
Signed-off-by: nolouch <nolouch@gmail.com>
@nolouch
Copy link
Contributor Author

nolouch commented Jun 15, 2020

PTAL @rleungx @HunDunDM

@nolouch nolouch requested a review from disksing June 15, 2020 13:06
@codecov-commenter
Copy link

Codecov Report

Merging #2530 into master will increase coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2530      +/-   ##
==========================================
+ Coverage   77.06%   77.09%   +0.03%     
==========================================
  Files         204      204              
  Lines       22079    22044      -35     
==========================================
- Hits        17015    16995      -20     
+ Misses       3773     3769       -4     
+ Partials     1291     1280      -11     
Impacted Files Coverage Δ
server/schedule/operator_controller.go 79.14% <100.00%> (+0.48%) ⬆️
server/core/test_util.go 77.21% <0.00%> (-18.99%) ⬇️
server/schedulers/shuffle_hot_region.go 63.47% <0.00%> (-11.31%) ⬇️
server/region_syncer/client.go 74.80% <0.00%> (-4.59%) ⬇️
server/grpc_service.go 58.80% <0.00%> (-1.75%) ⬇️
pkg/mock/mockhbstream/mockhbstream.go 89.23% <0.00%> (-1.54%) ⬇️
server/schedulers/label.go 73.84% <0.00%> (-0.40%) ⬇️
client/client.go 70.60% <0.00%> (ø)
server/schedule/operator/kind.go 100.00% <0.00%> (ø)
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0df431e...e95bd02. Read the comment docs.

func (oc *OperatorController) checkStaleOperator(op *operator.Operator, step operator.OpStep, region *core.RegionInfo) bool {
switch s := step.(type) {
case operator.RemovePeer:
if s.FromStore == region.GetLeader().GetStoreId() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's too tricky to make special judgments on removePeer and leader here. Maybe we can solve it by recording the expected region leader at each step and check if the leader is the same. (like op.ConfVerChanged)

Signed-off-by: nolouch <nolouch@gmail.com>
Signed-off-by: nolouch <nolouch@gmail.com>
Copy link
Member

@rleungx rleungx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest LGTM

server/schedule/operator/step.go Outdated Show resolved Hide resolved
Signed-off-by: nolouch <nolouch@gmail.com>
@nolouch
Copy link
Contributor Author

nolouch commented Jun 18, 2020

PTAL @disksing @rleungx

@disksing
Copy link
Contributor

/merge

1 similar comment
@nolouch
Copy link
Contributor Author

nolouch commented Jun 18, 2020

/merge

@ti-srebot ti-srebot added the status/can-merge Indicates a PR has been approved by a committer. label Jun 18, 2020
@ti-srebot
Copy link
Contributor

/run-all-tests

@ti-srebot ti-srebot merged commit ffafc22 into tikv:master Jun 18, 2020
@ti-srebot
Copy link
Contributor

cherry pick to release-3.0 failed

@ti-srebot
Copy link
Contributor

cherry pick to release-3.1 failed

@nolouch nolouch deleted the fast-fail branch June 18, 2020 07:44
@ti-srebot
Copy link
Contributor

cherry pick to release-4.0 failed

nolouch added a commit to nolouch/pd that referenced this pull request Jun 18, 2020
ti-srebot pushed a commit that referenced this pull request Jun 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/schedule Scheduling logic. needs-cherry-pick-release-3.0 The PR needs to cherry pick to release-3.0 branch. needs-cherry-pick-release-3.1 The PR needs to cherry pick to release-3.1 branch. needs-cherry-pick-release-4.0 The PR needs to cherry pick to release-4.0 branch. status/can-merge Indicates a PR has been approved by a committer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fast timeout operator when leader changed
6 participants