-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
operator: fast-fail if the leader changed to the RemovePeer #2530
Conversation
Signed-off-by: nolouch <nolouch@gmail.com>
Codecov Report
@@ Coverage Diff @@
## master #2530 +/- ##
==========================================
+ Coverage 77.06% 77.09% +0.03%
==========================================
Files 204 204
Lines 22079 22044 -35
==========================================
- Hits 17015 16995 -20
+ Misses 3773 3769 -4
+ Partials 1291 1280 -11
Continue to review full report at Codecov.
|
func (oc *OperatorController) checkStaleOperator(op *operator.Operator, step operator.OpStep, region *core.RegionInfo) bool { | ||
switch s := step.(type) { | ||
case operator.RemovePeer: | ||
if s.FromStore == region.GetLeader().GetStoreId() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's too tricky to make special judgments on removePeer and leader here. Maybe we can solve it by recording the expected region leader at each step and check if the leader is the same. (like op.ConfVerChanged
)
Signed-off-by: nolouch <nolouch@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest LGTM
Signed-off-by: nolouch <nolouch@gmail.com>
/merge |
1 similar comment
/merge |
/run-all-tests |
cherry pick to release-3.0 failed |
cherry pick to release-3.1 failed |
cherry pick to release-4.0 failed |
Signed-off-by: nolouch <nolouch@gmail.com>
Signed-off-by: nolouch nolouch@gmail.com
What problem does this PR solve?
Fix #2493
If the leader changed cause by TiKV(maybe too busy or network problem), the operator cannot step to the final state, such as
Then the operator needs 10mins to timeout, which may cause the rule sync learner became slow.
What is changed and how it works?
Check List
Tests
Release note