-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
do not scale or upgrade tikv at the same time #2705
Conversation
I think PD should have same logic to avoid the same error. |
83a184e
to
e07a472
Compare
pkg/manager/member/tikv_scaler.go
Outdated
@@ -144,7 +173,7 @@ func (tsd *tikvScaler) ScaleIn(tc *v1alpha1.TidbCluster, oldSet *apps.StatefulSe | |||
} | |||
klog.Infof("tikv scale in: set pvc %s/%s annotation: %s to %s", | |||
ns, pvcName, label.AnnPVCDeferDeleting, now) | |||
|
|||
tc.Status.TiKV.Phase = v1alpha1.ScaleInPhase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can't directly set TiKV phase as ScaleInPhase
here. During scaling, we would update its statefulset
and tidbcluster
. What if the updating of statefulset is success but the updating of tidbcluster is failed? This would cause the scaling truly happened, but we didn't record the state.
I think we should judge the status in syncTidbClusterStatus
function as upgrading do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot exactly know whether TiKV is scaling in the syncTidbClusterStatus
function, because the operator may be deleting the store at the time we're checking and we also cannot check the store state because if a store is in tombstone state we cannot make sure it's the store for the current Pod and or a previous Pod that had been deleted before.
We have a retry mechanism for the tidbcluster update and yes, it still can fail, but actually we cannot fix all of the corner cases, this PR will mitigate this issue in most cases.
PD is the first component to be handled, if the spec change requires both upgrade and scaling at the same time, operator will always upgrade it first. |
I think we need a mutex variable to ensure the upgrading and scaling won't happen in the same time. Only when the upgrading or scaling have fetch the mutex successfully, then we could do the action. |
@DanielZhangQD I'm wondering If you first scale pd from 5 to 3 and it might take about 10 sec for pd-4 to transfer leader and for pd-3 to be deleted member. Then we follow the action below 0:00 Scaling PD from 5-3
0:02 Upgrading PD Image Will the controller logic prevent pd from upgrading as it is under scaling? |
I have tested this at the beginning, PD can be upgraded and scaled successfully, but yes, in theory, it's possible to hit the similar issue here, I have updated code for PD. |
if upgrading && tc.Status.PD.Phase != v1alpha1.UpgradePhase { | ||
tc.Status.TiKV.Phase = v1alpha1.UpgradePhase | ||
} else if tc.TiKVStsDesiredReplicas() != *set.Spec.Replicas { | ||
tc.Status.TiKV.Phase = v1alpha1.ScalePhase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we update tikvSpec
with some upgrading and scaling in one updating. I think the logic here will think tikv is under Scaling. That is to say, we will start scaling tikv before upgrading tikv. WDYT @DanielZhangQD @cofyc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this also apply to the PD logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think whether to scaling or upgrading first is OK, the key is not to do both of them at the same time.
/merge |
Your auto merge job has been accepted, waiting for:
|
/run-all-tests |
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
cherry pick to release-1.1 in PR #2770 |
What problem does this PR solve?
Fix #2631
What is changed and how does it work?
Cross-check during upgrading and scaling and make sure that upgrading and scaling for tikv do not occur at the same time.
Check List
Tests
Code changes
Related changes
Does this PR introduce a user-facing change?: