Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do not scale or upgrade tikv at the same time #2705

Merged
merged 8 commits into from
Jun 18, 2020

Conversation

DanielZhangQD
Copy link
Contributor

What problem does this PR solve?

Fix #2631

What is changed and how does it work?

Cross-check during upgrading and scaling and make sure that upgrading and scaling for tikv do not occur at the same time.

Check List

Tests

  • Unit test
  • E2E test
  • Manual test (add detailed scripts or steps below)
    • Scale out TiKV and then upgrade TiKV
    • Scale in TiKV and then upgrade TiKV
    • Scale in and upgrade TiKV at the same time
    • Upgrade TidbCluster to newer version and scale in TiKV at the same time

Code changes

  • Has Go code change

Related changes

  • Need to cherry-pick to the release branch

Does this PR introduce a user-facing change?:

Do not scale or upgrade tikv at the same time

@Yisaer
Copy link
Contributor

Yisaer commented Jun 15, 2020

I think PD should have same logic to avoid the same error.

@DanielZhangQD DanielZhangQD force-pushed the scale0 branch 2 times, most recently from 83a184e to e07a472 Compare June 16, 2020 06:21
@DanielZhangQD DanielZhangQD requested review from Yisaer, weekface and cofyc and removed request for Yisaer June 16, 2020 06:54
@@ -144,7 +173,7 @@ func (tsd *tikvScaler) ScaleIn(tc *v1alpha1.TidbCluster, oldSet *apps.StatefulSe
}
klog.Infof("tikv scale in: set pvc %s/%s annotation: %s to %s",
ns, pvcName, label.AnnPVCDeferDeleting, now)

tc.Status.TiKV.Phase = v1alpha1.ScaleInPhase
Copy link
Contributor

@Yisaer Yisaer Jun 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can't directly set TiKV phase as ScaleInPhase here. During scaling, we would update its statefulset and tidbcluster. What if the updating of statefulset is success but the updating of tidbcluster is failed? This would cause the scaling truly happened, but we didn't record the state.

I think we should judge the status in syncTidbClusterStatus function as upgrading do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot exactly know whether TiKV is scaling in the syncTidbClusterStatus function, because the operator may be deleting the store at the time we're checking and we also cannot check the store state because if a store is in tombstone state we cannot make sure it's the store for the current Pod and or a previous Pod that had been deleted before.
We have a retry mechanism for the tidbcluster update and yes, it still can fail, but actually we cannot fix all of the corner cases, this PR will mitigate this issue in most cases.

@DanielZhangQD
Copy link
Contributor Author

I think PD should have same logic to avoid the same error.

PD is the first component to be handled, if the spec change requires both upgrade and scaling at the same time, operator will always upgrade it first.

@DanielZhangQD DanielZhangQD requested a review from Yisaer June 16, 2020 09:42
@Yisaer
Copy link
Contributor

Yisaer commented Jun 16, 2020

I think we need a mutex variable to ensure the upgrading and scaling won't happen in the same time. Only when the upgrading or scaling have fetch the mutex successfully, then we could do the action.

@Yisaer
Copy link
Contributor

Yisaer commented Jun 17, 2020

@DanielZhangQD I'm wondering If you first scale pd from 5 to 3 and it might take about 10 sec for pd-4 to transfer leader and for pd-3 to be deleted member.

Then we follow the action below

0:00 Scaling PD from 5-3
0:02 Upgrading PD Image

Will the controller logic prevent pd from upgrading as it is under scaling?

@DanielZhangQD
Copy link
Contributor Author

@DanielZhangQD I'm wondering If you first scale pd from 5 to 3 and it might take about 10 sec for pd-4 to transfer leader and for pd-3 to be deleted member.

Then we follow the action below

0:00 Scaling PD from 5-3
0:02 Upgrading PD Image

Will the controller logic prevent pd from upgrading as it is under scaling?

I have tested this at the beginning, PD can be upgraded and scaled successfully, but yes, in theory, it's possible to hit the similar issue here, I have updated code for PD.

@DanielZhangQD
Copy link
Contributor Author

@Yisaer @cofyc Code updated, PTAL.

Comment on lines 634 to +637
if upgrading && tc.Status.PD.Phase != v1alpha1.UpgradePhase {
tc.Status.TiKV.Phase = v1alpha1.UpgradePhase
} else if tc.TiKVStsDesiredReplicas() != *set.Spec.Replicas {
tc.Status.TiKV.Phase = v1alpha1.ScalePhase
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we update tikvSpec with some upgrading and scaling in one updating. I think the logic here will think tikv is under Scaling. That is to say, we will start scaling tikv before upgrading tikv. WDYT @DanielZhangQD @cofyc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this also apply to the PD logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think whether to scaling or upgrading first is OK, the key is not to do both of them at the same time.

@DanielZhangQD
Copy link
Contributor Author

/merge

@ti-srebot
Copy link
Contributor

Your auto merge job has been accepted, waiting for:

  • 2712

@ti-srebot
Copy link
Contributor

/run-all-tests

@ti-srebot ti-srebot merged commit f129922 into pingcap:master Jun 18, 2020
ti-srebot pushed a commit to ti-srebot/tidb-operator that referenced this pull request Jun 18, 2020
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
@ti-srebot
Copy link
Contributor

cherry pick to release-1.1 in PR #2770

ti-srebot added a commit that referenced this pull request Jun 18, 2020
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>

Co-authored-by: DanielZhangQD <36026334+DanielZhangQD@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

upgrade and scaling in tikv at the same time will cause upgrade failure
4 participants