GCE Node Upgrade should respect PodDisruptionBudget #38336

enisoc · 2016-12-07T23:52:01Z

As part of pod safety for StatefulSets, we ensured that kubectl drain uses the eviction subresource so it respects PodDisruptionBudget.

Unfortunately, there is no script or recommended node upgrade process (AFAIK) that actually uses kubectl drain. In particular, the experimental cluster/gce/upgrade.sh script uses GCE rolling updates which simply terminate and recreate node VMs.

This issue is to track the need to provide a script or recommended process for upgrading nodes that actually respects PodDisruptionBudgets, so that the guarantees made by things like StatefulSet actually work during upgrades.

cc @kubernetes/sig-apps

The text was updated successfully, but these errors were encountered:

davidopp · 2016-12-08T08:15:45Z

There is some discussion of this in #32323. It's not trivial. See my comment #32323 (comment)

Actually, I think I'll copy it here, since most of that thread is unrelated.

If this is high-priority I'm happy to discuss these options in more detail.

[copied comment below]

We've talked about two possible solutions to this:

(1) Create "node drain as a service" a.k.a. "move kubectl drain into the server" -- an API object where you specify which nodes you want to drain, with an associated controller that figures out a good "schedule" for draining them subject to PDB and any other constraints, and which does the drain (evictions) and then when a node is drained calls out to some endpoint to do the maintenance action, in this case presumably calling out to some GCP endpoint that will then execute the node upgrade. Today's upgrade.sh logic would be split between this controller and the GCP logic that executes the node upgrade.

(2) Modify the MIG rolling updater so that it is GKE-aware, e.g. drains the nodes before upgrading and takes PodDisruptionBudget into account. Today's upgrade.sh could continue to operate more or less as-is, at least in the sense of relying on GCP to take care of scheduling and executing the upgrades.

Similar issues of choosing the node, draining, and respecting PDB come up elsewhere too (e.g. preemptible VMs, cluster auto-scaler scale down, and rescheduler).

But anyway, you're right, it's not just a simple matter of calling kubectl drain from upgrade.sh

enisoc · 2016-12-08T17:51:41Z

@davidopp Thanks for surfacing that post. I'm not familiar with the available GCE APIs, but would an intermediate approach be possible where we first "unroll" upgrade.sh to do everything in terms of individual VMs (rather than using rolling updates), and then we add kubectl drain/uncordon? Going in arbitrary order (no smart schedule) would at least preserve safety, at the expense of extra pod churn and perhaps a risk of stalling the rolling upgrade. Did you exclude that option because it isn't possible as long as we're running the nodes via MIG?

For the record, I should add some more context for why this is a priority for us in sig-apps. Together with #37984, this means upgrading nodes from 1.5.0 to a future release can (1) take a clustered StatefulSet app below the required quorum to function, and (2) leave it in that stuck state for ~10min (or potentially up to the time it takes to complete a rolling update of all nodes) due solely to voluntary actions (planned maintenance).

In effect, StatefulSets still don't have a viable upgrade path going forward, despite being beta now. I discussed this with @erictune and we're going to look across the set of interacting issues that got us here to see what can be addressed perhaps in a patch release. If there's no easy mitigation for the issue in question on this page (like a script refactoring or a documented safe manual flow for GCE), we'll need to focus elsewhere for now.

davidopp · 2016-12-08T20:33:41Z

I just chatted with @maisem and he says he is working on exactly what you described and should have it done within the next few weeks. I'll let him describe it here, but from chatting with him it does sound like the third option you described.

davidopp · 2016-12-08T20:39:23Z

(BTW I still think we should do "node drain as a service" (option 1 above) eventually, because that would be portable across cloud providers and could even be used by GKE, vs. upgrade.sh which only works on GCE.)

roberthbailey · 2016-12-09T06:02:34Z

/cc @maisem

would an intermediate approach be possible where we first "unroll" upgrade.sh to do everything in terms of individual VMs (rather than using rolling updates), and then we add kubectl drain/uncordon?

This is definitely an improvement over the current shell script.

Did you exclude that option because it isn't possible as long as we're running the nodes via MIG?

This is definitely possible when using a MIG.

roberthbailey · 2016-12-09T06:02:48Z

/cc @kubernetes/sig-cluster-lifecycle (as this is related to 1.5 -> 1.6 upgrades)

davidopp · 2016-12-09T06:14:01Z

This is definitely possible when using a MIG.

Yeah I didn't realize it was possible, but I agree this is definitely the most expedient thing to do short-term.

maisem · 2016-12-12T18:02:57Z

I started working on a safe way to do upgrades a couple of weeks ago but then got sidetracked.
This should allow for a non-disruptive way of upgrading the nodes in a cluster.

1. Create a new Instance Template (we do this already)
2. Change the Managed Instance Group to use the new template.
    gcloud compute instance-groups managed set-instance-template
3. For each node in the Instance Group
    a. Verify the node needs to be updated by checking the current instance template
        This can be done by checking for the "instance-template" key in the instance metadata.
    b. Drain the node
        kubectl drain
    c. Recreate the node
        gcloud compute instance-groups managed recreate-instances
    d. Wait for node to become healthy again

bgrant0607 · 2017-03-23T17:36:46Z

cc @mml

santinoncs · 2017-04-19T15:02:16Z

Hi, anyone knows if GKE upgrade is doing a “kubectl drain” ?

According to its documentation, GKE upgrade uses "kubectl drain" before upgrading an specific node.

But I got a 3 node GKE cluster and a cluster of consul ( 3 pods ) and the consensus was lost, looking for the logs it seems that it halt 2 pods at the same time not respecting the minAvailable parameter of the PodDisruptionBudget resource.

When I perform a drain, it honor the PodDisruptionBudget, so it works.

I am using StatefulSet for the consul deployment.

justinsb · 2017-04-19T16:56:49Z

FYI kops has drain support behind a feature flag, and we do GCE behind a feature flag. I'm not sure whether we do drain on GCE though. We use the drain, delete instance, wait for readiness approach.

roberthbailey · 2017-04-21T21:00:04Z

@santinoncs

Yes, @maisem added a drain step as part of upgrading GKE nodes but it doesn't exec kubectl drain as a subcommand. Since drain is still client side, the implementation may be slightly different than what you get using the kubectl tool. In particular, I believe we don't yet honor the PodDisruptionBudget (as you noticed).

maisem · 2017-04-21T21:14:41Z

As @roberthbailey pointed out we don't honor PodDisruptionBudget yet. However, we do have a high priority work item to fix this soon.

mml · 2017-04-24T19:41:42Z

I will take a whack at this. I would like test coverage too, in the form of one of those fancy e2e upgrade tests.

mml · 2017-04-24T20:53:23Z

I spoke with @krousey about how to add the tests. Basically, we'll add this to the existing list of tests in test/e2e/cluster_upgrade.go. Either modifying, say, the ServiceUpgradeTest to verify it works through a node upgrade, or adding a separate test, which we'll probably do as we want to test several apps with native APIs.

Kris asked the question whether or not this should live in cluster/gce/upgrade.sh (or perhaps a rewrite of it in go). AFAIK, this (kube-up and the whole cluster/ directory) is still the reference implementation for these sorts of things, and the only supported method for GCE clusters. @maisem listed some logic above that doesn't strike me as that hard to implement in shell, anyway.

Kris' concern is that maybe cluster/ is marked for destruction. Is it?

roberthbailey · 2017-04-25T04:00:19Z

Matt -- we'd love to delete the cluster/ directory, but we still don't have a replacement. So for now cluster/gce/upgrade.sh is the only place where I know that we have upgrade test coverage (other than the GKE upgrade test suites).

mml · 2017-05-12T19:31:05Z

Well that's pretty neat for a shell script: https://gist.github.com/mml/1a24775b6fceea4685c84aee2f548f6b

enisoc · 2017-05-12T22:40:17Z

@mml wrote:

The question is, do we want the upgrade script to automatically blow up local storage by default? I can't see how to make it work without this, and at the same time it seems a dangerous default.

It may not be too bad. Currently, I don't think there's any guarantee your pods will come back when the node is upgraded. If the restart takes too long, the pods will get deleted and rescheduled on another node. However, I admit in practice many people may be relying on this unguaranteed behavior. Speaking of which, is the drain going to be opt-in, out-out, or required?

Perhaps we could copy the pattern kubectl drain itself uses, which is "quit if such pods exist and make the user re-run the command with --delete-local-data". I'm not sure what a good UX for that would be though, and it may require fixing the script to be resumable if it isn't already.

Another option could be to make specifying whether to delete local data a required option when upgrading nodes, so you cannot even run it the first time without already having considered what you want to happen.

@kow3ns

Automatic merge from submit-queue Respect PDBs during node upgrades and add test coverage to the ServiceTest upgrade test. This is still a WIP... needs to be squashed at least, and I don't think it's currently passing until I increase the scale of the RC, but please have a look at the general outline. Thanks! Fixes #38336 @kow3ns @bdbauer @krousey @erictune @maisem @davidopp ``` On GCE, node upgrades will now respect PodDisruptionBudgets, if present. ```

Automatic merge from submit-queue (batch tested with PRs 48374, 48524, 48519, 42548, 48615) Add a node upgrade test for etcd statefulset with a PDB. Tests for #38336

Automatic merge from submit-queue StatefulSet upgrade test - replicated database (mysql) **What this PR does / why we need it**: Adds a new upgrade test. The test creates a statefulset with a replicated mysql database. It populates the database and then continually polls the data while performing an upgrade. Ultimately, this PR increases confidence of reliability during upgrades. It helps show that StatefulSets and Pod Disruption Budgets are doing what they're supposed to. Code to pay attention to this was added for #38336. Also vendors in a golang mysql client driver, for use in the test. **Release note**: ```release-note NONE ```

Automatic merge from submit-queue Adding a Cassandra upgrade test Tests for #38336

roberthbailey added area/stateful-apps sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Dec 9, 2016

bgrant0607 mentioned this issue Mar 23, 2017

DisruptionBudget object to define the max disruption that can be caused to pods #12611

Closed

bgrant0607 added the area/node-lifecycle Issues or PRs related to Node lifecycle label Mar 23, 2017

mml self-assigned this Apr 24, 2017

mml mentioned this issue May 12, 2017

Respect PDBs during node upgrades and add test coverage to the ServiceTest upgrade test. #45748

Merged

k8s-github-robot closed this as completed in #45748 Jun 4, 2017

mml mentioned this issue Jun 16, 2017

Adding a Cassandra upgrade test #47200

Merged

mml mentioned this issue Jun 30, 2017

Add a node upgrade test for etcd statefulset with a PDB. #48374

Merged

bdbauer mentioned this issue Jul 6, 2017

StatefulSet upgrade test - replicated database (mysql) #46738

Merged

k8s-github-robot pushed a commit that referenced this issue Aug 4, 2017

Merge pull request #47200 from maisem/cassandra

325a53a

Automatic merge from submit-queue Adding a Cassandra upgrade test Tests for #38336

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCE Node Upgrade should respect PodDisruptionBudget #38336

GCE Node Upgrade should respect PodDisruptionBudget #38336

enisoc commented Dec 7, 2016

davidopp commented Dec 8, 2016

enisoc commented Dec 8, 2016

davidopp commented Dec 8, 2016

davidopp commented Dec 8, 2016

roberthbailey commented Dec 9, 2016

roberthbailey commented Dec 9, 2016

davidopp commented Dec 9, 2016

maisem commented Dec 12, 2016

bgrant0607 commented Mar 23, 2017

santinoncs commented Apr 19, 2017 •

edited

Loading

justinsb commented Apr 19, 2017

roberthbailey commented Apr 21, 2017

maisem commented Apr 21, 2017

mml commented Apr 24, 2017

mml commented Apr 24, 2017

roberthbailey commented Apr 25, 2017

mml commented May 12, 2017 •

edited

Loading

enisoc commented May 12, 2017

GCE Node Upgrade should respect PodDisruptionBudget #38336

GCE Node Upgrade should respect PodDisruptionBudget #38336

Comments

enisoc commented Dec 7, 2016

davidopp commented Dec 8, 2016

enisoc commented Dec 8, 2016

davidopp commented Dec 8, 2016

davidopp commented Dec 8, 2016

roberthbailey commented Dec 9, 2016

roberthbailey commented Dec 9, 2016

davidopp commented Dec 9, 2016

maisem commented Dec 12, 2016

bgrant0607 commented Mar 23, 2017

santinoncs commented Apr 19, 2017 • edited Loading

justinsb commented Apr 19, 2017

roberthbailey commented Apr 21, 2017

maisem commented Apr 21, 2017

mml commented Apr 24, 2017

mml commented Apr 24, 2017

roberthbailey commented Apr 25, 2017

mml commented May 12, 2017 • edited Loading

enisoc commented May 12, 2017

santinoncs commented Apr 19, 2017 •

edited

Loading

mml commented May 12, 2017 •

edited

Loading