Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCE Node Upgrade should respect PodDisruptionBudget #38336

Closed
enisoc opened this issue Dec 7, 2016 · 18 comments
Closed

GCE Node Upgrade should respect PodDisruptionBudget #38336

enisoc opened this issue Dec 7, 2016 · 18 comments
Assignees
Labels
area/node-lifecycle Issues or PRs related to Node lifecycle area/stateful-apps sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.

Comments

@enisoc
Copy link
Member

enisoc commented Dec 7, 2016

As part of pod safety for StatefulSets, we ensured that kubectl drain uses the eviction subresource so it respects PodDisruptionBudget.

Unfortunately, there is no script or recommended node upgrade process (AFAIK) that actually uses kubectl drain. In particular, the experimental cluster/gce/upgrade.sh script uses GCE rolling updates which simply terminate and recreate node VMs.

This issue is to track the need to provide a script or recommended process for upgrading nodes that actually respects PodDisruptionBudgets, so that the guarantees made by things like StatefulSet actually work during upgrades.

cc @kubernetes/sig-apps

@davidopp
Copy link
Member

davidopp commented Dec 8, 2016

There is some discussion of this in #32323. It's not trivial. See my comment #32323 (comment)

Actually, I think I'll copy it here, since most of that thread is unrelated.

If this is high-priority I'm happy to discuss these options in more detail.

[copied comment below]

We've talked about two possible solutions to this:

(1) Create "node drain as a service" a.k.a. "move kubectl drain into the server" -- an API object where you specify which nodes you want to drain, with an associated controller that figures out a good "schedule" for draining them subject to PDB and any other constraints, and which does the drain (evictions) and then when a node is drained calls out to some endpoint to do the maintenance action, in this case presumably calling out to some GCP endpoint that will then execute the node upgrade. Today's upgrade.sh logic would be split between this controller and the GCP logic that executes the node upgrade.

(2) Modify the MIG rolling updater so that it is GKE-aware, e.g. drains the nodes before upgrading and takes PodDisruptionBudget into account. Today's upgrade.sh could continue to operate more or less as-is, at least in the sense of relying on GCP to take care of scheduling and executing the upgrades.

Similar issues of choosing the node, draining, and respecting PDB come up elsewhere too (e.g. preemptible VMs, cluster auto-scaler scale down, and rescheduler).

But anyway, you're right, it's not just a simple matter of calling kubectl drain from upgrade.sh

@enisoc
Copy link
Member Author

enisoc commented Dec 8, 2016

@davidopp Thanks for surfacing that post. I'm not familiar with the available GCE APIs, but would an intermediate approach be possible where we first "unroll" upgrade.sh to do everything in terms of individual VMs (rather than using rolling updates), and then we add kubectl drain/uncordon? Going in arbitrary order (no smart schedule) would at least preserve safety, at the expense of extra pod churn and perhaps a risk of stalling the rolling upgrade. Did you exclude that option because it isn't possible as long as we're running the nodes via MIG?

For the record, I should add some more context for why this is a priority for us in sig-apps. Together with #37984, this means upgrading nodes from 1.5.0 to a future release can (1) take a clustered StatefulSet app below the required quorum to function, and (2) leave it in that stuck state for ~10min (or potentially up to the time it takes to complete a rolling update of all nodes) due solely to voluntary actions (planned maintenance).

In effect, StatefulSets still don't have a viable upgrade path going forward, despite being beta now. I discussed this with @erictune and we're going to look across the set of interacting issues that got us here to see what can be addressed perhaps in a patch release. If there's no easy mitigation for the issue in question on this page (like a script refactoring or a documented safe manual flow for GCE), we'll need to focus elsewhere for now.

@davidopp
Copy link
Member

davidopp commented Dec 8, 2016

I just chatted with @maisem and he says he is working on exactly what you described and should have it done within the next few weeks. I'll let him describe it here, but from chatting with him it does sound like the third option you described.

@davidopp
Copy link
Member

davidopp commented Dec 8, 2016

(BTW I still think we should do "node drain as a service" (option 1 above) eventually, because that would be portable across cloud providers and could even be used by GKE, vs. upgrade.sh which only works on GCE.)

@roberthbailey
Copy link
Contributor

/cc @maisem

would an intermediate approach be possible where we first "unroll" upgrade.sh to do everything in terms of individual VMs (rather than using rolling updates), and then we add kubectl drain/uncordon?

This is definitely an improvement over the current shell script.

Did you exclude that option because it isn't possible as long as we're running the nodes via MIG?

This is definitely possible when using a MIG.

@roberthbailey
Copy link
Contributor

/cc @kubernetes/sig-cluster-lifecycle (as this is related to 1.5 -> 1.6 upgrades)

@roberthbailey roberthbailey added area/stateful-apps sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Dec 9, 2016
@davidopp
Copy link
Member

davidopp commented Dec 9, 2016

This is definitely possible when using a MIG.

Yeah I didn't realize it was possible, but I agree this is definitely the most expedient thing to do short-term.

@maisem
Copy link

maisem commented Dec 12, 2016

I started working on a safe way to do upgrades a couple of weeks ago but then got sidetracked.
This should allow for a non-disruptive way of upgrading the nodes in a cluster.

1. Create a new Instance Template (we do this already)
2. Change the Managed Instance Group to use the new template.
    gcloud compute instance-groups managed set-instance-template
3. For each node in the Instance Group
    a. Verify the node needs to be updated by checking the current instance template
        This can be done by checking for the "instance-template" key in the instance metadata.
    b. Drain the node
        kubectl drain
    c. Recreate the node
        gcloud compute instance-groups managed recreate-instances
    d. Wait for node to become healthy again 

@bgrant0607
Copy link
Member

cc @mml

@santinoncs
Copy link
Contributor

santinoncs commented Apr 19, 2017

Hi, anyone knows if GKE upgrade is doing a “kubectl drain” ?

According to its documentation, GKE upgrade uses "kubectl drain" before upgrading an specific node.

But I got a 3 node GKE cluster and a cluster of consul ( 3 pods ) and the consensus was lost, looking for the logs it seems that it halt 2 pods at the same time not respecting the minAvailable parameter of the PodDisruptionBudget resource.

When I perform a drain, it honor the PodDisruptionBudget, so it works.

I am using StatefulSet for the consul deployment.

@justinsb
Copy link
Member

FYI kops has drain support behind a feature flag, and we do GCE behind a feature flag. I'm not sure whether we do drain on GCE though. We use the drain, delete instance, wait for readiness approach.

@roberthbailey
Copy link
Contributor

@santinoncs

Yes, @maisem added a drain step as part of upgrading GKE nodes but it doesn't exec kubectl drain as a subcommand. Since drain is still client side, the implementation may be slightly different than what you get using the kubectl tool. In particular, I believe we don't yet honor the PodDisruptionBudget (as you noticed).

@maisem
Copy link

maisem commented Apr 21, 2017

As @roberthbailey pointed out we don't honor PodDisruptionBudget yet. However, we do have a high priority work item to fix this soon.

@mml
Copy link
Contributor

mml commented Apr 24, 2017

I will take a whack at this. I would like test coverage too, in the form of one of those fancy e2e upgrade tests.

@mml mml self-assigned this Apr 24, 2017
@mml
Copy link
Contributor

mml commented Apr 24, 2017

I spoke with @krousey about how to add the tests. Basically, we'll add this to the existing list of tests in test/e2e/cluster_upgrade.go. Either modifying, say, the ServiceUpgradeTest to verify it works through a node upgrade, or adding a separate test, which we'll probably do as we want to test several apps with native APIs.

Kris asked the question whether or not this should live in cluster/gce/upgrade.sh (or perhaps a rewrite of it in go). AFAIK, this (kube-up and the whole cluster/ directory) is still the reference implementation for these sorts of things, and the only supported method for GCE clusters. @maisem listed some logic above that doesn't strike me as that hard to implement in shell, anyway.

Kris' concern is that maybe cluster/ is marked for destruction. Is it?

@roberthbailey
Copy link
Contributor

Matt -- we'd love to delete the cluster/ directory, but we still don't have a replacement. So for now cluster/gce/upgrade.sh is the only place where I know that we have upgrade test coverage (other than the GKE upgrade test suites).

@mml
Copy link
Contributor

mml commented May 12, 2017

Well that's pretty neat for a shell script: https://gist.github.com/mml/1a24775b6fceea4685c84aee2f548f6b

@enisoc
Copy link
Member Author

enisoc commented May 12, 2017

@mml wrote:

The question is, do we want the upgrade script to automatically blow up local storage by default? I can't see how to make it work without this, and at the same time it seems a dangerous default.

It may not be too bad. Currently, I don't think there's any guarantee your pods will come back when the node is upgraded. If the restart takes too long, the pods will get deleted and rescheduled on another node. However, I admit in practice many people may be relying on this unguaranteed behavior. Speaking of which, is the drain going to be opt-in, out-out, or required?

Perhaps we could copy the pattern kubectl drain itself uses, which is "quit if such pods exist and make the user re-run the command with --delete-local-data". I'm not sure what a good UX for that would be though, and it may require fixing the script to be resumable if it isn't already.

Another option could be to make specifying whether to delete local data a required option when upgrading nodes, so you cannot even run it the first time without already having considered what you want to happen.

k8s-github-robot pushed a commit that referenced this issue Jun 4, 2017
Automatic merge from submit-queue

Respect PDBs during node upgrades and add test coverage to the ServiceTest upgrade test.

This is still a WIP... needs to be squashed at least, and I don't think it's currently passing until I increase the scale of the RC, but please have a look at the general outline.  Thanks!

Fixes #38336 

@kow3ns @bdbauer @krousey @erictune @maisem @davidopp 

```
On GCE, node upgrades will now respect PodDisruptionBudgets, if present.
```
k8s-github-robot pushed a commit that referenced this issue Jul 7, 2017
Automatic merge from submit-queue (batch tested with PRs 48374, 48524, 48519, 42548, 48615)

Add a node upgrade test for etcd statefulset with a PDB.

Tests for #38336
k8s-github-robot pushed a commit that referenced this issue Jul 12, 2017
Automatic merge from submit-queue

StatefulSet upgrade test - replicated database (mysql)

**What this PR does / why we need it**:
Adds a new upgrade test. The test creates a statefulset with a replicated mysql database. It populates the database and then continually polls the data while performing an upgrade. 

Ultimately, this PR increases confidence of reliability during upgrades. It helps show that StatefulSets and Pod Disruption Budgets are doing what they're supposed to.  Code to pay attention to this was added for #38336.

Also vendors in a golang mysql client driver, for use in the test.

**Release note**:
```release-note
NONE
```
k8s-github-robot pushed a commit that referenced this issue Aug 4, 2017
Automatic merge from submit-queue

Adding a Cassandra upgrade test

Tests for #38336
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/node-lifecycle Issues or PRs related to Node lifecycle area/stateful-apps sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Projects
None yet
Development

No branches or pull requests

8 participants