-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCE Node Upgrade should respect PodDisruptionBudget #38336
Comments
There is some discussion of this in #32323. It's not trivial. See my comment #32323 (comment) Actually, I think I'll copy it here, since most of that thread is unrelated. If this is high-priority I'm happy to discuss these options in more detail. [copied comment below] We've talked about two possible solutions to this: (1) Create "node drain as a service" a.k.a. "move kubectl drain into the server" -- an API object where you specify which nodes you want to drain, with an associated controller that figures out a good "schedule" for draining them subject to PDB and any other constraints, and which does the drain (evictions) and then when a node is drained calls out to some endpoint to do the maintenance action, in this case presumably calling out to some GCP endpoint that will then execute the node upgrade. Today's upgrade.sh logic would be split between this controller and the GCP logic that executes the node upgrade. (2) Modify the MIG rolling updater so that it is GKE-aware, e.g. drains the nodes before upgrading and takes PodDisruptionBudget into account. Today's upgrade.sh could continue to operate more or less as-is, at least in the sense of relying on GCP to take care of scheduling and executing the upgrades. Similar issues of choosing the node, draining, and respecting PDB come up elsewhere too (e.g. preemptible VMs, cluster auto-scaler scale down, and rescheduler). But anyway, you're right, it's not just a simple matter of calling |
@davidopp Thanks for surfacing that post. I'm not familiar with the available GCE APIs, but would an intermediate approach be possible where we first "unroll" upgrade.sh to do everything in terms of individual VMs (rather than using rolling updates), and then we add For the record, I should add some more context for why this is a priority for us in sig-apps. Together with #37984, this means upgrading nodes from 1.5.0 to a future release can (1) take a clustered StatefulSet app below the required quorum to function, and (2) leave it in that stuck state for ~10min (or potentially up to the time it takes to complete a rolling update of all nodes) due solely to voluntary actions (planned maintenance). In effect, StatefulSets still don't have a viable upgrade path going forward, despite being beta now. I discussed this with @erictune and we're going to look across the set of interacting issues that got us here to see what can be addressed perhaps in a patch release. If there's no easy mitigation for the issue in question on this page (like a script refactoring or a documented safe manual flow for GCE), we'll need to focus elsewhere for now. |
I just chatted with @maisem and he says he is working on exactly what you described and should have it done within the next few weeks. I'll let him describe it here, but from chatting with him it does sound like the third option you described. |
(BTW I still think we should do "node drain as a service" (option 1 above) eventually, because that would be portable across cloud providers and could even be used by GKE, vs. |
/cc @maisem
This is definitely an improvement over the current shell script.
This is definitely possible when using a MIG. |
/cc @kubernetes/sig-cluster-lifecycle (as this is related to 1.5 -> 1.6 upgrades) |
Yeah I didn't realize it was possible, but I agree this is definitely the most expedient thing to do short-term. |
I started working on a safe way to do upgrades a couple of weeks ago but then got sidetracked.
|
cc @mml |
Hi, anyone knows if GKE upgrade is doing a “kubectl drain” ? According to its documentation, GKE upgrade uses "kubectl drain" before upgrading an specific node. But I got a 3 node GKE cluster and a cluster of consul ( 3 pods ) and the consensus was lost, looking for the logs it seems that it halt 2 pods at the same time not respecting the minAvailable parameter of the PodDisruptionBudget resource. When I perform a drain, it honor the PodDisruptionBudget, so it works. I am using StatefulSet for the consul deployment. |
FYI kops has drain support behind a feature flag, and we do GCE behind a feature flag. I'm not sure whether we do drain on GCE though. We use the drain, delete instance, wait for readiness approach. |
Yes, @maisem added a drain step as part of upgrading GKE nodes but it doesn't exec |
As @roberthbailey pointed out we don't honor PodDisruptionBudget yet. However, we do have a high priority work item to fix this soon. |
I will take a whack at this. I would like test coverage too, in the form of one of those fancy e2e upgrade tests. |
I spoke with @krousey about how to add the tests. Basically, we'll add this to the existing list of tests in Kris asked the question whether or not this should live in Kris' concern is that maybe |
Matt -- we'd love to delete the |
Well that's pretty neat for a shell script: https://gist.github.com/mml/1a24775b6fceea4685c84aee2f548f6b |
@mml wrote:
It may not be too bad. Currently, I don't think there's any guarantee your pods will come back when the node is upgraded. If the restart takes too long, the pods will get deleted and rescheduled on another node. However, I admit in practice many people may be relying on this unguaranteed behavior. Speaking of which, is the drain going to be opt-in, out-out, or required? Perhaps we could copy the pattern Another option could be to make specifying whether to delete local data a required option when upgrading nodes, so you cannot even run it the first time without already having considered what you want to happen. |
Automatic merge from submit-queue Respect PDBs during node upgrades and add test coverage to the ServiceTest upgrade test. This is still a WIP... needs to be squashed at least, and I don't think it's currently passing until I increase the scale of the RC, but please have a look at the general outline. Thanks! Fixes #38336 @kow3ns @bdbauer @krousey @erictune @maisem @davidopp ``` On GCE, node upgrades will now respect PodDisruptionBudgets, if present. ```
Automatic merge from submit-queue (batch tested with PRs 48374, 48524, 48519, 42548, 48615) Add a node upgrade test for etcd statefulset with a PDB. Tests for #38336
Automatic merge from submit-queue StatefulSet upgrade test - replicated database (mysql) **What this PR does / why we need it**: Adds a new upgrade test. The test creates a statefulset with a replicated mysql database. It populates the database and then continually polls the data while performing an upgrade. Ultimately, this PR increases confidence of reliability during upgrades. It helps show that StatefulSets and Pod Disruption Budgets are doing what they're supposed to. Code to pay attention to this was added for #38336. Also vendors in a golang mysql client driver, for use in the test. **Release note**: ```release-note NONE ```
Automatic merge from submit-queue Adding a Cassandra upgrade test Tests for #38336
As part of pod safety for StatefulSets, we ensured that kubectl drain uses the eviction subresource so it respects PodDisruptionBudget.
Unfortunately, there is no script or recommended node upgrade process (AFAIK) that actually uses
kubectl drain
. In particular, the experimental cluster/gce/upgrade.sh script uses GCE rolling updates which simply terminate and recreate node VMs.This issue is to track the need to provide a script or recommended process for upgrading nodes that actually respects PodDisruptionBudgets, so that the guarantees made by things like StatefulSet actually work during upgrades.
cc @kubernetes/sig-apps
The text was updated successfully, but these errors were encountered: