recommend draining the node before updating kubelet #43053

tzneal · 2023-09-14T14:49:31Z

The Kubernetes docs state that minor upgrades of kubelet require draining the node:

If you are performing a minor version upgrade for any kubelet, you must first drain the node (or nodes) that you are upgrading.

It’s also been stated, but not in documentation that patch upgrades don’t require draining the node:

in place minor version kubelet upgrades are not supported without draining. I think in place patch version upgrades are.

My contention is that currently in-place patch upgrades of kubelet are unsafe and we should clarify the K8s documentation to explicitly state that you must drain the node in all circumstances when updating kubelet.

Why?

I suspect there are other examples of when in-place kubelet upgrades fail, but the most recent one I've found is a failure to resolve a CVE for running pods. A recent bug in kubelet allowed pods to bypass seccomp enforcement. A CVE (CVE-2023-2431) was published and the announcement was sent to the community. A fix for this CVE was backported into the 1.27.2 patch release of kubelet.

Without this docs change, users may assume that in-place updating of kubelet will resolve this CVE. It will, but only for newly admitted pods. An in-place update will not terminate any running pods allowing them to continue to bypass seccomp enforcement.

To demonstrate the failure, I launched a pod with an empty string localhostProfile on a node with kubelet 1.27.0. I then stopped kubelet and replaced its binary with the 1.27.6 patch version of kubelet before restarting it. The pod is still running on the node with no seccomp enforcement, but any vulnerability scans which look at the kubelet version will indicate that it’s patched and not susceptible to the CVE.

$ kubectl get pod test-pod -o json | jq '{status:  .status.phase, seccomp: .spec.containers[0].securityContext, nodeName: .spec.nodeName}'
{
  "status": "Running",
  "seccomp": {
    "seccompProfile": {
      "localhostProfile": "",
      "type": "Localhost"
    }
  },
  "nodeName": "i-05a72568b9cceb470.us-west-2.compute.internal"
}

$ kubectl get node i-05a72568b9cceb470.us-west-2.compute.internal
NAME                                             STATUS   ROLES    AGE     VERSION
i-05a72568b9cceb470.us-west-2.compute.internal   Ready    <none>   9m50s   v1.27.6

tzneal · 2023-09-14T14:50:00Z

/cc @micahhausler
/cc @dims

dims · 2023-09-14T14:51:05Z

/assign @derekwaynecarr @SergeyKanzhelev @mrunalp

tzneal · 2023-09-14T14:59:10Z

/cc @liggitt

liggitt · 2023-09-14T15:02:29Z

content/en/releases/version-skew-policy.md

@@ -178,8 +178,7 @@ Pre-requisites:
 Optionally upgrade `kubelet` instances to **{{< skew currentVersion >}}** (or they can be left at **{{< skew currentVersionAddMinor -1 >}}**, **{{< skew currentVersionAddMinor -2 >}}**, or **{{< skew currentVersionAddMinor -3 >}}**)

 {{< note >}}
-Before performing a minor version `kubelet` upgrade, [drain](/docs/tasks/administer-cluster/safely-drain-node/) pods from that node.
-In-place minor version `kubelet` upgrades are not supported.
+Before performing any `kubelet` upgrade, [drain](/docs/tasks/administer-cluster/safely-drain-node/) pods from that node.


I don't think we should say this here or require this. There are two aspects:

kubelet compatibility

effectively fixing bugs

This document is focused on kubelet compatibility. Kubelet should work properly when restarted against state persisted by another patch version of the kubelet.

Fixing some bugs require restarting containers. That seems like a useful thing to note in releases where a bugfix requiring a container restart is released, but I would not impose that requirement on all patch updates.

My concern is that, for this example, there are theoretically in the wild pods that are running with no seccomp enforcement but with a known "good" kubelet version even though users followed the current documentation which doesn't suggest that a node drain is necessary.

That seems like a useful thing to note in releases where a bugfix requiring a container restart is released, but I would not impose that requirement on all patch updates.

I think a careful evaluation of the question of "Will running containers require a restart?" for every backported fix is possible, but its a much larger burden than recommending draining the node in all cases. For example, it was missed in this case.

Users can still not drain the node if they choose, but its a risk assessment they make at that point.

How do you feel about suggesting draining the node, something like:

- Before performing a minor version `kubelet` upgrade, [drain](/docs/tasks/administer-cluster/safely-drain-node/) pods from that node. - In-place minor version `kubelet` upgrades are not supported. - For in-place patch `kubelet` upgrades, draining the node is the safest approach but not strictly necessary.

I think we should keep this document focused on the kubelet compatibility requirements, and make it unambiguous that kubelet is compatible with runtime state persisted by other patch versions within the same minor.

sftim · 2023-09-20T15:59:31Z

@tzneal what are you plans for this PR, given the feedback so far?

tzneal · 2023-09-20T16:02:54Z

@tzneal what are you plans for this PR, given the feedback so far?

I'm going to find a different place in the docs to put this wording. I don't think that the danger of in-place updates is expressed anywhere currently in the documentation which leaves users vulnerable to CVEs they think they've mitigated.

tzneal · 2023-09-20T16:12:25Z

Added a warning in the upgrade guide that draining the node may be necessary to resolve CVEs or bugs.

netlify · 2023-09-20T16:20:08Z

✅ Pull request preview available for checking

Built without sensitive environment variables

Name	Link
🔨 Latest commit	`20cfb80`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-io-main-staging/deploys/650b7323e1631800089fb129
😎 Deploy Preview	https://deploy-preview-43053--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

sftim

/lgtm

You could also propose a change to https://kubernetes.io/docs/concepts/security/security-checklist/ (maybe another PR, I'm not sure what works best).

I'm thinking of adding an item to say [words to the effect] that there are no nodes where:

some Pod x is running
the version of the kubelet that started Pod x is more than one minor version less than the cluster's version of Kubernetes

Hard to express though, isn't it? I initially assumed that systemctl stop kubelet would shut down workloads but that's not the case, and this is moderately surprising. So it's worth documenting a bit more IMO.

content/en/docs/tasks/administer-cluster/cluster-upgrade.md

k8s-ci-robot · 2023-09-20T20:16:21Z

LGTM label has been added.

Git tree hash: d44750628ce2b4b7ec89104a489485252a3d02a4

This change clarifies that the node should be drained even when performing patch upgrades of kubelet.

tzneal · 2023-09-20T22:36:34Z

Hard to express though, isn't it? I initially assumed that systemctl stop kubelet would shut down workloads but that's not the case, and this is moderately surprising. So it's worth documenting a bit more IMO.

Agree, regarding documenting more. is there a good way to determine which version of kubelet started the pod?

For the security page, maybe a new section on upgrading? The page I modified was what was linked in the CVE, so I figured it was the best place to start.

tengqm · 2023-09-26T08:15:36Z

/lgtm

k8s-ci-robot · 2023-09-26T08:15:44Z

LGTM label has been added.

Git tree hash: 994a79b40398adcd1216fe62de0a94ca665114a4

sftim · 2023-09-26T09:43:30Z

is there a good way to determine which version of kubelet started the pod?

I don't think it's generally possible. The kubelet may have restarted since, and the other end of the CRI socket is a black box as far as Kubernetes is concerned.

sftim

/approve

k8s-ci-robot · 2023-09-26T09:43:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sftim

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~content/en/OWNERS~~ [sftim]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 14, 2023

k8s-ci-robot requested review from kbhawkey and sftim September 14, 2023 14:49

k8s-ci-robot added area/release-eng Issues or PRs related to the Release Engineering subproject language/en Issues or PRs related to English language sig/docs Categorizes an issue or PR as relevant to SIG Docs. sig/release Categorizes an issue or PR as relevant to SIG Release. labels Sep 14, 2023

k8s-ci-robot requested review from dims and micahhausler September 14, 2023 14:50

k8s-ci-robot assigned derekwaynecarr, mrunalp and SergeyKanzhelev Sep 14, 2023

k8s-ci-robot requested a review from liggitt September 14, 2023 14:59

liggitt reviewed Sep 14, 2023

View reviewed changes

tzneal force-pushed the recommend-drain-before-any-kubelet-upgrade branch from 2064dfa to 2282d06 Compare September 20, 2023 16:11

sftim reviewed Sep 20, 2023

View reviewed changes

content/en/docs/tasks/administer-cluster/cluster-upgrade.md Outdated Show resolved Hide resolved

k8s-ci-robot assigned sftim Sep 20, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 20, 2023

recommend draining the node before updating kubelet

20cfb80

This change clarifies that the node should be drained even when performing patch upgrades of kubelet.

tzneal force-pushed the recommend-drain-before-any-kubelet-upgrade branch from 2282d06 to 20cfb80 Compare September 20, 2023 22:33

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 20, 2023

k8s-ci-robot requested review from derekwaynecarr, mrunalp, SergeyKanzhelev and sftim September 20, 2023 22:33

k8s-ci-robot assigned tengqm Sep 26, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 26, 2023

sftim reviewed Sep 26, 2023

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 26, 2023

k8s-ci-robot merged commit 3039a75 into kubernetes:main Sep 26, 2023
3 checks passed

tzneal deleted the recommend-drain-before-any-kubelet-upgrade branch September 26, 2023 12:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

recommend draining the node before updating kubelet #43053

recommend draining the node before updating kubelet #43053

tzneal commented Sep 14, 2023

tzneal commented Sep 14, 2023

dims commented Sep 14, 2023

tzneal commented Sep 14, 2023

liggitt Sep 14, 2023

tzneal Sep 14, 2023 •

edited

Loading

liggitt Sep 14, 2023

sftim commented Sep 20, 2023

tzneal commented Sep 20, 2023

tzneal commented Sep 20, 2023

netlify bot commented Sep 20, 2023 •

edited

Loading

sftim left a comment

k8s-ci-robot commented Sep 20, 2023

tzneal commented Sep 20, 2023

tengqm commented Sep 26, 2023

k8s-ci-robot commented Sep 26, 2023

sftim commented Sep 26, 2023

sftim left a comment

k8s-ci-robot commented Sep 26, 2023

recommend draining the node before updating kubelet #43053

recommend draining the node before updating kubelet #43053

Conversation

tzneal commented Sep 14, 2023

Why?

tzneal commented Sep 14, 2023

dims commented Sep 14, 2023

tzneal commented Sep 14, 2023

liggitt Sep 14, 2023

Choose a reason for hiding this comment

tzneal Sep 14, 2023 • edited Loading

Choose a reason for hiding this comment

liggitt Sep 14, 2023

Choose a reason for hiding this comment

sftim commented Sep 20, 2023

tzneal commented Sep 20, 2023

tzneal commented Sep 20, 2023

netlify bot commented Sep 20, 2023 • edited Loading

✅ Pull request preview available for checking

sftim left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 20, 2023

tzneal commented Sep 20, 2023

tengqm commented Sep 26, 2023

k8s-ci-robot commented Sep 26, 2023

sftim commented Sep 26, 2023

sftim left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 26, 2023

tzneal Sep 14, 2023 •

edited

Loading

netlify bot commented Sep 20, 2023 •

edited

Loading