From 1e4a0dd942bf259ca658446d8b2698303266574f Mon Sep 17 00:00:00 2001 From: Rodrigo Campos Date: Thu, 11 Jan 2024 16:12:18 +0100 Subject: [PATCH 01/10] KEP-127: Add pod range configuration Signed-off-by: Rodrigo Campos --- keps/sig-node/127-user-namespaces/README.md | 25 +++++++++++++++------ 1 file changed, 18 insertions(+), 7 deletions(-) diff --git a/keps/sig-node/127-user-namespaces/README.md b/keps/sig-node/127-user-namespaces/README.md index 3899788d1f6..86fe8b720f8 100644 --- a/keps/sig-node/127-user-namespaces/README.md +++ b/keps/sig-node/127-user-namespaces/README.md @@ -19,6 +19,7 @@ - [Pod.spec changes](#podspec-changes) - [CRI changes](#cri-changes) - [Support for pods](#support-for-pods) + - [Configuration of ranges](#configuration-of-ranges) - [Handling of volumes](#handling-of-volumes) - [Example of how idmap mounts work](#example-of-how-idmap-mounts-work) - [Example without idmap mounts](#example-without-idmap-mounts) @@ -315,10 +316,21 @@ The picked range will be stored under a file named `userns` in the pod folder (by default it is usually located in `/var/lib/kubelet/pods/$POD/userns`). This way, the Kubelet can read all the allocated mappings if it restarts. -During alpha, to make sure we don't exhaust the host UID namespace, we will -limit the number of pods using user namespaces to `min(maxPods, 1024)`. This -leaves us plenty of host UID space free and this limits is probably never hit in -practice. See the [Unresolved section](#unresolved) for more details on this. +#### Configuration of ranges + +If no configuration is present, the kubelet will use a sane default: + + * `0-65535`: reserved for the host processes and files + * `65536-<65536 * maxPods>`: reserved for pods + +The kubelet will detect if a configuration is present by using the `getsubids` binary. It will query +for ranges allocated to the "kubelet" user. + +If the kubelet user doesn't exist or `getsubids` is not installed, the kubelet will use the sane +default aforementioned. On any other cases, it will return an error. + +By using `getsubids` we make sure the kubelet interacts fine with other programs in the host that +also allocate ranges for user namespaces (be it /etc/subuid or centralized systems as freeIPA). ### Handling of volumes @@ -1017,9 +1029,8 @@ and validate the declared limits? The kubelet is spliting the host UID/GID space for different pods, to use for their user namespace mapping. The design allows for 65k pods per node, and the -resource is limited in the alpha phase to the min between maxPods per node -kubelet setting and 1024. This guarantees we are not inadvertly exhausting the -resource. +resource is limited to maxPods per node (currently maxPods defaults to 110, it is unlikely we will +reach the host limit). For container runtimes, they might use more disk space or inodes to chown the rootfs. This is if they chose to support this feature without relying on new From 739f80bb346e505087ca665d1a5c8c4f6987dd06 Mon Sep 17 00:00:00 2001 From: Rodrigo Campos Date: Thu, 11 Jan 2024 17:03:29 +0100 Subject: [PATCH 02/10] KEP-127: Fix feature gate name Signed-off-by: Rodrigo Campos --- keps/sig-node/127-user-namespaces/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-user-namespaces/README.md b/keps/sig-node/127-user-namespaces/README.md index 86fe8b720f8..d330089452c 100644 --- a/keps/sig-node/127-user-namespaces/README.md +++ b/keps/sig-node/127-user-namespaces/README.md @@ -690,7 +690,7 @@ well as the [existing list] of feature gates. --> - [x] Feature gate (also fill in values in `kep.yaml`) - - Feature gate name: UserNamespacesStatelessPodsSupport + - Feature gate name: UserNamespacesPodsSupport - Components depending on the feature gate: kubelet, kube-apiserver ###### Does enabling the feature change any default behavior? From be037de4810569c8136047c102a7d95959e76655 Mon Sep 17 00:00:00 2001 From: Rodrigo Campos Date: Thu, 11 Jan 2024 17:07:26 +0100 Subject: [PATCH 03/10] KEP-127: Update PRR for beta Signed-off-by: Rodrigo Campos Signed-off-by: Giuseppe Scrivano --- keps/sig-node/127-user-namespaces/README.md | 217 +++++++++++++++++++- 1 file changed, 209 insertions(+), 8 deletions(-) diff --git a/keps/sig-node/127-user-namespaces/README.md b/keps/sig-node/127-user-namespaces/README.md index d330089452c..578511dcba2 100644 --- a/keps/sig-node/127-user-namespaces/README.md +++ b/keps/sig-node/127-user-namespaces/README.md @@ -377,7 +377,7 @@ what we expect, the kubelet needs to chown the file to a UID that is mapped into the pod's userns (e.g. UID 65536 in this example), as that is what root inside the container is mapped to in the user namespace. -We tried this before, but several limitations were hit. See the +We tried this before, but several limitations were hit. See the [alternatives section](#dont-use-idmap-mounts-and-rely-chown-all-the-files-correctly) for more details on the limitations we hit. @@ -765,6 +765,18 @@ This section must be completed when targeting beta to a release. ###### How can a rollout or rollback fail? Can it impact already running workloads? +The rollout is just a feature flag on the kubelet and the kube-apiserver. + +If one API server is upgraded while others aren't, the pod will be accepted (if the apiserver is >= +1.25). If it is scheduled to a node that the kubelet has the feature flag activated and the node +meets the requirements to use user namespaces, then the pod will be created with the namespace. If +it is scheduled to a node that has the feature disabled, it will be scheduled without the user +namespace. + +On a rollback, pods created while the feature was active (created with user namespaces) will have to +be restarted to be re-created without user namespaces. Just a re-creation of the pod will do the +trick. + @@ -807,6 +834,7 @@ previous answers based on experience in the field. ###### How can an operator determine if the feature is in use by workloads? +Check if any pod has the pod.spec.HostUsers field set to false. - [ ] Events - - Event Reason: + - Event Reason: - [ ] API .status - - Condition name: - - Other field: -- [ ] Other (treat as last resort) - - Details: + - Condition name: + - Other field: +- [x] Other (treat as last resort) + - Details: check pods with pod.spec.HostUsers field set to false, and see if they are in RUNNING + state. ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? +If a node meets all the requirements, there should be no change to existing SLO/SLIs. + +If a container runtime wants to support old kernels, it can have a performance impact, though. For +more details, see the question: + "Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?" + @@ -864,6 +905,17 @@ Pick one more of these and delete the rest. ###### Are there any missing metrics that would be useful to have to improve observability of this feature? +No. + +This feature is using yet another namespace when creating a pod. If the pod creation fails (by +an error on the kubelet or returned by the container runtime), a clear error is returned to the +user. The feedback on this is very direct to the user actions. + +A metric like "errors returned in pods with user namespaces enabled" can be very noisy, as the error +can be completely unrelated (image pull secret errors, configmap referenced and not defined, any +other container runtime error, etc.). We can't see any metric that can be helpful, as the user has a +very direct feedback already. + @@ -126,8 +126,8 @@ Here we use UIDs, but the same applies for GIDs. inside the container to different IDs in the host. In particular, mapping root inside the container to unprivileged user and group IDs in the node. - Increase pod to pod isolation by allowing to use non-overlapping mappings - (UIDs/GIDs) whenever possible. IOW, if two containers runs as user X, they run - as different UIDs in the node and therefore are more isolated than today. + (UIDs/GIDs) whenever possible. In other words: if two containers runs as user + X, they run as different UIDs in the node and therefore are more isolated than today. - Allow pods to have capabilities (e.g. `CAP_SYS_ADMIN`) that are only valid in the pod (not valid in the host). - Benefit from the security hardening that user namespaces provide against some @@ -291,7 +291,7 @@ message Mount { ### Support for pods Make pods work with user namespaces. This is activated via the -bool `pod.spec.HostUsers`. +bool `pod.spec.hostUsers`. The mapping length will be 65536, mapping the range 0-65535 to the pod. This wide range makes sure most workloads will work fine. Additionally, we don't need to @@ -403,7 +403,7 @@ If the pod wants to read who is the owner of file `/vol/configmap/foo`, now it will see the owner is root inside the container. This is due to the IDs transformations that the idmap mount does for us. -In other words, we can make sure the pod can read files instead of chowning them +In other words: we can make sure the pod can read files instead of chowning them all using the host IDs the pod is mapped to, by just using an idmap mount that has the same mapping that we use for the pod user namespace. @@ -469,7 +469,7 @@ something else to this list: - What about windows or VM container runtimes, that don't use linux namespaces? We need a review from windows maintainers once we have a more clear proposal. We can then adjust the needed details, we don't expect the changes (if any) to be big. - IOW, in my head this looks like this: we merge this KEP in provisional state if + In my head this looks like this: we merge this KEP in provisional state if we agree on the high level idea, with @giuseppe we do a PoC so we can fill-in more details to the KEP (like CRI changes, changes to container runtimes, how to configure kubelet ranges, etc.), and then the Windows folks can review and we @@ -686,7 +686,7 @@ well as the [existing list] of feature gates. --> - [x] Feature gate (also fill in values in `kep.yaml`) - - Feature gate name: UserNamespacesPodsSupport + - Feature gate name: UserNamespacesSupport - Components depending on the feature gate: kubelet, kube-apiserver ###### Does enabling the feature change any default behavior? @@ -733,7 +733,7 @@ Pods will have to be re-created to use the feature. We will add. -We will test for when the field pod.spec.HostUsers is set to true, false +We will test for when the field pod.spec.hostUsers is set to true, false and not set. All of this with and without the feature gate enabled. We will also unit test that, if pods were created with the new field @@ -766,7 +766,7 @@ The rollout is just a feature flag on the kubelet and the kube-apiserver. If one API server is upgraded while others aren't, the pod will be accepted (if the apiserver is >= 1.25). If it is scheduled to a node that the kubelet has the feature flag activated and the node meets the requirements to use user namespaces, then the pod will be created with the namespace. If -it is scheduled to a node that has the feature disabled, it will be scheduled without the user +it is scheduled to a node that has the feature disabled, it will be created without the user namespace. On a rollback, pods created while the feature was active (created with user namespaces) will have to @@ -787,7 +787,7 @@ will rollout across nodes. On Kubernetes side, the kubelet should start correctly. -On the node runtime side, a pod created with pod.spec.HostUsers=false should be on RUNNING state if +On the node runtime side, a pod created with pod.spec.hostUsers=false should be on RUNNING state if all node requirements are met. +#### Kubelet and Kube-apiserver skew + +The apiserver and kubelet feature gate enablement work fine in any combination: + +1. If the apiserver has the feature gate enabled and the kubelet doesn't, then the pod will show + that field and the kubelet will ignore it. Then, the pod is created without user namespaces. +2. If the apiserver has the feature gate disabled and the kubelet enabled, the pod won't show this + field and therefore the kubelet won't act on a field that isn't shown. The pod is created without + user namespaces. + +The kubelet can still create pods with user namespaces if static-pods are configured with +pod.spec.hostUsers and has the feature gate enabled. + +If the kube-apiserver doesn't support the feature at all (< 1.25), a pod with userns will be +rejected. + +If the kubelet doesn't support the feature (< 1.25), it will ignore the pod.spec.hostUsers field. + +#### Kubelet and container runtime skews + Some definitions first: - New kubelet: kubelet with CRI proto files that includes the changes proposed in this KEP. @@ -794,6 +819,9 @@ We will also unit test that, if pods were created with the new field pod.specHostUsers, then if the featuregate is disabled all works as expected (no user namespace is used). +We will add tests exercising the `switch` of feature gate itself (what happens +if I disable a feature gate after having objects written with the new field) +