Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2400: Graduate swap to Beta 1 #3957

Merged
merged 1 commit into from
Jun 15, 2023
Merged

Conversation

harche
Copy link
Contributor

@harche harche commented Apr 17, 2023

  • One-line PR description: Graduate swap to Beta 1
  • Other comments:

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Apr 17, 2023
@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Apr 17, 2023
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 17, 2023
@harche harche force-pushed the swap_beta branch 2 times, most recently from 31c6e9c to f205be1 Compare April 17, 2023 19:49
Copy link
Contributor

@iholder101 iholder101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @harche! Great stuff!
See inline comments.

Also, do you think it's necessary to link the POCs[1][2] at the KEP / PR description?

- For each container, divide its memory request by the `TotalMemory`. The result will be the proportion of requested memory for that container within the pod. Let's call this value `RequestedMemoryProportion`.

3. **Calculate the total swap memory available:**
- Determine the total swap memory available in the system. If the total swap memory is equal to the total physical memory, you can skip this step. Otherwise, calculate the proportion of available swap memory compared to the total physical memory. Let's call this value `SwapMemoryProportion`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
What do you think about dropping If the total swap memory is equal to the total physical memory, you can skip this step? Although unlikely in practice, in theory the available swap memory might also be bigger than the available memory on the node.

Perhaps something like divide the available swap memory by the total physical memory is more generalized and complete, and covers all cases.


#### Example 2 - Configured swap memory on the host is NOT equal to the physical memory

If the total physical memory is 4 GB and the total swap memory available is 2 GB, we can adjust the example 1 using the same approach:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
What do you think about making this example a bit more realistic?
Maybe we can show total memory of 400G and swap size of 30G?

Not sure it's important though.

@@ -300,6 +369,8 @@ and/or workloads in a number of different scenarios.
Since swap provisioning is out of scope of this proposal, this enhancement
poses low risk to Kubernetes clusters that will not enable swap.

Enabling swap on a system without encryption poses a security risk, as critical information, such as Kubernetes secrets, may be swapped out to the disk. If an unauthorized individual gains access to the disk, they could potentially obtain these secrets. To mitigate this risk, it is recommended to use encrypted swap. However, handling encrypted swap is not within the scope of kubelet; rather, it is a general Linux configuration concern and should be addressed at that level. Nevertheless, it is essential to provide documentation that warns users of this potential issue, ensuring they are aware of the potential security implications and can take appropriate steps to safeguard their system.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"it is a general Linux configuration concern"
As it's not linux-specific issue, maybe we can go with "it is a general OS configuration concern" instead?

Comment on lines 615 to 616
- Better understand relationship of swap with memory QoS in cgroup v2
(particularly `memory.high` usage).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better understand relationship of swap with memory QoS in cgroup v2 (particularly memory.high usage).

Do we want to remove this line?
Is there something missing regarding that?

keps/sig-node/2400-node-swap/README.md Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 24, 2023
@harche harche changed the title Graduate swap to Beta 1 KEP-2400: Graduate swap to Beta 1 Apr 24, 2023
### Enable Swap Support only for Burstable QoS Pods
Before enabling swap support through the pod API, it is crucial to build confidence in this feature by carefully assessing its impact on workloads and Kubernetes. As an initial step, we propose enabling swap support for Burstable QoS Pods by automatically calculating the appropriate swap values, rather than allowing users to input these values manually.

Swap access is granted only for pods of Burstrable QoS. Guaranteed QoS pods are usually higher-priority pods, therefore we want to avoid swap's performance penalty for them. Best-Effort pods, on the contrary, are low-priority pods that are the first to be killed during node pressures. In addition, they're unpredictible, therefore it's hard to assess how much swap memory is a reasonable amount to allocate for them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Burstrable typo

2. **Determine the swap proportion for each container:**
- For each container, divide its memory request by the `TotalMemory`. The result will be the proportion of requested memory for that container within the pod. Let's call this value `RequestedMemoryProportion`.

3. **Calculate the total swap memory available:**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we set any aside for the system (kernel + systemd + kubelet + runtime, etc.)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubernetes/kubernetes#105271
This should be based on the allocatable swap after it is supported in system-reserved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few questions about the way system reserved for swap is implemented there.

But we can indirectly set aside some swap space away from the pods. I have tried to document that here.

@pacoxu
Copy link
Member

pacoxu commented Apr 25, 2023

/assign

- Add swap memory to the Kubelet stats api.
- Determine a set of metrics for node QoS in order to evaluate the performance
of nodes with and without swap enabled.
- Better understand relationship of swap with memory QoS in cgroup v2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cgroup v2, memory.swap.high may be helpful at the node level. We can set it in the node level using a factor like the other KEP https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos for memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes sense to align with v2 QoS work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that using swap throttling is the right approach here.

There are significant differences between memory.high and memory.swap.high and they are used in a completely different manner.

memory.high is described as "the main mechanism to control memory usage." and "when hit, it throttles allocations by forcing them into direct reclaim to work off the excess" [1].

However, memory.swap.high is described as [1]:

Swap usage throttle limit.  If a cgroup's swap usage exceeds
this limit, all its further allocations will be throttled to
allow userspace to implement custom out-of-memory procedures.

marks a point of no return for the cgroup. It is NOT
designed to manage the amount of swapping a workload does
during regular operation. Compare to memory.swap.max, which
prohibits swapping past a set amount, but lets the cgroup
continue unimpeded as long as other memory can be reclaimed.

Healthy workloads are not expected to reach this limit.

In addition, the Linux kernel commit message that brought memory.swap.high says [2] That is not to say that the knob itself is equivalent to memory.high and that Slowing misbehaving tasks down gradually allows user space oom killers or other protection mechanisms to react.


Since we don't implement any user-space OOM killers, IIUC reaching the memory.swap.high limit would freeze the processes in the cgroup forever.

This makes sense, as swap memory isn't being reclaimed as often as ram, and it's much more expensive to do so. If processes in a cgroup would reach the memory.swap.max boundary, however, they would have to keep allocated memory in-ram. If memory.high is defined, as the KEP suggests, the processes would reach a point where they freeze up, forced to reclaim memory to continue their operation.

The bottom line is that processes that are out of memory should reclaim ram memory, not swap memory.

[1] https://www.kernel.org/doc/Documentation/admin-guide/cgroup-v2.rst
[2] https://lore.kernel.org/linux-mm/20200602044952.g4tpmtOiL%25akpm@linux-foundation.org/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slowing misbehaving tasks down gradually allows user space oom killers or other protection mechanisms to react

IMO is a good thing. We have seen so many issues in the past where the workload would end up consuming the memory so fast that kernel won't get adequate time to react. In this scenario, the entire system would just freeze and the node would eventually transition into NOT_READY state in k8s.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two lines I'm most concerned about, and don't fully understand, are:

marks a point of no return for the cgroup.

What does "point of no return" actually means here?
Does it mean that a user-space procedure needs to resume the cgroup's execution or else it won't ever be able to allocate memory?

It is NOT designed to manage the amount of swapping a workload does during regular operation.

What are "regular operations"?

It worries me that this description is very different than the description for memory.high, which is said to be the "the main mechanism to control memory usage".

- Container A: (20 GB) / (30 GB) = 2/3
- Container B: (10 GB) / (30 GB) = 1/3

Step 3: Calculate the total swap memory available: Since the total swap memory (2 GB) is not equal to the total physical memory (4 GB), the `SwapMemoryProportion` will be 2 GB / 4 GB = 1/2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If swap can be reserved in --system-reserved of kubelet, this should base on that input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few concerns about how is it implemented in kubernetes/kubernetes#105271 (comment), however I have added a section in KEP on how we can effectively set aside some swap space for system critical daemons.

@harche harche force-pushed the swap_beta branch 2 times, most recently from 68e4f60 to f775c0a Compare May 4, 2023 18:44
@harche harche marked this pull request as ready for review May 4, 2023 18:56
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 4, 2023
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 6, 2023
- Remove support for setting unlimited amount of swap (including [swapBehavior](https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/#kubelet-config-k8s-io-v1beta1-MemorySwapConfiguration) flag of the Kubelet) as any workload with aggressive memory allocation can bring down a node with having no limits on swap usage.

#### Beta 2
- Add support for controlling swap consumption at the container level [via cgroups] in Pod API in [container resources](https://github.com/kubernetes/kubernetes/blob/94a15929cf13354fdf3747cb266d511154f8c97b/staging/src/k8s.io/api/core/v1/types.go#L2443). More specifically add a new [ResourceName](https://github.com/kubernetes/kubernetes/blob/94a15929cf13354fdf3747cb266d511154f8c97b/staging/src/k8s.io/api/core/v1/types.go#L5522) `swap`. This will make sure we stay consistent with other resources like `cpu`, `memory` etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would drop this since we don't really want to commit to adding this to the pod API yet. We can evaluate that and then decide.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍

@iholder101
Copy link
Contributor

iholder101 commented Jun 14, 2023

@SergeyKanzhelev Could you kindly have another look? KEP freeze is within 1 day.

I think we basically agree on everything, there are only two minor [1][2] things left to discuss.

Since we basically already agree, can we try to merge this in time?

- Improve coverage for appropriate scenarios in testgrid.
- Remove support for setting unlimited amount of swap (including [swapBehavior](https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/#kubelet-config-k8s-io-v1beta1-MemorySwapConfiguration) flag of the Kubelet) as any workload with aggressive memory allocation can bring down a node with having no limits on swap usage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite follow how it will bring the node down if it has unlimited swap comparing to many Gb limit? Unlimited swap is useful for scenarios when the node is being used essentially for one big Pod doing some heavy processing that requires memory mapping of a large size without a lot of random access. This scenario still works with limiting as proposed - proportional to the node limit. But I would question the reasoning suggested here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, your point is valid. I will revise the KEP to eliminate that concern. Should we encounter any issues regarding unlimited swap during our testing phase, we will promptly investigate and address them as needed.


Enabling swap on a system without encryption poses a security risk, as critical information, such as Kubernetes secrets, may be swapped out to the disk. If an unauthorized individual gains access to the disk, they could potentially obtain these secrets. To mitigate this risk, it is recommended to use encrypted swap. However, handling encrypted swap is not within the scope of kubelet; rather, it is a general OS configuration concern and should be addressed at that level. Nevertheless, it is essential to provide documentation that warns users of this potential issue, ensuring they are aware of the potential security implications and can take appropriate steps to safeguard their system.

Additionally end user may decide to disable swap completely for a Pod in beta 1 by making it guaranteed. This way, there will be no swap enabled for the corresponding containers and there will be no information exposure risks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we guarantee that system reserved can be set to zero so no system processes will ever use swap? Can we add this as an explicit goal?

Copy link
Contributor

@iholder101 iholder101 Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With @pacoxu's PR we can reserve some swap space to the system, but reserving 0 swap to the system does not mean that the system cannot use swap, it's just that there won't be a swap space that's dedicated to it.

no system processes will ever use swap

What do you mean by "system processes"? Do you mean non-kubernetes processes?
If so - I'm not sure how we can guarantee that unless we can mess with the node's non-kubernetes cgroups hierarchies, and I assume that we can't do that.

Do you refer to Kubernetes system non-containerized processes? If so, do they live under /sys/fs/cgroup/kubepods?
If so - that's a problem since this is the top cgroup hierarchy for Kubernetes, and limiting swap at that level would affect all of the containers.

In the comment below, you wrote:

configure k8s so system daemonsets are not swapped

So here we are talking about Kubernetes containers. How can "system" daemonsets be identified? Why is it wrong to allocate swap to them just like any other container?

@pacoxu

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SergeyKanzhelev we can set memory.swap.max to 0 on system reserved cgroup (in our case usually that happens to be /sys/fs/cgroup/system.slice) and if we set memory.swap.max to max on burstable pods slice (which in our case usually happens to be /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice) we can
make sure that no system processes will ever use swap.

As long as system reserved cgroup is setup in a way such that it doesn't end up becoming the parent of the cgroups where burstable pods are running this arrangement will work fine. Because if someone configures their cgroup in a way where system cgroup is a parent of burstable pods cgroup then this will silently fail because a child cgroup cannot have memory.swap.max set to max while parent is set to 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a check in kubelet during startup to verify that burstable pods cgroup is not a child of system reserved cgroup. That should take care of it.

I will update the KEP with this info.


Enabling swap on a system without encryption poses a security risk, as critical information, such as Kubernetes secrets, may be swapped out to the disk. If an unauthorized individual gains access to the disk, they could potentially obtain these secrets. To mitigate this risk, it is recommended to use encrypted swap. However, handling encrypted swap is not within the scope of kubelet; rather, it is a general OS configuration concern and should be addressed at that level. Nevertheless, it is essential to provide documentation that warns users of this potential issue, ensuring they are aware of the potential security implications and can take appropriate steps to safeguard their system.

Additionally end user may decide to disable swap completely for a Pod in beta 1 by making it guaranteed. This way, there will be no swap enabled for the corresponding containers and there will be no information exposure risks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Additionally end user may decide to disable swap completely for a Pod in beta 1 by making it guaranteed. This way, there will be no swap enabled for the corresponding containers and there will be no information exposure risks.
Additionally end user may decide to disable swap completely for a Pod or a container in beta 1 by making Pod guaranteed or set request == limit for a container. This way, there will be no swap enabled for the corresponding containers and there will be no information exposure risks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks.

@SergeyKanzhelev
Copy link
Member

Overall lgtm. Let's add a note to make sure we can configure k8s so system daemonsets are not swapped. This will address the security concerns.

Also I'm not sure about the reasoning to disable unlimited swap. I left a comment about it. If this is indeed a problem, I'm fine with removing this option.

Signed-off-by: Harshal Patil <harpatil@redhat.com>
Co-authored-by: Itamar Holder <iholder@redhat.com>
@harche
Copy link
Contributor Author

harche commented Jun 15, 2023

Thanks @SergeyKanzhelev for your review. I have added a note to make sure system daemons aren't swapped as well as dropped the point about removing the support for unlimited swap.


Enabling swap on a system without encryption poses a security risk, as critical information, such as Kubernetes secrets, may be swapped out to the disk. If an unauthorized individual gains access to the disk, they could potentially obtain these secrets. To mitigate this risk, it is recommended to use encrypted swap. However, handling encrypted swap is not within the scope of kubelet; rather, it is a general OS configuration concern and should be addressed at that level. Nevertheless, it is essential to provide documentation that warns users of this potential issue, ensuring they are aware of the potential security implications and can take appropriate steps to safeguard their system.

To guarantee that system daemons are not swapped, the kubelet must configure the `memory.swap.max` setting to `0` within the system reserved cgroup. Moreover, to make sure that burstable pods are able to utilize swap space, kubelet should verify that the cgroup associated with burstable pods should not be nested under the cgroup designated for system reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In otherwords: kubeleconfig.CgroupRoot can't be a child of kubeletconfig.SystemReservedCgroups.

as an implementation note: this should probably be added to the kubelet config documentation

Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

thank you!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 15, 2023
@mrunalp
Copy link
Contributor

mrunalp commented Jun 15, 2023

Thanks @SergeyKanzhelev

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: harche, mrunalp, SergeyKanzhelev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 15, 2023
@k8s-ci-robot k8s-ci-robot merged commit 9e9e591 into kubernetes:master Jun 15, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone Jun 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

9 participants