Skip to content

Commit

Permalink
Add a blog-post for Swap graduating to Beta
Browse files Browse the repository at this point in the history
Signed-off-by: Itamar Holder <iholder@redhat.com>
  • Loading branch information
iholder101 committed Jul 19, 2023
1 parent d97fba7 commit 932ea12
Showing 1 changed file with 191 additions and 0 deletions.
191 changes: 191 additions & 0 deletions content/en/blog/_posts/2023-07-18-swap-beta1-graduation/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
---
layout: blog
title: "Kubernetes 1.28: NodeSwap graduates to Beta1"
date: 2023-07-18
slug: swap-beta1-1.28-2023
---

**Author:** Itamar Holder (Red Hat)

The 1.22 release introduced Alpha support for configuring swap memory usage for
Kubernetes workloads on a per-node basis. Now, in release 1.28, swap configuration
graduates to Beta1 with many new improvements.

Prior to version 1.22, Kubernetes did not provide support for swap memory on Linux systems.
This was due to the inherent difficulty in guaranteeing and accounting for pod memory utilization
when swap memory was involved. As a result, swap support was deemed out of scope in the initial
design of Kubernetes, and the default behavior of a kubelet was to fail to start if swap memory
was detected on a node.

In version 1.22, the swap feature was initially introduced in its Alpha stage. This represented
a significant advancement, providing users with the opportunity to experiment with the swap
feature for the first time. However, as an Alpha version, it was not fully developed and had
several issues, including inadequate support for cgroup v2, insufficient metrics and summary
API statistics, inadequate testing, and more.

Swap in Kubernetes has numerous [use cases](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md#user-stories)
for a wide range of users. As a result, significant effort has been put into making
the Beta version of Swap more stable, robust, user-friendly, and to adress issues regarding the
Alpha version that are mentioned above. This represents a crucial step towards achieving the
goal of fully supporting swap in Kubernetes in a controlled, reliable, and predictable manner.

## How do I use it?
The utilization of swap memory on a node where it has already been provisioned can be
facilitated by the activation of the `NodeSwap` feature gate on the kubelet.
Additionally, the failSwapOn configuration setting must be disabled, or the
`--fail-swap-on` command line flag must be deactivated.

It is possible to configure the `memorySwap.swapBehavior` option to define the manner in which a node utilizes swap memory. For instance,

```yaml
memorySwap:
swapBehavior: UnlimitedSwap
```
The available configuration options for `swapBehavior` are:
- `UnlimitedSwap` (default): Kubernetes workloads can use as much swap memory as they
request, up to the system limit.
- `LimitedSwap` (default): The utilization of swap memory by Kubernetes workloads is subject to limitations. Only Pods of Burstable QoS are permitted to employ swap.

If configuration for `memorySwap` is not specified and the feature gate is
enabled, by default the kubelet will apply the same behaviour as the
`UnlimitedSwap` setting.

Note that `NodeSwap` is supported for **cgroup v2** only. cgroup v1 is no longer supported.

## How is the swap limit being determined with LimitedSwap?
The configuration of swap memory, including its limitations, presents a significant
challenge. Not only is it prone to misconfiguration, but as a system-level property, any
misconfiguration could potentially compromise the entire node rather than just a specific
workload. To mitigate this risk and ensure the health of the node, we have implemented
Swap in Beta with automatic configuration of limitations.

With `LimitedSwap`, Pods that do not fall under the Burstable QoS classification (i.e.
`BestEffort`/`Guaranteed` Qos Pods) are prohibited from utilizing swap memory.
`BestEffort` QoS Pods exhibit unpredictable memory consumption patterns and lack
information regarding their memory usage, making it difficult to determine a safe
allocation of swap memory. Conversely, `Guaranteed` QoS Pods are typically employed for
applications that rely on the precise allocation of resources specified by the workload,
with memory being immediately available. To maintain the aforementioned security and node
health guarantees, these Pods are not permitted to use swap memory when `LimitedSwap` is
in effect.

Prior to detailing the calculation of the swap limit, it is necessary to define the following terms:
* `nodeTotalMemory`: The total amount of physical memory available on the node.
* `totalPodsSwapAvailable`: The total amount of swap memory on the node that is available for use by Pods (some swap memory may be reserved for system use).
* `containerMemoryRequest`: The container's memory request.

Swap limitation is configured as:
`(containerMemoryRequest / nodeTotalMemory) * totalPodsSwapAvailable`.

In other words, the amount of swap that a container is able to use is proportionate to its
memory request, the node's total physical memory and the total amount of swap memory on
the node that is available for use by Pods.

It is important to note that, for containers within Burstable QoS Pods, it is possible to
opt-out of swap usage by specifying memory requests that are equal to memory limits.
Containers configured in this manner will not have access to swap memory.

## How does it work?
There are a number of possible ways that one could envision swap use on a node.
When swap is already provisioned and available on a node,
[we have proposed](https://github.com/kubernetes/enhancements/blob/9d127347773ad19894ca488ee04f1cd3af5774fc/keps/sig-node/2400-node-swap/README.md#proposal)
the kubelet should be able to be configured such that:
- It can start with swap on.
- It will direct the Container Runtime Interface to allocate zero swap memory
to Kubernetes workloads by default.
- You can configure the kubelet to specify swap utilization for the entire
node.

Swap configuration on a node is exposed to a cluster admin via the
[`memorySwap` in the KubeletConfiguration](/docs/reference/config-api/kubelet-config.v1beta1/).
As a cluster administrator, you can specify the node's behaviour in the
presence of swap memory by setting `memorySwap.swapBehavior`.

The kubelet employs the CRI (container runtime interface) API to direct the CRI to
configure specific cgroup v2 parameters (such as `memory.swap.max`) in a manner that will
enable the desired swap configuration for a container. The CRI is then responsible to
write these settings to the container-level cgroup.

## How can I monitor swap?
A notable deficiency in the Alpha version was the inability to monitor and introspect swap
usage. This issue has been addressed in the Beta version introduced in Kubernetes 1.28, which now
provides the capability to monitor swap usage through several different methods.

The beta version of kubelet now collects
[node-level metric statistics](https://kubernetes.io/docs/reference/instrumentation/node-metrics/),
which can be accessed at the /metrics/resource and /stats/summary endpoints. This allows users to
monitor swap usage and remaining swap memory when using LimitedSwap. Additionally, a
machine_swap_bytes metric has been added to cadvisor to show the total physical swap capacity of the
machine.

## Caveats
Having swap available on a system reduces predictability. Swap's performance is
worse than regular memory, sometimes by many orders of magnitude, which can
cause unexpected performance regressions. Furthermore, swap changes a system's
behaviour under memory pressure, and applications cannot directly control what
portions of their memory usage are swapped out. Since enabling swap permits
greater memory usage for workloads in Kubernetes that cannot be predictably
accounted for, it also increases the risk of noisy neighbours and unexpected
packing configurations, as the scheduler cannot account for swap memory usage.

The performance of a node with swap memory enabled depends on the underlying
physical storage. When swap memory is in use, performance will be significantly
worse in an I/O operations per second (IOPS) constrained environment, such as a
cloud VM with I/O throttling, when compared to faster storage mediums like
solid-state drives or NVMe.

As such, we do not advocate the utilization of swap memory for workloads or
environments that are subject to performance constraints. Furthermore, it is
recommended to employ `LimitedSwap`, as this significantly mitigates the risks
posed to the node.

Cluster administrators and developers should benchmark their nodes and applications
before using swap in production scenarios, and [we need your help](#how-do-i-get-involved) with that!

### Security risk
Enabling swap on a system without encryption poses a security risk, as critical information,
such as Kubernetes secrets, may be swapped out to the disk. If an unauthorized individual gains
access to the disk, they could potentially obtain these secrets. To mitigate this risk, it is
recommended to use encrypted swap. However, handling encrypted swap is not within the scope of
kubelet; rather, it is a general OS configuration concern and should be addressed at that level.
It is the administrator's responsibility to provision encrypted swap to mitigate this risk.

Furthermore, as previously mentioned, with `LimitedSwap` the user has the option to completely
disable swap usage for a container by specifying memory requests that are equal to memory limits.
This will prevent the corresponding containers from accessing swap memory and eliminate any
associated risks of information exposure.

## Looking ahead
The Kubernetes 1.28 release introduces Beta support for swap memory on nodes,
and we will continue to work towards beta graduation in future releases. This
will include:

* Add the ability to set a system-reserved quantity of swap from what kubelet detects on the host.
* Adding support for controlling swap consumption at the Pod level via cgroups.
* This point is still under discussion.
* Collecting feedback from test user cases.
* We will consider introducing new configuration modes for swap, such as a
node-wide swap limit for workloads.

## How can I learn more?
You can review the current [documentation](https://kubernetes.io/docs/concepts/architecture/nodes/#swap-memory)
on the Kubernetes website.

For more information, and to assist with testing and provide feedback, please
see [KEP-2400](https://github.com/kubernetes/enhancements/issues/4128) and its
[design proposal](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md).

Feel free to reach out to me, Itamar Holder (**@iholder101** on Slack and GitHub)
if you'd like to help or ask further questions.

## How do I get involved?
Your feedback is always welcome! SIG Node [meets regularly](https://github.com/kubernetes/community/tree/master/sig-node#meetings)
and [can be reached](https://github.com/kubernetes/community/tree/master/sig-node#contact)
via [Slack](https://slack.k8s.io/) (channel **#sig-node**), or the SIG's
[mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node).

Feel free to reach out to me, Itamar Holder (**@iholder101** on Slack and GitHub)
if you'd like to help or ask further questions.


0 comments on commit 932ea12

Please sign in to comment.