-
Notifications
You must be signed in to change notification settings - Fork 14.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add a blog-post for Swap graduating to Beta
Signed-off-by: Itamar Holder <iholder@redhat.com>
- Loading branch information
1 parent
d97fba7
commit 932ea12
Showing
1 changed file
with
191 additions
and
0 deletions.
There are no files selected for viewing
191 changes: 191 additions & 0 deletions
191
content/en/blog/_posts/2023-07-18-swap-beta1-graduation/index.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,191 @@ | ||
--- | ||
layout: blog | ||
title: "Kubernetes 1.28: NodeSwap graduates to Beta1" | ||
date: 2023-07-18 | ||
slug: swap-beta1-1.28-2023 | ||
--- | ||
|
||
**Author:** Itamar Holder (Red Hat) | ||
|
||
The 1.22 release introduced Alpha support for configuring swap memory usage for | ||
Kubernetes workloads on a per-node basis. Now, in release 1.28, swap configuration | ||
graduates to Beta1 with many new improvements. | ||
|
||
Prior to version 1.22, Kubernetes did not provide support for swap memory on Linux systems. | ||
This was due to the inherent difficulty in guaranteeing and accounting for pod memory utilization | ||
when swap memory was involved. As a result, swap support was deemed out of scope in the initial | ||
design of Kubernetes, and the default behavior of a kubelet was to fail to start if swap memory | ||
was detected on a node. | ||
|
||
In version 1.22, the swap feature was initially introduced in its Alpha stage. This represented | ||
a significant advancement, providing users with the opportunity to experiment with the swap | ||
feature for the first time. However, as an Alpha version, it was not fully developed and had | ||
several issues, including inadequate support for cgroup v2, insufficient metrics and summary | ||
API statistics, inadequate testing, and more. | ||
|
||
Swap in Kubernetes has numerous [use cases](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md#user-stories) | ||
for a wide range of users. As a result, significant effort has been put into making | ||
the Beta version of Swap more stable, robust, user-friendly, and to adress issues regarding the | ||
Alpha version that are mentioned above. This represents a crucial step towards achieving the | ||
goal of fully supporting swap in Kubernetes in a controlled, reliable, and predictable manner. | ||
|
||
## How do I use it? | ||
The utilization of swap memory on a node where it has already been provisioned can be | ||
facilitated by the activation of the `NodeSwap` feature gate on the kubelet. | ||
Additionally, the failSwapOn configuration setting must be disabled, or the | ||
`--fail-swap-on` command line flag must be deactivated. | ||
|
||
It is possible to configure the `memorySwap.swapBehavior` option to define the manner in which a node utilizes swap memory. For instance, | ||
|
||
```yaml | ||
memorySwap: | ||
swapBehavior: UnlimitedSwap | ||
``` | ||
The available configuration options for `swapBehavior` are: | ||
- `UnlimitedSwap` (default): Kubernetes workloads can use as much swap memory as they | ||
request, up to the system limit. | ||
- `LimitedSwap` (default): The utilization of swap memory by Kubernetes workloads is subject to limitations. Only Pods of Burstable QoS are permitted to employ swap. | ||
|
||
If configuration for `memorySwap` is not specified and the feature gate is | ||
enabled, by default the kubelet will apply the same behaviour as the | ||
`UnlimitedSwap` setting. | ||
|
||
Note that `NodeSwap` is supported for **cgroup v2** only. cgroup v1 is no longer supported. | ||
|
||
## How is the swap limit being determined with LimitedSwap? | ||
The configuration of swap memory, including its limitations, presents a significant | ||
challenge. Not only is it prone to misconfiguration, but as a system-level property, any | ||
misconfiguration could potentially compromise the entire node rather than just a specific | ||
workload. To mitigate this risk and ensure the health of the node, we have implemented | ||
Swap in Beta with automatic configuration of limitations. | ||
|
||
With `LimitedSwap`, Pods that do not fall under the Burstable QoS classification (i.e. | ||
`BestEffort`/`Guaranteed` Qos Pods) are prohibited from utilizing swap memory. | ||
`BestEffort` QoS Pods exhibit unpredictable memory consumption patterns and lack | ||
information regarding their memory usage, making it difficult to determine a safe | ||
allocation of swap memory. Conversely, `Guaranteed` QoS Pods are typically employed for | ||
applications that rely on the precise allocation of resources specified by the workload, | ||
with memory being immediately available. To maintain the aforementioned security and node | ||
health guarantees, these Pods are not permitted to use swap memory when `LimitedSwap` is | ||
in effect. | ||
|
||
Prior to detailing the calculation of the swap limit, it is necessary to define the following terms: | ||
* `nodeTotalMemory`: The total amount of physical memory available on the node. | ||
* `totalPodsSwapAvailable`: The total amount of swap memory on the node that is available for use by Pods (some swap memory may be reserved for system use). | ||
* `containerMemoryRequest`: The container's memory request. | ||
|
||
Swap limitation is configured as: | ||
`(containerMemoryRequest / nodeTotalMemory) * totalPodsSwapAvailable`. | ||
|
||
In other words, the amount of swap that a container is able to use is proportionate to its | ||
memory request, the node's total physical memory and the total amount of swap memory on | ||
the node that is available for use by Pods. | ||
|
||
It is important to note that, for containers within Burstable QoS Pods, it is possible to | ||
opt-out of swap usage by specifying memory requests that are equal to memory limits. | ||
Containers configured in this manner will not have access to swap memory. | ||
|
||
## How does it work? | ||
There are a number of possible ways that one could envision swap use on a node. | ||
When swap is already provisioned and available on a node, | ||
[we have proposed](https://github.com/kubernetes/enhancements/blob/9d127347773ad19894ca488ee04f1cd3af5774fc/keps/sig-node/2400-node-swap/README.md#proposal) | ||
the kubelet should be able to be configured such that: | ||
- It can start with swap on. | ||
- It will direct the Container Runtime Interface to allocate zero swap memory | ||
to Kubernetes workloads by default. | ||
- You can configure the kubelet to specify swap utilization for the entire | ||
node. | ||
|
||
Swap configuration on a node is exposed to a cluster admin via the | ||
[`memorySwap` in the KubeletConfiguration](/docs/reference/config-api/kubelet-config.v1beta1/). | ||
As a cluster administrator, you can specify the node's behaviour in the | ||
presence of swap memory by setting `memorySwap.swapBehavior`. | ||
|
||
The kubelet employs the CRI (container runtime interface) API to direct the CRI to | ||
configure specific cgroup v2 parameters (such as `memory.swap.max`) in a manner that will | ||
enable the desired swap configuration for a container. The CRI is then responsible to | ||
write these settings to the container-level cgroup. | ||
|
||
## How can I monitor swap? | ||
A notable deficiency in the Alpha version was the inability to monitor and introspect swap | ||
usage. This issue has been addressed in the Beta version introduced in Kubernetes 1.28, which now | ||
provides the capability to monitor swap usage through several different methods. | ||
|
||
The beta version of kubelet now collects | ||
[node-level metric statistics](https://kubernetes.io/docs/reference/instrumentation/node-metrics/), | ||
which can be accessed at the /metrics/resource and /stats/summary endpoints. This allows users to | ||
monitor swap usage and remaining swap memory when using LimitedSwap. Additionally, a | ||
machine_swap_bytes metric has been added to cadvisor to show the total physical swap capacity of the | ||
machine. | ||
|
||
## Caveats | ||
Having swap available on a system reduces predictability. Swap's performance is | ||
worse than regular memory, sometimes by many orders of magnitude, which can | ||
cause unexpected performance regressions. Furthermore, swap changes a system's | ||
behaviour under memory pressure, and applications cannot directly control what | ||
portions of their memory usage are swapped out. Since enabling swap permits | ||
greater memory usage for workloads in Kubernetes that cannot be predictably | ||
accounted for, it also increases the risk of noisy neighbours and unexpected | ||
packing configurations, as the scheduler cannot account for swap memory usage. | ||
|
||
The performance of a node with swap memory enabled depends on the underlying | ||
physical storage. When swap memory is in use, performance will be significantly | ||
worse in an I/O operations per second (IOPS) constrained environment, such as a | ||
cloud VM with I/O throttling, when compared to faster storage mediums like | ||
solid-state drives or NVMe. | ||
|
||
As such, we do not advocate the utilization of swap memory for workloads or | ||
environments that are subject to performance constraints. Furthermore, it is | ||
recommended to employ `LimitedSwap`, as this significantly mitigates the risks | ||
posed to the node. | ||
|
||
Cluster administrators and developers should benchmark their nodes and applications | ||
before using swap in production scenarios, and [we need your help](#how-do-i-get-involved) with that! | ||
|
||
### Security risk | ||
Enabling swap on a system without encryption poses a security risk, as critical information, | ||
such as Kubernetes secrets, may be swapped out to the disk. If an unauthorized individual gains | ||
access to the disk, they could potentially obtain these secrets. To mitigate this risk, it is | ||
recommended to use encrypted swap. However, handling encrypted swap is not within the scope of | ||
kubelet; rather, it is a general OS configuration concern and should be addressed at that level. | ||
It is the administrator's responsibility to provision encrypted swap to mitigate this risk. | ||
|
||
Furthermore, as previously mentioned, with `LimitedSwap` the user has the option to completely | ||
disable swap usage for a container by specifying memory requests that are equal to memory limits. | ||
This will prevent the corresponding containers from accessing swap memory and eliminate any | ||
associated risks of information exposure. | ||
|
||
## Looking ahead | ||
The Kubernetes 1.28 release introduces Beta support for swap memory on nodes, | ||
and we will continue to work towards beta graduation in future releases. This | ||
will include: | ||
|
||
* Add the ability to set a system-reserved quantity of swap from what kubelet detects on the host. | ||
* Adding support for controlling swap consumption at the Pod level via cgroups. | ||
* This point is still under discussion. | ||
* Collecting feedback from test user cases. | ||
* We will consider introducing new configuration modes for swap, such as a | ||
node-wide swap limit for workloads. | ||
|
||
## How can I learn more? | ||
You can review the current [documentation](https://kubernetes.io/docs/concepts/architecture/nodes/#swap-memory) | ||
on the Kubernetes website. | ||
|
||
For more information, and to assist with testing and provide feedback, please | ||
see [KEP-2400](https://github.com/kubernetes/enhancements/issues/4128) and its | ||
[design proposal](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md). | ||
|
||
Feel free to reach out to me, Itamar Holder (**@iholder101** on Slack and GitHub) | ||
if you'd like to help or ask further questions. | ||
|
||
## How do I get involved? | ||
Your feedback is always welcome! SIG Node [meets regularly](https://github.com/kubernetes/community/tree/master/sig-node#meetings) | ||
and [can be reached](https://github.com/kubernetes/community/tree/master/sig-node#contact) | ||
via [Slack](https://slack.k8s.io/) (channel **#sig-node**), or the SIG's | ||
[mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node). | ||
|
||
Feel free to reach out to me, Itamar Holder (**@iholder101** on Slack and GitHub) | ||
if you'd like to help or ask further questions. | ||
|
||
|