Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet base environment fixes #3240

Merged
merged 4 commits into from
May 19, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion images/base/files/etc/default/kubelet
Original file line number Diff line number Diff line change
@@ -1 +1 @@
KUBELET_EXTRA_ARGS=--fail-swap-on=false
KUBELET_EXTRA_ARGS=--runtime-cgroups=/system.slice/containerd.service
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the cgroup in which containerd operates (in kind, and in most cases since there's a standard in-tree service spec).

GKE nodes set this, noticed while debugging the runc issues, which reminded me that we aren't here.
So do the SIG node GCE containerd CI jobs.

https://github.com/kubernetes/test-infra/blob/6928f7825c26cd52dd2db97298db1882d6609584/config/jobs/kubernetes/sig-node/containerd.yaml#L20

https://github.com/kubernetes/kubernetes/blob/15a1f9a39db95a7f83ce5463a328566411d99d51/cluster/gce/util.sh#L823

9 changes: 4 additions & 5 deletions images/base/files/etc/systemd/system/containerd.service
Original file line number Diff line number Diff line change
Expand Up @@ -3,27 +3,26 @@
[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target
After=network.target local-fs.target
# disable rate limiting
StartLimitIntervalSec=0

[Service]
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/containerd
Restart=always
RestartSec=1

Type=notify
Delegate=yes
KillMode=process
Restart=always
RestartSec=1
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
LimitNOFILE=infinity
# Comment TasksMax if your systemd version does not supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
OOMScoreAdjust=-999

[Install]
WantedBy=multi-user.target
1 change: 1 addition & 0 deletions images/base/files/etc/systemd/system/kubelet.service
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ RestartSec=1s
CPUAccounting=true
MemoryAccounting=true
Slice=kubelet.slice
KillMode=process

[Install]
WantedBy=multi-user.target
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,4 @@ ExecStartPre=/bin/sh -euc "if [ -f /sys/fs/cgroup/cgroup.controllers ]; then cre
# This eventually leads to kubelet failing to start, see: https://github.com/kubernetes-sigs/kind/issues/2323
ExecStartPre=/bin/sh -euc "if [ ! -f /sys/fs/cgroup/cgroup.controllers ] && [ ! -d /sys/fs/cgroup/systemd/kubelet ]; then mkdir -p /sys/fs/cgroup/systemd/kubelet; fi"
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS --cgroup-root=/kubelet
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS
43 changes: 29 additions & 14 deletions images/base/files/usr/local/bin/entrypoint
Original file line number Diff line number Diff line change
Expand Up @@ -181,8 +181,8 @@ fix_mount() {
mount --make-rshared /
}

# helper used by fix_cgroup
mount_kubelet_cgroup_root() {
# helper used by mount_kubelet_cgroup_root
mount_kubelet_cgroup_root_subsystem() {
local cgroup_root=$1
local subsystem=$2
if [ -z "${cgroup_root}" ]; then
Expand All @@ -201,6 +201,22 @@ mount_kubelet_cgroup_root() {
mount --bind "${subsystem}/${cgroup_root}" "${subsystem}/${cgroup_root}"
}

# helper used by fix_cgroup
mount_kubelet_cgroup_root() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just pulled out from the end of fix_cgroup for re-use in v1 both with/without cgroupns.

We're responsible for setting up the cgroup when we use --cgroup-root.

We did that previously for the "v1 no cgroupns and nested in another cluster" reasons, but we'll need to keep --cgroup-root=kubelet until we can phase it out at some point later following #3241.

local cgroup_subsystems=$1
echo "${cgroup_subsystems}" |
while IFS= read -r subsystem; do
mount_kubelet_cgroup_root_subsystem /kubelet "${subsystem}"
mount_kubelet_cgroup_root_subsystem /kubelet.slice "${subsystem}"
done
# workaround for hosts not running systemd
# we only do this for kubelet.slice because it's not relevant when not using
# the systemd cgroup driver
if [[ ! "${cgroup_subsystems}" = */sys/fs/cgroup/systemd* ]]; then
mount_kubelet_cgroup_root_subsystem /kubelet.slice /sys/fs/cgroup/systemd
fi
}

fix_cgroup() {
if [[ -f "/sys/fs/cgroup/cgroup.controllers" ]]; then
echo 'INFO: detected cgroup v2'
Expand All @@ -226,8 +242,17 @@ fix_cgroup() {
# current process. this tells us what cgroup-path the container is in.
local current_cgroup
current_cgroup=$(grep -E '^[^:]*:([^:]*,)?cpu(,[^,:]*)?:.*' /proc/self/cgroup | cut -d: -f3)
local cgroup_subsystems
cgroup_subsystems=$(findmnt -lun -o source,target -t cgroup | grep -F "${current_cgroup}" | awk '{print $2}')
if [ "$current_cgroup" = "/" ]; then
echo "INFO: cgroupns detected, no need to fix cgroups"
echo "INFO: detected cgroupns"
# kubelet will try to manage cgroups / pods that are not owned by it when
# "nesting" clusters, unless we instruct it to use a different cgroup root.
# We do this, and when doing so we must fixup this alternative root
# currently this is hardcoded to be /kubelet
# under systemd cgroup driver, kubelet appends .slice
mount --make-rprivate /sys/fs/cgroup
mount_kubelet_cgroup_root "${cgroup_subsystems}"
return
fi

Expand Down Expand Up @@ -310,17 +335,7 @@ fix_cgroup() {
# currently this is hardcoded to be /kubelet
# under systemd cgroup driver, kubelet appends .slice
mount --make-rprivate /sys/fs/cgroup
echo "${cgroup_subsystems}" |
while IFS= read -r subsystem; do
mount_kubelet_cgroup_root /kubelet "${subsystem}"
mount_kubelet_cgroup_root /kubelet.slice "${subsystem}"
done
# workaround for hosts not running systemd
# we only do this for kubelet.slice because it's not relevant when not using
# the systemd cgroup driver
if [[ ! "${cgroup_subsystems}" = */sys/fs/cgroup/systemd* ]]; then
mount_kubelet_cgroup_root /kubelet.slice /sys/fs/cgroup/systemd
fi
mount_kubelet_cgroup_root "${cgroup_subsystems}"
}

fix_machine_id() {
Expand Down