Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't use hugepage if UserNamespaceSupport is enabled #1380

Closed
ikwork opened this issue Dec 18, 2023 · 17 comments
Closed

Can't use hugepage if UserNamespaceSupport is enabled #1380

ikwork opened this issue Dec 18, 2023 · 17 comments

Comments

@ikwork
Copy link

ikwork commented Dec 18, 2023

Hi,

I'm using "hostUsers: false" in my pods, to use the feature UserNamespacesSupport
https://kubernetes.io/docs/concepts/workloads/pods/user-namespaces/

All other volumeMounts are working fine except hugepages.

I'm using crun version 1.9 (also tried version 1.12) with crio 1.28.1

root@ubuntu:~# k get node -o wide
NAME     STATUS   ROLES           AGE     VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE       KERNEL-VERSION     CONTAINER-RUNTIME
ubuntu   Ready    control-plane   7d10h   v1.28.4   10.54.41.172   <none>        Ubuntu 23.10   6.5.0-14-generic   cri-o://1.28.1

root@ubuntu:~# crun --version
crun version 1.9
commit: a538ac4ea1ff319bcfe2bf81cb5c6f687e2dc9d3
rundir: /run/user/0/crun
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL

root@ubuntu:~# cat /proc/meminfo | grep -i ugep
AnonHugePages:      4096 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:   49152
HugePages_Free:    45190
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

root@ubuntu:~# k get po | grep test4
test4-859cb57c87-shsms            0/1     CreateContainerError   0                8m30s

root@ubuntu:~# k describe po test4-859cb57c87-shsms
Events:
  Type     Reason          Age                     From               Message
  ----     ------          ----                    ----               -------
  Warning  Failed          7m8s (x8 over 8m41s)    kubelet            Error: container create failed: mount_setattr `/hugepages`: Invalid argument

My pod spec:

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: test
  name: test4
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: test
    spec:
      imagePullSecrets:
      - name: regcred
      hostUsers: false
      securityContext:
        fsGroup: 10000
      containers:
      - image: nginx
        name: nginx
        command:
        - "/bin/bash"
        - "-c"
        - "tail -f /dev/null"
        resources:
          limits:
            hugepages-2Mi: 1Gi
            cpu: "1"
            memory: 1Gi
          requests:
            hugepages-2Mi: 1Gi
            cpu: "1"
            memory: 1Gi
        securityContext:
          runAsUser: 10001
        volumeMounts:
          - name: hugepage
            mountPath: /hugepages
      volumes:
        - name: hugepage
          emptyDir:
            medium: HugePages

Error from kubelet

Dec 18 22:14:25 ubuntu kubelet[294806]: I1218 22:14:25.682661  294806 kubelet_pods.go:161] "Creating hosts mount for container" pod="default/test4-859cb57c87-shsms" containerName="nginx" podIPs=["10.85.5.252"] path=true
Dec 18 22:14:25 ubuntu kubelet[294806]: E1218 22:14:25.788376  294806 kuberuntime_manager.go:1261] container &Container{Name:nginx,Image:bash,Command:[/bin/bash -c tail -f /dev/null],Args:[],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{cpu: {{1 0} {<nil>} 1 DecimalSI},hugepages-2Mi: {{1073741824 0} {<nil>} 1Gi BinarySI},memory: {{1073741824 0} {<nil>} 1Gi BinarySI},},Requests:ResourceList{cpu: {{1 0} {<nil>} 1 DecimalSI},hugepages-2Mi: {{1073741824 0} {<nil>} 1Gi BinarySI},memory: {{1073741824 0} {<nil>} 1Gi BinarySI},},Claims:[]ResourceClaim{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:hugepage,ReadOnly:false,MountPath:/hugepages,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:cm-mount,ReadOnly:false,MountPath:/etc/cm,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:sec-mount,ReadOnly:false,MountPath:/etc/sec,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:local-persistent-storage,ReadOnly:false,MountPath:/mnt,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:cache-volume,ReadOnly:false,MountPath:/cache,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:kube-api-access-vx7fr,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:Always,SecurityContext:&SecurityContext{Capabilities:nil,Privileged:nil,SELinuxOptions:nil,RunAsUser:*10001,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,ResizePolicy:[]ContainerResizePolicy{},RestartPolicy:nil,} start failed in pod test4-859cb57c87-shsms_default(8ba891d8-7514-4bb1-904e-a2f1700f8d7e): CreateContainerError: container create failed: mount_setattr `/hugepages`: Invalid argument
@giuseppe
Copy link
Member

would it be possible for you to grab the /run/user/0/crun/$CONTAINER_ID/config.json file and upload it here? It might be some invalid setting passed to the mount, as we probably pass the mount for idmapped mounts.

What kernel are you using? It needs to support idmapped mounts for tmpfs (kernel > 6.3)

@ikwork
Copy link
Author

ikwork commented Dec 22, 2023

Hi @giuseppe ,

Sry for the late response,

The container is not running at all, so I can't get the container id for it.

The kernel I'm using is 6.5

# k get po | grep test4
test4-79767494fb-w47dx            0/1     CreateContainerError   0                 32s

# k describe po test4-79767494fb-w47dx | tail -6
  Type     Reason          Age                 From               Message
  ----     ------          ----                ----               -------
  Normal   Scheduled       112s                default-scheduler  Successfully assigned default/test4-79767494fb-w47dx to ubuntu
  Normal   AddedInterface  111s                multus             Add eth0 [10.85.6.3/16] from crio
  Normal   Pulled          2s (x11 over 111s)  kubelet            Container image "bash" already present on machine
  Warning  Failed          2s (x11 over 111s)  kubelet            Error: container create failed: mount_setattr `/hugepages`: Invalid argument

root@ubuntu:~# crictl pods | grep test4-79767494fb-w47dx
182453b8701f4       2 minutes ago       Ready               test4-79767494fb-w47dx                         default             0                   (default)

root@ubuntu:~# crictl ps | grep 182453b8701f4 | wc -l
0

More info about my setup

# cat /etc/crio/crio.conf.d/10-crun.conf
[crio.runtime]
default_runtime = "crun"

[crio.runtime.runtimes.crun]
allowed_annotations = [
    "io.containers.trace-syscall",
]

# uname -r
6.5.0-14-generic

# k get node -o wide
NAME     STATUS   ROLES           AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE       KERNEL-VERSION     CONTAINER-RUNTIME
ubuntu   Ready    control-plane   10d   v1.28.4   10.54.41.172   <none>        Ubuntu 23.10   6.5.0-14-generic   cri-o://1.28.1

# crio config |grep crun
INFO[2023-12-22 11:42:38.698570514Z] Starting CRI-O, version: 1.28.1, git: eda470f7f503d9f40a9aa2a02e45f0878ed6fc61(dirty)
INFO Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL
default_runtime = "crun"
# It also requires at least runc 1.1.0 or crun 0.19 which support the notifier
[crio.runtime.runtimes.crun]
runtime_path = "/usr/local/bin/crun"
runtime_root = "/run/crun"

When I launch the same pod without the hostUsers (user namespace) feature, it works and here's the info:

root@ubuntu:/etc/crio# ps aux | grep test4-7c8cc7b8f7-qp7ht
root     2121501  0.0  0.0   2156  1120 ?        Ss   11:45   0:00 /usr/local/bin/conmon -b /run/containers/storage/overlay-containers/14df8f0ee231c289b06d441f6fc94fa66bfed5aa3b40962bba3ebd48b153eda6/userdata -c 14df8f0ee231c289b06d441f6fc94fa66bfed5aa3b40962bba3ebd48b153eda6 --exit-dir /var/run/crio/exits -l /var/log/pods/default_test4-7c8cc7b8f7-qp7ht_7746ecfa-5076-4264-83bb-4789b9d8394b/nginx/0.log --log-level info -n k8s_nginx_test4-7c8cc7b8f7-qp7ht_default_7746ecfa-5076-4264-83bb-4789b9d8394b_0 -P /run/containers/storage/overlay-containers/14df8f0ee231c289b06d441f6fc94fa66bfed5aa3b40962bba3ebd48b153eda6/userdata/conmon-pidfile -p /run/containers/storage/overlay-containers/14df8f0ee231c289b06d441f6fc94fa66bfed5aa3b40962bba3ebd48b153eda6/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/14df8f0ee231c289b06d441f6fc94fa66bfed5aa3b40962bba3ebd48b153eda6/userdata -r /usr/local/bin/crun --runtime-arg --root=/run/crun --socket-dir-path /var/run/crio --syslog -u 14df8f0ee231c289b06d441f6fc94fa66bfed5aa3b40962bba3ebd48b153eda6 -s


But even in a running pod/container, I can't see the config.json file:

root@ubuntu:/etc/crio# crictl pods | grep test4-7c8cc7b8f7-qp7ht
03d9f43eaccd3       2 minutes ago       Ready               test4-7c8cc7b8f7-qp7ht                         default             0                   (default)
root@ubuntu:/etc/crio# crictl ps | grep 03d9f43eaccd3
14df8f0ee231c       f208c2c858b15de928c27e2a010fa60ca1de880563fa9abce98d5e0a19f8b48d
                2 minutes ago       Running             nginx                         0                   03d9f43eaccd3       test4-7c8cc7b8f7-qp7ht
root@ubuntu:/etc/crio# cat /run/user/0/crun/14df8f0ee231c/config.json
cat: /run/user/0/crun/14df8f0ee231c/config.json: No such file or directory

@ikwork
Copy link
Author

ikwork commented Dec 22, 2023

I can get the config.json in the following way, let me know if that's any helpful:

root@ubuntu:/etc/crio# crun spec

root@ubuntu:/etc/crio# cat config.json
{
        "ociVersion": "1.0.0",
        "process": {
                "terminal": true,
                "user": {
                        "uid": 0,
                        "gid": 0
                },
                "args": [
                        "sh"
                ],
                "env": [
                        "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                        "TERM=xterm"
                ],
                "cwd": "/",
                "capabilities": {
                        "bounding": [
                                "CAP_AUDIT_WRITE",
                                "CAP_KILL",
                                "CAP_NET_BIND_SERVICE"
                        ],
                        "effective": [
                                "CAP_AUDIT_WRITE",
                                "CAP_KILL",
                                "CAP_NET_BIND_SERVICE"
                        ],
                        "inheritable": [
                        ],
                        "permitted": [
                                "CAP_AUDIT_WRITE",
                                "CAP_KILL",
                                "CAP_NET_BIND_SERVICE"
                        ],
                        "ambient": [
                                "CAP_AUDIT_WRITE",
                                "CAP_KILL",
                                "CAP_NET_BIND_SERVICE"
                        ]
                },
                "rlimits": [
                        {
                                "type": "RLIMIT_NOFILE",
                                "hard": 1024,
                                "soft": 1024
                        }
                ],
                "noNewPrivileges": true
        },
        "root": {
                "path": "rootfs",
                "readonly": true
        },
        "hostname": "crun",
        "mounts": [
                {
                        "destination": "/proc",
                        "type": "proc",
                        "source": "proc"
                },
                {
                        "destination": "/dev",
                        "type": "tmpfs",
                        "source": "tmpfs",
                        "options": [
                                "nosuid",
                                "strictatime",
                                "mode=755",
                                "size=65536k"
                        ]
                },
                {
                        "destination": "/dev/pts",
                        "type": "devpts",
                        "source": "devpts",
                        "options": [
                                "nosuid",
                                "noexec",
                                "newinstance",
                                "ptmxmode=0666",
                                "mode=0620",
                                "gid=5"
                        ]
                },
                {
                        "destination": "/dev/shm",
                        "type": "tmpfs",
                        "source": "shm",
                        "options": [
                                "nosuid",
                                "noexec",
                                "nodev",
                                "mode=1777",
                                "size=65536k"
                        ]
                },
                {
                        "destination": "/dev/mqueue",
                        "type": "mqueue",
                        "source": "mqueue",
                        "options": [
                                "nosuid",
                                "noexec",
                                "nodev"
                        ]
                },
                {
                        "destination": "/sys",
                        "type": "sysfs",
                        "source": "sysfs",
                        "options": [
                                "nosuid",
                                "noexec",
                                "nodev",
                                "ro"
                        ]
                },
                {
                        "destination": "/sys/fs/cgroup",
                        "type": "cgroup",
                        "source": "cgroup",
                        "options": [
                                "nosuid",
                                "noexec",
                                "nodev",
                                "relatime",
                                "ro"
                        ]
                }
        ],
        "linux": {
                "resources": {
                        "devices": [
                                {
                                        "allow": false,
                                        "access": "rwm"
                                }
                        ]
                },
                "namespaces": [
                        {
                                "type": "pid"
                        },
                        {
                                "type": "network"
                        },
                        {
                                "type": "ipc"
                        },
                        {
                                "type": "uts"
                        },
                        {
                                "type": "mount"
                        }
                ],
                "maskedPaths": [
                        "/proc/acpi",
                        "/proc/asound",
                        "/proc/kcore",
                        "/proc/keys",
                        "/proc/latency_stats",
                        "/proc/timer_list",
                        "/proc/timer_stats",
                        "/proc/sched_debug",
                        "/sys/firmware",
                        "/proc/scsi"
                ],
                "readonlyPaths": [
                        "/proc/bus",
                        "/proc/fs",
                        "/proc/irq",
                        "/proc/sys",
                        "/proc/sysrq-trigger"
                ]
        }
}

@giuseppe
Copy link
Member

I can get the config.json in the following way, let me know if that's any helpful:

thanks, that is the default config.json file generated by crun so it doesn't help in this case.

What is the underlying file system? The file system might not support idmapped mounts

@ikwork
Copy link
Author

ikwork commented Dec 22, 2023

Sure,

Is there anything specific I should check?

root@ubuntu:~# df -h -T /
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/vda1      ext4  484G   75G  409G  16% /

root@ubuntu:~# df -h -T /var/lib/kubelet/
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/vda1      ext4  484G   76G  408G  16% /

@giuseppe
Copy link
Member

this is probably an issue in Kubernetes, and we should not be using idmapped mounts with the hugetlb mount.

@rata have you even seen this issue before?

@rata
Copy link
Contributor

rata commented Dec 27, 2023

@giuseppe no, I haven't seen this before. I've been playing locally, and it seems that huge pages use the hugetlbfs file-system, so we can't idmap that filesystem and the pod fails to start. It seems that simple.

IMHO the path forward might be to document this in kube (although we way fs needs to be supported, it might not be clear that huge pages use a different fs), improve the errors on crun/runc and add support in Linux for idmap on hugetlbfs filesystems.

@giuseppe what do you think?

Here is a more detailed version of what I did to conclude what I've just said.

I've created a repro locally, based on the pod here (with small adjustments as the image pull secrets didn't exist here and the like). You can configure huge pages as explained here: https://kubernetes.io/docs/tasks/manage-hugepages/scheduling-hugepages/. And start the pod without userns so you really check it works as expected.

With userns it fails to start with this error (this is using containerd and runc from main, but something similar should happen with crio and crun):

  Warning  Failed     27s (x3 over 47s)  kubelet, 127.0.0.1  Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: failed to fulfil mount request: failed to set MOUNT_ATTR_IDMAP on /var/lib/kubelet/pods/68cca114-92e8-4916-abf8-7e36d2f221a6/volumes/kubernetes.io~empty-dir/hugepage: invalid argument: unknown

When checking on the host, the source is a hugetlbfs:

# mount | grep huge
hugetlbfs on /dev/hugepages type hugetlbfs (rw,nosuid,nodev,relatime,pagesize=2M)
nodev on /var/lib/kubelet/pods/68cca114-92e8-4916-abf8-7e36d2f221a6/volumes/kubernetes.io~empty-dir/hugepage type hugetlbfs (rw,relatime,pagesize=2M)

I've captured the config.json also, although there isn't nothing super interesting, just the /huge bind-mount that it's fs is hugetlbfs:

{
  "ociVersion": "1.1.0+dev",
  "process": {
    "user": {
      "uid": 10001,
      "gid": 0,
      "additionalGids": [
        0,
        10000
      ]
    },
    "args": [
      "/bin/bash",
      "-c",
      "tail -f /dev/null"
    ],
    "env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "HOSTNAME=test4-5ff77f866-mrv44",
      "NGINX_VERSION=1.25.3",
      "NJS_VERSION=0.8.2",
      "PKG_RELEASE=1~bookworm",
      "KUBERNETES_PORT=tcp://10.0.0.1:443",
      "KUBERNETES_PORT_443_TCP=tcp://10.0.0.1:443",
      "KUBERNETES_PORT_443_TCP_PROTO=tcp",
      "KUBERNETES_PORT_443_TCP_PORT=443",
      "KUBERNETES_PORT_443_TCP_ADDR=10.0.0.1",
      "KUBERNETES_SERVICE_HOST=10.0.0.1",
      "KUBERNETES_SERVICE_PORT=443",
      "KUBERNETES_SERVICE_PORT_HTTPS=443"
    ],
    "cwd": "/",
    "capabilities": {
      "bounding": [
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FSETID",
        "CAP_FOWNER",
        "CAP_MKNOD",
        "CAP_NET_RAW",
        "CAP_SETGID",
        "CAP_SETUID",
        "CAP_SETFCAP",
        "CAP_SETPCAP",
        "CAP_NET_BIND_SERVICE",
        "CAP_SYS_CHROOT",
        "CAP_KILL",
        "CAP_AUDIT_WRITE"
      ],
      "effective": [
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FSETID",
        "CAP_FOWNER",
        "CAP_MKNOD",
        "CAP_NET_RAW",
        "CAP_SETGID",
        "CAP_SETUID",
        "CAP_SETFCAP",
        "CAP_SETPCAP",
        "CAP_NET_BIND_SERVICE",
        "CAP_SYS_CHROOT",
        "CAP_KILL",
        "CAP_AUDIT_WRITE"
      ],
      "permitted": [
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FSETID",
        "CAP_FOWNER",
        "CAP_MKNOD",
        "CAP_NET_RAW",
        "CAP_SETGID",
        "CAP_SETUID",
        "CAP_SETFCAP",
        "CAP_SETPCAP",
        "CAP_NET_BIND_SERVICE",
        "CAP_SYS_CHROOT",
        "CAP_KILL",
        "CAP_AUDIT_WRITE"
      ]
    },
    "oomScoreAdj": -997
  },
  "root": {
    "path": "rootfs"
  },
  "mounts": [
    {
      "destination": "/proc",
      "type": "proc",
      "source": "proc",
      "options": [
        "nosuid",
        "noexec",
        "nodev"
      ]
    },
    {
      "destination": "/dev",
      "type": "tmpfs",
      "source": "tmpfs",
      "options": [
        "nosuid",
        "strictatime",
        "mode=755",
        "size=65536k"
      ]
    },
    {
      "destination": "/dev/pts",
      "type": "devpts",
      "source": "devpts",
      "options": [
        "nosuid",
        "noexec",
        "newinstance",
        "ptmxmode=0666",
        "mode=0620",
        "gid=5"
      ]
    },
    {
      "destination": "/dev/mqueue",
      "type": "mqueue",
      "source": "mqueue",
      "options": [
        "nosuid",
        "noexec",
        "nodev"
      ]
    },
    {
      "destination": "/sys",
      "type": "sysfs",
      "source": "sysfs",
      "options": [
        "nosuid",
        "noexec",
        "nodev",
        "ro"
      ]
    },
    {
      "destination": "/sys/fs/cgroup",
      "type": "cgroup",
      "source": "cgroup",
      "options": [
        "nosuid",
        "noexec",
        "nodev",
        "relatime",
        "ro"
      ]
    },
    {
      "destination": "/hugepages",
      "type": "bind",
      "source": "/var/lib/kubelet/pods/68cca114-92e8-4916-abf8-7e36d2f221a6/volumes/kubernetes.io~empty-dir/hugepage",
      "options": [
        "rbind",
        "rprivate",
        "rw"
      ],
      "uidMappings": [
        {
          "containerID": 0,
          "hostID": 4068605952,
          "size": 65536
        }
      ],
      "gidMappings": [
        {
          "containerID": 0,
          "hostID": 4068605952,
          "size": 65536
        }
      ]
    },
    {
      "destination": "/etc/hosts",
      "type": "bind",
      "source": "/var/lib/kubelet/pods/68cca114-92e8-4916-abf8-7e36d2f221a6/etc-hosts",
      "options": [
        "rbind",
        "rprivate",
        "rw"
      ],
      "uidMappings": [
        {
          "containerID": 0,
          "hostID": 4068605952,
          "size": 65536
        }
      ],
      "gidMappings": [
        {
          "containerID": 0,
          "hostID": 4068605952,
          "size": 65536
        }
      ]
    },
    {
      "destination": "/dev/termination-log",
      "type": "bind",
      "source": "/var/lib/kubelet/pods/68cca114-92e8-4916-abf8-7e36d2f221a6/containers/nginx/e397e419",
      "options": [
        "rbind",
        "rprivate",
        "rw"
      ],
      "uidMappings": [
        {
          "containerID": 0,
          "hostID": 4068605952,
          "size": 65536
        }
      ],
      "gidMappings": [
        {
          "containerID": 0,
          "hostID": 4068605952,
          "size": 65536
        }
      ]
    },
    {
      "destination": "/etc/hostname",
      "type": "bind",
      "source": "/var/lib/containerd-rata/io.containerd.grpc.v1.cri/sandboxes/660a9463a8b8a6038c5b5f0aa915ee28b76a092c134558759865ab67444700bd/hostname",
      "options": [
        "rbind",
        "rprivate",
        "rw"
      ],
      "uidMappings": [
        {
          "containerID": 0,
          "hostID": 4068605952,
          "size": 65536
        }
      ],
      "gidMappings": [
        {
          "containerID": 0,
          "hostID": 4068605952,
          "size": 65536
        }
      ]
    },
    {
      "destination": "/etc/resolv.conf",
      "type": "bind",
      "source": "/var/lib/containerd-rata/io.containerd.grpc.v1.cri/sandboxes/660a9463a8b8a6038c5b5f0aa915ee28b76a092c134558759865ab67444700bd/resolv.conf",
      "options": [
        "rbind",
        "rprivate",
        "rw"
      ],
      "uidMappings": [
        {
          "containerID": 0,
          "hostID": 4068605952,
          "size": 65536
        }
      ],
      "gidMappings": [
        {
          "containerID": 0,
          "hostID": 4068605952,
          "size": 65536
        }
      ]
    },
    {
      "destination": "/dev/shm",
      "type": "bind",
      "source": "/run/containerd-rata/io.containerd.grpc.v1.cri/sandboxes/660a9463a8b8a6038c5b5f0aa915ee28b76a092c134558759865ab67444700bd/shm",
      "options": [
        "rbind",
        "rprivate",
        "rw"
      ]
    },
    {
      "destination": "/var/run/secrets/kubernetes.io/serviceaccount",
      "type": "bind",
      "source": "/var/lib/kubelet/pods/68cca114-92e8-4916-abf8-7e36d2f221a6/volumes/kubernetes.io~projected/kube-api-access-xpmck",
      "options": [
        "rbind",
        "rprivate",
        "ro"
      ],
      "uidMappings": [
        {
          "containerID": 0,
          "hostID": 4068605952,
          "size": 65536
        }
      ],
      "gidMappings": [
        {
          "containerID": 0,
          "hostID": 4068605952,
          "size": 65536
        }
      ]
    }
  ],
  "annotations": {
    "io.kubernetes.cri.container-name": "nginx",
    "io.kubernetes.cri.container-type": "container",
    "io.kubernetes.cri.image-name": "docker.io/library/nginx:latest",
    "io.kubernetes.cri.sandbox-id": "660a9463a8b8a6038c5b5f0aa915ee28b76a092c134558759865ab67444700bd",
    "io.kubernetes.cri.sandbox-name": "test4-5ff77f866-mrv44",
    "io.kubernetes.cri.sandbox-namespace": "default",
    "io.kubernetes.cri.sandbox-uid": "68cca114-92e8-4916-abf8-7e36d2f221a6"
  },
  "linux": {
    "uidMappings": [
      {
        "containerID": 0,
        "hostID": 4068605952,
        "size": 65536
      }
    ],
    "gidMappings": [
      {
        "containerID": 0,
        "hostID": 4068605952,
        "size": 65536
      }
    ],
    "resources": {
      "devices": [
        {
          "allow": false,
          "access": "rwm"
        }
      ],
      "memory": {
        "limit": 1073741824,
        "swap": 1073741824
      },
      "cpu": {
        "shares": 1024,
        "quota": 100000,
        "period": 100000
      },
      "unified": {
        "memory.oom.group": "1",
        "memory.swap.max": "0"
      }
    },
    "cgroupsPath": "/kubepods/pod68cca114-92e8-4916-abf8-7e36d2f221a6/13389b90b8cb8e34e660a79472ac79d336dd8926ab6977067de8e67993c1f8af",
    "namespaces": [
      {
        "type": "pid"
      },
      {
        "type": "ipc",
        "path": "/proc/14711/ns/ipc"
      },
      {
        "type": "uts",
        "path": "/proc/14711/ns/uts"
      },
      {
        "type": "mount"
      },
      {
        "type": "network",
        "path": "/proc/14711/ns/net"
      },
      {
        "type": "user",
        "path": "/proc/14711/ns/user"
      },
      {
        "type": "cgroup"
      }
    ],
    "maskedPaths": [
      "/proc/asound",
      "/proc/acpi",
      "/proc/kcore",
      "/proc/keys",
      "/proc/latency_stats",
      "/proc/timer_list",
      "/proc/timer_stats",
      "/proc/sched_debug",
      "/proc/scsi",
      "/sys/firmware"
    ],
    "readonlyPaths": [
      "/proc/bus",
      "/proc/fs",
      "/proc/irq",
      "/proc/sys",
      "/proc/sysrq-trigger"
    ]
  }
}

@rata
Copy link
Contributor

rata commented Dec 27, 2023

IMHO the path forward might be to document this in kube (although we way fs needs to be supported, it might not be clear that huge pages use a different fs), improve the errors on crun/runc and add support in Linux for idmap on hugetlbfs filesystems.

Opened PRs to runc and crun to improve the errormsg they show. I think this is clear enough that doesn't need more docs on kube. The remaining thing would be to add support on hugetlbfs for idmap mounts.

What do others think?

@rhatdan
Copy link
Member

rhatdan commented Dec 28, 2023

SGTM

@giuseppe
Copy link
Member

thanks for the patch. I am closing the issue because there is not much more to do from the OCI runtime.

idmapped mounts support must be added to the hugetlb file system in the kernel for this to work

@giuseppe
Copy link
Member

giuseppe commented Dec 29, 2023

I'll propose giuseppe/linux@3592ce4 upstream to add idmapped mounts support to hugetlbfs

@giuseppe
Copy link
Member

@brauner is giuseppe/linux@3592ce4 something you could pull to your tree (assuming you are fine with it) or should I submit it to lkml + hugetlb maintainers?

@brauner
Copy link

brauner commented Jan 2, 2024 via email

@ikwork
Copy link
Author

ikwork commented Jan 2, 2024

@giuseppe thank you so much for this

btw, I tried the same with older kernel 5.14.0-70.30.1.el9_0.x86_64 (with the older k8s feature gate UserNamespacesStatelessPodsSupport) and hugepages seem to be working fine.

(the rest of the mounts however are not supported yet on this version. like cm, secrets)

[root@node1 ]# k get node -o wide
NAME    STATUS   ROLES    AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                              KERNEL-VERSION                CONTAINER-RUNTIME
node1   Ready    <none>   3h58m   v1.25.3   10.228.41.164   <none>        Red Hat Enterprise Linux 9.3 (Plow)   5.14.0-70.30.1.el9_0.x86_64   containerd://1.7.11

[root@node1 ]# k get po test4-6d77d9f8d-5pb7l -o yaml | grep -i "hostUser\|huge"
        hugepages-2Mi: 100Mi
        hugepages-2Mi: 100Mi
    - mountPath: /hugepages
      name: hugepage
  hostUsers: false
      medium: HugePages
    name: hugepage

[root@node1 ~]# k exec -it test4-6d77d9f8d-5pb7l sh
# cat /proc/meminfo | grep -i hugepa
AnonHugePages:   4911104 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:    2048
HugePages_Free:     2048
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
# mount | grep hugepages
nodev on /hugepages type hugetlbfs (rw,relatime,seclabel,pagesize=2M)
# id
uid=0(root) gid=0(root) groups=0(root),10(wheel),10000
# sleep 44554455 &
# ps aux | grep 4455
  31 root      0:00 sleep 44554455
# exit
# ps aux | grep 44554455
196608    406429  0.0  0.0   2484   920 ?        S    06:54   0:00 sleep 4455445

@rata
Copy link
Contributor

rata commented Jan 2, 2024

@ikwork yes, older k8s releases don't require idmap mounts, that is why it worked. But moving forward we do, because there were some limitations and concerns from other SIGs, and we need them anyways for persistent volumes.

@giuseppe
Copy link
Member

giuseppe commented Jan 8, 2024

Yeah, I can take that!

thanks! I'll send you the patch by email

@ikwork
Copy link
Author

ikwork commented Jan 10, 2024

Thanks @giuseppe and @brauner for the fix and resolutions.

Would really like to try this, any chances this may be available an a kernel release soon?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants