seccomp: Can't find entry on tid_real #2508

rst0git · 2024-11-01T18:32:06Z

Steps to reproduce the issue:

Create a Pod with the following manifest:

apiVersion: v1
kind: Pod
metadata:
  name: counter
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 101
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: counter
      image: busybox:latest
      command:
        - "/bin/sh"
        - "-c"
        - "i=0; while true; do echo $i; i=$(expr $i + 1); sleep 1; done"
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop:
            - ALL
      resources:
        requests:
          memory: "1Gi"
          cpu: "1"
        limits:
          memory: "1Gi"
          cpu: "1"

Create a container checkpoint using the kubelet Checkpoint API

Describe the results you received:
CRIU fails with the following error:

(00.122028) Collecting pidns 9/501089
(00.122037) Error (criu/seccomp.c:61): seccomp: Can't find entry on tid_real 501089
(00.122039) Error (criu/seccomp.c:278): seccomp: Can't collect filter on leader tid_real 501089
(00.122098) net: Unlock network
(00.122100) Running network-unlock scripts
(00.122101) 	RPC
(00.196082) cuda_plugin: finished cuda_plugin stage 0 err -1
(00.196155) Unfreezing tasks into 1
(00.196157) 	Unseizing 501089 into 1
(00.196165) Error (compel/src/lib/infect.c:418): Unable to detach from 501089: No such process
(00.196174) Error (criu/cr-dump.c:2111): Dumping FAILED

dump.log

The text was updated successfully, but these errors were encountered:

rst0git · 2024-11-02T13:40:11Z

@avagin, using git bisect, I was able to confirm that this error is caused by the changes introduced in #2475
If we remove /usr/lib/criu/cuda_plugin.so, the error does not appear.

When check_freezer_cgroup() has non-zero, `goto err` calls `return ret`. However, the value of `ret` has been set to 0 in the lines above and CRIU does not handle the error. This problem is related to checkpoint-restore#2508 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

When `check_freezer_cgroup()` has non-zero return value, `goto err` calls `return ret`. However, the value of `ret`` has been set to `0` in the lines above and CRIU does not handle the error properly. This problem is related to checkpoint-restore#2508 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

When `check_freezer_cgroup()` has non-zero return value, `goto err` calls `return ret`. However, the value of `ret` has been set to `0` in the lines above and CRIU does not handle the error properly. This problem is related to checkpoint-restore#2508 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

Container runtimes like CRI-O and containerd utilize the freezer cgroup to create a consistent snapshot of container rootfs changes. In this case, the container is frozen before invoking CRIU. Once CRIU successfully completes, a copy of the container rootfs diff is saved, and then the container is unfrozen. To enable GPU checkpointing support with these runtimes, we need to unfreeze the cgroup and restore it to its original state at the end. Fixes: checkpoint-restore#2508 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

When `check_freezer_cgroup()` has non-zero return value, `goto err` calls `return ret`. However, the value of `ret` has been set to `0` in the lines above and CRIU does not handle the error properly. This problem is related to checkpoint-restore#2508 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

Container runtimes like CRI-O and containerd utilize the freezer cgroup to create a consistent snapshot of container rootfs changes. In this case, the container is frozen before invoking CRIU. Once CRIU successfully completes, a copy of the container rootfs diff is saved, and then the container is unfrozen. To enable GPU checkpointing support with these runtimes, we need to unfreeze the cgroup and restore it to its original state at the end. Fixes: checkpoint-restore#2508 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

When `check_freezer_cgroup()` has non-zero return value, `goto err` calls `return ret`. However, the value of `ret` has been set to `0` in the lines above and CRIU does not handle the error properly. This problem is related to #2508 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

Container runtimes like CRI-O and containerd utilize the freezer cgroup to create a consistent snapshot of container root filesystem (rootfs) changes. In this case, the container is frozen before invoking CRIU. After CRIU successfully completes, a copy of the container rootfs diff is saved, and the container is then unfrozen. However, the `cuda-checkpoint` tool is not able to perform a 'lock' action on frozen threads. To support GPU checkpointing with these container runtimes, we need to unfreeze the cgroup and return it to its original state once the checkpointing is complete. To reflect this new behavior, the following changes are applied: - `dont_use_freeze_cgroup(void)` -> `set_compel_interrupt_only_mode(void)` - `bool freeze_cgroup_disabled` -> `bool compel_interrupt_only_mode` - `check_freezer_cgroup(void)` -> `prepare_freezer_for_interrupt_only_mode(void)` Note that when `compel_interrupt_only_mode` is set to `true`, `compel_interrupt_task()` is used instead of `freeze_processes()` to prevent tasks from running during `criu dump`. Fixes: checkpoint-restore#2508 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

rst0git added the bug label Nov 2, 2024

rst0git mentioned this issue Nov 4, 2024

seize: fix error handling for check_freezer_cgroup #2513

Merged

rst0git mentioned this issue Nov 4, 2024

seize: enable support for frozen containers #2514

Merged

avagin closed this as completed in #2514 Nov 12, 2024

avagin closed this as completed in f8f0e1d Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

seccomp: Can't find entry on tid_real #2508

seccomp: Can't find entry on tid_real #2508

rst0git commented Nov 1, 2024

rst0git commented Nov 2, 2024 •

edited

Loading

seccomp: Can't find entry on tid_real #2508

seccomp: Can't find entry on tid_real #2508

Comments

rst0git commented Nov 1, 2024

rst0git commented Nov 2, 2024 • edited Loading

rst0git commented Nov 2, 2024 •

edited

Loading