Correct crun's behavior when it runs inside a cgroup v2 container #923

skepticoitusInteruptus · 2022-05-19T17:37:55Z

Description

When ran inside a cgroup v2 container, crun attempts an apparently cgroup v1-like operation.

My spike:

Evaluate the behavior parity of the runc and crun container runtimes for cgroup v2 support.¹

My tests:

my@host $ docker run -it -m 1024m --memory-max 1024m --privileged skepticoital/system.me:hmmm

/system.me # podman run -it skepticoital/mem_limit:hmmm

/system.me # dmesg | grep -i killed

Steps to reproduce the issue:

Using Docker from a physical host machine with cgroup v2 support², run a rootful Alpine container that is, itself, also configured with cgroup v2 support:³

docker run -it -m 1024m --memory-swap 1024m --privileged skepticoital/system.me:hmmm

Inside the container, observe the container's enabled controllers:

cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu pids

Observe the container's cgroup.type:

cat /sys/fs/cgroup/cgroup.type
domain threaded

Observe there are controller interface files in the container's /sys/fs/cgroup dir:⁴

ls /sys/fs/cgroup/
...cpu.max...hugeTLB.1GB.max...io.stat...memory.max...pids.max...rdma.max

Observe that the value in bytes of the container's /sys/fs/cgroup/memory.max file equals the value in megabytes of the host's docker run -m switch:

cat /sys/fs/cgroup/memory.max
1073741824

Exercise crun's interaction with the container's cgroup v2 memory controller:

podman run -it skepticoital/mem_limit:hmmm

Observe that crun chokes; reporting an EOPNOTSUPP (guessing: it wants to mod its parent cgroup?):⁵

WARN[0005] Failed to add conmon to cgroupfs sandbox cgroup: error creating cgroup path /libpod_parent/conmon: write /sysfs/cgroup/cgroup.subtree_control: operation not supported

Describe the results you received:

Error: OCI runtime error: crun: writing file `/sysfs/cgroup/cgroup.subtree_control`: Not supported

Describe the results you expected:

To behave — out of the box — the way runc behaves under identical constraints (see Workaround)

Additional information you deem important:

The root of cgroup v2's hierarchy is the host machine (i.e., where the kernel is)
The kernel considers my system.me container process to be an immediate child of the root cgroup
Control Group v2 — The Linux Kernel Admin Guide:
- "When a process forks a child process, the new process is born into the cgroup that the forking process belongs to at the time of the operation…" — Processes
- "Marking a cgroup threaded makes it join the resource domain of its parent as a threaded cgroup…The root…serves as the resource domain for the entire subtree…" — Threads
- "Enabling a controller in a cgroup indicates that the distribution of the target resource across its immediate children will be controlled…" — Enabling and Disabling
- "…the controller interface files - anything which doesn't start with ‚cgroup.‘ are owned by the parent rather than the cgroup itself" — Enabling and Disabling
- "Resources are distributed top-down…" — Top-down Constraint
- "…only domain cgroups which don't contain any processes can have domain controllers enabled in their ‚cgroup.subtree_control‘ files" — No Internal Process Constraint
- "To control resource distribution of a cgroup, the cgroup must create children and transfer all its processes to the children before enabling controllers in its ‚cgroup.subtree_control‘ file" — No Internal Process Constraint
My Alpine distro is not inited by systemd⁶
My spike is not for a rootless container use case

Workaround

By configuring podman to replace crun with runc in the container's /etc/containers/containers.conf, the container's podman run... test behaves as expected:

podman run -it skepticoital/mem_limit:hmmm

Allocated = 0 to 1 MB
Allocated = 1 to 2 MB
...
Allocated = 420 to 421 MB
...
Allocated = 933 to 934 MB
<Killed>
...
dmesg | grep -i killed
...
Memory cgroup out of memory: Killed process 42 (mem_limit) total-vm:962004kB...

Output of podman version:

podman version 4.1.0

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.26.1
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - rdma
  cgroupManager: cgroupfs
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.0-r1
    path: /usr/bin/conmon
    version: 'conmon version 2.1.0, commit: ad24dda9f2b11fd974e510713e0923f810ea19c6'
  cpuUtilization:
    idlePercent: 99.67
    systemPercent: 0.22
    userPercent: 0.12
  cpus: 4
  distribution:
    distribution: alpine
    version: 3.16.0_alpha20220328
  eventLogger: file
  hostname: 76aab6ccf14e
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.10.102.1-microsoft-standard-WSL2
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 10450141184
  memTotal: 12926758912
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.4.5-r0
    path: /usr/bin/crun
    version: |-
      crun version 1.4.5
      commit: c381048530aa750495cf502ddb7181f2ded5b400
      spec: 1.0.0
      +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /etc/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-r0
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.2
  swapFree: 4294967296
  swapTotal: 4294967296
  uptime: 19h 41m 19.05s (Approximately 0.79 days)
plugins:
  log:
  - k8s-file
  - none
  - passthrough
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 269490393088
  graphRootUsed: 4095729664
  graphStatus:
    Backing Filesystem: overlayfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 0
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.1.0
  Built: 1652313655
  BuiltTime: Thu May 12 00:00:55 2022
  GitCommit: 6c6d79e5cd2b9dd69b78913a88c062126ff5e11c
  GoVersion: go1.18.2
  Os: linux
  OsArch: linux/amd64
  Version: 4.1.0

Package info:

apk list podman
podman-4.1.0-r1 x86_64 {podman} (Apache-2.0) [installed]

Additional environment details:

uname -a

Linux 10bald424b38 5.10.102.1-microsoft-standard-WSL2 #1 SMP Wed Mar 2 00:30:59 UTC 2022 x86_64 Linux

cat /etc/os-release
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.16.0_alpha20220328
PRETTY_NAME="Alpine Linux edge"
...

—

¹ _{Relates to Podman issue #14236}

² _{The host's root cgroup MUST have all 6 cgroup v2 controllers enabled}

³ _{The child cgroup will inherit hugetlb io memory rdma from the host}

⁴ _{"…enabling [a controller] creates the controller's interface files in the child cgroups…" — The Linux Kernel Control Group v2}

⁵ _{"Operations which fail due to invalid topology use EOPNTSUPP as the errno…" — Threads}

⁶ _{Considering systemd as a dependency is off the table}

The text was updated successfully, but these errors were encountered:

skepticoitusInteruptus · 2022-05-24T13:15:26Z

Hey @giuseppe, @n1hility, @mheon, @rhatdan 👋 ^{^#nudge}

If there's anything you fellahs can fill me in on (insights, corrections, advice?), please holler.

If there are any questions I need to answer, shoot.

I look forward to being able to close whatever gaps there might be in my understanding of the issue I'm observing.

TIA.

giuseppe · 2022-05-26T09:12:01Z

runc has an additional check to not enable cgroup v2 controllers that do not support the threaded cgroup type (that is the memory controller).

Not sure what the entrypoint in your image is doing and why you need it, but if I do something like:

# podman run --entrypoint /bin/sh --rm -it -m 1024m --memory 1024m --privileged skepticoital/system.me:hmmm
# mkdir /sys/fs/cgroup/init
# echo 1 > /sys/fs/cgroup/init/cgroup.procs
# podman run -it skepticoital/mem_limit:hmmm
Allocated 1049 to 1050 MB
Done!

if moving a process fails with EOPNOTSUPP, then change the target cgroup type to threaded and attempt the migration again. Closes: containers#923 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

giuseppe · 2022-05-26T13:05:57Z

opened a PR:

cgroup: make target cgroup threaded if needed #931

skepticoitusInteruptus · 2022-05-26T14:40:01Z

Thanks for looking into this @giuseppe 👍

"...runc has an additional check to not enable cgroup v2 controllers that do not support the threaded cgroup type (that is the memory controller)..."

I'm sorry. It's not clear to me right now how that is applicable to my spike. If you have handy a URL to a resource you could share that explains that feature, I'd appreciate that. TIA.

"…Not sure what the entrypoint in your image is doing…"

It's a kind of very simple init. It's doing two things:

Execute a script that does the same thing my Step 6 here does: "Switch the stock Alpine 3.15's default cgroup filesystem from it's original v1 support, to v2"
Drop into /bin/sh in the container

"…why you need it…"

I need it for the setup step for my test. I need my test to…

"Evaluate the behavior parity of the runc and crun container runtimes for cgroup v2 support."¹

The way my test does that evaluation is to execute a simple C program that mocks memory load.²

For my test, a PASS would be if the out-of-memory killer kills the mem_limit process before mem_limit allocated more than the amount of memory specified by the -m values of the outermost docker run command; 1024m in my original example above.

...Allocated 1049 to 1050 MB...

As described in the "Workaround" section of my OP above, the expected outcome is that process is expected to have been killed and disallowed by cgroup v2 resource control to ever reach Done!

In other words, the expectation is that the -m 1024m limit set on your outermost podman run (and my original, docker run) is expected to have been applied to the nested podman run that attempts to allocate more than the prescribed 1024m.

# podman run --entrypoint /bin/sh --rm -it -m 1024m --memory 1024m --privileged skepticoital/system.me:hmmm
…
Done!

The outcome you're reporting there would be a FAIL in the test scenario described in my OP; given that1050 MB is greater than -m 1024m.

So my questions at this point are:

How unreasonable are the above expectations?
Is the use case that my reproducer attempts to model, atypical in your opinion?

TIA for your answers @giuseppe.

—

¹ _{Apologies for being redundant and quoting myself}

² _{The skepticoital:mem_limit container allocates memory up to a hard coded total of 1050 MB}

if moving a process fails with EOPNOTSUPP, then change the target cgroup type to threaded and attempt the migration again. Closes: containers#923 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

giuseppe · 2022-05-26T19:18:37Z

how is the memory allocated by the C program?

skepticoitusInteruptus · 2022-05-26T19:48:20Z

"...how is the memory allocated by the C program?..."

To see the complete, original implementation, do a Ctrl+F for mem-limit.c on this page.¹

# podman run --entrypoint /bin/sh --rm -it -m 1024m --memory 1024m --privileged skepticoital/system.me:hmmm

Also, I would expect that with your --entrypoint /bin/sh there, the Alpine container that's instantiated would be running with Alpine's default cgroup v1. My spike is about evaluating v2.

—

¹ _{The skepticoital:mem_limit container allocates memory up to a hard coded total of 1050 MB instead of 50}

n1hility · 2022-05-26T20:26:37Z

# podman run --entrypoint /bin/sh --rm -it -m 1024m --memory 1024m --privileged skepticoital/system.me:hmmm

Also, I would expect that with your --entrypoint /bin/sh there, the Alpine container that's instantiated would be running with Alpine's default cgroup v1. My spike is about evaluating v2.

The kernel and mount on the host is what determines cgroupv1 vs v2 not the container. The container bootstrap just creates a cgroups namespace of whatever the host has.

skepticoitusInteruptus · 2022-05-26T20:37:07Z

"...The kernel and mount on the host is what determines cgroupv1 vs v2 not the container..."

TIL 🎓

U Rawk, @n1hility 🎸

skepticoitusInteruptus · 2022-05-26T23:14:34Z

Say @flouthoc, @n1hility? 👋

To give @giuseppe a break from my nagging questions, I'd be cool with either of you fellahs fielding, on his behalf, these questions that are still outstanding…

How unreasonable are the above expectations?

Is the use case that my reproducer attempts to model, atypical in your opinion?

TIA.

flouthoc · 2022-05-27T18:17:03Z

Hi @skepticoitusInteruptus

How unreasonable are the above expectations?

After reading the context in the issue above I think yes in any case the memory usage of the nested container should be capped by parent container although I doubt that cgroup will be mounted correctly inside the nested container with the example you have shared but still in worst case I think the max memory will be always capped by what is provided by the parent container.

A small example should verify this

sudo podman run --memory 500m --memory-swap 500m --rm -it --privileged quay.io/containers/podman:latest bash
# Inside the container
[root@7e0a58f2e066 /]# podman run --rm -it progrium/stress --vm 1 --vm-bytes 600M --timeout 1s
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: dbug: [1] using backoff sleep of 3000us
stress: dbug: [1] setting timeout to 1s
stress: dbug: [1] --> hogvm worker 1 [2] forked
stress: dbug: [2] allocating 629145600 bytes ...
stress: dbug: [2] touching bytes in strides of 4096 bytes ...
stress: FAIL: [1] (416) <-- worker 2 got signal 9
stress: WARN: [1] (418) now reaping child worker processes
stress: FAIL: [1] (422) kill error: No such process
stress: FAIL: [1] (452) failed run completed in 1s

But since I am not entirely sure about the nested cgroup v2 behavior here I'll wait for others to confirm.

giuseppe · 2022-05-27T20:29:04Z

For my test, a PASS would be if the out-of-memory killer kills the mem_limit process before mem_limit allocated more than the amount of memory specified by the -m values of the outermost docker run command; 1024m in my original example above.

sorry my mistake. I forgot to specify the --memory-swap option.

If I specify that, as in the original report, then I get your expected result:

$ podman run -it skepticoital/mem_limit:hmmm || echo failed
...
Allocated 986 to 987 MB
Allocated 987 to 988 MB
Allocated 988 to 989 MB
Allocated 989 to 990 MB
failed

skepticoitusInteruptus · 2022-05-27T21:25:02Z

Hey @flouthoc 👋

Ahhh! So this comment must be what you were referring to in our discussion? I didn't see this until two minutes ago. My apologies for my confusion 😕

…although I doubt that cgroup will be mounted correctly inside the nested container with the example you have shared…

Even given this (from my OP)?

…
2. Inside the container, observe the container's enabled controllers:
cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu pids
Observe the container's cgroup.type:
cat /sys/fs/cgroup/cgroup.type
domain threaded
Observe there are controller interface files in the container's /sys/fs/cgroup dir:⁴
ls /sys/fs/cgroup/
...cpu.max...hugeTLB.1GB.max...io.stat...memory.max...pids.max...rdma.max
Observe that the value in bytes of the container's /sys/fs/cgroup/memory.max file equals the value in megabytes of the ?host's docker run -m switch:
cat /sys/fs/cgroup/memory.max
1073741824

That's just to note the evidence that convinced me, at least, that the nested container in my OP is correctly configured for cgroup v2.

_A small example should verify this-

sudo podman run --memory 500m --memory-swap 500m --rm -it --privileged quay.io/containers/podman:latest bash
…

Awesome! I will try that myself at some point. For now though I'll note, for the sake of completeness, that my original command above ran docker run… instead.

As for my second question:

"Is the use case that my reproducer attempts to model, atypical in your opinion?"

I guess I'll just have to be happy with my own speculative answer: No, it's not atypical.

skepticoitusInteruptus · 2022-05-27T21:39:18Z

Hey 👋

If I specify that, as in the original report, then I get your expected result:
$ podman run -it skepticoital/mem_limit:hmmm || echo failed
...
Allocated 986 to 987 MB
Allocated 987 to 988 MB
Allocated 988 to 989 MB
Allocated 989 to 990 MB
failed
…

That's awesome, @giuseppe! I will try that (podman run -m 1024…) myself at some point.

In the meantime though I'll note, for the sake of completeness, that my original reproducer above ran docker run -m 1024… instead.

Might that difference be enough to result in me getting the error I reported above and you not getting that same error with podman run -m 1024… as the outer container?

Error: OCI runtime error: crun: writing file `/sysfs/cgroup/cgroup.subtree_control`: Not supported

n1hility · 2022-05-27T21:51:04Z

Hey @flouthoc 👋

Ahhh! So this comment must be what you were referring to in our discussion? I didn't see this until two minutes ago. My apologies for my confusion 😕

…although I doubt that cgroup will be mounted correctly inside the nested container with the example you have shared…

Even given this (from my OP)?

Per earlier discussion, there is no need to unmount and remount /sys/fs/cgroup: It's a cgroup namespace, so its already mounted for you. BTW The reason you end up with a threaded domain is because your script enables the cpu controller without moving the cgroup your init process is in, triggering the 'no internal process constraint'.

"Is the use case that my reproducer attempts to model, atypical in your opinion?"

I guess I'll just have to be happy with my own speculative answer: No, it's not atypical.

Hard to answer this one. I don't follow what your use-case is. I understand that you are nesting containers and testing memory limiting when nested, but you haven't mentioned the use-case behind it.

skepticoitusInteruptus · 2022-05-27T23:14:56Z

Hey 👋

"…BTW The reason you end up with a threaded domain is because your script enables the cpu controller without moving the cgroup your init process is in, triggering the 'no internal process constraint'…"

TIL even more 🎓

"…I don't follow what your use-case is…"

OK if I refer you to item number 7 on my menu of reasons to keep schtum?

I'm contractually obligated to limit what I reveal in public forums about my org's uses cases, to only what is strictly sufficient to resolve an issue.

I sincerely want to reciprocate and be as helpful to you @n1hility as you have been to me, though.

That's why I feel bad that "cgroup v2-controlled nested containers" isn't sufficient enough for you.

So hopefully sharing this link with you will suffice. That issue lists docker and podman commands that are more or less similar to ours.

I just happened to stumble across that issue without even looking for it.

I imagine if I were to proactively search for them, I might find more convincing evidence that similar "cgroup v2-controlled nested containers" use cases are not all that novel after all.

"…you haven't mentioned the use-case behind it…"

You'll just have to take my word for it, @n1hility. The abstractions I shared in my reproducers are pretty decent representative models of the problems we need to solve; in my opinion they are, anyway.

n1hility · 2022-05-28T05:45:40Z

Hey 👋

"…BTW The reason you end up with a threaded domain is because your script enables the cpu controller without moving the cgroup your init process is in, triggering the 'no internal process constraint'…"

TIL even more 🎓

"…I don't follow what your use-case is…"

OK if I refer you to item number 7 on my menu of reasons to keep schtum?

I'm contractually obligated to limit what I reveal in public forums about my org's uses cases, to only what is strictly sufficient to resolve an issue.

Ah sure, been there. In that case I can give a general answer. While there are certainly legitimate cases to nest container engines, life is simpler if you can avoid it and stick to a single flat engine namespace on your host. An example where it can make sense is a CI infrastructure where your host OS image is locked down, but you want a different container engine to orchestrate your tests. One thing to keep in mind is that nested cgroup tree limits overcommit, so setting a limit on the parent can lead to surprising results (containers can get killed that are within their individual limit because the total exceeds the parent limit).

skepticoitusInteruptus · 2022-05-28T16:35:29Z

Thanks @giuseppe, @n1hility and @flouthoc (cc: @rhatdan)

I've now reread all of your responses with a lot more attention to detail than I had time to do yesterday afternoon when I first read them.

Other follow-on questions arose after reading your responses. I will spare you and not bombard you with all of those questions today though.

Instead, I will just ask you all one single follow-on question. Before I ask the question though:

Context

Given that I...

Refactor the skeptocoital/system.me container to not do anything whatsoever regarding configuring|inititalizing|unmounting|mounting cgroups
Run the refactored skeptocoital/system.me container with the following very specific command:

docker run -it -m 1024m --memory-swap 1024m --privileged skepticoital/system.me:hmmm

Inside the Docker-instantiated container constructed from skeptocoital/system.me, run the following very specific command:

/system.me # podman run --rm -it progrium/stress --vm 1 --vm-bytes 2048M --timeout 1s

One, single, follow-on question

What outcome should I expect if I execute the above very specific commands?

TIA.

n1hility · 2022-05-28T19:39:47Z

Thanks @giuseppe, @n1hility and @flouthoc (cc: @rhatdan)

your welcome!

docker run -it -m 1024m --memory-swap 1024m --privileged skepticoital/system.me:hmmm
 * Inside the _**Docker**_-instantiated container constructed from  `skeptocoital/system.me`, run the following _**very specific**_ command:
/system.me # podman run --rm -it progrium/stress --vm 1 --vm-bytes 2048M --timeout 1

Change this to add the following before your podman command

mkdir /sys/fs/cgroup/init
echo $$ > /sys/fs/cgroup/init/cgroup.procs
/system.me # podman run --rm -it progrium/stress --vm 1 --vm-bytes 2048M --timeout 1s

The first two commands move your process from the namespace root to a leaf node, which prevents the internal process constraint from flipping your cgroup to domain threaded. Domain is really what you want here, and it has the added benefit of not requiring a release with #931 (which allows crun to operate when domain threaded is in use)

Alternatively instead of relocating your process you can disable cgroups usage by podman for the nested podman since the parent is enforcing the 2048 in this policy configuration. It's the creation of the cgroup and enabling subtree controllers that triggers the flip (when the process is in the root namespace)

podman run --cgroups disabled --rm -it progrium/stress --vm 1 --vm-bytes 2048M --timeout 1s

In this mode you will observe that allocating over 2048 will kill the container, since its part of the docker group that has the 2048 limit.

skepticoitusInteruptus · 2022-05-28T19:51:16Z

Change this to add the following before your podman command
echo $$ > /sys/fs/cgroup/init/cgroup.procs
/system.me # podman run --rm -it progrium/stress --vm 1 --vm-bytes 2048M --timeout 1s

That's fantabulous @n1hility 👍

I will give that a shot at some point later.

🎓 In the meantime, please can I get you to edu-muh-cate me on this:

One, single, follow-on question

What outcome should I expect if I execute the above very specific commands?

…

TIA.

skepticoitusInteruptus · 2022-05-28T19:57:05Z

Ahhh. So sorry. I need to clean my glasses.

Just saw this...

"…In this mode you will observe that allocating over 2048 will kill the container, since its part of the docker group that has the 2048 limit…"

n1hility · 2022-05-28T20:07:07Z

Change this to add the following before your podman command
echo $$ > /sys/fs/cgroup/init/cgroup.procs
/system.me # podman run --rm -it progrium/stress --vm 1 --vm-bytes 2048M --timeout 1s
That's fantabulous @n1hility 👍

I will give that a shot at some point later.

🎓 In the meantime, please can I get you to edu-muh-cate me on this:

One, single, follow-on question

What outcome should I expect if I execute the above very specific commands?

…

TIA.

Without either of the options I mentioned (note I posted an edit adding an alternative of disabling cgroups with podman i forgot to mention), you will get a failure because without #931 crun will fail attempting to create a domain child under a domain threaded root (cgroups disallows this). After #931 it should work but will be less ideal since you will be using domain threaded when you don't really need it - it adds additional restrictions / semantics.

skepticoitusInteruptus · 2022-05-28T23:38:27Z

Howdy do, @n1hility 👋

"…In this mode you will observe that allocating over ~~2048~~ 1024 will kill the container, since its part of the docker group that has the ~~2048~~ 1024 limit…"

I fixed (what I presume is) a typo for you there.

Given, verbatim, all of the refactors,¹ preconditions, setup and very specific commands I listed in that Context section above, this is the actual outcome I observe…

my@host $ docker run -it -m 1024m --memory-swap 1024m --privileged skepticoital/system.me:hmmmm
…
/system.me # mkdir /sys/fs/cgroup/init
/system.me # cat /sys/fs/cgroup/init/cgroup.procs
/system.me # echo $$ > /sys/fs/cgroup/init/cgroup.procs
…
/system.me # podman run --rm -it progrium/stress --vm 1 --vm-bytes 2048M --timeout 1s
…
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: dbug: [1] using backoff sleep of 3000us
stress: dbug: [1] setting timeout to 1s
stress: dbug: [1] --> hogvm worker 1 [2] forked
stress: dbug: [2] allocating 2147483648 bytes ...
stress: dbug: [2] touching bytes in strides of 4096 bytes ...
stress: dbug: [1] <-- worker 2 signalled normally
stress: info: [1] successful run completed in 1s
/system.me # dmesg | grep -i killed
/system.me #

TL;DR: Given that the nested Podman's --vm-bytes 2048M is greater than the outer Docker's -m 1024m I expected to observe something like @flouthoc's kill error…

…
stress: FAIL: [1] (416) <-- worker 2 got signal 9
stress: WARN: [1] (418) now reaping child worker processes
stress: FAIL: [1] (422) kill error: No such process
stress: FAIL: [1] (452) failed run completed in 1s

And/or something like…

…
dmesg | grep -i killed
…
Memory cgroup out of memory: Killed process 42 (mem_limit) total-vm:962004kB...

Or are my expectations mistaken?

—

¹ _{The tag for the refactored reproducer image has four ms: skepticoital/system.me:hmmmm}

n1hility · 2022-05-29T04:44:03Z

Howdy do, @n1hility 👋

"…In this mode you will observe that allocating over ~~2048~~ 1024 will kill the container, since its part of the docker group that has the ~~2048~~ 1024 limit…"

I fixed (what I presume is) a typo for you there.

Yes sorry, I should have said 1024

Given, verbatim, all of the refactors,1 preconditions, setup and very specific commands I listed in that Context section above, this is the actual outcome I observe…

my@host $ docker run -it -m 1024m --memory-swap 1024m --privileged skepticoital/system.me:hmmmm
…
/system.me # mkdir /sys/fs/cgroup/init
/system.me # cat /sys/fs/cgroup/init/cgroup.procs
/system.me # echo $$ > /sys/fs/cgroup/init/cgroup.procs
…
/system.me # podman run --rm -it progrium/stress --vm 1 --vm-bytes 2048M --timeout 1s
…
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: dbug: [1] using backoff sleep of 3000us
stress: dbug: [1] setting timeout to 1s
stress: dbug: [1] --> hogvm worker 1 [2] forked
stress: dbug: [2] allocating 2147483648 bytes ...
stress: dbug: [2] touching bytes in strides of 4096 bytes ...
stress: dbug: [1] <-- worker 2 signalled normally
stress: info: [1] successful run completed in 1s
/system.me # dmesg | grep -i killed
/system.me #

TL;DR: Given that the nested Podman's --vm-bytes 2048M is greater than the outer Docker's -m 1024m I expected to observe something like @flouthoc's kill error…

It's probably not running long enough for it to get killed. Try bumping timeout to 10s or something like that. If a process temporarily allocates over but releases quickly it may survive. That other mem_limit container you had earlier in the thread that allocates and holds should be more reliable at demonstrating a limit kill.

skepticoitusInteruptus · 2022-05-30T23:07:58Z

Hey @giuseppe 👋

"…runc has an additional check to not enable cgroup v2 controllers that do not support the threaded cgroup type…"

…If you have handy a URL to a resource you could share that explains that feature…

One of my favorite old Greek sayings is, "The gods helps those who helps themselves" 🇬🇷

Please correct me if I've guessed wrong that one of these is what you were referring to:

n1hility · 2022-05-31T03:13:30Z

@skepticoitusInteruptus did bumping the timeout and trying mem_limit work for you?

skepticoitusInteruptus · 2022-05-31T15:00:29Z

"…did bumping the timeout and trying mem_limit work for you?…"

TL;DR

If I don't do anything whatsoever to initialize|configure|umount|mount cgroups in the skepticoital/system.me image, the -m 1048m and --memory-swap 1048m limits are totally ignored by any process running inside the outer Docker container.¹

I have a hunch² about what might be preventing those limits from being applied. But, I want to be careful not to (mis)lead y'all by the power of suggestion.

So if you and @giuseppe or @flouthoc were to independently arrive at that same hunch from each of your own investigations, that would be super helpful. Not just to me, personally; to the entire community!

Personally though, you all's help would certainly increase my confidence about what the next test cases of my investigation might be.

Instead of overworking y'all with reading, I'll share this recording; if it's any help to you fellahs at all:

Speaking of help, I gotta say: I hope Red Hat appreciate how unique your helpfulness in the containers issue trackers is @n1hility and are paying you your much deserved big ~~bugs~~ bucks 🥇

I know I can't express my ❤️-felt appreciation of your help, often enough. On this and 14236.

Muchas Thankyas es millionas 💯

—

¹ _{On an Alpine host in WSL2 on Windows 10; configured like here}

² _{Based on the output of cat /proc/self/mounts I highlight in the recording}

n1hility · 2022-05-31T17:52:15Z

@skepticoitusInteruptus Ah ha! I see the problem (thanks for the animated walkthrough that was helpful - and the kind words). Here is what is happening. When you run docker and it complains about the swap limit, what's happening is that, much like the podman issue in containers/podman#14236, it can't detect whether or not swap limiting should be employed, and falls-back to not adjusting it, leaving it at max. Then when you run anything that exceeds the limit, it will just use swap instead of killing the process. (Note that once a podman releases is available that includes containers/podman#14308, podman will correctly detect swap in spite of being in the root cgroup). To work around the docker scenario, you need a similar workaround as discussed in containers/podman#14236, create an initial cgroup of some kind on the host and run the docker command there.

From that point on, once in the container, you should see that both /sys/fs/cgroup/memory.max and memory.swap.max reflect the values you are passing to docker.

skepticoitusInteruptus · 2022-05-31T20:55:11Z

Hey @n1hility 👋

"…create an initial cgroup of some kind on the host and run the docker command there…"

Correct'O'mondo!

That's a step I would've had to have done in order to have observed the original outcome I reported in my OP above.

It was my oversight not listing it as a "Step to reproduce". So much for my "very specific commands". Right? 😅

Also, so much for my hunch.¹ That cgroup /sys/fs/cgroup cgroup2 … mount looks sketchy to me. I intend to take that one up with the WSL2 team at some point.

…If I don't do anything whatsoever to initialize|configure|umount|mount cgroups in the skepticoital/system.me image…

I think there is something worth sharing about that…

If I do initialize|configure|umount|mount cgroups in the skepticoital/system.me image (when I remember to run docker in a non-root cgroup before-hand), then when running in the outer Docker container inited by system.me, I've observed that I do get the expected Killed outcome; even though the nested Podman is running in a domain threaded subcgroup.²

"…When you run docker and it complains about the swap limit … it can't detect whether or not swap limiting should be employed, and falls-back to not adjusting it, leaving it at max…"

That "WARNING:…" from Docker about swap is the next little duckie in my sights 🦆 🦆 🦆 🦆 🔫

"…podman will correctly detect swap in spite of being in the root cgroup…"

I'll spare you Podman bros my questions about how reasonable it would be to expect Docker to do that too.

Brace yerselves @kolyshkin, @AkihiroSuda, @thaJeztah, … and the rest of you Docker bros and sisters 😁

—

¹ _{Based on the output of cat /proc/self/mounts I highlight in the recording}

² _{Only works if Podman's runtime is runc though}

n1hility · 2022-05-31T22:02:48Z

If I do initialize|configure|umount|mount cgroups in the skepticoital/system.me image (when I remember to run docker in a non-root cgroup before-hand), then when running in the outer Docker container inited by system.me, I've observed that I do get the expected Killed outcome; even though the nested Podman is running in a domain threaded subcgroup.2

@skepticoitusInteruptus I just checked this and want to check we are seeing the same thing. With all of the above podman on crun does work with limits. Does this work for you?:

PS C:\Users\jason> wsl --shutdown
PS C:\Users\jason> wsl -d Alpine
WIN10PC:/mnt/c/Users/jason# umount /sys/fs/cgroup/unified/
WIN10PC:/mnt/c/Users/jason# umount /sys/fs/cgroup
WIN10PC:/mnt/c/Users/jason# mount -t cgroup2 cgroup /sys/fs/cgroup
WIN10PC:/mnt/c/Users/jason# mkdir /sys/fs/cgroup/init
WIN10PC:/mnt/c/Users/jason# echo +memory > /sys/fs/cgroup/cgroup.subtree_control
WIN10PC:/mnt/c/Users/jason# echo $$ > /sys/fs/cgroup/init/cgroup.procs
WIN10PC:/mnt/c/Users/jason# dockerd > /dev/null 2>&1 &
WIN10PC/mnt/c/Users/jason# docker run -it -m 100M --memory-swap 100M --privileged --entrypoint /bin/sh skeptico
ital/system.me:hmmm

No warning since we created the cgroup, and values are now what we expect (note that cgroup swap max = container memory-swap - memory)

/system.me # cat /sys/fs/cgroup/memory.max
104857600
/system.me # cat /sys/fs/cgroup/memory.swap.max
0

setup our cgroup to ensure that we dont get converted to domain threaded

/system.me # mkdir /sys/fs/cgroup/init
/system.me # echo $$ > /sys/fs/cgroup/init/cgroup.procs

Now run nested podman using your mem limit container:

 /system.me # podman run skepticoital/mem_limit:hmmm
-snipped-
Allocated 44 to 45 MB
Allocated 45 to 46 MB
Allocated 46 to 47 MB
Allocated /system.me # echo $?
137
 dmesg | grep kill
[  178.622193] podman invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[  178.622248]  oom_kill_process.cold+0xb/0x10
[  178.622371] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=init,mems_allowed=0,oom_memcg=/docker/1fd0c04db8fe22ff55da0a157e626582f6302a1cf5a1e8579c721da54cef0822,task_memcg=/docker/1fd0c04db8fe22ff55da0a157e626582f6302a1cf5a1e8579c721da54cef0822/libpod_parent/libpod-eaa7d0e0ecb25fa0da4d48fe6bb840e1bec91065dd40d679ac36ec0e3d621e48,task=mem_limit,pid=490,uid=0

Rerun nested mem limit with a shell and double check the cgroup type

/system.me # podman run -it --entrypoint /bin/sh skepticoital/mem_limit:hmmm
/ # cat /sys/fs/cgroup/cgroup.type
domain
/ # /mem_limit
Starting ...
Allocated 0 to 1 MB
-snipped-
Allocated 73 to 74 MB
Allocated 74 to 75 MB
Killed
/ # cat /sys/fs/cgroup/cgroup.type
domain
/ # exit

Double check we used crun

/system.me # podman ps -a
CONTAINER ID  IMAGE                                  COMMAND     CREATED             STATUS                       PORTS       NAMES
70581260adac  docker.io/skepticoital/mem_limit:hmmm              About a minute ago  Exited (127) 33 seconds ago              nostalgic_darwin

/system.me # podman inspect nostalgic_darwin | grep OCIRuntime
          "OCIRuntime": "crun",

skepticoitusInteruptus · 2022-06-01T02:56:31Z

"…Does this work for you?:…"

Nah. But I have no idea why it doesn't 😕 If anything in this first recording jumps out at you, please holler…

What does work for me are the steps I listed and reported in my OP.¹

This second recording demonstrates those steps and the expected outcome…²

And last, but not least, after deleting the cgroup dir I created in the previous recording, I create it again afresh.³

Then I follow all your steps you just listed…⁴

—

¹ _{With the Workaround of replacing crun with runc}

² _{Using the 1st DockerHub skepticoital/system.me:hmmm image; I do umount|mount cgroups}

³ _{I don't umount|mount cgroups in this local skepticoital/system.hmmm:me image}

⁴ _{Worked without replacing crun with runc}

n1hility · 2022-06-01T03:56:23Z

"…Does this work for you?:…"

Nah. But I have no idea why it doesn't 😕 If anything in this first recording jumps out at you, please holler…

Ah looks like an early step echo +memory > /sys/fs/cgroup/cgroup.subtree_control somehow got transposed to echo +memory > /sys/fs/cgroup/init/cgroup.subtree_control

What does work for me are the steps I listed and reported in my OP.1

This second recording demonstrates those steps and the expected outcome…2

Cool. So once #931 lands in a release threaded will work on crun. BTW to add more color to the limitations. Once you have a threaded controller you can not create cgroups below it that reference non-threaded controllers like the memory controller, so anything that might create another container like construct, or some manual cgroup usage in a container might not behave as expected or error.

And last, but not least, after deleting the cgroup dir I created in the previous recording, I create it again afresh.3

Then I follow all your steps you just listed…4

Excellent 🎉

skepticoitusInteruptus · 2022-06-01T15:31:32Z

"…Ah looks like an early step echo +memory > /sys/fs/cgroup/cgroup.subtree_control somehow got transposed to echo +memory > /sys/fs/cgroup/init/cgroup.subtree_control…"

Oops! "I see!", said the blind man 👓

…
WIN10PC:/mnt/c/Users/jason# mount -t cgroup2 **cgroup** /sys/fs/cgroup
…

Q: What is the intent of specifying cgroup there instead of cgroup2?

I'm sure there must be an advantage of doing it that way instead of the way I do it (mount -t cgroup2 cgroup2 … ),

It's surprising to me that they're not both cgroup2. What's the effective difference?

TIA.

skepticoitusInteruptus · 2022-06-01T17:38:17Z

Q: What is the intent of specifying cgroup there instead of cgroup2?

Eventually got around to reading the man pages for mount(8)…

"…The proc filesystem is not associated with a special device, and when mounting it, an arbitrary keyword - for example, proc - can be used instead of a device specification…"

So I suppose that presuming the cgroup2 file system type is a so-called pseudo file system type (like proc is), I'm gonna go out and a limb and guess …

A: Same difference.

Six of one, half a dozen of the other type deal ❔

n1hility · 2022-06-01T18:46:42Z

Q: What is the intent of specifying cgroup there instead of cgroup2?

Eventually got around to reading the man pages for mount(8)…

"…The proc filesystem is not associated with a special device, and when mounting it, an arbitrary keyword - for example, proc - can be used instead of a device specification…"

So I suppose that presuming the cgroup2 file system type is a so-called pseudo file system type (like proc is), I'm gonna go out and a limb and guess …

A: Same difference.

Six of one, half a dozen of the other type deal ❔

Yes thats right. The device spec can be named anything for the same reason. The FS mount location (/sys/fs/cgroup) is the contract/api point that everything looks for.

This was referenced May 20, 2022

Cannot run container as root with Podman due to Cgroups issue with missing cpu.max sysfs file containers/podman#13379

Closed

Podman's vs Docker's approach to container id inspection in cgroup v2 environments containers/podman#14236

Closed

giuseppe mentioned this issue May 26, 2022

cgroup: make target cgroup threaded if needed #931

Merged

flouthoc closed this as completed in #931 May 31, 2022

skepticoitusInteruptus mentioned this issue Jun 2, 2022

Correct create.go's failure to detect WSL2 kernel's compiled-in swap capabilities moby/moby#43674

Open

Correct crun's behavior when it runs inside a cgroup v2 container #923

Correct crun's behavior when it runs inside a cgroup v2 container #923

Comments

skepticoitusInteruptus commented May 19, 2022 • edited Loading

Workaround

skepticoitusInteruptus commented May 24, 2022 • edited Loading

giuseppe commented May 26, 2022

giuseppe commented May 26, 2022 • edited Loading

skepticoitusInteruptus commented May 26, 2022 • edited Loading

giuseppe commented May 26, 2022

skepticoitusInteruptus commented May 26, 2022

n1hility commented May 26, 2022 • edited Loading

skepticoitusInteruptus commented May 26, 2022

skepticoitusInteruptus commented May 26, 2022

flouthoc commented May 27, 2022

giuseppe commented May 27, 2022

skepticoitusInteruptus commented May 27, 2022

skepticoitusInteruptus commented May 27, 2022

n1hility commented May 27, 2022

skepticoitusInteruptus commented May 27, 2022

n1hility commented May 28, 2022 • edited Loading

skepticoitusInteruptus commented May 28, 2022

Context

One, single, follow-on question

n1hility commented May 28, 2022 • edited Loading

skepticoitusInteruptus commented May 28, 2022

One, single, follow-on question

skepticoitusInteruptus commented May 28, 2022

n1hility commented May 28, 2022

One, single, follow-on question

skepticoitusInteruptus commented May 28, 2022 • edited Loading

n1hility commented May 29, 2022

skepticoitusInteruptus commented May 30, 2022

n1hility commented May 31, 2022

skepticoitusInteruptus commented May 31, 2022 • edited Loading

TL;DR

n1hility commented May 31, 2022 • edited Loading

skepticoitusInteruptus commented May 31, 2022

n1hility commented May 31, 2022 • edited Loading

skepticoitusInteruptus commented Jun 1, 2022

n1hility commented Jun 1, 2022 • edited Loading

skepticoitusInteruptus commented Jun 1, 2022

skepticoitusInteruptus commented Jun 1, 2022 • edited Loading

n1hility commented Jun 1, 2022

skepticoitusInteruptus commented May 19, 2022 •

edited

Loading

skepticoitusInteruptus commented May 24, 2022 •

edited

Loading

giuseppe commented May 26, 2022 •

edited

Loading

skepticoitusInteruptus commented May 26, 2022 •

edited

Loading

n1hility commented May 26, 2022 •

edited

Loading

n1hility commented May 28, 2022 •

edited

Loading

n1hility commented May 28, 2022 •

edited

Loading

skepticoitusInteruptus commented May 28, 2022 •

edited

Loading

skepticoitusInteruptus commented May 31, 2022 •

edited

Loading

n1hility commented May 31, 2022 •

edited

Loading

n1hility commented May 31, 2022 •

edited

Loading

n1hility commented Jun 1, 2022 •

edited

Loading

skepticoitusInteruptus commented Jun 1, 2022 •

edited

Loading