Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support cgroup v2 (unified hierarchy) #654

Closed
sols1 opened this issue Mar 17, 2016 · 37 comments
Closed

support cgroup v2 (unified hierarchy) #654

sols1 opened this issue Mar 17, 2016 · 37 comments

Comments

@sols1
Copy link

sols1 commented Mar 17, 2016

cgroup v2 (unified hierarchy) is now official in 4.5:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=34a9304a96d6351c2d35dcdc9293258378fc0bd8

cgroup v2 should have more sensible behavior:

https://www.youtube.com/watch?v=PzpG40WiEfM

moby/moby#16238

@cyphar
Copy link
Member

cyphar commented Mar 22, 2016

cgroupv2 still doesn't support many of the cgroup controllers we need for runc. The most important one is the device "cgroup", which is a hard requirement for security. As far as I can see, CPU still hasn't been implemented either. Also, many of the other cgroups provide us with protections against other resource exhaustion attacks.

@sols1
Copy link
Author

sols1 commented Apr 27, 2016

It is possible to do cgroup v2 for some controllers and cgroup v1 for others, which are still not available for cgroup v2.

Memory is the most difficult resource to manage and that's what is fixed in cgroup v2.

The device cgroup seems to be fairly straightforward to convert to cgroup v2: add device permissions to existing single hierarchy.

@cyphar
Copy link
Member

cyphar commented Apr 27, 2016

The other issue is that we need to be running on a distribution which supports cgroupv2 as the default setup with systemd (which is essentially none of them). We can't really use cgroupv2 otherwise because it would require either:

  • Moving all of the processes in the system to the v2 equivalent. But because of the internal node (and threadgroup) constraints this won't be pretty and we'd be changing distro policy.
  • Moving just the subtree to the v2 equivalent. While this is technically allowed, the documentation makes it clear that it's a development tool and shouldn't be used for production purposes.

For me, one of the biggest benefits of cgroupv2 is that cgroup namespaces make more sense on v2. Unfortunately, cgroup namespaces don't implement features that would make them useful at the moment (see #774 and #781). So there's that.

And yes, we can use both v2 and v1 at the same time, but that doesn't make the implementation any nicer (now we'd have to use two managers with two different "cgroup paths").

@rodionos
Copy link

rodionos commented May 1, 2016

For context, Ubuntu 16 LTS is on kernel version 4.4
https://wiki.ubuntu.com/XenialXerus/ReleaseNotes#Linux_kernel_4.4

@sols1
Copy link
Author

sols1 commented May 24, 2016

Not sure I understand all the issues related to cgroup namespaces. It would be nice to resolve all conceptual issues before doing this but for practical production use of containers resource management is a big issue and memory is the most difficult resource to manage because of its "non-renewable" nature so to speak.

For example, Paralles/Virtuozzo used containers in production for 10+ years and they ended up back porting memory cgroup v2 to the old kernel that they used (RHEL6, if I'm not mistaken).

Also, as far as I understand Google used containers in production for a long time and they had some kernel patches to deal with memory accounting and management.

@cyphar
Copy link
Member

cyphar commented May 24, 2016

@sols1

Not sure I understand all the issues related to cgroup namespaces. It would be nice to resolve all conceptual issues before doing this but ...

cgroup namespaces was a benefit of cgroupv2 😉. The general issue with cgroupv2 is that there just aren't enough controllers enabled for us to be able to use it properly (at a minimum, we'd need the freezer and device cgroups), and using both cgroupv2 and cgroupv1 together will make the implementation more complicated than it needs to be. On the plus side, we don't need the net_* controllers in cgroupv2 (they won't ever be added to cgroupv2) because you can now specify iptable rules by cgroup path (which AFAIK is namespaced by cgroups).

I'd be happy to work on kernel patches to add support for the controllers, but I'd recommend pushing upstream to get more controllers enabled for cgroupv2 -- they just aren't feature complete for us right now and I don't feel good about adding hacks to our cgroup management implementation to deal with cgroupv2's shortcomings.

but for practical production use of containers resource management is a big issue and memory is the most difficult resource to manage because of its "non-renewable" nature so to speak.

I understand, but there's also the problem that I'm not sure how we could test our use of cgroupv2 because systemd uses the cgroupv1 hierarchy on almost every distribution (I tried to switch to cgroupv2 on my laptop while my system was running -- it did not end well).

@justincormack
Copy link
Contributor

@cyphar we are in the merge window for 4.9 which will be next LTS, so it is getting quite late to get support in for the next few years for most distros - any chance of looking at the kernel patches?

I am happy to help testing, it should be fairly easy on Alpine Linux as it does not use systemd so can change more easily.

@sols1
Copy link
Author

sols1 commented Oct 19, 2016

RancherOS (https://github.com/rancher/os) is another option. It does not use systemd and even systemd emulation was removed AFAIK.

@cyphar
Copy link
Member

cyphar commented Oct 20, 2016

I haven't really had a chance to work on kernel patches recently. However, I did try a few months ago to implement freezer so it worked with cgroupv2 -- as far as I can tell it's not really that trivial to do. Namely there are some edge cases that made the handling non-clear. And I looked at the devices code but its quite a bit more complicated than the freezer code.

I might take look sometime next month, but I can't really guarantee anything (I've been swamped quite recently).

@hustcat
Copy link
Contributor

hustcat commented Nov 29, 2016

Buffer io throttle is another biggest benefits of cgroupv2.

@rhatdan
Copy link
Contributor

rhatdan commented Jan 9, 2017

Rawhide just moved to CgroupV2. Causing docker/runc to blow up.

https://bugzilla.redhat.com/show_bug.cgi?id=1411286

docker run -ti fedora bash
/usr/bin/docker-current: Error response from daemon: invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:359: container init caused \\\"rootfs_linux.go:54: mounting \\\\\\\"cgroup\\\\\\\" to rootfs \\\\\\\"/var/lib/docker/overlay/e1432a26e33bebbc27619c9802d9218f3da8938b7f1696ca9be0890a2e75ac65/merged\\\\\\\" at \\\\\\\"/sys/fs/cgroup\\\\\\\" caused \\\\\\\"no subsystem for mount\\\\\\\"\\\"\"\n".

uname -r

4.10.0-0.rc2.git4.1.fc26.x86_64

stefanberger pushed a commit to stefanberger/runc that referenced this issue Sep 8, 2017
config: Bring "unique... within this map" back together
stefanberger pushed a commit to stefanberger/runc that referenced this issue Sep 8, 2017
This condition landed in 27a05de (Add text about extensions,
2016-06-26, opencontainers#510) with subsequent wording tweaks in 3f0440b
(config.md: add empty limit for key of annotations, Dec 28 10:35:19
2016, opencontainers#645) and 2c8feeb (config: Bring "unique... within this map"
back together, 2017-01-12, opencontainers#654).  However, since eeaccfa (glossary:
Make objects explicitly unordered and forbid duplicate names,
2016-09-27, opencontainers#584) we forbid duplicate keys on *all* objects (not just
annotations), so this PR removes the redundant annotation-specific
condition.

Signed-off-by: W. Trevor King <wking@tremily.us>
@webczat
Copy link

webczat commented Oct 6, 2017

isn't cpu controller merged for 4.14 already?

@cyphar
Copy link
Member

cyphar commented Oct 7, 2017

4.14 isn't out yet 😉. CPU and memory have been merged, but there's still some disagreements over some bits (I still have to read through some patches I saw on the ML).

@brauner (from the LXC project) gave a nice talk about the more generic issues about cgroupv2: https://www.youtube.com/watch?v=P6Xnm0IhiSo .

@webczat
Copy link

webczat commented Oct 7, 2017 via email

@sargun
Copy link

sargun commented Nov 13, 2017

4.14 is out now.

@cyphar
Copy link
Member

cyphar commented Nov 14, 2017

My reservations about cgroupv2's shortcomings (and the issues with the "hybrid" mode of operation) still hold. Not to mention that (last I tried) I wasn't able to get a system to boot with cgroupv2 enabled -- which doesn't bode well for testing any of that code.

@redbaron
Copy link

Is there any news/development regarding cgroups v2?

@cyphar
Copy link
Member

cyphar commented May 25, 2018

Not really. freezer/devices is still not enabled on cgroupv2 and there are still arguments about the threaded mode of operation that was merged in 4.14.

@sargun
Copy link

sargun commented May 25, 2018 via email

@cyphar
Copy link
Member

cyphar commented Jun 3, 2018

You don't need it, but you do want it. The main problem is that we'd still need to have a hybrid mode (which is something I've always felt uncomfortable with the idea of).

@sargun
Copy link

sargun commented Sep 24, 2018

@cyphar For users who do not use freezer (because they have PID namespaces) and they aren't trying to take live snapshots, do you think it's reasonable to have cgroupv2 support, and be able to have runc use the cgroupv2 "alternate" mode?

@cyphar
Copy link
Member

cyphar commented Sep 25, 2018

I don't mind having a pure-cgroupv2 implementation, but I don't think it would be ultimately useful. As far as I know, no distribution actually uses cgroupv2 controllers "for real" (to be fair, we are also probably the reason it hasn't happened yet). I unfortunately think that we must have a hybrid implementation otherwise we won't be able to implement the cgroup parts OCI spec fully on ordinary systems (I mean, we can error out and that's compliant but it's not correct). Maybe for a first step pure-cgroupv2 would be fine but I'm not 100% on that.

But my main concern is that this actually is going to be harder than you might think to implement. @brauner gave a talk about this last year, specifically in the context of LXC and container runtimes in general. The no-internal-process constraint in particular means that container runtimes will have to do a very large amount of dodgy things in order to be able to run containers inside a new cgroup (you have to move the processes from any parent cgroups into a new leaf-node). In addition, subtree_control gives you quite a few headaches because some parent cgroup could limit your ability to create new

In the Docker case this won't be as awful (though it will still be bad) because you can just create a new cgroup at /docker/FOO which will avoid some of the internal-process constraint issues (it's very unlikely that the cgroup is completely unused and so / will not be a leaf node). But I have a feeling systemd will cause many headaches if we start doing things like that in cgroupv2 -- especially since in cgroupv2 they have the same problem as us with the internal-process constraint.

@alban
Copy link
Contributor

alban commented Sep 25, 2018

we won't be able to implement the cgroup parts OCI spec fully on ordinary systems

I agree, the current OCI spec has been written with cgroup-v1 in mind... the device cgroup and the network classID are tied to cgroup-v1.

In cgroup-v2, the same features can be achieved with some equivalents for device cgroup and net_cls but that's different API.

So in my opinion, the OCI spec would need an update for cgroup-v2... either include some croup-v2 concepts or be abstracted.

But I have a feeling systemd will cause many headaches if we start doing things like that in cgroupv2

Do you refer to the systemd in the container, on the host, or using the container runtime systemd-nspawn?

For reference, systemd (on the host) supports 3 options for container runtimes with cgroup-v2.

@sargun
Copy link

sargun commented Sep 25, 2018

Yeah, I think there are two threads here:

  1. We need to change the OCI spec to accomodate cgroupv2, and not be as "prescriptive" about how cgroups are implemented.
  2. We need a cgroupv2 engine

I think that the engine should ideally have pluggable backends. The first one should probably just make RPCs to systemd to create slices and scopes. For example, in our system today, we run all containers under /containers.slice. I can imagine something like this:

/containers.slice/..
        (The following scopes are created by systemd with Delegate=true)
        /container-1.scope (Resource constraints exist here)
        /container-2.scope

It might make sense for us to do our own cgroup control eventually, but given how poorly systemd plays with others, and how much investment goes into it, I see no reason to reinvent the wheel.

@arianvp
Copy link

arianvp commented Jun 23, 2019

FYI systemd-nspawn actually implements the OCI spec since the latest release and I assume it works with cgroups v2 (I would be surprised if it didn't) so perhaps that isn't such a large blocker as we thought? https://github.com/systemd/systemd/blob/916f595c7cbe5dd5028a23a17a245ef19e8f6a29/NEWS#L628

@cyphar
Copy link
Member

cyphar commented Aug 25, 2019

Fedora 31 is switching to cgroupv2 entirely (and will start using crun as a result -- because it supports cgroupv2). I guess now it's do or die (work is being done in #2114).

@mrunalp
Copy link
Contributor

mrunalp commented Aug 25, 2019

@cyphar @giuseppe @filbranden and others interested in this, I think we need to settle on what we want to do in the runtime spec soon. I am leaning towards having a separate cgroupv2 struct in the spec and then allowing conversion in runc if needed. wdyt?
We can add this to the agenda for the OCI call if we think it is better discussed in sync there.

@timchenxiaoyu
Copy link

does cgroup v2 support limit page cache used ???

@Werkov
Copy link

Werkov commented Oct 11, 2019

@timchenxiaoyu Yes. In the sense that v1 controller limits page cache too.

@AkihiroSuda
Copy link
Member

@cyphar I think this can be closed and now we should create separate issues for remaining tasks

@crosbymichael
Copy link
Member

Ok, i'll close this and we will work out of individual issues remaining

@Jamlee
Copy link

Jamlee commented Oct 25, 2019

so, where is the new issue about cgroup ?

@AkihiroSuda
Copy link
Member

AkihiroSuda commented Oct 25, 2019

Basic support for cpu, cpuset, memory, pids, io (blkio), and freezer controllers is already done. (#2113)

The major remaining issues are:

I think maintainers should set cgroup2 labels so that people can easily find them.

Maybe we should also discuss the design of Manager and Subsystem structs: #2148 (comment)

@Keruspe
Copy link

Keruspe commented Oct 25, 2019 via email

@AkihiroSuda
Copy link
Member

I wrote a blog about this: https://medium.com/nttlabs/cgroup-v2-596d035be4d7

There is no official milestone, but I think it will almost reach feature-complete when #2144 and #2149 gets merged.

@AkihiroSuda
Copy link
Member

Rootful mode seems almost feature-complete now.

Rootless still doesn't work for cgroup2: #2163

paralin added a commit to skiffos/buildroot that referenced this issue Nov 11, 2019
Docker fails to start with "Devices cgroup isn't mounted." According to the
systemd documentation:

  systemd now defaults to the "unified" cgroup hierarchy setup during
  build-time, i.e. -Ddefault-hierarchy=unified is now the build-time default.
  Previously, -Ddefault-hierarchy=hybrid was the default. [...] Downstream
  production distributions might want to continue to use
  -Ddefault-hierarchy=hybrid (or even =legacy) for their builds as unfortunately
  the popular container managers have not caught up with the kernel API changes.

Changing this option to "hybrid" or "legacy" fixes the Docker startup.

Reference: opencontainers/runc#654

Signed-off-by: Christian Stewart <christian@paral.in>
paralin added a commit to skiffos/buildroot that referenced this issue Nov 11, 2019
Docker fails to start with "Devices cgroup isn't mounted." According to the
systemd documentation:

  systemd now defaults to the "unified" cgroup hierarchy setup during
  build-time, i.e. -Ddefault-hierarchy=unified is now the build-time default.
  Previously, -Ddefault-hierarchy=hybrid was the default. [...] Downstream
  production distributions might want to continue to use
  -Ddefault-hierarchy=hybrid (or even =legacy) for their builds as unfortunately
  the popular container managers have not caught up with the kernel API changes.

Changing this option to "hybrid" or "legacy" fixes the Docker startup.

Reference: opencontainers/runc#654

Signed-off-by: Christian Stewart <christian@paral.in>
paralin added a commit to skiffos/buildroot that referenced this issue Nov 11, 2019
Docker fails to start with "Devices cgroup isn't mounted." According to the
systemd documentation:

  systemd now defaults to the "unified" cgroup hierarchy setup during
  build-time, i.e. -Ddefault-hierarchy=unified is now the build-time default.
  Previously, -Ddefault-hierarchy=hybrid was the default. [...] Downstream
  production distributions might want to continue to use
  -Ddefault-hierarchy=hybrid (or even =legacy) for their builds as unfortunately
  the popular container managers have not caught up with the kernel API changes.

Changing this option to "hybrid" or "legacy" fixes the Docker startup.

Reference: opencontainers/runc#654

Signed-off-by: Christian Stewart <christian@paral.in>
buildroot-auto-update pushed a commit to buildroot/buildroot that referenced this issue Nov 11, 2019
Docker fails to start with "Devices cgroup isn't mounted" as of systemd 243.
According to the systemd documentation:

  systemd now defaults to the "unified" cgroup hierarchy setup during
  build-time, i.e. -Ddefault-hierarchy=unified is now the build-time default.
  Previously, -Ddefault-hierarchy=hybrid was the default. [...] Downstream
  production distributions might want to continue to use
  -Ddefault-hierarchy=hybrid (or even =legacy) for their builds as unfortunately
  the popular container managers have not caught up with the kernel API changes.

Changing this option to "hybrid" or "legacy" fixes the Docker startup.

Reference: opencontainers/runc#654

Signed-off-by: Christian Stewart <christian@paral.in>
Tested-by: Jérémy Rosen <jeremy.rosen@smile.fr>
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
@AkihiroSuda
Copy link
Member

I compiled the list of leftover TODOs: #2209

Joseph-Conley pushed a commit to LairdCP/wb-buildroot that referenced this issue Jun 10, 2020
Docker fails to start with "Devices cgroup isn't mounted" as of systemd 243.
According to the systemd documentation:

  systemd now defaults to the "unified" cgroup hierarchy setup during
  build-time, i.e. -Ddefault-hierarchy=unified is now the build-time default.
  Previously, -Ddefault-hierarchy=hybrid was the default. [...] Downstream
  production distributions might want to continue to use
  -Ddefault-hierarchy=hybrid (or even =legacy) for their builds as unfortunately
  the popular container managers have not caught up with the kernel API changes.

Changing this option to "hybrid" or "legacy" fixes the Docker startup.

Reference: opencontainers/runc#654

Signed-off-by: Christian Stewart <christian@paral.in>
Tested-by: Jérémy Rosen <jeremy.rosen@smile.fr>
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
tpgxyz added a commit to OpenMandrivaAssociation/systemd that referenced this issue Jun 17, 2020
Last blocker got fixed, so updating runc should be last thing to do.
opencontainers/runc#654
tpgxyz added a commit to OpenMandrivaAssociation/systemd that referenced this issue May 7, 2021
Last blocker got fixed, so updating runc should be last thing to do.
opencontainers/runc#654
jhatler pushed a commit to rfpros/wb-buildroot that referenced this issue Feb 2, 2023
Docker fails to start with "Devices cgroup isn't mounted" as of systemd 243.
According to the systemd documentation:

  systemd now defaults to the "unified" cgroup hierarchy setup during
  build-time, i.e. -Ddefault-hierarchy=unified is now the build-time default.
  Previously, -Ddefault-hierarchy=hybrid was the default. [...] Downstream
  production distributions might want to continue to use
  -Ddefault-hierarchy=hybrid (or even =legacy) for their builds as unfortunately
  the popular container managers have not caught up with the kernel API changes.

Changing this option to "hybrid" or "legacy" fixes the Docker startup.

Reference: opencontainers/runc#654

Signed-off-by: Christian Stewart <christian@paral.in>
Tested-by: Jérémy Rosen <jeremy.rosen@smile.fr>
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests