-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support cgroup v2 (unified hierarchy) #654
Comments
cgroupv2 still doesn't support many of the cgroup controllers we need for runc. The most important one is the device "cgroup", which is a hard requirement for security. As far as I can see, CPU still hasn't been implemented either. Also, many of the other cgroups provide us with protections against other resource exhaustion attacks. |
It is possible to do cgroup v2 for some controllers and cgroup v1 for others, which are still not available for cgroup v2. Memory is the most difficult resource to manage and that's what is fixed in cgroup v2. The device cgroup seems to be fairly straightforward to convert to cgroup v2: add device permissions to existing single hierarchy. |
The other issue is that we need to be running on a distribution which supports cgroupv2 as the default setup with systemd (which is essentially none of them). We can't really use cgroupv2 otherwise because it would require either:
For me, one of the biggest benefits of cgroupv2 is that cgroup namespaces make more sense on v2. Unfortunately, cgroup namespaces don't implement features that would make them useful at the moment (see #774 and #781). So there's that. And yes, we can use both v2 and v1 at the same time, but that doesn't make the implementation any nicer (now we'd have to use two managers with two different "cgroup paths"). |
For context, Ubuntu 16 LTS is on kernel version 4.4 |
Not sure I understand all the issues related to cgroup namespaces. It would be nice to resolve all conceptual issues before doing this but for practical production use of containers resource management is a big issue and memory is the most difficult resource to manage because of its "non-renewable" nature so to speak. For example, Paralles/Virtuozzo used containers in production for 10+ years and they ended up back porting memory cgroup v2 to the old kernel that they used (RHEL6, if I'm not mistaken). Also, as far as I understand Google used containers in production for a long time and they had some kernel patches to deal with memory accounting and management. |
cgroup namespaces was a benefit of cgroupv2 😉. The general issue with cgroupv2 is that there just aren't enough controllers enabled for us to be able to use it properly (at a minimum, we'd need the I'd be happy to work on kernel patches to add support for the controllers, but I'd recommend pushing upstream to get more controllers enabled for cgroupv2 -- they just aren't feature complete for us right now and I don't feel good about adding hacks to our cgroup management implementation to deal with cgroupv2's shortcomings.
I understand, but there's also the problem that I'm not sure how we could test our use of cgroupv2 because systemd uses the cgroupv1 hierarchy on almost every distribution (I tried to switch to cgroupv2 on my laptop while my system was running -- it did not end well). |
@cyphar we are in the merge window for 4.9 which will be next LTS, so it is getting quite late to get support in for the next few years for most distros - any chance of looking at the kernel patches? I am happy to help testing, it should be fairly easy on Alpine Linux as it does not use systemd so can change more easily. |
RancherOS (https://github.com/rancher/os) is another option. It does not use systemd and even systemd emulation was removed AFAIK. |
I haven't really had a chance to work on kernel patches recently. However, I did try a few months ago to implement I might take look sometime next month, but I can't really guarantee anything (I've been swamped quite recently). |
Buffer io throttle is another biggest benefits of cgroupv2. |
Rawhide just moved to CgroupV2. Causing docker/runc to blow up. https://bugzilla.redhat.com/show_bug.cgi?id=1411286
uname -r4.10.0-0.rc2.git4.1.fc26.x86_64 |
config: Bring "unique... within this map" back together
This condition landed in 27a05de (Add text about extensions, 2016-06-26, opencontainers#510) with subsequent wording tweaks in 3f0440b (config.md: add empty limit for key of annotations, Dec 28 10:35:19 2016, opencontainers#645) and 2c8feeb (config: Bring "unique... within this map" back together, 2017-01-12, opencontainers#654). However, since eeaccfa (glossary: Make objects explicitly unordered and forbid duplicate names, 2016-09-27, opencontainers#584) we forbid duplicate keys on *all* objects (not just annotations), so this PR removes the redundant annotation-specific condition. Signed-off-by: W. Trevor King <wking@tremily.us>
isn't cpu controller merged for 4.14 already? |
4.14 isn't out yet 😉. CPU and memory have been merged, but there's still some disagreements over some bits (I still have to read through some patches I saw on the ML). @brauner (from the LXC project) gave a nice talk about the more generic issues about cgroupv2: https://www.youtube.com/watch?v=P6Xnm0IhiSo . |
I thought that cpu controller is merged so there are no more questions.
Memory controller was there for a longer time wasn't it?
4.14 is not out but it is at rc so probably nothing significant can change
|
4.14 is out now. |
My reservations about |
Is there any news/development regarding cgroups v2? |
Not really. |
You don't need freezer.
…On Thu, May 24, 2018 at 9:35 PM, Aleksa Sarai ***@***.***> wrote:
Not really. freezer is still not enabled on cgroupv2 and there are still
arguments about the threaded mode of operation that was merged in 4.14.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#654 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAtyRLHcgmthQNcwVoQoHYAEn_YwbnHyks5t14ougaJpZM4HzVBa>
.
|
You don't need it, but you do want it. The main problem is that we'd still need to have a hybrid mode (which is something I've always felt uncomfortable with the idea of). |
@cyphar For users who do not use freezer (because they have PID namespaces) and they aren't trying to take live snapshots, do you think it's reasonable to have cgroupv2 support, and be able to have runc use the cgroupv2 "alternate" mode? |
I don't mind having a pure-cgroupv2 implementation, but I don't think it would be ultimately useful. As far as I know, no distribution actually uses cgroupv2 controllers "for real" (to be fair, we are also probably the reason it hasn't happened yet). I unfortunately think that we must have a hybrid implementation otherwise we won't be able to implement the cgroup parts OCI spec fully on ordinary systems (I mean, we can error out and that's compliant but it's not correct). Maybe for a first step pure-cgroupv2 would be fine but I'm not 100% on that. But my main concern is that this actually is going to be harder than you might think to implement. @brauner gave a talk about this last year, specifically in the context of LXC and container runtimes in general. The no-internal-process constraint in particular means that container runtimes will have to do a very large amount of dodgy things in order to be able to run containers inside a new cgroup (you have to move the processes from any parent cgroups into a new leaf-node). In addition, In the Docker case this won't be as awful (though it will still be bad) because you can just create a new cgroup at |
I agree, the current OCI spec has been written with cgroup-v1 in mind... the device cgroup and the network classID are tied to cgroup-v1. In cgroup-v2, the same features can be achieved with some equivalents for device cgroup and net_cls but that's different API. So in my opinion, the OCI spec would need an update for cgroup-v2... either include some croup-v2 concepts or be abstracted.
Do you refer to the systemd in the container, on the host, or using the container runtime systemd-nspawn? For reference, systemd (on the host) supports 3 options for container runtimes with cgroup-v2. |
Yeah, I think there are two threads here:
I think that the engine should ideally have pluggable backends. The first one should probably just make RPCs to systemd to create slices and scopes. For example, in our system today, we run all containers under /containers.slice. I can imagine something like this:
It might make sense for us to do our own cgroup control eventually, but given how poorly systemd plays with others, and how much investment goes into it, I see no reason to reinvent the wheel. |
FYI |
Fedora 31 is switching to cgroupv2 entirely (and will start using |
@cyphar @giuseppe @filbranden and others interested in this, I think we need to settle on what we want to do in the runtime spec soon. I am leaning towards having a separate cgroupv2 struct in the spec and then allowing conversion in runc if needed. wdyt? |
does cgroup v2 support limit page cache used ??? |
@timchenxiaoyu Yes. In the sense that v1 controller limits page cache too. |
@cyphar I think this can be closed and now we should create separate issues for remaining tasks |
Ok, i'll close this and we will work out of individual issues remaining |
so, where is the new issue about cgroup ? |
Basic support for The major remaining issues are:
I think maintainers should set Maybe we should also discuss the design of Manager and Subsystem structs: #2148 (comment) |
What about a milestone?
|
I wrote a blog about this: https://medium.com/nttlabs/cgroup-v2-596d035be4d7 There is no official milestone, but I think it will almost reach feature-complete when #2144 and #2149 gets merged. |
Rootful mode seems almost feature-complete now. Rootless still doesn't work for cgroup2: #2163 |
Docker fails to start with "Devices cgroup isn't mounted." According to the systemd documentation: systemd now defaults to the "unified" cgroup hierarchy setup during build-time, i.e. -Ddefault-hierarchy=unified is now the build-time default. Previously, -Ddefault-hierarchy=hybrid was the default. [...] Downstream production distributions might want to continue to use -Ddefault-hierarchy=hybrid (or even =legacy) for their builds as unfortunately the popular container managers have not caught up with the kernel API changes. Changing this option to "hybrid" or "legacy" fixes the Docker startup. Reference: opencontainers/runc#654 Signed-off-by: Christian Stewart <christian@paral.in>
Docker fails to start with "Devices cgroup isn't mounted." According to the systemd documentation: systemd now defaults to the "unified" cgroup hierarchy setup during build-time, i.e. -Ddefault-hierarchy=unified is now the build-time default. Previously, -Ddefault-hierarchy=hybrid was the default. [...] Downstream production distributions might want to continue to use -Ddefault-hierarchy=hybrid (or even =legacy) for their builds as unfortunately the popular container managers have not caught up with the kernel API changes. Changing this option to "hybrid" or "legacy" fixes the Docker startup. Reference: opencontainers/runc#654 Signed-off-by: Christian Stewart <christian@paral.in>
Docker fails to start with "Devices cgroup isn't mounted." According to the systemd documentation: systemd now defaults to the "unified" cgroup hierarchy setup during build-time, i.e. -Ddefault-hierarchy=unified is now the build-time default. Previously, -Ddefault-hierarchy=hybrid was the default. [...] Downstream production distributions might want to continue to use -Ddefault-hierarchy=hybrid (or even =legacy) for their builds as unfortunately the popular container managers have not caught up with the kernel API changes. Changing this option to "hybrid" or "legacy" fixes the Docker startup. Reference: opencontainers/runc#654 Signed-off-by: Christian Stewart <christian@paral.in>
Docker fails to start with "Devices cgroup isn't mounted" as of systemd 243. According to the systemd documentation: systemd now defaults to the "unified" cgroup hierarchy setup during build-time, i.e. -Ddefault-hierarchy=unified is now the build-time default. Previously, -Ddefault-hierarchy=hybrid was the default. [...] Downstream production distributions might want to continue to use -Ddefault-hierarchy=hybrid (or even =legacy) for their builds as unfortunately the popular container managers have not caught up with the kernel API changes. Changing this option to "hybrid" or "legacy" fixes the Docker startup. Reference: opencontainers/runc#654 Signed-off-by: Christian Stewart <christian@paral.in> Tested-by: Jérémy Rosen <jeremy.rosen@smile.fr> Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
I compiled the list of leftover TODOs: #2209 |
Docker fails to start with "Devices cgroup isn't mounted" as of systemd 243. According to the systemd documentation: systemd now defaults to the "unified" cgroup hierarchy setup during build-time, i.e. -Ddefault-hierarchy=unified is now the build-time default. Previously, -Ddefault-hierarchy=hybrid was the default. [...] Downstream production distributions might want to continue to use -Ddefault-hierarchy=hybrid (or even =legacy) for their builds as unfortunately the popular container managers have not caught up with the kernel API changes. Changing this option to "hybrid" or "legacy" fixes the Docker startup. Reference: opencontainers/runc#654 Signed-off-by: Christian Stewart <christian@paral.in> Tested-by: Jérémy Rosen <jeremy.rosen@smile.fr> Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
Last blocker got fixed, so updating runc should be last thing to do. opencontainers/runc#654
Last blocker got fixed, so updating runc should be last thing to do. opencontainers/runc#654
Docker fails to start with "Devices cgroup isn't mounted" as of systemd 243. According to the systemd documentation: systemd now defaults to the "unified" cgroup hierarchy setup during build-time, i.e. -Ddefault-hierarchy=unified is now the build-time default. Previously, -Ddefault-hierarchy=hybrid was the default. [...] Downstream production distributions might want to continue to use -Ddefault-hierarchy=hybrid (or even =legacy) for their builds as unfortunately the popular container managers have not caught up with the kernel API changes. Changing this option to "hybrid" or "legacy" fixes the Docker startup. Reference: opencontainers/runc#654 Signed-off-by: Christian Stewart <christian@paral.in> Tested-by: Jérémy Rosen <jeremy.rosen@smile.fr> Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
cgroup v2 (unified hierarchy) is now official in 4.5:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=34a9304a96d6351c2d35dcdc9293258378fc0bd8
cgroup v2 should have more sensible behavior:
https://www.youtube.com/watch?v=PzpG40WiEfM
moby/moby#16238
The text was updated successfully, but these errors were encountered: