Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Implement systemd-specific per-cgroup support (+ proof-of-concept "devices" and "memory") #1991

Closed
wants to merge 3 commits into from

Conversation

filbranden
Copy link
Contributor

This PR is trying to accomplish two things:

  1. Define a new interface that will allow subsystems/controllers to
    implement systemd-based configuration, by using systemd directives
    rather than writing directly to the cgroup subtree.

  2. Add a systemd-based implementation to the "devices" subsystem, to
    illustrate how it is meant to be used.

The initial point I'd like to make here is towards discussing (1) as an idea and whether the Go abstractions/interfaces are appropriate here or whether we should move things around.

Consider part (2) to be really a draft and not really finished (even though it actually works to a large extent, the D-Bus messages are correct and that has been tested to do what's expected.)

I tested this with Podman using:

$ podman --runtime ~/go/src/github.com/opencontainers/runc/runc run -t fedora:29 echo hello

And also bringing up a container and checking the contents of "device.list" in the cgroup subtree:

$ podman --runtime ~/go/src/github.com/opencontainers/runc/runc run -t fedora:29 sleep 1h
$ cat /sys/fs/cgroup/devices/machine.slice/libpod-12fc7bd62fd6*/devices.list
c 10:200 rwm
c 5:2 rwm
c 136:* rwm
c 5:1 rwm
c 1:9 rwm
c 1:5 rwm
c 5:0 rwm
c 1:7 rwm
c 1:8 rwm
c 1:3 rwm
b *:* m
c *:* m

This matches the output of devices.list when using the official "runc" binary, only difference being the lines are inverted in order (again, we can fix that on a second step.)

Querying systemd for this unit also works as expected:

$ systemctl show libpod-12fc7bd62fd66ff62fa1b045c2d717c7b2076c072c20de14f5c1ad86b78865eb.scope -p DevicePolicy -p DeviceAllow
DevicePolicy=strict
DeviceAllow=/dev/net/tun rwm
DeviceAllow=/dev/ptmx rwm
DeviceAllow=char-136 rwm
DeviceAllow=/dev/console rwm
DeviceAllow=/dev/urandom rwm
DeviceAllow=/dev/zero rwm
DeviceAllow=/dev/tty rwm
DeviceAllow=/dev/full rwm
DeviceAllow=/dev/random rwm
DeviceAllow=/dev/null rwm
DeviceAllow=block-* m
DeviceAllow=char-* m

/cc @cyphar @crosbymichael @mrunalp

Also @rhatdan

Maybe @sargun might be interested, since he was working on #1708 and this here is a step towards supporting cgroupv2 in unified hierarchy in runc/libcontainer.

Also @AkihiroSuda who has been working on rootless/usernetes and has been asking about this.

Cheers!
Filipe

@rhatdan
Copy link
Contributor

rhatdan commented Feb 20, 2019

@mrunalp PTAL

@mrunalp
Copy link
Contributor

mrunalp commented Feb 26, 2019

I like the approach of subsystem returning systemd properties 👍 Looking at the rest in more detail. Thanks!

@filbranden filbranden changed the title [RFC] Implement systemd-specific per-cgroup support, add it to "devices" cgroup [RFC] Implement systemd-specific per-cgroup support (+ proof-of-concept "devices" and "memory") Mar 10, 2019
@filbranden
Copy link
Contributor Author

So I have just broken this down into two separate commits, one for the infrastructure of detecting subsystems that support systemd and the second for handling the devices cgroup.

Furthermore, on the second commit handling devices, I did a small fix to the implementation, when we have major/minor, I'm using the syntax /dev/{block,char}/<major>:<minor>, which is supported by systemd.

I also added a third commit, handling the memory cgroup.

The memory cgroup was quite interesting, since a lot changed in cgroupv2 and many of the options are no longer available. (Also see related issue opencontainers/runtime-spec#1005 I opened about the OCI spec for memory and comparing it to how memory controller works in cgroupv2.)

In doing the split, my objective was to keep the original behavior whenever the "legacy" or "hybrid" hierarchy was in use, so I split those into a separate SetCgroupv1() method that we can still keep calling.

I tested this locally and it's working. Granted, it needs a lot more testing, we want to address that, but I'd like to get some feedback on the code itself and the methods I'm introducing. I'm happy to do changes to the API, but I'd like to have some feedback of what would look good to you.

@cyphar, @crosbymichael Would you mind taking a look at the code and see if you'd like to have structural changes on where this code goes and whether the function prototypes look OK to you?

Very specifically, this part here (which essentially determines how all the rest works):

type systemdSubsystem interface {
       // Returns a list of systemd properties to manage the underlying cgroups.
       ToSystemdProperties(cgroup *configs.Cgroup) ([]systemdDbus.Property, error)
       // Set the cgroupv1 attributes only, which can be used when in hybrid or legacy mode
       // to keep setting properties that are not (and will not be) managed by systemd.
       SetCgroupv1(path string, cgroup *configs.Cgroup) error
}

Thanks!
Filipe

Define a new interface that will allow subsystems/controllers to
implement systemd-based configuration, by using systemd directives
rather than writing directly to the cgroup subtree.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
I tested this with Podman using:

  $ podman --runtime ~/go/src/github.com/opencontainers/runc/runc run -t fedora:29 echo hello

And also bringing up a container and checking the contents of
"device.list" in the cgroup subtree:

  $ podman --runtime ~/go/src/github.com/opencontainers/runc/runc run -t fedora:29 sleep 1h
  $ cat /sys/fs/cgroup/devices/machine.slice/libpod-12fc7bd62fd6*/devices.list
  c 10:200 rwm
  c 5:2 rwm
  c 136:* rwm
  c 5:1 rwm
  c 1:9 rwm
  c 1:5 rwm
  c 5:0 rwm
  c 1:7 rwm
  c 1:8 rwm
  c 1:3 rwm
  b *:* m
  c *:* m

This matches the output of devices.list when using the official "runc"
binary, only difference being the lines are inverted in order (again, we
can fix that on a second step.)

Querying systemd for this unit also works as expected:

  $ systemctl show libpod-12fc7bd62fd66ff62fa1b045c2d717c7b2076c072c20de14f5c1ad86b78865eb.scope -p DevicePolicy -p DeviceAllow
  DevicePolicy=strict
  DeviceAllow=/dev/net/tun rwm
  DeviceAllow=/dev/ptmx rwm
  DeviceAllow=char-136 rwm
  DeviceAllow=/dev/console rwm
  DeviceAllow=/dev/urandom rwm
  DeviceAllow=/dev/zero rwm
  DeviceAllow=/dev/tty rwm
  DeviceAllow=/dev/full rwm
  DeviceAllow=/dev/random rwm
  DeviceAllow=/dev/null rwm
  DeviceAllow=block-* m
  DeviceAllow=char-* m

v2: Now using systemd support for /dev/{char,block}/<major>:<minor>
instead, which results into these properties being set instead:

  DevicePolicy=strict
  DeviceAllow=/dev/char/10:200 rwm
  DeviceAllow=/dev/char/5:2 rwm
  DeviceAllow=char-136 rwm
  DeviceAllow=/dev/char/5:1 rwm
  DeviceAllow=/dev/char/1:9 rwm
  DeviceAllow=/dev/char/1:5 rwm
  DeviceAllow=/dev/char/5:0 rwm
  DeviceAllow=/dev/char/1:7 rwm
  DeviceAllow=/dev/char/1:8 rwm
  DeviceAllow=/dev/char/1:3 rwm
  DeviceAllow=block-* m
  DeviceAllow=char-* m

Unclear whether we should really be using this unconditionally, or only
in the cases where a device name hasn't been supplied through d.Path.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
This is a preliminary implementation setting systemd properties rather
than writing to the cgroup files directly.

Given systemd doesn't support most of the cgroupv1 properties, these
continue being set using a new SetCgroupv1() method, which keeps
compatibility in case the system is using "legacy" or "hybrid" hierarchy
(but which can be skipped when using "unified" hierarchy, at the cost of
change in behavior.)

Tested with podman:

  $ podman --runtime ~/go/src/github.com/opencontainers/runc/runc run --memory 1g --memory-reservation 100m --memory-swap 2g --memory-swappiness 50 -t fedora:29 sleep 1h

Inspecting the systemd unit:

  $ systemctl show libpod-30a387de72953df5368645405c04e7aa7253fc31be2ef73f334e2753f652e34e.scope -p MemoryLow -p MemoryMax -p MemorySwapMax -p MemoryLimit
  MemoryLow=104857600
  MemoryMax=1073741824
  MemorySwapMax=2147483648
  MemoryLimit=1073741824

And looking at the files which are set in the cgroup tree directly:

  $ cat /sys/fs/cgroup/memory/machine.slice/libpod-30a387de72953df5368645405c04e7aa7253fc31be2ef73f334e2753f652e34e.scope/memory.swappiness
  50

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
@filbranden
Copy link
Contributor Author

I have now rebased this on top of #2014, so the Travis CI checks will include a run of localintegration using the systemd cgroup driver.

@filbranden
Copy link
Contributor Author

@crosbymichael Can you please do a code review and take a look at the API? I'm happy to rework the code to change the internal API, just looking for some feedback to see what you think would be acceptable here... Thanks!

@filbranden
Copy link
Contributor Author

Closing this one, since this approach will not work with older systemd versions (Travis-CI running systemd v229 on Ubuntu Xenial 16.04 shows that clearly.) See #2007 for a more lengthy update on the proposed new approach. And also #2047 which starts to refactor the code to allow for that switch.

Cheers!
Filipe

@filbranden filbranden closed this May 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants