Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve libcontainer's systemd cgroup driver #2007

Open
filbranden opened this issue Mar 10, 2019 · 3 comments
Open

Improve libcontainer's systemd cgroup driver #2007

filbranden opened this issue Mar 10, 2019 · 3 comments

Comments

@filbranden
Copy link
Contributor

I'm opening this issue to track improvements on systemd cgroup driver I and some others are working on.

The main motivations to improve the systemd cgroup driver are:

  • It's a path towards cgroupv2 support, since what the systemd API exports matches the feature set from cgroupv2. systemd also supports three hierarchies ("legacy", "hybrid" and "unified") to support Linux distributions and administrators in making a gradual migration to v2.

  • To support the "rootless" efforts, of running a container manager without root privileges. Projects such as usernetes are making strides in that direction. They rely on using systemd to manage cgroups, since writing to the cgroup tree directly would require root privileges.

  • To minimize future breakage with changes to kernel implementation and systemd versions. The systemd API encoding cgroupv2 attributes is expected to be stable and to be maintained even if future changes get introduced. New restrictions may start being imposed, to ensure the hierarchy works as expected (non-authoritative agents shouldn't be able to "hijack" resources from the machine), doing delegation correctly is a key step in preventing DoS attacks through the cgroup tree.

The steps I have in mind for this series of refactorings is:

  1. Set systemd unit properties rather than writing to the cgroup tree. PR [RFC] Implement systemd-specific per-cgroup support (+ proof-of-concept "devices" and "memory") #1991 is my WIP towards that goal. The idea is to extend the API, so it allows encoding systemd property settings that reflect the configuration, and then migrate all controllers to set systemd properties rather than write directly to the cgroup subtree.

  2. Support working under an unified hierarchy. I have a branch which does most of that (assuming step 1 has been addressed, I'm taking some shortcuts there.) Setting properties goes through systemd, so paths don't matter there. Reading stats goes through the tree, but it's easy to detect which hierarchy is in use and to adjust paths to match the correct one.

  3. Create sub-cgroups of a delegated systemd scope, to pass it to containers. This follows the recommendations from systemd, which are also based on the cgroupv2 implementation (particularly the workings of cgroup.subtree_control), so having libcontainer support this delegation protocol would definitely be advantageous.

One large concern here is testing. I was talking to @mrunalp about this and he offered to help. Since running this setup on a container can be challenging, he suggested using an external test runner, possibly monitoring GitHub PRs and reporting changes back to GitHub. We'll be working on this testing and will report back here when we have more details. (We might open separate issues to track testing on systemd as well.)

I'm planning to join the OCI call on Wednesday March 13th (also to discuss related opencontainers/runtime-spec#1005 and opencontainers/runtime-spec#1002), so I'm happy to talk more about this libcontainer systemd cgroup driver improvement proposal.

@filbranden
Copy link
Contributor Author

The testing introduced in #2014, which runs on Travis-CI, which happens to use Ubuntu Xenial, which includes systemd v229 (much older than the one I was testing with) made it clear that it's hard to have libcontainer make a heavier use of systemd properties in StartTransientUnit while maintaining compatibility with older versions of systemd, which is an objective to be preserved.

So I'm changing my approach slightly here. Instead of refactoring the current code, I'll write a new implementation that works in unified cgroup hierarchy only and then use that code only when unified cgroup hierarchy is detected, falling back to the current code in case the hybrid or legacy hierarchy is in use.

The issue with backward compatibility of properties is still a relevant issue, so I opened issue systemd/systemd#12460 requesting a better solution for that. True, that will only be available in systemd v243 or v244, but since no mainstream Linux distributions have switched to using unified hierarchy by default yet and by the time they do they'll probably have a systemd version that includes that code, we should be OK (in any case, the libcontainer code will implicitly check that a systemd version that includes that feature will be running, so while "unified cgroup hierarchy" is the trigger, there are other components that need to be in place for this all to work correctly.)

This also helps answer the question of having all the features supported in the cgroup2 mode (such as cpuset and freezer cgroup), since we will only activate the new implementation once all the components are in place (kernel 5.2 for freezer support in cgroup2, systemd v244 or maybe v245 to include the D-Bus RPCs to expose cpuset and freezer, etc.) Again, using "unified cgroup hierarchy" as the trigger, but knowing there are more dependencies that need to be in place for that to work.

So I'll close #1991 for now, since the final approach will be different. I opened #2047 to push the constructor of the systemd cgroup driver into the systemd module itself, that way when there are two separate implementations of it (one for "legacy" and another for "unified"), that point can serve as detection to make the decision.

Thanks!
Filipe

@AkihiroSuda
Copy link
Member

What's the next step?

@AkihiroSuda
Copy link
Member

@filbranden
For unified mode, I think we only care really recent version of systemd.
Maybe #1991 can be reopened?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants