-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libcontainer: map PidsLimit to systemd's TasksMax property #1917
Conversation
Currently runc applies PidsLimit restriction by writing directly to cgroup's pids.max, without notifying systemd. As a consequence, when the later updates the context of the corresponding scope, pids.max is reset to the value of systemd's TasksMax property. This can be easily reproduced this way (I'm using "postfix" here just an example, any unrelated but existing service will do): # CTR=`docker run --pids-limit 111 --detach --rm busybox /bin/sleep 8h` # cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max 111 # systemctl disable --now postfix # systemctl enable --now postfix # cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max max This patch adds TasksAccounting=true and TasksMax=PidsLimit to the properties sent to systemd. Signed-off-by: Sergio Lopez <slp@redhat.com>
LGTM, but I would like you to note that for a very long time systemd has caused more problems when you tell it about things. We had plenty of fun bugs when systemd would un-isolate a process's resource constraints purely because we told systemd to enable some settings. There is a reason we default to |
newProp("TasksAccounting", true), | ||
newProp("TasksMax", uint64(c.Resources.PidsLimit))) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this work if we have an older version of systemd that doesnt support TasksAccounting or TasksMax ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's why TasksAccounting is only added along TasksMax if PidsLimit is greater than zero, instead of adding it unconditionally at the same time we add MemoryAccounting and co. That way, on (quite) old distros, the only breakage is that you won't be able to use "pids-limit".
On the other hand, TasksMax/TasksAccounting was added in systemd 227, which was released back in 2015-10-07. According to Distrowatch, Ubuntu 16.04 LTS has 229 (I think 14.04 LTS still uses upstart?), Debian "stretch" already has 232, and OpenSUSE 42.2 uses 228.
Older releases aren't probably well-suited to be used as a container platform for other, more important reasons (i.e. buggy aufs/overlay support), so I think we're playing safe here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this works the same as supplying a property in a *.service
file, then it should be harmless to include -- older systemd
versions should simply warn that it's unknown and ignore it (by design, specifically to handle this very quandry).
See https://www.freedesktop.org/software/systemd/man/systemd.unit.html, especially:
Unit files may contain additional options on top of those listed here. If systemd encounters an unknown option, it will write a warning log message but continue loading the unit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly, it fails with an error like this:
# docker run -ti --pids-limit 111 busybox /bin/sh
/usr/bin/docker-current: Error response from daemon: invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:258: applying cgroup configuration for process caused \\\"Cannot set property TasksAccounting, or unknown property.\\\"\"\n".
This comes from a RHEL7.2 VM with systemd-219-19.el7.x86_64.
Of course, you can still create containers if you don't use "pids-limit":
# docker run -ti busybox /bin/sh
/ # uname -a
Linux b216dfef5457 3.10.0-514.26.2.el7.x86_64 #1 SMP Fri Jun 30 05:26:04 UTC 2017 x86_64 GNU/Linux
/ #
And I still think distros that old aren't actually suitable as reliable container platforms for reasons way more important than missing TasksMax/TasksAccounting.
Honestly, I can't comment about other properties, but PidsLimit needs to be sync'ed with systemd's TasksMax, or users will see the limit is randomly not enforced. |
My point was mainly that I don't recommend ever using the systemd cgroup driver because systemd has (in the past, quite a few times) un-isolated containers when using it (not by changing properties but by moving containers into a parent cgroup, bypassing the cgroup limit set). I know it's used by default on RHEL systems, but it is an issue I hope you've mitigated in other ways ( |
Last call for pre-1.0 merges! |
opencontainers/runc#1917 Signed-off-by: Sergio Lopez <slp@redhat.com>
Currently runc applies PidsLimit restriction by writing directly to
cgroup's pids.max, without notifying systemd. As a consequence, when the
later updates the context of the corresponding scope, pids.max is reset
to the value of systemd's TasksMax property.
This can be easily reproduced this way (I'm using "postfix" here just an
example, any unrelated but existing service will do):
This patch adds TasksAccounting=true and TasksMax=PidsLimit to the
properties sent to systemd.
Signed-off-by: Sergio Lopez slp@redhat.com