Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libcontainer: map PidsLimit to systemd's TasksMax property #1917

Merged
merged 1 commit into from
Nov 13, 2018
Merged

Conversation

slp
Copy link
Contributor

@slp slp commented Oct 24, 2018

Currently runc applies PidsLimit restriction by writing directly to
cgroup's pids.max, without notifying systemd. As a consequence, when the
later updates the context of the corresponding scope, pids.max is reset
to the value of systemd's TasksMax property.

This can be easily reproduced this way (I'm using "postfix" here just an
example, any unrelated but existing service will do):

 # CTR=`docker run --pids-limit 111 --detach --rm busybox /bin/sleep 8h`
 # cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max
 111
 # systemctl disable --now postfix
 # systemctl enable --now postfix
 # cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max
 max

This patch adds TasksAccounting=true and TasksMax=PidsLimit to the
properties sent to systemd.

Signed-off-by: Sergio Lopez slp@redhat.com

Currently runc applies PidsLimit restriction by writing directly to
cgroup's pids.max, without notifying systemd. As a consequence, when the
later updates the context of the corresponding scope, pids.max is reset
to the value of systemd's TasksMax property.

This can be easily reproduced this way (I'm using "postfix" here just an
example, any unrelated but existing service will do):

 # CTR=`docker run --pids-limit 111 --detach --rm busybox /bin/sleep 8h`
 # cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max
 111
 # systemctl disable --now postfix
 # systemctl enable --now postfix
 # cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max
 max

This patch adds TasksAccounting=true and TasksMax=PidsLimit to the
properties sent to systemd.

Signed-off-by: Sergio Lopez <slp@redhat.com>
@cyphar
Copy link
Member

cyphar commented Oct 27, 2018

LGTM, but I would like you to note that for a very long time systemd has caused more problems when you tell it about things. We had plenty of fun bugs when systemd would un-isolate a process's resource constraints purely because we told systemd to enable some settings. There is a reason we default to cgroupfs entirely now-a-days.

Approved with PullApprove

newProp("TasksAccounting", true),
newProp("TasksMax", uint64(c.Resources.PidsLimit)))
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this work if we have an older version of systemd that doesnt support TasksAccounting or TasksMax ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's why TasksAccounting is only added along TasksMax if PidsLimit is greater than zero, instead of adding it unconditionally at the same time we add MemoryAccounting and co. That way, on (quite) old distros, the only breakage is that you won't be able to use "pids-limit".

On the other hand, TasksMax/TasksAccounting was added in systemd 227, which was released back in 2015-10-07. According to Distrowatch, Ubuntu 16.04 LTS has 229 (I think 14.04 LTS still uses upstart?), Debian "stretch" already has 232, and OpenSUSE 42.2 uses 228.

Older releases aren't probably well-suited to be used as a container platform for other, more important reasons (i.e. buggy aufs/overlay support), so I think we're playing safe here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this works the same as supplying a property in a *.service file, then it should be harmless to include -- older systemd versions should simply warn that it's unknown and ignore it (by design, specifically to handle this very quandry).

See https://www.freedesktop.org/software/systemd/man/systemd.unit.html, especially:

Unit files may contain additional options on top of those listed here. If systemd encounters an unknown option, it will write a warning log message but continue loading the unit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, it fails with an error like this:

# docker run -ti --pids-limit 111  busybox /bin/sh
/usr/bin/docker-current: Error response from daemon: invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:258: applying cgroup configuration for process caused \\\"Cannot set property TasksAccounting, or unknown property.\\\"\"\n".

This comes from a RHEL7.2 VM with systemd-219-19.el7.x86_64.

Of course, you can still create containers if you don't use "pids-limit":

# docker run -ti  busybox /bin/sh
/ # uname -a
Linux b216dfef5457 3.10.0-514.26.2.el7.x86_64 #1 SMP Fri Jun 30 05:26:04 UTC 2017 x86_64 GNU/Linux
/ #

And I still think distros that old aren't actually suitable as reliable container platforms for reasons way more important than missing TasksMax/TasksAccounting.

@slp
Copy link
Contributor Author

slp commented Oct 30, 2018

LGTM, but I would like you to note that for a very long time systemd has caused more problems when you tell it about things. We had plenty of fun bugs when systemd would un-isolate a process's resource constraints purely because we told systemd to enable some settings. There is a reason we default to cgroupfs entirely now-a-days.

Honestly, I can't comment about other properties, but PidsLimit needs to be sync'ed with systemd's TasksMax, or users will see the limit is randomly not enforced.

@cyphar
Copy link
Member

cyphar commented Nov 3, 2018

My point was mainly that I don't recommend ever using the systemd cgroup driver because systemd has (in the past, quite a few times) un-isolated containers when using it (not by changing properties but by moving containers into a parent cgroup, bypassing the cgroup limit set). I know it's used by default on RHEL systems, but it is an issue I hope you've mitigated in other ways (Delegate=yes was supposed to fix it, but for at least a year we still had issues).

@cyphar
Copy link
Member

cyphar commented Nov 13, 2018

Last call for pre-1.0 merges!
/cc @opencontainers/runc-maintainers

@cyphar cyphar added this to the 1.0.0 milestone Nov 13, 2018
@mrunalp
Copy link
Contributor

mrunalp commented Nov 13, 2018

LGTM

Approved with PullApprove

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants