-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for cgroup namespace #1916
Conversation
For #1884, We worked on a repo and I have verified that this fixes the issue. Without the patch, you can repo in about 2-5min, with the patch, it runs for hours without error. I'm confident that this fixes the issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some nits...
libcontainer/nsenter/nsexec.c
Outdated
|
||
/* Send the init_func pid back to our parent. */ | ||
/* | ||
* * Send the init_func pid and the pid of the first child back to our parent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment block mishap..
libcontainer/nsenter/nsexec.c
Outdated
@@ -201,7 +207,8 @@ static void update_setgroups(int pid, enum policy_t setgroup) | |||
* open(2) or write(2) will return ENOENT. This is fine. | |||
*/ | |||
if (errno != ENOENT) | |||
bail("failed to write '%s' to /proc/%d/setgroups", policy, pid); | |||
bail("failed to write '%s' to /proc/%d/setgroups", | |||
policy, pid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
harder to see what's new when the diff includes lots of odd extra spaces and new lines throughout
libcontainer/nsenter/nsexec.c
Outdated
* with clone() because there were some old kernel versions where | ||
* clone(CLONE_PARENT | CLONE_NEWPID) was broken, so we'll just do | ||
* it the long way. | ||
if (config.cloneflags & CLONE_NEWCGROUP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment mishap
80dc628
to
637df62
Compare
This is a very simple implementation because it doesn't require any configuration unlike the other namespaces, and in its current state it only masks paths. This feature is available in Linux 4.6+ and is enabled by default for kernels compiled with CONFIG_CGROUP=y. Signed-off-by: Aleksa Sarai <asarai@suse.de> Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also there's a lot of whitespace and wrapping changes -- were these added by accident?
So as far as the whitespace, I thought we all agreed on using |
Resolved the comments, let me know if you want me to revert the formatting changes. |
Max line length seems changed in recent indent? |
@AkihiroSuda I think it has always been 80chars |
Whats the next steps here? I have verified the issues that this fixes and it would be nice to get this in. Let me know what changes you want or not |
// TODO: should not be the responsibility to call here | ||
p.manager.Destroy() | ||
} | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this block is odd because we do most of the same thing earlier. Shouldnt we only need to apply the cgroup namespace and wait for child here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean, just this defer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, basically most of the block except the new part about setting up cgroup namespace and wait for first child to exit. Here is the current block we have just a few lines above: https://github.com/opencontainers/runc/blob/master/libcontainer/process_linux.go#L276-L295
Now this is interesting. We certainly put the parent into cgroup before we spawning the child accoring to our current code, so i wonder where the leak can be. Also #1884 is about runc exec
, shouldnt we apply the same thing in setnsProcess
too ? I think that is the root cause of #1884, isnt it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the old Apply
code should be removed since there's not much point in applying it twice -- maybe everything including the intelRdt
cleanup as well as the comments should be just wholesale moved here... Not sure.
We certainly put the parent into cgroup before we spawning the child accoring to our current code, so i wonder where the leak can be.
I'm not sure that's correct -- we have err := p.cmd.Start()
before the cgroup application code. Though, I find it hard to believe that runc init
can do all of its setup and reach the Go code before Go runs through 5 lines of code (especially since Linux's scheduler is biased to the currently-running process -- so theoretically the parent process should get more work done in the short race after fork+exec).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having the Apply
after we have PIDs in the cgroup breaks memory.kmem.limit_in_bytes
which can be set only when the cgroup is empty.
Is it safe to remove the second Apply
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand why there are two identical defer
blocks in the same start()
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kolyshkin I saw #2238 is suggesting to remove one
It's always been 80chars, but I think when I ran the formatting it was with 120 characters (which is becoming a more common "sane" width). My only complaint with the formatting is that it makes stuff harder to read (function arguments are wrapped in odd ways and so on -- something that you don't see in actual Linux code). I would exclude them and we can re-establish a format in a separate PR. |
Sounds good to me, i'll make the changes. |
Cgroup namespace can be configured in `config.json` as other namespaces. Here is an example: ``` "namespaces": [ { "type": "pid" }, { "type": "network" }, { "type": "ipc" }, { "type": "uts" }, { "type": "mount" }, { "type": "cgroup" } ], ``` Note that if you want to run a container which has shared cgroup ns with another container, then it's strongly recommended that you set proper `CgroupsPath` of both containers(the second container's cgroup path must be the subdirectory of the first one). Or there might be some unexpected results. Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com> Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Ok, pushed the formatting changes |
I think we should merge this for 1.0, its a new feature but the implementation also fixes a nasty bug under stress |
Yeah, I agree. |
will merge once pull approve returns. |
It seems that this change introduced a problem with the systemd cgroupfs backend. I've added this snippet to the default "runc spec" config file (under "memory": {
"kernel": 41943040,
"disableOOMKiller": false
} now when I try to run a container I get: # runc --systemd-cgroup run foo
container_linux.go:344: starting container process caused "process_linux.go:311: applying cgroup configuration for process caused \"failed to set memory.kmem.limit_in_bytes, because either tasks have already joined this cgroup or it has children\"" |
Fix for opencontainers/runc#1884 is in opencontainers/runc#1916 - problem described in 1884 is independent of kubernetes version. - happens with kernel >= v4.8 - fixed in runc version v1.0.0-rc6 and above ( opencontainers/runc@9a3a8a5 ) - which got pulled into containerd v1.3.0-beta.0 and above ( containerd/containerd@97dd5df#diff-4061fcef378a6d912e14e2ce162a1995)
Carrying #781. Thanks for @cyphar 's work. 😄
Closes #1184
Fixes #1884
Cgroup namespace can be configured in
config.json
as othernamespaces. Here is an example:
Note that if you want to run a container which has shared cgroup ns with
another container, then it's strongly recommended that you set
proper
CgroupsPath
of both containers(the second container's cgrouppath must be the subdirectory of the first one). Or there might be
some unexpected results.
Signed-off-by: Yuanhong Peng pengyuanhong@huawei.com