Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for cgroup namespace #1916

Merged
merged 2 commits into from
Nov 13, 2018
Merged

Conversation

crosbymichael
Copy link
Member

Carrying #781. Thanks for @cyphar 's work. 😄

Closes #1184
Fixes #1884

Cgroup namespace can be configured in config.json as other
namespaces. Here is an example:

"namespaces": [
	{
		"type": "pid"
	},
	{
		"type": "network"
	},
	{
		"type": "ipc"
	},
	{
		"type": "uts"
	},
	{
		"type": "mount"
	},
	{
		"type": "cgroup"
	}
],

Note that if you want to run a container which has shared cgroup ns with
another container, then it's strongly recommended that you set
proper CgroupsPath of both containers(the second container's cgroup
path must be the subdirectory of the first one). Or there might be
some unexpected results.

Signed-off-by: Yuanhong Peng pengyuanhong@huawei.com

@crosbymichael
Copy link
Member Author

For #1884, We worked on a repo and I have verified that this fixes the issue. Without the patch, you can repo in about 2-5min, with the patch, it runs for hours without error. I'm confident that this fixes the issue.

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some nits...


/* Send the init_func pid back to our parent. */
/*
* * Send the init_func pid and the pid of the first child back to our parent.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment block mishap..

@@ -201,7 +207,8 @@ static void update_setgroups(int pid, enum policy_t setgroup)
* open(2) or write(2) will return ENOENT. This is fine.
*/
if (errno != ENOENT)
bail("failed to write '%s' to /proc/%d/setgroups", policy, pid);
bail("failed to write '%s' to /proc/%d/setgroups",
policy, pid);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

harder to see what's new when the diff includes lots of odd extra spaces and new lines throughout

libcontainer/nsenter/nsexec.c Show resolved Hide resolved
* with clone() because there were some old kernel versions where
* clone(CLONE_PARENT | CLONE_NEWPID) was broken, so we'll just do
* it the long way.
if (config.cloneflags & CLONE_NEWCGROUP)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment mishap

This is a very simple implementation because it doesn't require any
configuration unlike the other namespaces, and in its current state it
only masks paths.

This feature is available in Linux 4.6+ and is enabled by default for
kernels compiled with CONFIG_CGROUP=y.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also there's a lot of whitespace and wrapping changes -- were these added by accident?

libcontainer/nsenter/nsexec.c Show resolved Hide resolved
libcontainer/nsenter/nsexec.c Outdated Show resolved Hide resolved
libcontainer/process_linux.go Outdated Show resolved Hide resolved
@crosbymichael
Copy link
Member Author

So as far as the whitespace, I thought we all agreed on using indent -linux for our C files so that they stay consistent, same as gofmt.

@crosbymichael
Copy link
Member Author

Resolved the comments, let me know if you want me to revert the formatting changes.

@AkihiroSuda
Copy link
Member

Max line length seems changed in recent indent?

@crosbymichael
Copy link
Member Author

@AkihiroSuda I think it has always been 80chars

@crosbymichael
Copy link
Member Author

Whats the next steps here? I have verified the issues that this fixes and it would be nice to get this in. Let me know what changes you want or not

// TODO: should not be the responsibility to call here
p.manager.Destroy()
}
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this block is odd because we do most of the same thing earlier. Shouldnt we only need to apply the cgroup namespace and wait for child here ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean, just this defer?

Copy link
Contributor

@dqminh dqminh Oct 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, basically most of the block except the new part about setting up cgroup namespace and wait for first child to exit. Here is the current block we have just a few lines above: https://github.com/opencontainers/runc/blob/master/libcontainer/process_linux.go#L276-L295

Now this is interesting. We certainly put the parent into cgroup before we spawning the child accoring to our current code, so i wonder where the leak can be. Also #1884 is about runc exec, shouldnt we apply the same thing in setnsProcess too ? I think that is the root cause of #1884, isnt it ?

Copy link
Member

@cyphar cyphar Oct 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the old Apply code should be removed since there's not much point in applying it twice -- maybe everything including the intelRdt cleanup as well as the comments should be just wholesale moved here... Not sure.

We certainly put the parent into cgroup before we spawning the child accoring to our current code, so i wonder where the leak can be.

I'm not sure that's correct -- we have err := p.cmd.Start() before the cgroup application code. Though, I find it hard to believe that runc init can do all of its setup and reach the Go code before Go runs through 5 lines of code (especially since Linux's scheduler is biased to the currently-running process -- so theoretically the parent process should get more work done in the short race after fork+exec).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having the Apply after we have PIDs in the cgroup breaks memory.kmem.limit_in_bytes which can be set only when the cgroup is empty.

Is it safe to remove the second Apply?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand why there are two identical defer blocks in the same start() method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolyshkin I saw #2238 is suggesting to remove one

@cyphar
Copy link
Member

cyphar commented Oct 31, 2018

It's always been 80chars, but I think when I ran the formatting it was with 120 characters (which is becoming a more common "sane" width). My only complaint with the formatting is that it makes stuff harder to read (function arguments are wrapped in odd ways and so on -- something that you don't see in actual Linux code). I would exclude them and we can re-establish a format in a separate PR.

@crosbymichael
Copy link
Member Author

Sounds good to me, i'll make the changes.

Cgroup namespace can be configured in `config.json` as other
namespaces. Here is an example:

```
"namespaces": [
	{
		"type": "pid"
	},
	{
		"type": "network"
	},
	{
		"type": "ipc"
	},
	{
		"type": "uts"
	},
	{
		"type": "mount"
	},
	{
		"type": "cgroup"
	}
],

```

Note that if you want to run a container which has shared cgroup ns with
another container, then it's strongly recommended that you set
proper `CgroupsPath` of both containers(the second container's cgroup
path must be the subdirectory of the first one). Or there might be
some unexpected results.

Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
@crosbymichael
Copy link
Member Author

Ok, pushed the formatting changes

@cyphar
Copy link
Member

cyphar commented Nov 13, 2018

LGTM, and I hope we can get this in 1.0 but if we can't we can always do it in the next release. I've tested it out and it appears to work for most basic uses.


Last call for pre-1.0 merges!
/cc @opencontainers/runc-maintainers

Approved with PullApprove

@crosbymichael
Copy link
Member Author

I think we should merge this for 1.0, its a new feature but the implementation also fixes a nasty bug under stress

@cyphar
Copy link
Member

cyphar commented Nov 13, 2018

Yeah, I agree.

@mrunalp
Copy link
Contributor

mrunalp commented Nov 13, 2018

LGTM

Approved with PullApprove

@mrunalp
Copy link
Contributor

mrunalp commented Nov 13, 2018

will merge once pull approve returns.

@mrunalp mrunalp merged commit 4769cdf into opencontainers:master Nov 13, 2018
@giuseppe
Copy link
Member

giuseppe commented Jan 4, 2019

It seems that this change introduced a problem with the systemd cgroupfs backend.

I've added this snippet to the default "runc spec" config file (under linux/resources):

      "memory": {
        "kernel": 41943040,
        "disableOOMKiller": false
      }

now when I try to run a container I get:

# runc --systemd-cgroup run foo
container_linux.go:344: starting container process caused "process_linux.go:311: applying cgroup configuration for process caused \"failed to set memory.kmem.limit_in_bytes, because either tasks have already joined this cgroup or it has children\""

dims added a commit to dims/image-builder that referenced this pull request Oct 14, 2019
Fix for opencontainers/runc#1884
is in opencontainers/runc#1916

- problem described in 1884 is independent of kubernetes version.
- happens with kernel >= v4.8
- fixed in runc version v1.0.0-rc6 and above ( opencontainers/runc@9a3a8a5 )
- which got pulled into containerd v1.3.0-beta.0 and above ( containerd/containerd@97dd5df#diff-4061fcef378a6d912e14e2ce162a1995)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants