Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix systemd cgroup driver's Apply #3782

Merged
merged 4 commits into from
Apr 3, 2023

Conversation

kolyshkin
Copy link
Contributor

This fixes the wrong logic of systemd cgroup manager's Apply method. For more details, see commit messages and #3780.

Fixes: #3780
Fixes: #3760

@kolyshkin kolyshkin requested a review from a team March 23, 2023 19:16
@kolyshkin kolyshkin added the backport/1.1-todo A PR in main branch which needs to be backported to release-1.1 label Mar 23, 2023
@kolyshkin
Copy link
Contributor Author

The failure with "runc ps" on CentOS 9 is gone as expected.

This failure:

not ok 168 update cgroup cpu.idle

is a separate issue not addressed by this PR.

@@ -356,7 +356,7 @@ function setup() {
[ "$output" = "ok" ]
}

@test "runc run/create should warn about a non-empty cgroup" {
@test "runc run/create should error for a non-empty cgroup" {
Copy link
Member

@AkihiroSuda AkihiroSuda Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a breaking change?
Probably fine for v1.2, but not sure backportable to v1.1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To begin with, it never worked anyway, because if the systemd unit exists, the error is ignored and the new container is never added to the unit (and to the cgroup). To repeat, it never worked anyway.

We can try making it work in 1.1 though. The fix would be very different from this one, something like "if startUnit returned UnitExists error, call setUnitProperties with properties of PIDs=[new pid]".

I am not sure that this will work (maybe, maybe not -- it's complicated. I am also unsure if we want to go that route at all -- I mean trying to fix something that never worked anyway.

In this version (1.2.x), I think this is the way it should be done.

In 1.1, we can discuss it later (for the 1.1.6 I guess).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reiterate -- let's concentrate on how can we fix it in main branch for now, and think about 1.1 backport later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought about this a bit -- I think we can still allow shared cgroup when fs cgroup driver is used, keeping the warning, and error out in case of systemd cgroup driver. This is sort of a breaking change, but since

  1. the functionality never worked correctly (UnitExists error was ignored, and container was not placed into the proper systemd unit and/or cgroup), and
  2. will be deprecated in runc 1.2,
  3. implementing such a feature (adding a container to an existing systemd unit) is not very easy,

it makes little sense in trying to do that.

In particular, this test can be changed to look for a warning in case of an fs cgroup driver, and error in case of systemd cgroup driver.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolyshkin That sounds good!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented as described above in #3806

@kolyshkin kolyshkin added this to the 1.1.6 milestone Mar 28, 2023
@kolyshkin kolyshkin requested a review from AkihiroSuda March 28, 2023 17:50
AkihiroSuda
AkihiroSuda previously approved these changes Mar 28, 2023
Move error handling earlier, removing "if err == nil" block.

No change of logic.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit d223e2a ("Ignore error when starting transient unit
that already exists" modified the code handling errors from startUnit
to ignore UnitExists error.

Apparently it was done so that kubelet can create the same pod slice
over and over without hitting an error (see [1]).

While it works for a pod slice to ensure it exists, it is a gross bug
to ignore UnitExists when creating a container. In this case, the
container init PID won't be added to the systemd unit (and to the
required cgroup), and as a result the container will successfully
run in a current user cgroup, without any cgroup limits applied.

So, fix the code to only ignore UnitExists if we're not adding a process
to the systemd unit. This way, kubelet will keep working as is, but
runc will refuse to create containers which are not placed into a
requested cgroup.

[1] opencontainers#1124

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case a systemd unit fails (for example, timed out or OOM-killed),
systemd keeps the unit. This prevents starting a new container with
the same systemd unit name.

The fix is to call reset-failed in case UnitExists error is returned,
and retry once.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit d08bc0c ("runc run: warn on non-empty cgroup") introduced
a warning when a container is started in a non-empty cgroup. Such
configuration has lots of issues.

In addition to that, such configuration is not possible at all when
using the systemd cgroup driver.

As planned, let's promote this warning to an error, and fix the test
case accordingly.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin

This comment was marked as outdated.

@kolyshkin
Copy link
Contributor Author

Rebased. The only CI failure to expect here is centos-stream-9 one from #3786 (not ok NN update cgroup cpu.idle), being fixed/mitigated by #3788.

@kolyshkin
Copy link
Contributor Author

1.1 backport, more conservative: #3806.

@kolyshkin kolyshkin added backport/1.1-done A PR in main branch which has been backported to release-1.1 and removed backport/1.1-todo A PR in main branch which needs to be backported to release-1.1 labels Apr 3, 2023
mmerkes added a commit to mmerkes/cgroups that referenced this pull request Apr 23, 2023
As done in this runc PR
opencontainers/runc#3782,
UnitExists errors are no longer ignored when starting units. If it
already exists, we attempt to reset the failed unit and retry once.
Otherwise, we fail. See containerd#279
for more context.

Signed-off-by: Matt Merkes <matt.merkes@gmail.com>
mmerkes added a commit to mmerkes/cgroups that referenced this pull request May 20, 2023
As done in opencontainers/runc#3782,
UnitExists errors are no longer ignored when starting units.
This change makes the logic more robust like in runc:

1. Attempts to reset a failed unit if it already exists
2. Verifies via the unit status that it successfully starts
3. Waits longer for unit to start
4. Continues to ignore unit existing when pid is -1 to
accommodate kubelet use case
5. Otherwise, returns an error if it already exists

Signed-off-by: Matt Merkes <matt.merkes@gmail.com>
mmerkes added a commit to mmerkes/cgroups that referenced this pull request May 20, 2023
As done in opencontainers/runc#3782,
UnitExists errors are no longer ignored when starting units.
This change makes the logic more robust like in runc:

1. Attempts to reset a failed unit if it already exists
2. Verifies via the unit status that it successfully starts
3. Waits longer for unit to start
4. Continues to ignore unit existing when pid is -1 to
accommodate kubelet use case
5. Otherwise, returns an error if it already exists

Signed-off-by: Matt Merkes <matt.merkes@gmail.com>
mmerkes added a commit to mmerkes/cgroups that referenced this pull request Jun 28, 2023
As done in opencontainers/runc#3782,
UnitExists errors are no longer ignored when starting units.
This change makes the logic more robust like in runc:

1. Attempts to reset a failed unit if it already exists
2. Verifies via the unit status that it successfully starts
3. Waits longer for unit to start
4. Continues to ignore unit existing when pid is -1 to
accommodate kubelet use case
5. Otherwise, returns an error if it already exists

Signed-off-by: Matt Merkes <matt.merkes@gmail.com>

Update .gitignore

Signed-off-by: Matt <matt.merkes@gmail.com>
jchorl added a commit to jchorl/cgroups that referenced this pull request Sep 21, 2024
Before this commit, creating a cgroup would silently ignore timeouts and
carry on. Concretely, this caused cases where a cgroup failed to create,
but the caller doesn't realize and ends up looking for files that should
exist (e.g. cgroups.controllers), only to find they don't exist. It's
very difficult as a caller to deal with this case, where NewSystemd
succeeds but the group doesn't exist.

The origins of this code seem to trace back to an initial implementation
written 5+ years ago:
containerd@5efa14e#diff-3331981e4ac06a8d9b06e91842b7f2759c7af3b65287e489a88385948d311ebdR672

runc added roughly the same logic here to deal with the same issue:
opencontainers/runc#3782

Now, containerd will also error if a cgroup cannot be created within the
timeout window.

Signed-off-by: Josh Chorlton <jchorlton@gmail.com>
jchorl added a commit to jchorl/cgroups that referenced this pull request Sep 25, 2024
Before this commit, creating a cgroup would silently ignore timeouts and
carry on. Concretely, this caused cases where a cgroup failed to create,
but the caller doesn't realize and ends up looking for files that should
exist (e.g. cgroups.controllers), only to find they don't exist. It's
very difficult as a caller to deal with this case, where NewSystemd
succeeds but the group doesn't exist.

The origins of this code seem to trace back to an initial implementation
written 5+ years ago:
containerd@5efa14e#diff-3331981e4ac06a8d9b06e91842b7f2759c7af3b65287e489a88385948d311ebdR672

runc added roughly the same logic here to deal with the same issue:
opencontainers/runc#3782

Now, containerd will also error if a cgroup cannot be created within the
timeout window.

Signed-off-by: Josh Chorlton <jchorlton@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.1-done A PR in main branch which has been backported to release-1.1 impact/changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

runc systemd cgroup driver logic is wrong centos-stream-9 CI is failing for the main branch
3 participants