Fix systemd.Apply() to check for DBus error before waiting on a channel. #1772

filbranden · 2018-03-31T07:47:46Z

The channel was introduced in #1683 to work around a race condition. However, the check for error in StartTransientUnit ignores the error for an already existing unit, and in that case there will be no notification from DBus (so waiting on the channel will make it hang.)

Later PR #1754 added a timeout, which worked around the issue, but we can fix this correctly by only waiting on the channel when there is no error. Fix the code to do so.

The timeout handling was kept, since there might be other cases where this situation occurs (the bug entry at Red Hat's bugzilla mentions calling this code from inside a container, it's unclear whether an existing container was in use or not, so not sure whether this would have fixed that bug as well.)

/assign @mrunalp @hqhq -> Please review, as you reviewed the original PRs too.
/cc @cyphar -> Code review on the original PRs.
/cc @vikaschoudhary16 -> Authored the original PRs.
/cc @derekwaynecarr @sjenning -> Cc'd on original PRs and comments on Red Hat's bugzilla entry.

The channel was introduced in opencontainers#1683 to work around a race condition. However, the check for error in StartTransientUnit ignores the error for an already existing unit, and in that case there will be no notification from DBus (so waiting on the channel will make it hang.) Later PR opencontainers#1754 added a timeout, which worked around the issue, but we can fix this correctly by only waiting on the channel when there is no error. Fix the code to do so. The timeout handling was kept, since there might be other cases where this situation occurs (https://bugzilla.redhat.com/show_bug.cgi?id=1548358 mentions calling this code from inside a container, it's unclear whether an existing container was in use or not, so not sure whether this would have fixed that bug as well.) Signed-off-by: Filipe Brandenburger <filbranden@google.com>

filbranden · 2018-04-10T16:13:59Z

Please give this one some attention... I'd say it's fixing an obvious bug (after you get to see it) and in my testing it did fix the hangs on Kubelet startup...

Thanks!
Filipe

mrunalp · 2018-04-10T17:09:33Z

LGTM

crosbymichael · 2018-04-10T18:14:29Z

LGTM

So that, if a timeout happens and we decide to stop blocking on the operation, the writer will not block when they try to report the result of the operation. This should address Issue opencontainers#1780 and it's a follow up for PR opencontainers#1683, PR opencontainers#1754 and PR opencontainers#1772.

So that, if a timeout happens and we decide to stop blocking on the operation, the writer will not block when they try to report the result of the operation. This should address Issue opencontainers#1780 and it's a follow up for PR opencontainers#1683, PR opencontainers#1754 and PR opencontainers#1772. Signed-off-by: Filipe Brandenburger <filbranden@google.com>

PR opencontainers/runc#1754 works around an issue in manager.Apply(-1) that makes Kubelet startup hang when using systemd cgroup driver (by adding a timeout) and further PR opencontainers/runc#1772 fixes that bug by checking the proper error status before waiting on the channel. PR opencontainers/runc#1776 checks whether Delegate works in slices, which keeps libcontainer systemd cgroup driver working on systemd v237+. PR opencontainers/runc#1781 makes the channel buffered, so if we time out waiting on the channel, the updater will not block trying to it since there are no longer any consumers.

@derekwaynecarr

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Update libcontainer to include PRs with fixes to systemd cgroup driver **What this PR does / why we need it**: PR opencontainers/runc#1754 works around an issue in manager.Apply(-1) that makes Kubelet startup hang when using systemd cgroup driver (by adding a timeout) and further PR opencontainers/runc#1772 fixes that bug by checking the proper error status before waiting on the channel. PR opencontainers/runc#1776 checks whether Delegate works in slices, which keeps libcontainer systemd cgroup driver working on systemd v237+. **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes #61474 **Special notes for your reviewer**: /assign @derekwaynecarr cc @vikaschoudhary16 @sjenning @adelton @mrunalp **Release note**: ```release-note NONE ```

PR opencontainers/runc#1754 works around an issue in manager.Apply(-1) that makes Kubelet startup hang when using systemd cgroup driver (by adding a timeout) and further PR opencontainers/runc#1772 fixes that bug by checking the proper error status before waiting on the channel. PR opencontainers/runc#1776 checks whether Delegate works in slices, which keeps libcontainer systemd cgroup driver working on systemd v237+. PR opencontainers/runc#1781 makes the channel buffered, so if we time out waiting on the channel, the updater will not block trying to it since there are no longer any consumers.

opencontainers/runc#1683 opencontainers/runc#1754 opencontainers/runc#1772 opencontainers/runc#1781 Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

filbranden mentioned this pull request Mar 31, 2018

Update libcontainer to include PRs with fixes to systemd cgroup driver kubernetes/kubernetes#61926

Merged

filbranden mentioned this pull request Apr 9, 2018

Detect whether Delegate is available on both slices and scopes #1776

Merged

filbranden force-pushed the systemd1 branch from 71ecfe6 to 8ab251f Compare April 9, 2018 18:52

crosbymichael merged commit 3cbb2fa into opencontainers:master Apr 10, 2018

filbranden mentioned this pull request Apr 14, 2018

Making systemd StartTransientUnit synchronous (mini post-mortem on that) #1780

Open

filbranden mentioned this pull request Apr 14, 2018

Make channel for StartTransientUnit buffered #1781

Merged

filbranden mentioned this pull request Jun 12, 2018

Use uint64 for resources to keep consistency with runtime-spec projectatomic/runc#10

Closed

mrunalp mentioned this pull request Jun 12, 2018

cgroups: Backport of upstream fixes around starting units projectatomic/runc#12

Merged

mrunalp mentioned this pull request Jun 12, 2018

cgroups: Backport of upstream fixes around starting units projectatomic/runc#13

Merged

filbranden deleted the systemd1 branch February 7, 2019 01:46

kolyshkin mentioned this pull request Mar 23, 2023

runc systemd cgroup driver logic is wrong #3780

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix systemd.Apply() to check for DBus error before waiting on a channel. #1772

Fix systemd.Apply() to check for DBus error before waiting on a channel. #1772

filbranden commented Mar 31, 2018

filbranden commented Apr 10, 2018

mrunalp commented Apr 10, 2018 •

edited by caniszczyk

Loading

crosbymichael commented Apr 10, 2018 •

edited by caniszczyk

Loading

Fix systemd.Apply() to check for DBus error before waiting on a channel. #1772

Fix systemd.Apply() to check for DBus error before waiting on a channel. #1772

Conversation

filbranden commented Mar 31, 2018

filbranden commented Apr 10, 2018

mrunalp commented Apr 10, 2018 • edited by caniszczyk Loading

crosbymichael commented Apr 10, 2018 • edited by caniszczyk Loading

mrunalp commented Apr 10, 2018 •

edited by caniszczyk

Loading

crosbymichael commented Apr 10, 2018 •

edited by caniszczyk

Loading