-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making systemd StartTransientUnit synchronous (mini post-mortem on that) #1780
Comments
I've just skimmed this issue quite lightly. I think that the hotfix of However, all of that being said, I have a more important question to ask -- why are people using the systemd cgroup driver so heavily? What is the use-case? Why does the ordinary cgroupfs driver not work for you? The reason I ask this (and I still haven't gotten an answer to this question -- even though I bring it up every time we have to deal with more systemd bugs) is that systemd is very notorious for being an incredibly "bad sport" when trying to use it in the way we do (it has had many bugs in the past -- even with So it would really help to know what is the problem that the systemd cgroup code fixes that is not fixed by cgroupfs (and it should be noted that the systemd cgroup code actually uses the cgroupfs driver for most of its operations -- because the systemd interfaces are useless for us). |
So that, if a timeout happens and we decide to stop blocking on the operation, the writer will not block when they try to report the result of the operation. This should address Issue opencontainers#1780 and it's a follow up for PR opencontainers#1683, PR opencontainers#1754 and PR opencontainers#1772.
So that, if a timeout happens and we decide to stop blocking on the operation, the writer will not block when they try to report the result of the operation. This should address Issue opencontainers#1780 and it's a follow up for PR opencontainers#1683, PR opencontainers#1754 and PR opencontainers#1772. Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Hi @cyphar Some reasons to use the systemd driver:
Regardless, some projects and distros are already using systemd cgroup driver and that's unlikely to change. I still think we need to support those. While I agree with some of your points about difficulties with the systemd cgroup driver, I think part the problem is here in the libcontainer driver. It's quite buggy, as you can see. I'm planning to try to address that situation. I think that could alleviate the problems you describe a lot. Anyways, the point is we do have users of it, so let's try to at least fix the bugs. I'm open to a larger process of evaluating how architectural changes to it might make it even more reliable, and currently planning to prehaps undertake such an effort. Cheers, |
By "the same cgroup interfaces" are you referring to the systemd helpers (because in reality there is only one set of cgroup interfaces offered by the kernel -- and it's definitely not the systemd ones 😉)?
I agree that cgroupv2 is going to be a very large issue, though I guess I have a less optimistic view of how that is going to pan out (the removal of (Also systemd broke all container runtimes in one release because they pushed cgroupv2 incorrectly and too early. If that was the opening act, I'm a little worried about what's going to follow it.)
We do need to support people using the code we already have, yes. But that doesn't mean we should be ignoring the fact that systemd has caused many more issues when using the systemd cgroup driver over the cgroupfs one -- if people can switch away they should.
I disagree (though I agree that the libcontainer driver is pretty awful in its current state). The bugs you're referring to in the original post all spawn from bugs in systemd or DBus which resulted in messages not being passed to us, and then later attempts to fix the broken fixes. Whether or not we should've handled this edge case before is up for debate, but let's say that was an example of libcontainer broken-ness. This bug is quite minor in comparison to the bugs I was referring to. The bugs I'm referring to are entirely on the systemd side of things. If you like, I can pull up our Bugzilla entries where we had several reproducible cases where creating a TransientUnit would result in systemd moving our container outside of the cgroup path we placed it in (and set limits on) and into a different cgroup that was effectively unconfined. There were several other bugs that had similarly confusing effects, and all of them were due to systemd moving processes around because we told it about them (not to mention that we in general are forced to place processes in the I will happily admit that personally I don't think |
Opening an issue here so we can try to understand the sequence to events, the current situation and what to do next.
cc @derekwaynecarr @vikaschoudhary16 @sjenning @mrunalp
os.MkdirAll
(which creates the directories, if they don't exist.)a. I believe this approach is appropriate. Waiting for the operation to complete before proceeding is the correct behavior here.
b. The problem with PR Fix race against systemd #1683 was that the StartTransientUnit call breaks on errors only when
err != nil && !isUnitExists(err)
, so in the case where the unit already exists, blocking on the channel is a mistake, since there's no pending operation from systemd. (Personal note: this would have been unlikely to have happened on a language with proper exception handling.)kubelet
(Kubernetes), since it does call libcontainer on the same cgroup multiple times, and the latter calls will trigger theisUnitExists(err)
situation that will make the libcontainer code now block on a channel that will never receive any message. Sokubelet
startup gets stuck there. (BTW, I spent a couple days trying to troubleshoot that myself too, seeingkubelet
mysteriously fail to initialize when using systemd cgroup driver.)a. A timeout is probably appropriate, blocking forever is really not good. I'm not sure one second is long enough for a timeout. I also think the action of assuming everything went well in this case is also not necessarily appropriate. Perhaps this can be done better here, maybe retry the call (which should be idempotent, since creating the same unit multiple times will actually get into the
isUnitExists
case we've discussed previously.)b. This PR also introduced a bug. Since now, if the timeout occurs, then no one is ever reading from that channel again, which means later on when the job completes, the code writing to it will block, forever.
startJob()
,jobComplete()
and thejobListener
lock incoreos/go-systemd/dbus
.)a. I think we need to address the actual bug in 7(b).
a. I'd appreciate to know more details about this, including which upstream patch has been used to fix this behavior.
b. Regardless of the change in behavior in systemd, I still think the intent of the PR in (5) was correct, the call to StartTransientUnit should be blocking and even if it's benign, we should avoid any race between systemd and runc/libcontainer manipulating the cgroup tree.
a. Once again, I don't think that's appropriate, since even if it's benign, a race condition is still there and could come back to bite us in the future. Having the "create cgroup" operation be synchronous is the appropriate behavior here. In any other situation, where we're creating a resource and then consuming a resource, I'm sure we would have wanted the same.
b. Rather than reverting everything, I think we should address the bug in 7(b) and fix it.
a. I think that's a bug too, and should be addressed.
My proposals here are that, instead of rolling back, we fix the actual bugs that still exist here. I still assert that keeping the synchronous behavior of StartTransientUnit is the proper approach.
Regarding the potential fixes (or workarounds) for 7(b):
statusChan := make(chan string, 1)
which buffers one write to this channel. That way, the write to the channel will not block, even if we stopped listening on it due to a timeout. This is very localized and doesn't really have bad side effects, even when timeouts occur. We could also increase the timeout from 1s to something a bit higher. I think this would probably be a good start.coreos/go-systemd/dbus
, moving the write to the channel to outside the lock. (There's no harm in doing that outside the lock.) We might further consider doing that write from a go-routine, so if it blocks we don't really get stuck. (The downside there is that we risk leaking go-routines in this situation.)coreos/go-systemd/dbus
to add the ability to cancel a blocking channel, in case of a timeout, so no write is done when not needed. But this looks like it might be quite intrusive to add, has some races of its own that have to be dealt with, so not sure whether it's actually worth it...coreos/go-systemd/dbus
instead of here. (Not really sure, need to investigate.) We probably don't need that right away, but long term, it's something we should probably consider.I think that's all for now... I hope this describes the situation fairly accurately, otherwise please leave comments (we can come back and edit the description to fix anything if there are mistakes.)
Cheers,
Filipe
The text was updated successfully, but these errors were encountered: