podman.service: use sdnotify #7312

vrothberg · 2020-08-13T10:58:15Z

Commit 2b6dd3f set the killmode of the podman.service to the
systemd default which ultimately lead to the problem that systemd
will kill all processes inside the unit's cgroup and hence kill
all containers whenever the service is stopped.

Fix it by setting the type to sdnotify and the killmode to process.
podman system service will send the necessary notify messages
when the SDNOTIFY_SOCKET is set and unset it right after to prevent
the backend and container runtimes from jumping in between and send
messages as well.

Fixes: #7294
Signed-off-by: Valentin Rothberg rothberg@redhat.com

vrothberg · 2020-08-13T10:59:14Z

@martinpitt @goochjj PTAL

pkg/api/server/server.go

vrothberg · 2020-08-13T11:08:31Z

/hold

vrothberg · 2020-08-13T11:21:10Z

This looks good to me now. Podman will now be stopped as systemd knows it's PID. Also systemd is not complaining about old conmons hanging around but it actually reports them. Below you can see what I'm referring to. I ran a bunch of containers, waited for the service to timeout and then did an image list to fire up the service again:

[root@nebuchadnezzar libpod]# systemctl status podman.service
● podman.service - Podman API Service
     Loaded: loaded (/usr/local/lib/systemd/system/podman.service; enabled; vendor preset: disabled)
     Active: active (running) since Thu 2020-08-13 13:18:14 CEST; 3s ago
TriggeredBy: ● podman.socket
       Docs: man:podman-system-service(1)
   Main PID: 1198502 (podman)
      Tasks: 27 (limit: 19009)
     Memory: 39.0M
        CPU: 191ms
     CGroup: /system.slice/podman.service
             ├─1197016 /usr/bin/conmon --api-version 1 -c b3ef44b91d1108f23c04f19b61b59c8ec1d6f57918ccca6a84bc64bf86eac834 
             ├─1197240 /usr/bin/conmon --api-version 1 -c fb6e0ae46293a97bf166dc3a76e17f549b949b28a6d66601ca27b220ce9431ec 
             ├─1197955 /usr/bin/conmon --api-version 1 -c da360f41e2d854e4c12b0f061b230c04a5f4f7e1abc408e60ef15ec1efc57c9c 
             ├─1198029 /usr/bin/conmon --api-version 1 -c d931f5e0bea1dc2acad2c1c32bedad7cc5b5571d4b5048a1f5d5fbae2ba76d60 
             ├─1198096 /usr/bin/conmon --api-version 1 -c 1b84665fa739d408146bfbf03ab775c9bc18c7b648e13d848bd832dcfb432312 
             ├─1198162 /usr/bin/conmon --api-version 1 -c 1ba81c25316fb91e2f585969a6cc6a2df4f9352fa3e44fa931fe8981f5f89308 
             ├─1198229 /usr/bin/conmon --api-version 1 -c aaabfd54a0fd3e3513ab91c73aaf91e9184df9f420150c1b2e9cd5a2a013d893 
             └─1198502 /usr/local/bin/podman system service

rhatdan · 2020-08-13T13:19:31Z

/lgtm

mheon · 2020-08-13T13:26:14Z

LGTM

vrothberg · 2020-08-13T13:35:50Z

Please wait with merging until we have more acks on it.

rhatdan · 2020-08-13T14:01:06Z

I thought you had two. But you have control anyways.

pkg/api/server/server.go

giuseppe · 2020-08-13T14:46:21Z

This looks good to me now. Podman will now be stopped as systemd knows it's PID. Also systemd is not complaining about old conmons hanging around but it actually reports them. Below you can see what I'm referring to. I ran a bunch of containers, waited for the service to timeout and then did an image list to fire up the service again:

why do we have conmon processes in the podman.service cgroup? I thought we moved them to a separate cgroup

vrothberg · 2020-08-13T14:59:14Z

why do we have conmon processes in the podman.service cgroup? I thought we moved them to a separate cgroup

Could it be because we're unsetting NOTIFY_SOCKET?

giuseppe · 2020-08-13T15:17:44Z

Could it be because we're unsetting NOTIFY_SOCKET?

that should not happen (at least I don't see we are using it internally in Podman). I don't think the issue is introduced with this change. Or are those old containers that are configured with --cgroup-mode=split?

vrothberg · 2020-08-13T15:18:50Z

Could it be because we're unsetting NOTIFY_SOCKET?

that should not happen (at least I don't see we are using it internally in Podman). I don't think the issue is introduced with this change. Or are those old containers that are configured with --cgroup-mode=split?

They are created with the remote client. Maybe we're missing that in the remote code paths?

giuseppe · 2020-08-13T15:45:32Z

They are created with the remote client. Maybe we're missing that in the remote code paths?

this patch seems to do the trick:

diff --git a/pkg/api/server/server.go b/pkg/api/server/server.go
index 18b48a3f6..2a3e44e04 100644
--- a/pkg/api/server/server.go
+++ b/pkg/api/server/server.go
@@ -153,6 +153,7 @@ func (s *APIServer) Serve() error {
        signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)
        errChan := make(chan error, 1)
 
+       os.Unsetenv("INVOCATION_ID")
        go func() {
                <-s.idleTracker.Done()
                logrus.Debugf("API Server idle for %v", s.idleTracker.Duration)

Commit 2b6dd3f set the killmode of the podman.service to the systemd default which ultimately lead to the problem that systemd will kill *all* processes inside the unit's cgroup and hence kill all containers whenever the service is stopped. Fix it by setting the type to sdnotify and the killmode to process. `podman system service` will send the necessary notify messages when the NOTIFY_SOCKET is set and unset it right after to prevent the backend and container runtimes from jumping in between and send messages as well. Fixes: containers#7294 Signed-off-by: Valentin Rothberg <rothberg@redhat.com>

vrothberg · 2020-08-13T15:54:22Z

this patch seems to do the trick:

Tested successfully, thanks @giuseppe !

vrothberg · 2020-08-13T15:56:02Z

Allright, now as we've ironed out the last remaining fart, I am good to merge. Thanks all!

giuseppe

LGTM

openshift-ci-robot · 2020-08-13T15:57:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [giuseppe,vrothberg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

goochjj · 2020-08-13T16:35:11Z

LGTM

TomSweeneyRedHat · 2020-08-13T17:44:53Z

The change LGTM, but it looks like you may have encountered a real error in the tests.

mheon · 2020-08-13T17:49:11Z

F31 looks like one of our remote flakes...

/lgtm
/hold

edsantiago · 2020-08-13T17:50:43Z

yes, it's the #7195 flake

TomSweeneyRedHat · 2020-08-13T17:55:21Z

Thanks, @QiWang19 's #7311 suffers the same issue.

edsantiago · 2020-08-13T19:50:22Z

/hold

investigating something. Please do not merge yet.

edsantiago · 2020-08-13T20:48:18Z

All clear. The problem I was looking at turned out to be caused by something else.

rhatdan · 2020-08-13T20:57:58Z

/hold cancel

openshift-ci-robot requested review from baude and TomSweeneyRedHat August 13, 2020 10:58

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 13, 2020

vrothberg force-pushed the fix-7294 branch 3 times, most recently from 9bb10d3 to dc12ec7 Compare August 13, 2020 11:01

giuseppe reviewed Aug 13, 2020

View reviewed changes

pkg/api/server/server.go Outdated Show resolved Hide resolved

vrothberg force-pushed the fix-7294 branch from dc12ec7 to 32d6675 Compare August 13, 2020 11:03

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 13, 2020

openshift-ci-robot assigned rhatdan Aug 13, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 13, 2020

edsantiago changed the title ~~podman.service: use sdnotiy~~ podman.service: use sdnotify Aug 13, 2020

edsantiago reviewed Aug 13, 2020

View reviewed changes

pkg/api/server/server.go Outdated Show resolved Hide resolved

vrothberg force-pushed the fix-7294 branch from 32d6675 to c9a3508 Compare August 13, 2020 14:43

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Aug 13, 2020

vrothberg force-pushed the fix-7294 branch from c9a3508 to 0f4e2be Compare August 13, 2020 15:54

giuseppe approved these changes Aug 13, 2020

View reviewed changes

edsantiago mentioned this pull request Aug 13, 2020

Containers started using socket-activated APIv2 die from systemd activation timeout #7294

Closed

openshift-ci-robot assigned mheon Aug 13, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 13, 2020

edsantiago mentioned this pull request Aug 13, 2020

CI flake: podman-remote: no output from container (and "does not exist in database"?) #7195

Closed

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 13, 2020

openshift-merge-robot merged commit 81499a5 into containers:master Aug 13, 2020

edsantiago mentioned this pull request Aug 13, 2020

system tests: enable sdnotify tests #7317

Merged

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 24, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

podman.service: use sdnotify #7312

podman.service: use sdnotify #7312

vrothberg commented Aug 13, 2020

vrothberg commented Aug 13, 2020

vrothberg commented Aug 13, 2020

vrothberg commented Aug 13, 2020

rhatdan commented Aug 13, 2020

mheon commented Aug 13, 2020

vrothberg commented Aug 13, 2020

rhatdan commented Aug 13, 2020

giuseppe commented Aug 13, 2020

vrothberg commented Aug 13, 2020

giuseppe commented Aug 13, 2020

vrothberg commented Aug 13, 2020

giuseppe commented Aug 13, 2020

vrothberg commented Aug 13, 2020

vrothberg commented Aug 13, 2020

giuseppe left a comment

openshift-ci-robot commented Aug 13, 2020

goochjj commented Aug 13, 2020

TomSweeneyRedHat commented Aug 13, 2020

mheon commented Aug 13, 2020

edsantiago commented Aug 13, 2020

TomSweeneyRedHat commented Aug 13, 2020

edsantiago commented Aug 13, 2020

edsantiago commented Aug 13, 2020

rhatdan commented Aug 13, 2020

podman.service: use sdnotify #7312

podman.service: use sdnotify #7312

Conversation

vrothberg commented Aug 13, 2020

vrothberg commented Aug 13, 2020

vrothberg commented Aug 13, 2020

vrothberg commented Aug 13, 2020

rhatdan commented Aug 13, 2020

mheon commented Aug 13, 2020

vrothberg commented Aug 13, 2020

rhatdan commented Aug 13, 2020

giuseppe commented Aug 13, 2020

vrothberg commented Aug 13, 2020

giuseppe commented Aug 13, 2020

vrothberg commented Aug 13, 2020

giuseppe commented Aug 13, 2020

vrothberg commented Aug 13, 2020

vrothberg commented Aug 13, 2020

giuseppe left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Aug 13, 2020

goochjj commented Aug 13, 2020

TomSweeneyRedHat commented Aug 13, 2020

mheon commented Aug 13, 2020

edsantiago commented Aug 13, 2020

TomSweeneyRedHat commented Aug 13, 2020

edsantiago commented Aug 13, 2020

edsantiago commented Aug 13, 2020

rhatdan commented Aug 13, 2020