Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman run -d: hangs when $NOTIFY_SOCKET is set #7316

Closed
edsantiago opened this issue Aug 13, 2020 · 22 comments · Fixed by #11246
Closed

podman run -d: hangs when $NOTIFY_SOCKET is set #7316

edsantiago opened this issue Aug 13, 2020 · 22 comments · Fixed by #11246
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

# export NOTIFY_SOCKET=/tmp/mypodmansocket
# socat unix-recvfrom:"$NOTIFY_SOCKET",fork system:"(cat;echo)" &
[1] 104770
# podman run -d --sdnotify=container alpine sh -c 'sleep 10'
2020/08/13 16:50:03 socat[104857] E sendto(8, 0x55de50ff26b0, 14, 0, AF=1 "<anon>", 0): Transport endpoint is not connected
95fb97931bbf3c3b773dddff07f9fea80b341293fb39e6d63e3becbd3937099f
2020/08/13 16:50:13 socat[104864] E sendto(8, 0x55de50ff26b0, 14, 0, AF=1 "<anon>", 0): Transport endpoint is not connected

Note the ten-second difference in the socat timestamps; that is because the container is sleep 10. Change it to sleep 30, you get a 30-second delay.

What I expected: since this is run -d (detached), I expected podman to detach immediately and let the container deal with sdnotify.

(How I found this: trying to run systemd-notify in a fedora:latest container. Ha ha, silly me, they removed systemd-notify from that image). Ergo, I think this is counterintuitive behavior if a user has a container that never makes it to sdnotify. I really don't expect podman run -d to hang forever.

@edsantiago edsantiago added the kind/bug Categorizes issue or PR as related to a bug. label Aug 13, 2020
@edsantiago
Copy link
Member Author

Oooh! Try running podman ps or even podman info while there's a hung podman run -d. It too will hang. (podman images is fine).

@mheon
Copy link
Member

mheon commented Aug 13, 2020

Is this the same issue as #6688

@edsantiago
Copy link
Member Author

Yes, it looks like a different manifestation of the same problem. I'd say "feel free to close", but given how badly #6688 has been neglected, I'm going to leave it open as a reminder that this really is an unpleasant and unacceptable bug,

edsantiago added a commit to edsantiago/libpod that referenced this issue Aug 14, 2020
Oops. PR containers#6693 (sdnotify) added tests, but they were disabled
due to broken crun on f31. I tried for three weeks to get a
magic CI:IMG PR to update crun on the CI VMs ... but in that
time I forgot to actually enable those new tests.

This PR removes a 'skip', replacing it with a check that systemd
is running plus one more to make sure our runtime is crun. It
looks like sdnotify just doesn't work on Ubuntu (it hangs), and
my guess is that it's a crun/runc issue.

I also changed the test image from fedora:latest to :31, because,
sigh, fedora:latest removed the systemd-notify tool.

WARNING WARNING WARNING: the symptom of a missing systemd-notify
is that podman will hang forever, not even stopped by the timeout
command in podman_run! (Filed: containers#7316). This means that if the
sdnotify-in-container test ever fails, the symptom will be that
Cirrus itself will time out (2 hours?). This is horrible. I
don't know what to do about it other than push for a fix for 7316.

Signed-off-by: Ed Santiago <santiago@redhat.com>
@giuseppe
Copy link
Member

I don't think it is a bug. podman run waits for the container to notify when it is ready. If the container is never ready what should we do? It is not even Podman fault at this point, the OCI runtime is handling the NOTIFY_SOCKET

@edsantiago
Copy link
Member Author

I'm fine with the container hanging. I'm not fine with podman ps or podman info or other podman commands hanging.

@rhatdan
Copy link
Member

rhatdan commented Aug 17, 2020

Yes, I think we need a way to work around the lock.

@mheon
Copy link
Member

mheon commented Aug 17, 2020

We have a way forward via containers/conmon#182 if we can get it landed

@giuseppe
Copy link
Member

We have a way forward via containers/conmon#182 if we can get it landed

I am fine with the solution proposed there, but we would still have the locking issue when --sdnotify=podman is used though.

edsantiago added a commit to edsantiago/libpod that referenced this issue Aug 18, 2020
Some CI tests are hanging, timing out in 60 or 120 minutes.
I wonder if it's containers#7316, the bug where all podman commands
hang forever if NOTIFY_SOCKET is set?

Signed-off-by: Ed Santiago <santiago@redhat.com>
@rhatdan
Copy link
Member

rhatdan commented Sep 11, 2020

Looks like containers/conmon#182 is ready to go in, but was not looked at for 20 days.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

edsantiago added a commit to edsantiago/libpod that referenced this issue Oct 14, 2020
 - run --userns=keep-id: confirm that $HOME gets set (containers#8013)

 - inspect: confirm that JSON output is a sane number of
   lines (10 or more), not an unreadable one-liner (containers#8011
   and containers#8021). Do so with image, pod, network, volume
   because the code paths might be different.

 - cgroups: confirm that 'run' preserves cgroup manager (containers#7970)

 - sdnotify: reenable tests, and hope CI doesn't hang. This
   test was disabled on August 18 because CI jobs were hanging
   and timing out. My suspicion was that it was containers#7316, which
   in turn seems to have hinged on conmon containers#182. The latter
   was merged on Sep 16, so let's cross our fingers and see
   what happens.

Also: remove inaccurate warning from a networking test.

And, wow, fix is_cgroupsv2(), it has never actually worked.

Signed-off-by: Ed Santiago <santiago@redhat.com>
@rhatdan
Copy link
Member

rhatdan commented Dec 24, 2020

I am continuing to try to get this to pass CI in #8508

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@github-actions
Copy link

github-actions bot commented Mar 8, 2021

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

Issue still present, fc02d16

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

Still present in 3bdbe3c

@github-actions
Copy link

github-actions bot commented Jul 5, 2021

A friendly reminder that this issue had no activity for 30 days.

@github-actions
Copy link

github-actions bot commented Aug 5, 2021

A friendly reminder that this issue had no activity for 30 days.

@vrothberg
Copy link
Member

IMHO, the only way to prevent hanging is to change the default of --sdnotify from "container" to "ignore"; runc is blocking until it receives the READY message from the container.

@vrothberg
Copy link
Member

Or well, once #11246 is merged.

@edsantiago
Copy link
Member Author

Confirmed: I no longer see this bug when I apply #11246.

vrothberg pushed a commit to vrothberg/libpod that referenced this issue Aug 20, 2021
This leverages conmon's ability to proxy the SD-NOTIFY socket.
This prevents locking caused by OCI runtime blocking, waiting for
SD-NOTIFY messages, and instead passes the messages directly up
to the host.

NOTE: Also re-enable the auto-update tests which has been disabled due
to flakiness.  With this change, Podman properly integrates into
systemd.

Fixes: containers#7316
Signed-off-by: Joseph Gooch <mrwizard@dok.org>
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants