-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI flake: podman-remote: no output from container (and "does not exist in database"?) #7195
Comments
Here's another one, completely different test, but I have a feeling it's related:
This time there's no |
I have a hunch, with zero supporting evidence, that this is another instance:
If my hunch is correct, the common factor might be multiple |
Here's a sort of reproducer: # while :;do echo -n .;x=$(podman-remote run alpine echo hi);if [ "$x" != "hi" ]; then echo FAILED: "'$x'";fi;done
...can only attach to created or running containers: container state improper
FAILED: ''
...............can only attach to created or running containers: container state improper
FAILED: ''
.can only attach to created or running containers: container state improper
FAILED: ''
......................................can only attach to created or running containers: container state improper
FAILED: ''
............................can only attach to created or running containers: container state improper
FAILED: ''
....can only attach to created or running containers: container state improper
FAILED: ''
.can only attach to created or running containers: container state improper
FAILED: ''
.can only attach to created or running containers: container state improper
FAILED: ''
.can only attach to created or running containers: container state improper
FAILED: ''
.^C I say "sort of" because I haven't before seen the "container state improper" error message. But the symptom sure is the same: I expected output, got empty string. This is podman @ master with #7312 applied, running server on the systemd socket. I can test on master itself if necessary. |
Wow! I was going to add a # while :;do x=$(podman-remote run alpine echo hi);if [ "$x" = "hi" ]; then echo -n .;else echo FAILED: "'$x'";fi;sleep 1;done
can only attach to created or running containers: container state improper
FAILED: ''
read unixpacket @->/var/run/libpod/socket/c5f9b16912deabef1bd471289a4afadeb4b0afe839b8c2670f3103cedc1f6419/attach: read: connection reset by peer
FAILED: ''
..can only attach to created or running containers: container state improper
FAILED: ''
can only attach to created or running containers: container state improper
FAILED: ''
.......can only attach to created or running containers: container state improper
FAILED: ''
.read unixpacket @->/var/run/libpod/socket/564f09a28aa8fe025e162c96d9f53b04fbb155c5133b163eb610a3f0390f7d23/attach: read: connection reset by peer
FAILED: ''
.can only attach to created or running containers: container state improper
FAILED: ''
.can only attach to created or running containers: container state improper
FAILED: ''
read unixpacket @->/var/run/libpod/socket/cbc2d70b3e45cf221f5707ccc38e788c617cbcdf07e71fe6e8341bd5a9745a9d/attach: read: connection reset by peer
FAILED: ''
..can only attach to created or running containers: container state improper
FAILED: ''
......can only attach to created or running containers: container state improper
FAILED: '' Note that this one easily reproduces the |
cirrus-flake-summarize just reported this:
I can't reproduce on my laptop. I have a hunch that this is the same problem. What I find most interesting is that the podman command did not fail. |
Here's a ginkgo flake that appears to be the same underlying problem: all three failures are in the same test (run user capabilities), but each one fails in a different
I'm posting this because I've been feeling a little sensitive about it being only the system tests flaking. I think the e2e tests are also seeing this, but their one-two-three-retry thing maybe masks it. |
There are now three instances of 80 podman pod create - hashtag AllTheOptions
In all cases, exit status is zero. |
There's something bothering me about this. In reviewing my flake logs I see various instances of flakes in the ginkgo tests that could be this problem -- but I'm not seeing any evidence of the scope we've seen the last two weeks. It has gotten so it's impossible to pass CI, even with me spending my day pressing the Re-run button. What has changed? |
I am thinking if @mheon does not fix this soon, we need to disable this remote tests, in order to get the back log of PRs merged. |
Agreed. I nearly suggested that in Planning yesterday; the commitment to fixing flakes dissuaded me. |
I think I am close, but it keeps developing new and exciting failure modes
to convince me otherwise.
…On Thu, Aug 27, 2020, 07:32 Ed Santiago ***@***.***> wrote:
Agreed. I nearly suggested that in Planning yesterday; the commitment to
fixing flakes dissuaded me.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7195 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3AOCFNWVFUEXCJWXL634LSCY755ANCNFSM4PTNKANA>
.
|
Re-pushed. The exec exit codes issue should theoretically be fixed. |
If it is, this can finally be closed... |
Our previous flow was to perform a hijack before passing a connection into Libpod, and then Libpod would attach to the container's attach socket and begin forwarding traffic. A problem emerges: we write the attach header as soon as the attach complete. As soon as we write the header, the client assumes that all is ready, and sends a Start request. This Start may be processed *before* we successfully finish attaching, causing us to lose output. The solution is to handle hijacking inside Libpod. Unfortunately, this requires a downright extensive refactor of the Attach and HTTP Exec StartAndAttach code. I think the result is an improvement in some places (a lot more errors will be handled with a proper HTTP error code, before the hijack occurs) but other parts, like the relocation of printing container logs, are just *bad*. Still, we need this fixed now to get CI back into good shape... Fixes containers#7195 Signed-off-by: Matthew Heon <matthew.heon@pm.me>
- pause test: enable when rootless + cgroups v2 (was previously disabled for all rootless) - run --pull: now works with podman-remote (in containers#7647, thank you @jwhonce) - various other run/volumes tests: try reenabling It looks like containers#7195 was fixed (by containers#7451? I'm not sure if I'm reading the conversation correctly). Anyway, remove all the skip()s on 7195. Only time will tell if it's really fixed) Also: - new test for podman image tree --whatrequires (because TIL). Doesn't work with podman-remote. Signed-off-by: Ed Santiago <santiago@redhat.com>
This is a bad report. I have no reproducer nor any real sense for what's going on.
I'm seeing consistent flakes in #7111 . The failing test is always "podman run : user namespace preserved root ownership" which is simply a quick loop of
podman run
commands. The last set of failures all looked like:In two of the three most recent failures, in teardown, there's a
podman rm -a -f
that barfs with:Logs: fedora 32, fedora 31, special testing rootless.
In an even earlier run, in special_testing_rootless, there was a different error in a different test:
The socket above is a conmon one. I don't know if this is a conmon problem.
The common factor seems to be
--userns=keep-id
.The text was updated successfully, but these errors were encountered: