CI flake: podman-remote: no output from container (and "does not exist in database"?) #7195

edsantiago · 2020-08-03T14:51:34Z

This is a bad report. I have no reproducer nor any real sense for what's going on.

I'm seeing consistent flakes in #7111 . The failing test is always "podman run : user namespace preserved root ownership" which is simply a quick loop of podman run commands. The last set of failures all looked like:

[+0136s] # # podman-remote --url ... run --rm --user=100 --userns=keep-id quay.io/libpod/alpine_labels:latest stat -c %u:%g:%n /etc
[+0136s] # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
[+0136s] # #|     FAIL: run  --user=100 --userns=keep-id (/etc)   <<<--- these flags are not always the same
[+0136s] # #| expected: '0:0:/etc'
[+0136s] # #|   actual: ''
[+0136s] # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In two of the three most recent failures, in teardown, there's a podman rm -a -f that barfs with:

[+0136s] # Error: container d0ea34aaeffe07cc3e3f7f79933372a2bab825ef97e17c580eb1e1e94b2ac7e7 does not exist in database: no such container

Logs: fedora 32, fedora 31, special testing rootless.

In an even earlier run, in special_testing_rootless, there was a different error in a different test:

[+0257s] not ok 74 podman volume with --userns=keep-id
[+0257s] # $ /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman.fpYZ0P run --rm -v /tmp/podman_bats.LsZK3v/volume_O8zRoMsGmt:/vol:z quay.io/libpod/alpine_labels:latest stat -c %u:%s /vol/myfile
[+0257s] # read unixpacket @->/run/user/23298/libpod/tmp/socket/1bee53f8e19e2b03ff773e504fead7f5994b8be5545b315a73a3d2cef290f567/attach: read: connection reset by peer
[+0257s] # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
[+0257s] # #|     FAIL: w/o keep-id: stat(file in container) == root
[+0257s] # #| expected: '0:0'
[+0257s] # #|   actual: 'read unixpacket @->/run/user/23298/libpod/tmp/socket/1bee53f8e19e2b03ff773e504fead7f5994b8be5545b315a73a3d2cef290f567/attach: read: connection reset by peer'
[+0257s] # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The socket above is a conmon one. I don't know if this is a conmon problem.

The common factor seems to be --userns=keep-id.

The text was updated successfully, but these errors were encountered:

edsantiago · 2020-08-03T18:23:24Z

Here's another one, completely different test, but I have a feeling it's related:

# $ podman-remote --url unix:/tmp/podman.n9RjqF start --attach --interactive c5cb2c8d5b2f3e9e04e3bbdab278bd2826fb60e9906d86c18b9e2ef829c7c848
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #|     FAIL: output from podman-start on created ctr
# #| expected: '0OeSILw3HySH9MVHmIn5N0aeZpDkC6w1lF5LvQqC'
# #|   actual: ''
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This time there's no --userns=keep-id but the symptom is the same: test expected output, no output was received.

edsantiago · 2020-08-10T20:43:30Z

I have a hunch, with zero supporting evidence, that this is another instance:

# #|     FAIL: SELinux role should always be system_r
# #| expected: '.*_u:system_r:.*'
# #|   actual: ''

If my hunch is correct, the common factor might be multiple podman run commands in quick succession. After a while the server gets tired and cranky, refusing to return any more stdout until it gets a nap.

edsantiago · 2020-08-13T18:01:50Z

Here's a sort of reproducer:

# while :;do echo -n .;x=$(podman-remote run alpine echo hi);if [ "$x" != "hi" ]; then echo FAILED: "'$x'";fi;done
...can only attach to created or running containers: container state improper
FAILED: ''
...............can only attach to created or running containers: container state improper
FAILED: ''
.can only attach to created or running containers: container state improper
FAILED: ''
......................................can only attach to created or running containers: container state improper
FAILED: ''
............................can only attach to created or running containers: container state improper
FAILED: ''
....can only attach to created or running containers: container state improper
FAILED: ''
.can only attach to created or running containers: container state improper
FAILED: ''
.can only attach to created or running containers: container state improper
FAILED: ''
.can only attach to created or running containers: container state improper
FAILED: ''
.^C

I say "sort of" because I haven't before seen the "container state improper" error message. But the symptom sure is the same: I expected output, got empty string.

This is podman @ master with #7312 applied, running server on the systemd socket. I can test on master itself if necessary.

edsantiago · 2020-08-13T18:54:12Z

Wow! I was going to add a sleep 1 to the BATS tests, in hopes of minimizing the flake, because it's hitting almost every CI run... but that made it worse, and gave me an even better reproducer:

# while :;do x=$(podman-remote run alpine echo hi);if [ "$x" = "hi" ]; then echo -n .;else echo FAILED: "'$x'";fi;sleep 1;done
can only attach to created or running containers: container state improper
FAILED: ''
read unixpacket @->/var/run/libpod/socket/c5f9b16912deabef1bd471289a4afadeb4b0afe839b8c2670f3103cedc1f6419/attach: read: connection reset by peer
FAILED: ''
..can only attach to created or running containers: container state improper
FAILED: ''
can only attach to created or running containers: container state improper
FAILED: ''
.......can only attach to created or running containers: container state improper
FAILED: ''
.read unixpacket @->/var/run/libpod/socket/564f09a28aa8fe025e162c96d9f53b04fbb155c5133b163eb610a3f0390f7d23/attach: read: connection reset by peer
FAILED: ''
.can only attach to created or running containers: container state improper
FAILED: ''
.can only attach to created or running containers: container state improper
FAILED: ''
read unixpacket @->/var/run/libpod/socket/cbc2d70b3e45cf221f5707ccc38e788c617cbcdf07e71fe6e8341bd5a9745a9d/attach: read: connection reset by peer
FAILED: ''
..can only attach to created or running containers: container state improper
FAILED: ''
......can only attach to created or running containers: container state improper
FAILED: ''

Note that this one easily reproduces the read unixpacket error. HTH.

edsantiago · 2020-08-17T14:05:45Z

I think this (from #7314) might be another instance of the bug: basically the common factor is "podman-remote, was expecting output, got nothing".

edsantiago · 2020-08-18T19:05:49Z

cirrus-flake-summarize just reported this:

# # podman-remote --url unix:/tmp/podman.k2CiCr run --rm --pod mypod quay.io/libpod/alpine_labels:latest hostname
# can only attach to created or running containers: container state improper
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #|     FAIL: --hostname set the hostname
# #| expected: '6sjaufqrbk.sxjb7jypwe.net'
# #|   actual: 'can only attach to created or running containers: container state improper'
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I can't reproduce on my laptop. I have a hunch that this is the same problem. What I find most interesting is that the podman command did not fail. run_podman would've barfed before the string check. So we're seeing "container state improper" output (no way to know if it's stdout or stderr), none of the expected output, and exit status zero.

edsantiago · 2020-08-18T19:12:30Z

Here's a ginkgo flake that appears to be the same underlying problem: all three failures are in the same test (run user capabilities), but each one fails in a different grep:

third "podman run" is empty
first ...
fourth ...

I'm posting this because I've been feeling a little sensitive about it being only the system tests flaking. I think the e2e tests are also seeing this, but their one-two-three-retry thing maybe masks it.

edsantiago · 2020-08-24T13:39:00Z

There are now three instances of podman run --pod mypod ... flaking with "container state improper", as reported above. The instances are:

80 podman pod create - hashtag AllTheOptions

fedora-31 : test fedora-31
- PR Wait for reexec to finish when fileOutput is nil #7292
  - 08-15 12:51
fedora-31 : test fedora-31 [remote]
- PR Update nix pin with make nixpkgs #7408
  - 08-22 07:15
ubuntu-19 : test ubuntu-19 [remote]
- PR fix pod creation with "new:" syntax followup + allow hostname #7388
  - 08-20 11:47

In all cases, exit status is zero.

edsantiago · 2020-08-26T12:23:22Z

There's something bothering me about this. In reviewing my flake logs I see various instances of flakes in the ginkgo tests that could be this problem -- but I'm not seeing any evidence of the scope we've seen the last two weeks. It has gotten so it's impossible to pass CI, even with me spending my day pressing the Re-run button. What has changed?

rhatdan · 2020-08-27T11:21:31Z

I am thinking if @mheon does not fix this soon, we need to disable this remote tests, in order to get the back log of PRs merged.

edsantiago · 2020-08-27T11:32:32Z

Agreed. I nearly suggested that in Planning yesterday; the commitment to fixing flakes dissuaded me.

mheon · 2020-08-27T11:59:57Z

I think I am close, but it keeps developing new and exciting failure modes to convince me otherwise.

…

On Thu, Aug 27, 2020, 07:32 Ed Santiago ***@***.***> wrote: Agreed. I nearly suggested that in Planning yesterday; the commitment to fixing flakes dissuaded me. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7195 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3AOCFNWVFUEXCJWXL634LSCY755ANCNFSM4PTNKANA> .

mheon · 2020-08-27T14:32:40Z

Re-pushed. The exec exit codes issue should theoretically be fixed.

mheon · 2020-08-27T14:32:59Z

If it is, this can finally be closed...

Our previous flow was to perform a hijack before passing a connection into Libpod, and then Libpod would attach to the container's attach socket and begin forwarding traffic. A problem emerges: we write the attach header as soon as the attach complete. As soon as we write the header, the client assumes that all is ready, and sends a Start request. This Start may be processed *before* we successfully finish attaching, causing us to lose output. The solution is to handle hijacking inside Libpod. Unfortunately, this requires a downright extensive refactor of the Attach and HTTP Exec StartAndAttach code. I think the result is an improvement in some places (a lot more errors will be handled with a proper HTTP error code, before the hijack occurs) but other parts, like the relocation of printing container logs, are just *bad*. Still, we need this fixed now to get CI back into good shape... Fixes containers#7195 Signed-off-by: Matthew Heon <matthew.heon@pm.me>

@jwhonce

- pause test: enable when rootless + cgroups v2 (was previously disabled for all rootless) - run --pull: now works with podman-remote (in containers#7647, thank you @jwhonce) - various other run/volumes tests: try reenabling It looks like containers#7195 was fixed (by containers#7451? I'm not sure if I'm reading the conversation correctly). Anyway, remove all the skip()s on 7195. Only time will tell if it's really fixed) Also: - new test for podman image tree --whatrequires (because TIL). Doesn't work with podman-remote. Signed-off-by: Ed Santiago <santiago@redhat.com>

edsantiago added flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. remote Problem is in podman-remote labels Aug 3, 2020

This was referenced Aug 3, 2020

Reenable remote system tests #7111

Merged

CI: flake: read unixpacket [conmon path] #7228

Closed

lsm5 self-assigned this Aug 6, 2020

This was referenced Aug 11, 2020

Makefile: use full path for ginkgo #7296

Merged

podman.service: use sdnotify #7312

Merged

This was referenced Aug 17, 2020

podman-remote: run --rm: Error removing container #7340

Closed

flake fix: podman image trust #7338

Merged

TomSweeneyRedHat mentioned this issue Aug 18, 2020

generate systemd: quote arguments with whitespace #7350

Merged

edsantiago mentioned this issue Aug 20, 2020

podman load/save: support multi-image docker archive #6811

Merged

lsm5 removed their assignment Aug 21, 2020

This was referenced Aug 24, 2020

Ensure pod REST API endpoints include ctr errors #7428

Merged

Just use rm for helper command to remove storage #7437

Merged

mheon mentioned this issue Aug 25, 2020

Send HTTP Hijack headers after successful attach #7451

Merged

edsantiago closed this as completed in #7451 Aug 27, 2020

edsantiago mentioned this issue Sep 10, 2020

CI flake: podman run --conmon-pidfile: conmon not running? #7580

Closed

edsantiago mentioned this issue Sep 28, 2020

System tests: reenable some skipped tests #7803

Merged

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI flake: podman-remote: no output from container (and "does not exist in database"?) #7195

CI flake: podman-remote: no output from container (and "does not exist in database"?) #7195

edsantiago commented Aug 3, 2020

edsantiago commented Aug 3, 2020

edsantiago commented Aug 10, 2020

edsantiago commented Aug 13, 2020

edsantiago commented Aug 13, 2020

edsantiago commented Aug 17, 2020

edsantiago commented Aug 18, 2020

edsantiago commented Aug 18, 2020

edsantiago commented Aug 24, 2020

edsantiago commented Aug 26, 2020

rhatdan commented Aug 27, 2020

edsantiago commented Aug 27, 2020

mheon commented Aug 27, 2020 via email

mheon commented Aug 27, 2020

mheon commented Aug 27, 2020

CI flake: podman-remote: no output from container (and "does not exist in database"?) #7195

CI flake: podman-remote: no output from container (and "does not exist in database"?) #7195

Comments

edsantiago commented Aug 3, 2020

edsantiago commented Aug 3, 2020

edsantiago commented Aug 10, 2020

edsantiago commented Aug 13, 2020

edsantiago commented Aug 13, 2020

edsantiago commented Aug 17, 2020

edsantiago commented Aug 18, 2020

edsantiago commented Aug 18, 2020

edsantiago commented Aug 24, 2020

edsantiago commented Aug 26, 2020

rhatdan commented Aug 27, 2020

edsantiago commented Aug 27, 2020

mheon commented Aug 27, 2020 via email

mheon commented Aug 27, 2020

mheon commented Aug 27, 2020