Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reenable remote system tests #7111

Conversation

edsantiago
Copy link
Member

podman-remote is in better shape now. Let's see what needs
to be done to reenable remote system tests.

Signed-off-by: Ed Santiago santiago@redhat.com

@edsantiago edsantiago force-pushed the reenable_remote_system_tests branch 4 times, most recently from eb216a0 to f27d71e Compare July 29, 2020 16:56
@edsantiago
Copy link
Member Author

This is uncomfortable: the remote test timed out but with no actual log of how far it got:

+ make remotesystem
# Wipe existing config, database, and cache: start with clean slate.
rm -f -rf /home/some13225dude/.local/share/containers /home/some13225dude/.config/containers
# Start podman server using tmp socket; loop-wait for it;
# test podman-remote; kill server, clean up tmp socket file.
# podman server spews copious unhelp
Timed out!

Restarted in hopes that it's a purely-coincidental infrastructure flake.

@edsantiago
Copy link
Member Author

@cevich I need help here please. The failures look like real timeouts, but the logs are useless (incomplete). I have to assume there's a buffering problem, as if stdout/stderr aren't getting flushed properly. Does that sound familiar? Can you think of any way we can get full logs?

@cevich
Copy link
Member

cevich commented Jul 30, 2020

@edsantiago taking a peek...

@cevich
Copy link
Member

cevich commented Jul 30, 2020

@edsantiago this is a tricky one! The main thing that stands out to me here is the size of the onion. Maybe try peeling back a layers? For example, the wait/retry loop in a makefile is two complexities wrapped in an enigma. I know make has it's fingers into the stdio and exit-code pie, what happens w/o make being involved?

@edsantiago edsantiago force-pushed the reenable_remote_system_tests branch 3 times, most recently from 63e9478 to d7bcb71 Compare July 30, 2020 23:44
@edsantiago
Copy link
Member Author

I think the Ubuntu issue is solved: Ubuntu /bin/sh is not bash, so [[ and &> weren't working. Those are resolved, and Ubuntu 19 & 20 tests now passed.

I see two alarming failures in special_testing_rootless:

[+0257s] not ok 74 podman volume with --userns=keep-id
[+0257s] # $ /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman.fpYZ0P run --rm -v /tmp/podman_bats.LsZK3v/volume_O8zRoMsGmt:/vol:z quay.io/libpod/alpine_labels:latest stat -c %u:%s /vol/myfile
[+0257s] # read unixpacket @->/run/user/23298/libpod/tmp/socket/1bee53f8e19e2b03ff773e504fead7f5994b8be5545b315a73a3d2cef290f567/attach: read: connection reset by peer
[+0257s] # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
[+0257s] # #|     FAIL: w/o keep-id: stat(file in container) == root
[+0257s] # #| expected: '0:0'
[+0257s] # #|   actual: 'read unixpacket @->/run/user/23298/libpod/tmp/socket/1bee53f8e19e2b03ff773e504fead7f5994b8be5545b315a73a3d2cef290f567/attach: read: connection reset by peer'
[+0257s] # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-------------------------------------

[+0297s] not ok 88 sensitive mount points are masked without --privileged
[+0297s] # $ /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman.fpYZ0P run --rm quay.io/libpod/alpine_labels:latest stat -c%n:%F:%h:%T:%t /dev/null /proc/acpi /proc/kcore /proc/keys /proc/latency_stats /proc/timer_list /proc/sched_debug /proc/scsi /sys/firmware /sys/fs/selinux /sys/dev
[+0297s] # can only attach to created or running containers: container state improper
[+0297s] # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
[+0297s] # #| FAIL: can only attach to created or running containers: Unknown file type ' container state improper'
[+0297s] # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The first: where is that /run/user/.../socket coming from??? The podman command is very clearly invoked with --url. What is going on here?

The second, well, I'm just going to disable that test for podman-remote - but what is happening here too? It's just a podman run, no attach anywhere. I think this one is a race condition, but I'm not going to dive into it now. I want to document these now and hope we don't ignore them.

@edsantiago edsantiago force-pushed the reenable_remote_system_tests branch from d7bcb71 to b51cd63 Compare July 31, 2020 01:26
@edsantiago
Copy link
Member Author

Okay, I don't think podman-remote is ready for prime time. These failures are weird. Giving up for the week.

@cevich
Copy link
Member

cevich commented Jul 31, 2020

Ubuntu /bin/sh is not bash

I can't tell you the number of times this catches me...a genuine PITA for no good reason IMHO...or at least no good reason I'm aware of. I'm glad to read that you got past the odd hanging/garbage problem.

The podman command is very clearly invoked with --url. What is going on here?

Also odd, I wonder if we're running into our old IFS bats bug again? IIRC, that was causing command-line args to not get passed through properly. In any case, it should be possible reasonably easy to run the command manually and see if it reproduces, no?

I think this one is a race condition

That occurred to me as well, like perhaps it's exiting just at the wrong time. Could it be due to a hiccup talking to quay.io maybe? Otherwise, given it's proximity to the first failure, I wonder if the two are somehow related. Is it possible the remote-end got wedged or crashed in some way causing both failures?

@edsantiago
Copy link
Member Author

edsantiago commented Aug 3, 2020

Just a marker to list the issues found and filed (so far) by this PR: #7116 #7117 #7119 #7122 #7123 #7128 #7135 #7136 #7137 #7139 #7195

podman-remote is in better shape now. Let's see what needs
to be done to reenable remote system tests.

 - logs test: skip multilog, it doesn't work remote

 - diff test: use -l only when local, not with remote

 - many other tests: skip_if_remote, with 'FIXME: pending #xxxx'
   where xxxx is a filed issue.

Unrelated: added new helper to skip_if_remote and _if_rootless,
where we check if the source message includes "remote"/"rootless"
and insert it if missing. This is a minor usability enhancement
to make it easier to understand at-a-glance why a skip triggers.

Signed-off-by: Ed Santiago <santiago@redhat.com>
@edsantiago
Copy link
Member Author

This is ready for review - or at least as close as I can get it. Unfortunately, I just don't think podman-remote is ready for the world: it's not reliable. This PR has seen a history of flakes, most particularly what I believe is a race condition in which the client never sees output (#7195). Until that is resolved, I think it's a bad idea to enable podman-remote tests in CI: we will get a ton of flakes. But I also think it's a bad idea to leave podman-remote untested. Comments welcome.

Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
@baude @jwhonce @mheon PTAL

@vrothberg
Copy link
Member

Thanks, @edsantiago! Your findings are concerning and suggest to dedicate some extra attention to podman-remote soon.

@rhatdan
Copy link
Member

rhatdan commented Aug 4, 2020

/approve
@edsantiago I don't really understand the comments, are you saying you don't want this PR merged?

@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago, rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 4, 2020
@edsantiago
Copy link
Member Author

@rhatdan I'm ambivalent. If we merge it, there is going to be a lot of developer pain because of frequent flakes and reruns. But I also don't see any incentive to fix podman-remote, so if we don't merge it I can see us going another six months without testing and with more bugs creeping into podman-remote.

I lean toward merging, I just want everyone to know that it's going to be painful if we do. (Until the flake bug is fixed).

@rhatdan
Copy link
Member

rhatdan commented Aug 4, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 4, 2020
@openshift-merge-robot openshift-merge-robot merged commit 6aed107 into containers:master Aug 4, 2020
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 24, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants