Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pasta: port range forwarding: mismatch between data sent and received #17287

Closed
edsantiago opened this issue Jan 30, 2023 · 20 comments · Fixed by #17759
Closed

pasta: port range forwarding: mismatch between data sent and received #17287

edsantiago opened this issue Jan 30, 2023 · 20 comments · Fixed by #17759
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. pasta pasta(1) bugs or features rootless

Comments

@edsantiago
Copy link
Member

not ok 434 podman networking with pasta(1) - TCP port range forwarding, IPv4, loopback
...
$ podman run --net=pasta -p [127.0.0.1]:5776-5778:5776-5778/tcp quay.io/libpod/testimage:20221018 sh -c for port in $(seq 5776 5778); do                              socat -u TCP4-LISTEN:${port},bind=[127.0.0.1] STDOUT &                          done; wait
2023/01/26 17:23:46 socat[119766] E connect(8, AF=2 127.0.0.1:5777, 16): Interrupted system call
xx
#/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#|     FAIL: Mismatch between data sent and received
#| expected: = 'xxx'
#|   actual:   'xx'
#\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Also happens in a different test, same symptom (xxx vs xx), same socat EINTR

@edsantiago edsantiago added flakes Flakes from Continuous Integration rootless pasta pasta(1) bugs or features labels Jan 30, 2023
@Luap99
Copy link
Member

Luap99 commented Jan 30, 2023

Sounds like socat needs to handle EINTR. Not sure if there is an option for this.

@edsantiago
Copy link
Member Author

Here's another one, f37 rootless, and in this one the test times out at 60m. Same thing in the first link above, I just hadn't noticed. @sbrivio-rh PTAL.

@sbrivio-rh
Copy link
Collaborator

Sorry, I didn't notice about this issue right away, I'm taking a look.

Somewhat interestingly, socat always fails (gets interrupted) on the third connection (out of three): the expected data includes a number of x characters, each one echoed back by a separate connection, and we're always getting two (xx instead of xxx).

What interrupts socat here (and whether socat should just handle EINTR) is still a mystery to me at the moment.

@sbrivio-rh
Copy link
Collaborator

I can't reproduce this on Fedora 37. I'm looking for a way to make the tests robust to socat getting interrupted on connect(), but I don't have anything working yet.

Alternatively we could try to find out what signal socat is getting, and from whom, but I guess we can't easily strace things on CI systems, right?

I'm wondering, meanwhile, if we should consider a mitigation such as issuing just two connections (instead of three) for any test covering forwarding ranges of (TCP) ports, something like this:

diff --git a/test/system/505-networking-pasta.bats b/test/system/505-networking-pasta.bats
index ab68e1d9d..27c602836 100644
--- a/test/system/505-networking-pasta.bats
+++ b/test/system/505-networking-pasta.bats
@@ -381,11 +381,11 @@ function teardown() {
 }
 
 @test "podman networking with pasta(1) - TCP port range forwarding, IPv4, tap" {
-    pasta_test_do 4 tap      tcp 3 0 "port"      1
+    pasta_test_do 4 tap      tcp 2 0 "port"      1
 }
 
 @test "podman networking with pasta(1) - TCP port range forwarding, IPv4, loopback" {
-    pasta_test_do 4 loopback tcp 3 0 "port"      1
+    pasta_test_do 4 loopback tcp 2 0 "port"      1
 }
 
 @test "podman networking with pasta(1) - Translated TCP port forwarding, IPv4, tap" {
@@ -397,11 +397,11 @@ function teardown() {
 }
 
 @test "podman networking with pasta(1) - TCP translated port range forwarding, IPv4, tap" {
-    pasta_test_do 4 tap      tcp 3 1 "port"      1
+    pasta_test_do 4 tap      tcp 2 1 "port"      1
 }
 
 @test "podman networking with pasta(1) - TCP translated port range forwarding, IPv4, loopback" {
-    pasta_test_do 4 loopback tcp 3 1 "port"      1
+    pasta_test_do 4 loopback tcp 2 1 "port"      1
 }
 
 @test "podman networking with pasta(1) - Address-bound TCP port forwarding, IPv4, tap" {
@@ -431,7 +431,7 @@ function teardown() {
 }
 
 @test "podman networking with pasta(1) - TCP port range forwarding, IPv6, tap" {
-    pasta_test_do 6 tap      tcp 3 0 "port"      1
+    pasta_test_do 6 tap      tcp 2 0 "port"      1
 }
 
 @test "podman networking with pasta(1) - TCP port range forwarding, IPv6, loopback" {
@@ -447,11 +447,11 @@ function teardown() {
 }
 
 @test "podman networking with pasta(1) - TCP translated port range forwarding, IPv6, tap" {
-    pasta_test_do 6 tap      tcp 3 1 "port"      1
+    pasta_test_do 6 tap      tcp 2 1 "port"      1
 }
 
 @test "podman networking with pasta(1) - TCP translated port range forwarding, IPv6, loopback" {
-    pasta_test_do 6 loopback tcp 3 1 "port"      1
+    pasta_test_do 6 loopback tcp 2 1 "port"      1
 }
 
 @test "podman networking with pasta(1) - Address-bound TCP port forwarding, IPv6, tap" {

@edsantiago, @Luap99, can this be tried in a CI run? Or should I submit a pull request before I know if it's sufficient or not?

@edsantiago
Copy link
Member Author

@sbrivio-rh the only way to try that in CI is to submit a PR. You can mark it DRAFT, WIP, and/or DO NOT MERGE for safety.

@sbrivio-rh
Copy link
Collaborator

sbrivio-rh commented Feb 6, 2023

@edsantiago, thanks for clarifying. I submitted a pull request (#17380), CI currently fails with:

ERROR: Failed: /var/tmp/go/src/github.com/containers/podman/.github/actions/check_cirrus_cron/cron_failures.sh with output '::group::Posting GraphQL Query and checking result query: { "query": "query { ownerRepository(platform: \"LINUX\", owner: \"containers\", name: \"podman\") { cronSettings { name lastInvocationBuild { id status } } }}" } ::error file=/var/tmp/go/src/github.com/containers/podman/.github/actions/check_cirrus_cron/cron_failures.sh,line=49::Query result did not pass filter '.data.ownerRepository.cronSettings': '{"data":{"ownerRepository":null}}' ::endgroup:: ' (/var/tmp/go/src/github.com/containers/podman/.github/actions/check_cirrus_cron/test.sh:54 in main())

I can retrigger that a few times I guess? If you want to speed this up maybe you can also skip this (unrelated?) part of the CI there.

@Luap99
Copy link
Member

Luap99 commented Feb 6, 2023

@sbrivio-rh CI is broken at the moment, you need to wait until we fix the issue then likely rebase.

@github-actions
Copy link

github-actions bot commented Mar 9, 2023

A friendly reminder that this issue had no activity for 30 days.

@sbrivio-rh
Copy link
Collaborator

The issue itself is not happening anymore, because I hid it with #17380, but I'd like to revert that and see if the original issue still occurs.

@edsantiago
Copy link
Member Author

Bummer

[+1306s] not ok 453 podman networking with pasta(1) - TCP translated port range forwarding, IPv4, loopback
...
#|     FAIL: Mismatch between data sent and received
#| expected: = xxx
#|   actual:   xx

@sbrivio-rh
Copy link
Collaborator

Bummer

Gah, sorry. I shouldn't have gone ahead with it without actually understanding the test failure. I filed #17808 to restore the previous "safe" behaviour.

To actually investigate this I can probably submit a merge request with the relevant test steps wrapped by strace. I'll do that in a bit.

@edsantiago
Copy link
Member Author

Incidentally, the flake can still happen with the '2' setting: remote f37 rootless.

@edsantiago edsantiago added the kind/bug Categorizes issue or PR as related to a bug. label Mar 28, 2023
@dgibson
Copy link
Collaborator

dgibson commented Jun 28, 2023

I had a look at this, but so far I haven't managed to reproduce it. I reverted @sbrivio-rh 's workaround patch, but the tests all still pass for me. This is with git podman with #19021 applied, and git pasta with a fix for bug 61 applied.

I'm wondering if one of the other fixes we've made has also fixed this as a side effect. Is anyone else able to reproduce this with current versions?

@Luap99
Copy link
Member

Luap99 commented Jun 28, 2023

Usually these flakes only manifest in our CI because it just so much slower than anything run locally which trigger race conditions.
The easiest is to just revert the workaround and see if it fails in CI, @edsantiago can you apply the revert of #17808 to your no flakes retry testing PR. I assume that should give us a result quickly if it still fails.

@sbrivio-rh
Copy link
Collaborator

sbrivio-rh commented Jun 28, 2023

I'm wondering if one of the other fixes we've made has also fixed this as a side effect. Is anyone else able to reproduce this with current versions?

Not really, but I never managed to reproduce this on my setup either (not even by adding delays in tcp_conn_from_tap() in pasta). I guess the only reasonable course of action if this happens again is to have a look at how EINTR is handled on connect() in socat (should _xioopen_connect() handle EINTR in a similar way as EINPROGRESS?)

@dgibson
Copy link
Collaborator

dgibson commented Jun 28, 2023

Usually these flakes only manifest in our CI because it just so much slower than anything run locally which trigger race conditions. The easiest is to just revert the workaround and see if it fails in CI, @edsantiago can you apply the revert of #17808 to your no flakes retry testing PR. I assume that should give us a result quickly if it still fails.

Drat, that's going to make this very hard to track down.

@edsantiago
Copy link
Member Author

I've re-reverted #17808 in my hammer-CI PR (#17831). Two runs so far this morning, unfortunately the flake has not triggered. I will report back if it does.

@edsantiago
Copy link
Member Author

Flake seen today in the wild, not in my PR (that is: with the lower 2 threshold, vs 3). remote f38 rootless. Timeout (hang) at end suggests that BATS is waiting for a zombie process.

[+0896s] not ok 479 podman networking with pasta(1) - TCP translated port range forwarding, IPv4, loopback
         # (from function `assert' in file test/system/helpers.bash, line 774,
         #  from function `pasta_test_do' in file test/system/505-networking-pasta.bats, line 156,
         #  in test file test/system/505-networking-pasta.bats, line 403)
         #   `pasta_test_do 4 loopback tcp 2 1 "port"      1' failed
         # [12:26:15.158453057] $ /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_pSdI rm -t 0 --all --force --ignore
         # [12:26:15.265938796] 20a86057e92d9cb7b5c878aa3cf3b9ec0bc0c345cd2423c1e20aa3db7db9b586
         # [12:26:15.280041470] $ /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_pSdI ps --all --external --format {{.ID}} {{.Names}}
         # [12:26:15.378607693] $ /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_pSdI images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
         # [12:26:15.461813445] quay.io/libpod/testimage:20221018 f5a99120db64
         # [12:26:15.740046832] $ /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_pSdI run --net=pasta -p [127.0.0.1]:5909-5910:5910-5911/tcp quay.io/libpod/testimage:20221018 sh -c for port in $(seq 5910 5911); do                              socat -u TCP4-LISTEN:${port},bind=[127.0.0.1] STDOUT &                          done; wait
         # 2023/06/27 12:26:16 socat[102788] E connect(8, AF=2 127.0.0.1:5909, 16): Interrupted system call
         # [12:26:17.065663636] x
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #|     FAIL: Mismatch between data sent and received
         # #| expected: = xx
         # #|   actual:   x
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@dgibson
Copy link
Collaborator

dgibson commented Jun 29, 2023

Ok, that's interesting. Sounds like the fact it always failed on the third one before was just a quirk of timing, not anything fundamental.

And based on that, I was able to reproduce locally, by increasing the range size to 1000. So, I have something concrete I can debug now.

@dgibson
Copy link
Collaborator

dgibson commented Jun 29, 2023

Ok, I think I know what's going on. It looks like a bug in socat to me, but there is a pretty easy workaround.

Viewing the problem under strace proved surprisingly tricky: I wasn't able to strace the entire bats run, because once I had a large enough number of ports that it reproduced frequently, strace slowed it down enough to just grind to a halt. But with a variety of tricks I eventually managed it.

It appears that the connect() in the "client" socat is being interrupted from a SIGCHLD as the printf x it invoked to generate the input exits. As I say, that seems like a bug in socat, I guess it's some of race.

The workaround is not to use an EXEC: address for socat, but use an OPEN: address, creating a 1-byte temporary file to send instead of generating it on fly with each socat invocation.

I have a draft patch doing that, and I'm running it through some testing now. I have had the test (with 1000 ports) time out once, but I haven't seen it hit the EINTR problem so far.

dgibson added a commit to dgibson/podman that referenced this issue Jun 29, 2023
…rward tests"

This reverts commit c2a24ab, which
itself reverted 1c08f2e, which
reverted e33f4e0.

The original e33f4e0 "pasta: Use two connections instead of three
in TCP range forward tests" was a workaround to avoid intermittent
errors in CI where the pasta networking port range forwarding tests
would fail.  It was reverted and unreverted when we thought we'd fixed
the problem, but that turned out not to be the case.

We're now much more confident that we've genuinely found and fixed (or
at least, worked around) the underlying problem, so we revert it again.

Link: containers#17287

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 27, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. pasta pasta(1) bugs or features rootless
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants