copying system image from manifest list: writing/storing blob: happened during read: unexpected EOF #17193

edsantiago · 2023-01-23T17:52:58Z

New flake, two instances, both in the same PR at a similar clock time:

# podman-remote --url unix:/tmp/podman_tmp_0PCR pull quay.io/libpod/testimage:multiimage
Trying to pull quay.io/libpod/testimage:multiimage...
Getting image source signatures
Copying blob sha256:9afcdfe780b4ea44cc52d22e3f93ccf212388a90370773571ce034a62e14174e
Error: copying system image from manifest list: writing blob: storing blob to file "/var/tmp/storage4172392440/1": happened during read: unexpected EOF
[ rc=125 (** EXPECTED 0 **) ]

and

 podman [options] pull quay.io/libpod/systemd-image:20230106
         Trying to pull quay.io/libpod/systemd-image:20230106...
         Getting image source signatures
         Copying blob sha256:aa55548baa7f0ff25533b19213c56de832034f22989b884d6ca968c6151dd572
         Error: copying system image from manifest list: writing blob: storing blob to file "/var/tmp/storage3799877059/1": happened during read: unexpected EOF

fedora-36 : int remote fedora-36 root host [remote]
- PR network create: do not allow default as name #17175
  - 01-20 09:23
fedora-37 : sys remote fedora-37 root host [remote]
- PR network create: do not allow default as name #17175
  - 01-20 10:31

(Am not tagging remote because even though the int failure says remote, the failure actually happened while pulling cache images, which I think is done using non-remote podman)

The text was updated successfully, but these errors were encountered:

flouthoc · 2023-01-23T18:45:44Z

Saw multiple occurrence of this flake while working on this PR as well #16297

vrothberg · 2023-01-24T09:04:50Z

Looks like a registry flake to me. @mtrmac, I think we need to tweak catching EOFs during retry. I did not investigate any further than guessing that https://github.com/containers/common/blob/main/pkg/retry/retry.go#L76-L79 is not enough.

mtrmac · 2023-01-24T12:16:24Z

Yes; it’s already on a short-term list as https://issues.redhat.com/browse/RUN-1636 .

edsantiago · 2023-01-25T16:00:38Z

Now our number-two flake (#16973 is number one by far)

fedora-36 : int remote fedora-36 root host [remote]
- PR ps: query health check in batch mode #17211
  - 01-25 05:52
- PR network create: do not allow default as name #17175
  - 01-20 09:23
fedora-37 : APIv2 test on fedora-37 (rootless)
- PR libpod,netavark: correctly set /etc/resolv.conf for custom dns server and make --dns functional #16297
  - 01-23 09:01
fedora-37 : bud podman fedora-37 root host
fedora-37 : bud remote fedora-37 root host [remote]
fedora-37 : sys remote fedora-37 root host [remote]
- PR network create: do not allow default as name #17175
  - 01-20 10:31

mtrmac · 2023-01-25T18:29:13Z

containers/image#1816 is WIP trying to improve this.

Note #17221 , filed early to collect Podman production data. If that doesn’t catch the flake (BODYREADER DEBUG in logs), it would be useful to keep retrying until it does.

edsantiago · 2023-01-30T18:22:55Z

Just the last four days:

fedora-36 : int podman fedora-36 root container
- PR [v4.4] Clean up more language for inclusiveness (cherry-pick from main) #17259
  - 01-28 05:35 in int podman fedora-36 root container: unknown failure
- PR [v4.4] quadlet: Add device support for .volume files #17257
  - 01-27 15:17 in int podman fedora-36 root container: unknown failure
- PR [v4.4] fix: running check error when podman is default in wsl #17256
  - 01-27 15:17 in int podman fedora-36 root container: unknown failure
fedora-36 : int podman fedora-36 root host
- PR [v4.4] Allow --device-cgroup-rule to be passed in by docker API #17242
  - 01-26 17:39 in int podman fedora-36 root host: unknown failure
fedora-36 : int podman fedora-36 rootless host
- PR fix: don't output "ago" when container is currently up and running #17251
  - 01-27 06:57 in int podman fedora-36 rootless host: unknown failure
fedora-36 : sys podman fedora-36 root host
- PR fix: don't output "ago" when container is currently up and running #17251
  - 01-27 07:53 in [sys] podman can pull an image
fedora-36 : sys remote fedora-36 root host [remote]
- PR fix: don't output "ago" when container is currently up and running #17251
  - 01-27 07:43 in [sys] podman can pull an image
fedora-37 : APIv2 test on fedora-37 (rootless)
- PR [v4.4] Match VT device paths to be blocked from mounting exactly #17272
  - 01-30 02:51 in APIv2 test on fedora-37 (rootless): unknown failure
fedora-37 : Test Bindings
- PR e2e: Remove the cache with "podman unshare rm" when a rootless user #17245
  - 01-27 10:05 in [Bindings] Podman manifests [It] Manifest Push
fedora-37 : Upgrade test: from v2.1.1
- PR [v4.4] fix: don't output "ago" when container is currently up and running #17255
  - 01-28 05:46 in [Upgrade] initial setup: start v2.1.1 containers
- PR Add … push --sign-by-sigstore #17088
  - 01-27 12:54 in [Upgrade] initial setup: start v2.1.1 containers
fedora-37 : bud podman fedora-37 root host
- PR [v4.4] Clean up more language for inclusiveness (cherry-pick from main) #17259
  - 01-28 06:27 in [bud] bud-multiple-platform-failure
- PR [v4.4] Add … push --sign-by-sigstore #17241
  - 01-27 12:09 in [bud] bud-multiple-platform-with-base-as-default-arg
- PR Bump Bulidah to v1.29.0 #17238
  - 01-26 14:22 in [bud] build test --retry and --retry-delay
fedora-37 : bud remote fedora-37 root host [remote]
- PR [v4.4] Match VT device paths to be blocked from mounting exactly #17272
  - 01-30 03:58 in [bud] build-with-additional-build-context and COPY, additional context from external URL
- PR Fix usage of absolute windows paths with --image-path #17266
  - 01-28 20:59 in [bud] bud pull never
- PR Match VT device paths to be blocked from mounting exactly #17265
  - 01-28 06:40 in [bud] build test pushing and pulling from remote cache sources
- PR [v4.4] quadlet: Add device support for .volume files #17257
  - 01-27 16:11 in [bud] build-with-additional-build-context and COPY, test pinning image
- PR [v4.4] Add … push --sign-by-sigstore #17241
  - 01-27 12:10 in [bud] bud-multiple-platform-no-run
fedora-37 : compose_v2 test on fedora-37 (root)
- PR [v4.4] Set runAsNonRoot=true in gen kube #17258
  - 01-27 15:08 in [compose_v2] two_networks - up
fedora-37 : compose_v2 test on fedora-37 (rootless)
- PR [v4.4] fix: running check error when podman is default in wsl #17256
  - 01-27 15:04 in [compose_v2] mount_and_label - up
fedora-37 : int podman fedora-37 root container
- PR Fix default handling of pids-limit #17262
  - 01-28 01:01 in int podman fedora-37 root container: unknown failure
fedora-37 : int podman fedora-37 rootless host
- PR [v4.4] Clean up more language for inclusiveness (cherry-pick from main) #17259
  - 01-28 05:33 in int podman fedora-37 rootless host: unknown failure
fedora-37 : int remote fedora-37 root host [remote]
- PR journald: podman events only show events for current user #17253
  - 01-27 10:46 in int remote fedora-37 root host: unknown failure
fedora-37 : sys podman fedora-37 root host
- PR [v4.4] Clean up more language for inclusiveness (cherry-pick from main) #17259
  - 01-28 06:27 in [sys] build with copy-from referencing the base image
- PR Add … push --sign-by-sigstore #17088
  - 01-27 12:10 in [sys] podman load - from URL
fedora-37 : sys podman fedora-37 rootless host
- PR Fix default handling of pids-limit #17262
  - 01-28 02:31 in [sys] podman corrupt images - initialize
fedora-37 : sys remote fedora-37 root host [remote]
- PR fix: running check error when podman is default in wsl #17228
  - 01-27 06:56 in [sys] podman load - multi-image archive
fedora-37 : sys remote fedora-37 rootless host [remote]
- PR [v4.4] fix: don't output "ago" when container is currently up and running #17255
  - 01-27 16:07 in [sys] podman load - multi-image archive

edsantiago · 2023-02-07T14:43:17Z

Last five days

fedora-36 : int podman fedora-36 root container
- PR [v4.4] make hack/markdown-preprocess parallel-safe #17360
  - 02-03 18:09 in int podman fedora-36 root container: unknown failure
- PR Expose Podman named pipe in Inspect output #17303
  - 02-02 17:34 in int podman fedora-36 root container: unknown failure
fedora-36 : int podman fedora-36 root host
- PR system-reset: use CleanCacheMount to clear build cache #17325
  - 02-03 02:28 in int podman fedora-36 root host: unknown failure
fedora-36 : int podman fedora-36 rootless host
- PR Move clean-binaries before podman-remote in podman-remote-docs target #17367
  - 02-05 01:09 in int podman fedora-36 rootless host: unknown failure
- PR Cirrus: Use versionable IMAGE_SUFFIX #17166
  - 02-01 12:01 in int podman fedora-36 rootless host: unknown failure
fedora-36 : int remote fedora-36 root host [remote]
- PR [v4.4] Fix default handling of pids-limit #17355
  - 02-03 11:59 in int remote fedora-36 root host: unknown failure
fedora-36 : sys podman fedora-36 root host
- PR Expose Podman named pipe in Inspect output #17303
  - 02-02 18:28 in [sys] podman corrupt images - initialize
fedora-37 : APIv2 test on fedora-37 (rootless)
- PR Bump to v4.4.0 #17314
  - 02-01 15:22 in APIv2 test on fedora-37 (rootless): unknown failure
fedora-37 : bud podman fedora-37 root host
- PR Cirrus: Emergency fix to un-stuck PRs #17381
  - 02-06 12:16 in [bud] build-test skipping unwanted stages with --mount from stagename with flag order changed
  - 02-06 12:16 in [bud] build test pushing and pulling from multiple remote cache sources
- PR [v4.4] make hack/markdown-preprocess parallel-safe #17360
  - 02-03 19:19 in [bud] build with basename resolving user arg
- PR Install podman-systemd.unit man page, make quadlet discoverable #17351
  - 02-03 11:25 in [bud] bud-multiple-platform-no-run
- PR libpod: allow userns=keep-id for root #17350
  - 02-03 08:07 in [bud] bud-with-mount-bind-from-cache-multistage-relative-like-buildkit
- PR Bump to v4.4.0 #17314
  - 02-01 17:03 in [bud] build-with-additional-build-context and RUN --mount=from=, additional-context is URL and mounted from subdir
  - 02-01 16:33 in [bud] bud --pull=false --arch test
- PR Expose Podman named pipe in Inspect output #17303
  - 02-02 18:28 in [bud] build with basename resolving user arg
- PR libpod: support idmap for --rootfs #17274
  - 02-02 18:01 in [bud] bud using gitrepo and branch
fedora-37 : bud remote fedora-37 root host [remote]
- PR Cirrus: Emergency fix to un-stuck PRs #17381
  - 02-06 12:46 in [bud] bud cache by format
  - 02-06 12:17 in [bud] build-with-inline-platform
- PR Move clean-binaries before podman-remote in podman-remote-docs target #17367
  - 02-06 04:17 in [bud] bud-multiple-platform-no-run
  - 02-05 02:09 in [bud] bud-multiple-platform-no-run
- PR [v4.4] make hack/markdown-preprocess parallel-safe #17360
  - 02-03 19:13 in [bud] build-test: do not warn for instructions declared in unused stages
- PR Handle filetype field in kubernetes.yaml files #17302
  - 02-02 11:50 in [bud] bud-multiple-platform-no-run
fedora-37 : compose test on fedora-37 (root)
- PR Cirrus: Emergency fix to un-stuck PRs #17381
  - 02-06 11:11 in [compose] uptwice - up
- PR Add quadlet support for Rootfs and SELinux labels containers #17352
  - 02-06 13:12 in [compose] uptwice - up
fedora-37 : compose_v2 test on fedora-37 (root)
- PR system-reset: use CleanCacheMount to clear build cache #17325
  - 02-03 02:17 in [compose_v2] slirp4netns_opts - up
- PR Cirrus: Use versionable IMAGE_SUFFIX #17166
  - 02-01 11:52 in [compose_v2] ipam_set_ip - up
fedora-37 : compose_v2 test on fedora-37 (rootless)
- PR network ls: handle removed container #17376
  - 02-07 04:06 in [compose_v2] env_and_volume - up
fedora-37 : int podman fedora-37 root container
- PR make hack/markdown-preprocess parallel-safe #17331
  - 02-02 08:13 in int podman fedora-37 root container: unknown failure
  - 02-02 08:02 in int podman fedora-37 root container: unknown failure
  - 02-02 07:59 in int podman fedora-37 root container: unknown failure
fedora-37 : sys podman fedora-37 root host
- PR system tests: fix volume exec/noexec test #17327
  - 02-02 08:44 in [sys] podman load - multi-image archive
fedora-37 : sys podman fedora-37 rootless host
- PR [v4.4] Fix default handling of pids-limit #17355
  - 02-03 12:45 in [sys] podman corrupt images - initialize
- PR Add quadlet support for Rootfs and SELinux labels containers #17352
  - 02-06 14:10 in [sys] podman auto-update - label io.containers.autoupdate=local with rollback
- PR test: adapt test to work on cgroupv1 #17343
  - 02-02 15:35 in [sys] podman run --rmi

vrothberg · 2023-02-07T14:50:43Z

Help is on the way. Let me vendor c/image directly now since @mtrmac's latest changes are expected to fix it.

To fix the registry flakes we see in containers#17193 making c/image more tolerant toward unexpected EOFs from registries. Fixes: containers#17193 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

To fix the registry flakes we see in containers#17193 making c/image more tolerant toward unexpected EOFs from registries. Also pin docker/docker to v20.10.23+incompatible as it doesn't compile as imagebuilder needs to be fixed before. Fixes: containers#17193 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

mtrmac · 2023-02-09T19:12:42Z

#17221 was merged; that should help but we do see failures that don’t fit into the current heuristic.

edsantiago · 2023-02-13T12:46:21Z

Same failure, new error message seen (in f36 rootless), is this helpful?

Error: copying system image from manifest list: writing blob: 
      storing blob to file "/var/tmp/storage2897374505/1": happened during read:
       (heuristic tuning data: last retry 0, current offset 752828; 77.493 ms total, 0.629 ms since progress): unexpected EOF

mtrmac · 2023-02-13T15:20:54Z

I don’t know what’s a good way to collect those “heuristic tuning data” messages… in principle they could show us something unexpected. (A real telemetry is … a thought.)

So far all of those have been something like the above: a failure very soon after starting to fetch data, less than a megabyte into the stream (a megabyte being the current cut-off). That’s probably a clear enough indication that we need to allow a retry in that case, so more instances of that kind of failure (last retry 0, small total, small “since progress”) would not make a difference. Other kinds of failure messages would still be interesting.

mtrmac · 2023-02-13T19:01:46Z

That’s probably a clear enough indication that we need to allow a retry in that case

containers/image#1847 / #17492

edsantiago · 2023-02-14T18:52:34Z

Here's one I don't know how to categorize: tls handshake failure. It's cdn03.quay, but not a DNS lookup error, so not #16973. If it doesn't belong here, lmk, and I'll shove it into #17288 ("Other").

not ok 538 pull image into a full storage
...
$ buildah --root=/tmp/buildah-test pull --signature-policy /usr/share/buildah/test/system/./policy.json alpine
Resolved "alpine" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf)
Trying to pull docker.io/library/alpine:latest...
parsing image configuration: Get "https://cdn03.quay.io/sha256/96/961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4
     ?X-Amz-Algorithm=AWS4-HMAC-SHA256
     &X-Amz-Credential=AKIAI5LUAQGPZRPNKSJA%2F20230206%2Fus-east-1%2Fs3%2Faws4_request
     &X-Amz-Date=20230206T124402Z
     &X-Amz-Expires=600
     &X-Amz-SignedHeaders=host
     &X-Amz-Signature=ce946b86fe8265c8db2dc7695e0758d104a584b385f122433a02cd670bdc1fb6
     &cf_sign=cqeGovWnO5c8L7E3AnZxOTbW%2FTX01Zl%2BiePEnxn7BA7DseHg8wde58MdN3joVi43OlROpIN6RjWGAzn3t0PxI1GN6iTdINMSK7Ngh12EDjvgGLqjr%2FTyDZQ%2BU1Ppb%2BvRuvh%2BwbysezFTFasli%2BUInIVOXFFa9uGUb%2BdAv4FTp6Q3UI8EyHNzycTbcvUgESsX9MOAhu4gUyd3vWqpdNQWRzP1XPBENSfZyHjezr5MMH3A2DzdnTVgu0v8U7iQD06KAdL7ZBH06PXr0ALyOg9Xs7UqAzUpHctyTS9oyuHGbz9nPw3fPGY29Mn1r7PUNE6%2F6wIGNlfJQVsiYLTur73QjQ%3D%3D
     &cf_expiry=1675688042
     &region=us-east-1": remote error: tls: handshake failure

mtrmac · 2023-02-14T19:01:40Z

From a not very careful check, I’d categorize that as “other”: It is the remote server successfully sending a “handshake failure” TLS alert record, i.e. it’s not an EOF we see from the server. (It may possible that the connection was disrupted in a fundamentally the same way, causing the server to see an EOF from us, and responding to that with a “handshake failure” alert. I’m not sure.)

edsantiago · 2023-03-15T12:34:37Z

I'm seeing a lot of flakes now that say "502 Bad Gateway" instead of "happened during read":

Error: reading blob sha256:9d16cba9fb961d1aafec9542f2bf7cb64acfc55245f9e4eb5abecd4cdc38d749: fetching blob: received unexpected HTTP status: 502 Bad Gateway

Is this a new variant of this same issue? Is it #16973 (quay flakes)? Or should I file a new issue?

mtrmac · 2023-03-15T17:41:20Z

“happened during read” means we successfully connected to the registry, and got status 200; the failure happened while reading the response body.
“happened during read … unexpected EOF” then means the body ended prematurely (IIRC violating the Transfer-Encoding: Chunked framing, in particular)
This 502 Bad Gateway means the registry responded with this code, instead of status 200.

So, 502 is a different behavior. (We don’t see inside the registry to understand whether it’s the same cause, and it might not matter.)

For context, currently there are two levels of retries:

A c/common/pkg/retry. AFAICS that should, by default, retry each pull (as a whole) three times . There should be a warning-level message Failed, retrying in %s ... (%d/%d). Error: %v .
A c/image retry. This applies only to the “happened during read” failures, and it is restricted specifically to “unexpected EOF” and “connection reset”. (The idea is that the top-level c/common/pkg/retry drives retries at the top level, and lower-level libraries should not retry so that we don’t multiply the number of attempts; the “happened during read” failures are an exception because if we retry at the higher level, we need to redownload the whole layer, while the c/image retry can possibly continue where it left off.)

Looking at the code, it seems that a 502 doesn’t trigger the c/image retry (as it shouldn’t), but it also doesn’t trigger c/common/pkg/retry (because the error returned by c/image doesn’t have a recognizable Go type).

So, if the failures don’t trigger a Failed, retrying in warning, please file that separately; we should export the error type in c/image, and handle it in c/common.

(This is under the assumption that retrying could help the operation succeed. I’m not sure about that, nor how to reasonably measure that. Retrying will very likely increase the load on the server, possibly making the failure harder to recover from.)

edsantiago added the flakes Flakes from Continuous Integration label Jan 23, 2023

edsantiago assigned mtrmac Jan 25, 2023

edsantiago mentioned this issue Jan 30, 2023

network flakes: other #17288

Open

edsantiago mentioned this issue Feb 7, 2023

placeholder issue for quay.io flakes #16973

Closed

vrothberg mentioned this issue Feb 7, 2023

vendor c/image #17405

Closed

mtrmac mentioned this issue Feb 13, 2023

Vendor c/image after EOF heuristic tuning #17492

Merged

Luap99 mentioned this issue Feb 17, 2023

podman pull fails with Error: writing blob: storing blob to file "/var/tmp/storage3244944871/5": happened during read: unexpected EOF #17545

Closed

luckylinux mentioned this issue Apr 4, 2024

Error: copying system image from manifest list: trying to reuse blob */diff: no such file or directory #21810

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

copying system image from manifest list: writing/storing blob: happened during read: unexpected EOF #17193

copying system image from manifest list: writing/storing blob: happened during read: unexpected EOF #17193

edsantiago commented Jan 23, 2023

flouthoc commented Jan 23, 2023

vrothberg commented Jan 24, 2023 •

edited

Loading

mtrmac commented Jan 24, 2023

edsantiago commented Jan 25, 2023

mtrmac commented Jan 25, 2023

edsantiago commented Jan 30, 2023

edsantiago commented Feb 7, 2023

vrothberg commented Feb 7, 2023

mtrmac commented Feb 9, 2023

edsantiago commented Feb 13, 2023

mtrmac commented Feb 13, 2023

mtrmac commented Feb 13, 2023

edsantiago commented Feb 14, 2023

mtrmac commented Feb 14, 2023

edsantiago commented Mar 15, 2023

mtrmac commented Mar 15, 2023

copying system image from manifest list: writing/storing blob: happened during read: unexpected EOF #17193

copying system image from manifest list: writing/storing blob: happened during read: unexpected EOF #17193

Comments

edsantiago commented Jan 23, 2023

flouthoc commented Jan 23, 2023

vrothberg commented Jan 24, 2023 • edited Loading

mtrmac commented Jan 24, 2023

edsantiago commented Jan 25, 2023

mtrmac commented Jan 25, 2023

edsantiago commented Jan 30, 2023

edsantiago commented Feb 7, 2023

vrothberg commented Feb 7, 2023

mtrmac commented Feb 9, 2023

edsantiago commented Feb 13, 2023

mtrmac commented Feb 13, 2023

mtrmac commented Feb 13, 2023

edsantiago commented Feb 14, 2023

mtrmac commented Feb 14, 2023

edsantiago commented Mar 15, 2023

mtrmac commented Mar 15, 2023

vrothberg commented Jan 24, 2023 •

edited

Loading