Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: Retry individual layer copies #1145

Closed
Gottox opened this issue Nov 28, 2020 · 23 comments
Closed

RFE: Retry individual layer copies #1145

Gottox opened this issue Nov 28, 2020 · 23 comments
Assignees
Labels
kind/feature A request for, or a PR adding, new functionality

Comments

@Gottox
Copy link

Gottox commented Nov 28, 2020

/kind bug

I'm unable to pull certain images from dockerhub

Steps to reproduce the issue:

  1. podman pull octoprint/octoprint

  2. wait

  3. get the following error:

  read tcp 192.168.155.178:46000->104.18.123.25:443: read: connection reset by peer
Error: unable to pull octoprint/octoprint: 1 error occurred:
	* Error writing blob: error storing blob to file "/var/tmp/storage241651785/6": read tcp 192.168.155.178:46000->104.18.123.25:443: read: connection reset by peer

Describe the results you received:

After a while I get a TCP connection reset error. This is the complete output of the run:

~ podman pull octoprint/octoprint
Trying to pull docker.io/octoprint/octoprint...
Getting image source signatures
Copying blob dc97f433b6ed done  
Copying blob cb732bb8dce0 done  
Copying blob bb79b6b2107f done  
Copying blob d8634511c1f0 done  
Copying blob 0065b4712c38 done  
Copying blob e4ab79a0ba11 done  
Copying blob 3d27de0ca1e3 done  
Copying blob 093359127cd2 done  
Copying blob 978d8fd02815 done  
Copying blob b8abb99ca1cc done  
Copying blob e1cde2378a2b done  
Copying blob 1ee9298ab334 done  
Copying blob 35e30c3f3e2b [======================================] 2.6MiB / 2.6MiB
Copying blob 38a288cfa675 [====================================>-] 287.0KiB / 293.3KiB
Copying blob 7fcd1d230ec9 done  
Copying blob 30e377ec29ea done  
Getting image source signatures
Copying blob dc97f433b6ed done  
Copying blob d8634511c1f0 done  
Copying blob cb732bb8dce0 done  
Copying blob bb79b6b2107f done  
Copying blob 3d27de0ca1e3 done  
Copying blob 35e30c3f3e2b done  
Copying blob 0065b4712c38 [======================================] 1.8MiB / 1.8MiB
Copying blob e4ab79a0ba11 done  
Copying blob 978d8fd02815 done  
Copying blob 093359127cd2 [======================================] 11.4MiB / 11.4MiB
Copying blob 1ee9298ab334 done  
Copying blob b8abb99ca1cc [======================================] 1.3MiB / 1.4MiB
Copying blob e1cde2378a2b [=============================>--------] 1.8KiB / 2.2KiB
Copying blob 38a288cfa675 [====================================>-] 288.0KiB / 293.3KiB
Copying blob 7fcd1d230ec9 done  
Copying blob 30e377ec29ea done  
Getting image source signatures
Copying blob bb79b6b2107f done  
Copying blob dc97f433b6ed done  
Copying blob 3d27de0ca1e3 [======================================] 242.1MiB / 242.1MiB
Copying blob 35e30c3f3e2b [======================================] 2.6MiB / 2.6MiB
Copying blob 0065b4712c38 done  
Copying blob cb732bb8dce0 [======================================] 10.1MiB / 10.1MiB
Copying blob d8634511c1f0 done  
Copying blob e4ab79a0ba11 done  
Copying blob 093359127cd2 done  
Copying blob 978d8fd02815 done  
Copying blob 1ee9298ab334 [======================================] 42.3MiB / 42.3MiB
Copying blob b8abb99ca1cc done  
Copying blob e1cde2378a2b done  
Copying blob 38a288cfa675 done  
Copying blob 7fcd1d230ec9 [======================================] 183.9KiB / 184.1KiB
Copying blob 30e377ec29ea done  
Getting image source signatures
Copying blob 35e30c3f3e2b done  
Copying blob 3d27de0ca1e3 done  
Copying blob d8634511c1f0 done  
Copying blob bb79b6b2107f done  
Copying blob dc97f433b6ed done  
Copying blob cb732bb8dce0 [======================================] 10.1MiB / 10.1MiB
Copying blob 0065b4712c38 done  
Copying blob e4ab79a0ba11 [======================================] 1.7MiB / 1.7MiB
Copying blob 093359127cd2 done  
Copying blob 978d8fd02815 done  
Copying blob 1ee9298ab334 done  
Copying blob b8abb99ca1cc [======================================] 1.4MiB / 1.4MiB
Copying blob e1cde2378a2b done  
Copying blob 7fcd1d230ec9 [====================================>-] 179.7KiB / 184.1KiB
Copying blob 30e377ec29ea done  
Copying blob 38a288cfa675 [====================================>-] 287.0KiB / 293.3KiB
  read tcp 192.168.155.178:46000->104.18.123.25:443: read: connection reset by peer
Error: unable to pull octoprint/octoprint: 1 error occurred:
	* Error writing blob: error storing blob to file "/var/tmp/storage241651785/6": read tcp 192.168.155.178:46000->104.18.123.25:443: read: connection reset by peer

Describe the results you expected:

The image should be pulled successfully
Additional information you deem important (e.g. issue happens only occasionally):

Smaller images may take an unusual amount of time, but they're pulled successfully:

~ podman pull busybox            
Trying to pull docker.io/library/busybox...
Getting image source signatures
Copying blob 5f5dd3e95e9f [====================================>-] 734.5KiB / 746.7KiB
Getting image source signatures
Copying blob 5f5dd3e95e9f done  
Copying config dc3bacd8b5 done  
Writing manifest to image destination
Storing signatures
dc3bacd8b5ea796cea5d6070c8f145df9076f26a6bc1c8981fd5b176d37de843

Output of podman version:

~ podman version
Version:      2.1.1
API Version:  2.0.0
Go Version:   go1.15.2
Built:        Thu Jan  1 01:00:00 1970
OS/Arch:      linux/amd64

Output of podman info --debug:

podman info --debug 
host:
  arch: amd64
  buildahVersion: 1.16.1
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: Unknown
    path: /usr/libexec/podman/conmon
    version: 'conmon version 2.0.21, commit: unknown'
  cpus: 8
  distribution:
    distribution: '"void"'
    version: unknown
  eventLogger: file
  hostname: rootkit
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 1065536
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 1065536
      size: 65536
  kernel: 5.9.8_1
  linkmode: dynamic
  memFree: 232652800
  memTotal: 16423014400
  ociRuntime:
    name: runc
    package: Unknown
    path: /usr/bin/runc
    version: 'runc version spec: 1.0.2-dev'
  os: linux
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  rootless: true
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: Unknown
    version: |-
      slirp4netns version 1.1.6
      commit: a995c1642ee9a59607dccf87758de586b501a800
      libslirp: 4.3.1
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.4.3
  swapFree: 33961799680
  swapTotal: 34359734272
  uptime: 10h 48m 41.26s (Approximately 0.42 days)
registries:
  search:
  - docker.io
store:
  configFile: /home/tox/.config/containers/storage.conf
  containerStore:
    number: 3
    paused: 0
    running: 0
    stopped: 3
  graphDriverName: vfs
  graphOptions: {}
  graphRoot: /home/tox/.local/share/containers/storage
  graphStatus: {}
  imageStore:
    number: 44
  runRoot: /run/user/1000
  volumePath: /home/tox/.local/share/containers/storage/volumes
version:
  APIVersion: 2.0.0
  Built: 0
  BuiltTime: Thu Jan  1 01:00:00 1970
  GitCommit: ""
  GoVersion: go1.15.2
  OsArch: linux/amd64
  Version: 2.1.1

Package info (e.g. output of rpm -q podman or apt list podman):

~ xbps-query podman
architecture: x86_64
filename-sha256: 4644867b3b093c0f2fbc3c95b70dffb44176bb6b14dfba44e43f10128e4b1d65
filename-size: 20MB
homepage: https://podman.io/
install-date: 2020-10-16 12:20 CEST
installed_size: 48MB
license: Apache-2.0
maintainer: Cameron Nemo <cnemo@tutanota.com>
metafile-sha256: 3a4c5202a0133b8ffe927dc9a9f4e732a782f3c7f1f28afce60f0baa73f887b5
pkgname: podman
pkgver: podman-2.1.1_1
repository: https://alpha.de.repo.voidlinux.org/current
run_depends:
	runc>=0
	conmon>=0
	cni-plugins>=0
	slirp4netns>=0
	containers.image>=0
	glibc>=2.29_1
	libgpgme>=1.12.0_2
	libassuan>=2.0.1_1
	libgpg-error>=1.6_1
	libseccomp>=2.0.0_1
	device-mapper>=2.02.110_1
shlib-requires:
	libpthread.so.0
	libgpgme.so.11
	libassuan.so.0
	libgpg-error.so.0
	libseccomp.so.2
	librt.so.1
	libdevmapper.so.1.02
	libc.so.6
short_desc: Simple management tool for containers and images
source-revisions: podman:b990889240
state: installed

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide?

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

Local box with voidlinux x86_64 glibc

@Gottox
Copy link
Author

Gottox commented Nov 28, 2020

After debugging a bit more, I see the same errors with docker too. But in contrast to podman, docker retries failed downloads so this error was hidden.

@rhatdan
Copy link
Member

rhatdan commented Nov 28, 2020

Podman is supposed to retry 3 times on network failures now. @QiWang19 Correct?

@mheon
Copy link
Member

mheon commented Nov 28, 2020 via email

@Gottox
Copy link
Author

Gottox commented Dec 6, 2020

The root cause is somewhere between Vodafone DE and dockerhub. Nevertheless, podman could manage theses issue better than it currently does.

@rhatdan
Copy link
Member

rhatdan commented Dec 7, 2020

@mtrmac @vrothberg Could we do better?

@vrothberg
Copy link
Member

Auto-retries seem like a reoccurring issue. Maybe that is something c/image could handle. Many reported that issue XYZ does not occur on Docker but on Podman, so it seems that Docker may have found a sweet spot of retries making it slightly more robust to transient network failures.

@mtrmac what do you think?

@mtrmac
Copy link
Collaborator

mtrmac commented Dec 8, 2020

I guess? It’s useful, probably possible, OTOH not quite trivial:

  • The retry logic should be conservative and safe: retry only on specific clearly identified errors where retrying makes sense; the current retry logic has turned out to be too broad in a few cases. (I don’t know how far to take this… why is “connection reset by peer”, caused by a TCP packet protected by checksum, not a legitimate final error state, but a reason to retry? It probably is a reason to retry but it’s not trivially obvious from first principles. Earlier there was a report from someone having a DNS server authoritatively reply “name does not exist”, that was rather too much I think. There are also some reports of io.EOF failures on slow networks, which are logic bugs somewhere but it is so far unknown where.)
  • Even with best effort, different callers may have different preferences. IIRC c/common/retry was created partially because defining a good API looked difficult (or was it just because there were also retry reasons higher up the stack? maybe both). Hopefully just being very conservative could keep c/image out of that business.
  • c/image must account for callers’ timeouts/retries somehow, so that 5 retries doesn’t turn into 25. It should be easy… enough… (assuming a strict version sync of the two? hopefully not an API) to change c/common/pkg/retry not to retry the same causes yet again. I don’t know what that means for non-GitHub/containers callers, if anything. Maybe we just make the change and let them deal with that, I guess… Or drastically improve our release notes, we should do that anyway.
  • Retrying the whole copyLayer pipeline might be a fairly invasive change — both in copy and more worryingly in transports that are currently guaranteed that layers arrive via PutLayer one at a time, in order. Or it might turn out to be easy.

@Gottox
Copy link
Author

Gottox commented Dec 10, 2020

Doing retry logic is pretty easy to do in a safe way: Just retry if any data has been received.

@mtrmac
Copy link
Collaborator

mtrmac commented Dec 10, 2020

That fails if we run out of disk space while storing the pulled data (but aborting causes the temporary copy to be deleted), a case we have already encountered.

@Gottox
Copy link
Author

Gottox commented Dec 10, 2020

I'm not into podman's source. I'll look into it soon.

@mtrmac
Copy link
Collaborator

mtrmac commented Dec 10, 2020

Another way to say this, non-targeted heuristics like this are exactly the kind of rule I think should be avoided; target only “specific clearly identified errors”.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@vrothberg vrothberg transferred this issue from containers/podman Feb 11, 2021
@vrothberg
Copy link
Member

Moved the issue over to containers/image where the work had to be done.

@grunlab
Copy link

grunlab commented Nov 24, 2021

I have exactly the same issue:

From the same host having docker and podman available and using the same proxy settings (company proxy), pulling for example the image mariadb:10.6.5 is OK using docker and failing using podman:

Docker:

# https_proxy=http://<user>:<password>@<proxy_server>:<proxy_port> docker pull docker.io/mariadb:10.6.5
Trying to pull repository docker.io/library/mariadb ...
10.6.5: Pulling from docker.io/library/mariadb
7b1a6ab2e44d: Pull complete
034655750c88: Pull complete
f0b757a2a0f0: Pull complete
4bbcce26bc5e: Pull complete
04f220ee9266: Pull complete
89c8a77f7842: Pull complete
d1de5652303b: Pull complete
ef669123e59e: Pull complete
e5cec468d3a6: Pull complete
b14b1ba1d651: Pull complete
Digest: sha256:0f04ae6f30c5a3295fb7cc9be5780c15ff21d6028f999b19f5803114c1e8559e
Status: Downloaded newer image for docker.io/mariadb:10.6.5

Podman:

# https_proxy=http://<user>:<password>@<proxy_server>:<proxy_port> podman pull docker://docker.io/mariadb:10.6.5
Trying to pull docker.io/library/mariadb:10.6.5...
Getting image source signatures
Copying blob 034655750c88 done
Copying blob 04f220ee9266 done
Copying blob 89c8a77f7842 done
Copying blob 4bbcce26bc5e done
Copying blob f0b757a2a0f0 done
Copying blob 7b1a6ab2e44d [>-------------------------------------] 775.2KiB / 27.2MiB
Copying blob ef669123e59e done
Copying blob d1de5652303b done
Copying blob e5cec468d3a6 done
Copying blob b14b1ba1d651 done
WARN[0030] failed, retrying in 1s ... (1/3). Error: Error writing blob: error storing blob to file "/var/tmp/storage528270247/6": error happened during read: read tcp 10.123.105.131:33740->10.126.240.5:3128: read: connection reset by peer

I noticed that when I launch the podman pull several times consecutively, it is all the time the copy of the specific layer 7b1a6ab2e44d whish is interrupted and never ends ... !?

@mtrmac mtrmac changed the title read: connection reset by peer RFE: Retry individual layer copies Nov 25, 2021
@grunlab
Copy link

grunlab commented Nov 25, 2021

I've tried to pull the image using skopeo specifying a format (OCI or v2s2) ... same issue ... so this is not linked to this:

  • Format OCI:
# https_proxy=http://<user>:<password>@<proxy_server>:<proxy_port> skopeo copy --format oci docker://docker.io/mariadb:10.6.5 containers-storage:mariadb:10.6.5
Getting image source signatures
Copying blob 04f220ee9266 done
Copying blob 034655750c88 done
Copying blob 89c8a77f7842 done
Copying blob 4bbcce26bc5e done
Copying blob 7b1a6ab2e44d [>-------------------------------------] 734.3KiB / 27.2MiB
Copying blob f0b757a2a0f0 done
Copying blob d1de5652303b done
Copying blob ef669123e59e done
Copying blob b14b1ba1d651 done
Copying blob e5cec468d3a6 done
FATA[0026] Error writing blob: error storing blob to file "/var/tmp/storage576120080/5": error happened during read: unexpected EOF
  • Format v2s2:
# https_proxy=http://<user>:<password>@<proxy_server>:<proxy_port> skopeo copy --format v2s2 docker://docker.io/mariadb:10.6.5 containers-storage:mariadb:10.6.5
Getting image source signatures
Copying blob 04f220ee9266 done
Copying blob 4bbcce26bc5e done
Copying blob f0b757a2a0f0 done
Copying blob 034655750c88 done
Copying blob 89c8a77f7842 done
Copying blob 7b1a6ab2e44d [>-------------------------------------] 750.3KiB / 27.2MiB
Copying blob d1de5652303b done
Copying blob ef669123e59e done
Copying blob b14b1ba1d651 done
Copying blob e5cec468d3a6 done
FATA[0027] Error writing blob: error storing blob to file "/var/tmp/storage713825139/6": error happened during read: unexpected EOF

When pulling the image using docker in debug mode, we can see that docker is resuming (so even better than just retrying) the copy of the layer 7b1a6ab2e44d :

# https_proxy=http://<user>:<password>@<proxy_server>:<proxy_port> docker -D --log-level debug pull docker.io/library/mariadb:10.6.5
Trying to pull repository docker.io/library/mariadb ...
10.6.5: Pulling from docker.io/library/mariadb
7b1a6ab2e44d: Pull complete
034655750c88: Pull complete
f0b757a2a0f0: Pull complete
4bbcce26bc5e: Pull complete
04f220ee9266: Pull complete
89c8a77f7842: Pull complete
d1de5652303b: Pull complete
ef669123e59e: Pull complete
e5cec468d3a6: Pull complete
b14b1ba1d651: Pull complete
Digest: sha256:0f04ae6f30c5a3295fb7cc9be5780c15ff21d6028f999b19f5803114c1e8559e
Status: Downloaded newer image for docker.io/mariadb:10.6.5
# journalctl -u docker | grep resume
Nov 25 14:46:59 fr0-vsiaas-9358 dockerd-current[73793]: time="2021-11-25T14:46:59.622406615+01:00" level=debug msg="attempting to resume download of \"sha256:7b1a6ab2e44dbac178598dabe7cff59bd67233dba0b27e4fbd1f9d4b3c877a54\" from 653653 bytes"

So there is certainly an issue at proxy level (cache ?) but resuming the download of the impacted layer is allowing the image pull completion.
It would be nice to implement the same feature at cri-o/podman/buildah/skopeo level.

@mtrmac
Copy link
Collaborator

mtrmac commented Nov 25, 2021

(--format only affects the format used to write to the destination (possibly causing a conversion), so it can’t have any effect on how the source is read.)

@djnotes
Copy link

djnotes commented Feb 15, 2022

Never seen this issue on Fedora, but I'm seeing it a lot in WSL 2.
The annoying thing is this error happens when around 90% of the layer has been downloaded, and then it starts all over again to download over 100MB of data!

My podman info:

podman info --debug:
host: arch: amd64 buildahVersion: 1.23.1 cgroupControllers: [] cgroupManager: cgroupfs cgroupVersion: v1 conmon: package: conmon-2.1.0-2.fc35.x86_64 path: /usr/bin/conmon version: 'conmon version 2.1.0, commit: ' cpus: 4 distribution: distribution: fedora variant: container version: "35" eventLogger: file hostname: Windows idMappings: gidmap: - container_id: 0 host_id: 1000 size: 1 - container_id: 1 host_id: 100000 size: 65536 uidmap: - container_id: 0 host_id: 1000 size: 1 - container_id: 1 host_id: 100000 size: 65536 kernel: 5.10.16.3-microsoft-standard-WSL2 linkmode: dynamic logDriver: journald memFree: 11843911680 memTotal: 13316186112 ociRuntime: name: crun package: crun-1.4.2-1.fc35.x86_64 path: /usr/bin/crun version: |- crun version 1.4.2 commit: f6fbc8f840df1a414f31a60953ae514fa497c748 spec: 1.0.0 +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL os: linux remoteSocket: path: /tmp/podman-run-1000/podman/podman.sock security: apparmorEnabled: false capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT rootless: true seccompEnabled: true seccompProfilePath: /usr/share/containers/seccomp.json selinuxEnabled: false serviceIsRemote: false slirp4netns: executable: /usr/bin/slirp4netns package: slirp4netns-1.1.12-2.fc35.x86_64 version: |- slirp4netns version 1.1.12 commit: 7a104a101aa3278a2152351a082a6df71f57c9a3 libslirp: 4.6.1 SLIRP_CONFIG_VERSION_MAX: 3 libseccomp: 2.5.3 swapFree: 4294967296 swapTotal: 4294967296 uptime: 44m 29.25s plugins: log: - k8s-file - none - journald network: - bridge - macvlan volume: - local registries: search: - registry.fedoraproject.org - registry.access.redhat.com - docker.io - quay.io store: configFile: /home/mehdi/.config/containers/storage.conf containerStore: number: 6 paused: 0 running: 0 stopped: 6 graphDriverName: overlay graphOptions: overlay.mount_program: Executable: /usr/bin/fuse-overlayfs Package: fuse-overlayfs-1.7.1-2.fc35.x86_64 Version: |- fusermount3 version: 3.10.5 fuse-overlayfs: version 1.7.1 FUSE library version 3.10.5 using FUSE kernel interface version 7.31 graphRoot: /home/mehdi/.local/share/containers/storage graphStatus: Backing Filesystem: extfs Native Overlay Diff: "false" Supports d_type: "true" Using metacopy: "false" imageStore: number: 5 runRoot: /tmp/podman-run-1000/containers volumePath: /home/mehdi/.local/share/containers/storage/volumes version: APIVersion: 3.4.4 Built: 1638999907 BuiltTime: Wed Dec 8 13:45:07 2021 GitCommit: "" GoVersion: go1.16.8 OsArch: linux/amd64 Version: 3.4.4

Update: This does happen on Fedora too. Very irritating. Goes up to almost fully downloading several images and then starts from scratch from zero percent.

@TomSweeneyRedHat
Copy link
Member

I just wanted to note that an RFE bugzilla has been created for this issue too. https://bugzilla.redhat.com/show_bug.cgi?id=2009877

@amirgon
Copy link

amirgon commented Jul 17, 2022

I'm seeing a lot of error happened during read: unexpected EOF failures when pulling large images (5GB) with podman pull on Fedora CoreOS.

Docker is more robust when pulling the same images. It completes without errors, resuming layers (as mentioned above by @grunlab) when it encounters network issues.

Not being able to reliably pull large images is a serious deal-breaker and may affect the choice between Podman and Docker.
I would really like to find some solution to this issue.

@gimiki
Copy link

gimiki commented Oct 25, 2022

I recently encountered the same problem from a private registry with podman Error: writing blob: storing blob to file "/var/tmp/storage3484117728/4": happened during read: unexpected EOF and skopeo.
It seems to occur particularly when I am on relatively slow connections with large layers.

I tryed also downloading the layer with wget and I get the following error, but the download is resumed like with docker pull.

> wget https://<private_repo>/<image_layer_path> -O /tmp/big_layer
...
2022-10-25 12:50:36 (540 KB/s) - Connection closed at byte 95550581. Retrying.


--2022-10-25 12:50:37--  (try: 2)  https://<private_repo>/<image_layer_path>
Connecting to <private_repo> (<private_repo>)|<private_repo_ip>|:443... connected.
HTTP request sent, awaiting response... 206
...

If I retry, the connection doesn't close at the same byte.

❯ wget --version
GNU Wget 1.21.2 built on linux-gnu.
❯ podman --version
podman version 3.4.4
❯ podman run -it docker://quay.io/skopeo/stable:latest --version
skopeo version 1.10.0

@amirgon
Copy link

amirgon commented Nov 28, 2022

We are using Sonatype Nexus as a private container registry.
By default, Nexus times out connections after 30 seconds.
When pulling containers with large layers this timeout sometimes expires and the connection is closed.

Docker is smart enough to resume the download, so after a retry or two it completes the download without the user noticing any problem.
But Podman restarts the download instead of resuming, so naturally Podman hits the same timeout on each retry and eventually fails with error happened during read: unexpected EOF or similar error.

The workaround for us was to increase Nexus timeout by setting nexus.httpclient.connectionpool.idleTime=600s in /opt/sonatype-work/nexus3/etc/nexus.properties and restarting Nexus.
Jetty idle timeout might also need to be increased.

Hopefully this could be useful for someone in the same situation.

@djnotes
Copy link

djnotes commented Nov 28, 2022 via email

@mtrmac mtrmac added the kind/feature A request for, or a PR adding, new functionality label Dec 9, 2022
@mtrmac
Copy link
Collaborator

mtrmac commented Feb 20, 2023

With #1816 and #1847, c/image will resume layer pulls after “unexpected EOF” and “connection reset by peer” errors, limited to cases where there has been enough progress, or enough time, since previous attempt. This should hopefully alleviate the most frequent cases, without interfering too much with any existing higher-level retry logic.

@mtrmac mtrmac closed this as completed Feb 20, 2023
akalenyu added a commit to akalenyu/containerized-data-importer that referenced this issue Aug 28, 2023
Hopefully the resume on EOFs gives us enough resiliency as mentioned in containers/image#1145 (comment).

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
kubevirt-bot pushed a commit to kubevirt/containerized-data-importer that referenced this issue Aug 28, 2023
…Fs (#2874)

Hopefully the resume on EOFs gives us enough resiliency as mentioned in containers/image#1145 (comment).

Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature A request for, or a PR adding, new functionality
Projects
None yet
Development

No branches or pull requests

10 participants