Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent Hangs on Registry Push #1568

Closed
dgershman opened this issue Apr 6, 2023 · 13 comments
Closed

Intermittent Hangs on Registry Push #1568

dgershman opened this issue Apr 6, 2023 · 13 comments
Assignees

Comments

@dgershman
Copy link
Contributor

dgershman commented Apr 6, 2023

Environment

Device and OS: AMD64 / Ubuntu 22.04
App version: 0.25.2
Kubernetes distro being used: K3D K3S v1.25.7+k3s1
Other: Big Bang 1.57.1

Steps to reproduce

  1. zarf package deploy zarf-package-big-bang-example-amd64-1.57.1.tar.zst --confirm --log-level=trace
  2. Occasionally the process gets hung up "Updating image ..."

Expected result

That things wouldn't get hung up, and continue along.

Actual Result

Things get hung up and/or sometimes fail / timeout.

Visual Proof (screenshots, videos, text, etc)

Screenshot 2023-04-06 at 4 31 31 PM

  DEBUG   2023-04-06T20:20:34Z  -  crane.Push() /tmp/zarf-1329914535/images:registry1.dso.mil/ironbank/fluxcd/kustomize-controller:v0.34.0 -> 127.0.0.1:34857/ironbank/fluxcd/kustomize-controller:v0.34.0)
└ (/home/runner/work/zarf/zarf/src/pkg/packager/deploy.go:403)
  ⠦  Updating image registry1.dso.mil/ironbank/fluxcd/kustomize-controller:v0.34.0 (2m5s))
  ⠧  Updating image registry1.dso.mil/ironbank/fluxcd/kustomize-controller:v0.34.0 (2m5s)
  DEBUG   2023-04-06T20:22:44Z  -  "msg"="error creating error stream for port 34857 -> 5000: Timeout occurred\n" "error"=null
└ (/home/runner/go/pkg/mod/github.com/go-logr/logr@v1.2.4/funcr/funcr.go:183)
  DEBUG   2023-04-06T20:24:01Z  -  "msg"="error creating error stream for port 34857 -> 5000: Timeout occurred\n" "error"=null
└ (/home/runner/go/pkg/mod/github.com/go-logr/logr@v1.2.4/funcr/funcr.go:183)
  DEBUG   2023-04-06T20:24:03Z  -  "msg"="error creating error stream for port 34857 -> 5000: Timeout occurred\n" "error"=null
└ (/home/runner/go/pkg/mod/github.com/go-logr/logr@v1.2.4/funcr/funcr.go:183)
  DEBUG   2023-04-06T20:33:33Z  -  "msg"="lost connection to pod\n" "error"=null.0 (13m13s)
└ (/home/runner/go/pkg/mod/github.com/go-logr/logr@v1.2.4/funcr/funcr.go:183)

Full logs:

full-logs.txt

Severity/Priority

There is a workaround, by keeping retrying until the process succeeds.

Additional Context

  • I don't believe this is specifically related to the bigbang component.
  • I have seen this happen on zarf init as well, specifically when pushes are happening to the registry.
  • I tested to see if the tunnel was still active and it was during the time of the error.

Screenshot 2023-04-06 at 4 30 24 PM

@github-project-automation github-project-automation bot moved this to New Requests in Zarf Project Board Apr 6, 2023
@dgershman dgershman changed the title Intermittent Hangs on Registry PUsh Intermittent Hangs on Registry Push Apr 6, 2023
@dgershman
Copy link
Contributor Author

dgershman commented Apr 11, 2023

Interesting piece of information I think, when connected to the console (using either SSM or SSH) of an EC2 instance is when we see this issue. When running this locally and pointed to the same K8S cluster, we have no issues.

@dgershman
Copy link
Contributor Author

I think this may have something to do with using k3d. I'm not seeing the intermittent issues with k3s.

@Racer159 Racer159 self-assigned this Apr 18, 2023
@Racer159 Racer159 added this to the v0.26.1 milestone Apr 18, 2023
Racer159 added a commit that referenced this issue Apr 18, 2023
…ushes (#1590)

## Description

This PR creates a tunnel per image push (making it easier to implement
concurrency - may do that in this PR if we can confirm that issues are
mitigated) moves the CRC from the image name to the tag and changes the
UI to use a progressbar instead of a spinner for better user feedback.

## Related Issue

Relates to #1568 , #1433, #1218, #1364

This also will make #1594 slightly easier.

(See aws/containers-roadmap#853)

Fixes: #1541

## Type of change

- [X] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Other (security config, docs update, etc)

## Checklist before merging

- [X] Test, docs, adr added or updated as needed
- [X] [Contributor Guide
Steps](https://github.com/defenseunicorns/zarf/blob/main/CONTRIBUTING.md#developer-workflow)
followed
@Racer159 Racer159 modified the milestones: v0.26 (m2), v0.26 (m3) Apr 20, 2023
@Racer159
Copy link
Contributor

@dgershman have you gotten a chance to try v0.26.0 and are you seeing any issues with that and k3d?

@dgershman
Copy link
Contributor Author

Still having issues with K3D and tried with v0.26.1. It times out on the registry push step. This is the latest response:

     ERROR:  Failed to deploy package: unable to deploy all components in this Zarf Package: unable to deploy
             component bigbang: unable to push images to the registry: Head
             "http://127.0.0.1:46657/v2/ironbank/opensource/kubernetes/kubectl/blobs/sha256:b4f2e7ad23124dea1bfd0a239be05aaa5c7ddad088a5039b7cf799eddcfa278a":
             dial tcp 127.0.0.1:46657: connect: connection refused
     ERROR:  Failed to deploy package: unable to deploy all components in this Zarf Package: unable to deploy
             component bigbang: unable to push images to the registry: Head
             "http://127.0.0.1:46657/v2/ironbank/opensource/kubernetes/kubectl/blobs/sha256:b4f2e7ad23124dea1bfd0a239be05aaa5c7ddad088a5039b7cf799eddcfa278a":
             dial tcp 127.0.0.1:46657: connect: connection refused
     ERROR:  Failed to deploy package: unable to deploy all components in this Zarf Package: unable to deploy
             component bigbang: unable to push images to the registry: Head
             "http://127.0.0.1:46657/v2/ironbank/opensource/kubernetes/kubectl/blobs/sha256:b4f2e7ad23124dea1bfd0a239be05aaa5c7ddad088a5039b7cf799eddcfa278a":
             dial tcp 127.0.0.1:46657: connect: connection refused
     ERROR:  Failed to deploy package: unable to deploy all components in this Zarf Package: unable to deploy
             component bigbang: unable to push images to the registry: Head
             "http://127.0.0.1:46657/v2/ironbank/opensource/kubernetes/kubectl/blobs/sha256:b4f2e7ad23124dea1bfd0a239be05aaa5c7ddad088a5039b7cf799eddcfa278a":
             dial tcp 127.0.0.1:46657: connect: connection refused

@Racer159
Copy link
Contributor

Thanks for the update @dgershman the SSH note might be an interesting / good lead to look into as well - some but not all had reported that this was happening on EC2 VMs over SSH - I had thought it was bad networking and tested with that locally but potentially there is something more going on with driving image uploads from an SSH session. I'll see if I can keep digging

@Racer159
Copy link
Contributor

@dgershman doing some more testing on this, but if you get a chance does the code in #1721 solve the hanging for you? (note that parts of the push/pull of OCI Zarf packages are broken on that PR - not sure if that is in your flow though - won't affect local Zarf packages)

@dgershman
Copy link
Contributor Author

@dgershman doing some more testing on this, but if you get a chance does the code in #1721 solve the hanging for you? (note that parts of the push/pull of OCI Zarf packages are broken on that PR - not sure if that is in your flow though - won't affect local Zarf packages)

Putting this on my todo list for today.

@dgershman
Copy link
Contributor Author

@dgershman doing some more testing on this, but if you get a chance does the code in #1721 solve the hanging for you? (note that parts of the push/pull of OCI Zarf packages are broken on that PR - not sure if that is in your flow though - won't affect local Zarf packages)

Putting this on my todo list for today.

I ran one test so far that looked good. Will do two more tomorrow.

@dgershman
Copy link
Contributor Author

@dgershman doing some more testing on this, but if you get a chance does the code in #1721 solve the hanging for you? (note that parts of the push/pull of OCI Zarf packages are broken on that PR - not sure if that is in your flow though - won't affect local Zarf packages)

Putting this on my todo list for today.

I ran one test so far that looked good. Will do two more tomorrow.

I think we are good!

@Racer159
Copy link
Contributor

Racer159 commented May 19, 2023

Ok going to clean up the PR and then rollback one change we made that didn't fix anything (and at this point only serves to slow down image pushes). Will leave this open though in case this doesn't fully solve the issue - the main thought in the PR is that the multithreaded jobs in crane are either overwhelming the resource limits of the registry pod and/or they are stepping on each other and resulting in a stuck state.

Racer159 added a commit that referenced this issue May 22, 2023
## Description

Creating this PR to test the performance impact of setting crane jobs to
`1` (which has shown initially effective in resolving the issue of
registry push hanging)

## Related Issue

Relates to #1568
Fixes #1656
Fixes #1734

## Type of change

- [X] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Other (security config, docs update, etc)

## Checklist before merging

- [X] Test, docs, adr added or updated as needed
- [X] [Contributor Guide
Steps](https://github.com/defenseunicorns/zarf/blob/main/CONTRIBUTING.md#developer-workflow)
followed

---------

Co-authored-by: Cole Winberry <86802655+mike-winberry@users.noreply.github.com>
@Racer159
Copy link
Contributor

Moving this back a milestone to give the fixes in v0.27.0 some time to bake with the community to see if this is truly fixed.

@Racer159
Copy link
Contributor

Racer159 commented Jun 5, 2023

Closing as fixed for now - we can reopen if it presents itself again

@Racer159 Racer159 closed this as completed Jun 5, 2023
@ranimbal
Copy link

ranimbal commented Oct 24, 2023

We are still seeing this issue in zarf version 0.29.2. We have a multi-node cluster on AWS EC2 running RKE2 v1.26.9+rke2r1. Our package size is about 2.9G, the "zarf package deploy..." command stalls and hangs about 80% of the time or so. A retry usually works fine.

Here are a few things that we noticed after some extensive testing:

  • this issue is not seen on a single EC2 node RKE2 cluster, it seems to only occur on multi-node clusters.
  • our zarf docker registry is backed by S3. The issue is always seen in this case, but only if a multi-node cluster.
  • if we back the registry with the default PVC (instead of S3), the issue is not seen at all. Since data transfer to S3 is slower than to the EBS backed PVC, maybe this extra time causes the problem to appear?
  • disabling or enabling the zarf docker registry HPA doesn't seem to matter either ways.

Can you please re-open this issue, else let me know if I should create a new issue? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants