Define optimal multi-distro HA solution for the in-cluster registry #375

jeff-mccoy · 2022-03-10T05:19:37Z

Currently Zarf uses a helm chart to deploy a docker registry. This chart utilizes deployments with an optional PVC for persistence. We need to define a better solution for making the registry highly-available and fault-tolerant. This issue exists to facilitate that discussion.

Concerns:

Storage classes are cluster/distro-specific and a default storage class does not always exist
Using a simple deployment with multiple replicas requires a RWX access mode to tolerate Node failure/taints
Docker registry does support S3/Minio object storage, but that introduces a new complex dependency (Minio) or a vendor cloud dependency (AWS)
Pushing to a svc vs every running pod without RWX or S3-backed storage will create an inconsistent state of image availability. In other words one pod would have the pushed image, the others would not

Considerations:

Docker registry as an OCI Distribution Spec API makes it trivial to synchronize registries with tools we already have in Zarf such as Crane.
For HA we don't need a daemonset such as Kube-Fledged as 3+ registries distributed across nodes (if 3 nodes are available) would still provide HA
We have solved the initial lack of a registry issue with Distro-Agnostic Zarf Registry Bootstrap #329, so we can assume a seed registry will already exist in the cluster during bootstrapping

jeff-mccoy · 2022-03-23T02:25:36Z

A. Require RWX StorageClass for HA

Pros:

Simplest deployment complexity for Zarf
Requires only single push of images to the registry
Protected against pod eviction / node failure

Cons:

Places HA burden on consumers
Could discourage HA which could be a real problem
May not be realistic for all systems
Not all RWX options are equally reliable, e.g. NFS

C. Use Statefulset combined with InitContainer for automatic image syncing in-cluster

Pros:

Very portable
No resource overhead or deployment complexity to manage
May not require any StorageClass or at least not RWX
Requires only single push of images to the registry
Leverages only standard OCI capabilities

Cons:

Additional code must be written to handle pod-to-pod syncing using Crane

Details:

Deploy 3+ pods with a Statefulset for predictable naming / rolling of pods
Push the image updates to pod-0
Start the pod roll (reverse-order)
Each pod (except pod-0, which syncs from any other pod in case of eviction) runs an init container first to sync to pod-0 using the zarf injector go binary, this prevents nodes from trying to pull from that pod before it's updated
After all pods are updated, continue zarf package deploy

Eliminated based on conversations with @runyontr / @YrrepNoj and comments below:

B. Deploy minio or use existing S3-compatible storage

Pros:

Standard charts exist to deploy minio
Well-established / mature option
Requires only single push of images to the registry
Protected against pod eviction / node failure

Cons:

Largest resource footprint
Non-trivial helm chart
May introduce additional cost/latency if use cloud object storage
Greatly complicates cloud deployments with need for IAM/S3 Bucket/Policies as a pre-req if using S3

D. Use existing helm chart registry deployment and push copies of the images to each pod

Pros:

Very portable
No resource overhead or deployment complexity to manage
Leverages only standard OCI capabilities

Cons:

Requires pushing images o(n) based on number of registry pods deployed, some optimizations exist due to the way OCI Distribution works
Does not resolve pod eviction / node failure

Details:

Instead of using a service binding for zarf connect, bind to each pod individually and update every registry pod
After all pods are updated, continue zarf package deploy

E. Use Trow for the registry

Pros:

External project working on this problem much longer
Interesting P2P syncing solution
Clever ways to manage host/TLS via hostPath

Cons:

Project is still alpha
Uses R/W hostPath violates many K8s security recommendations
Entire codebase is a custom Rust app vs using the well-established Docker Registry
Primarily only one main developer
Designed to use an ingress and subdomain for TLS registration

F: Use Uber Kraken

Pros:

Proved at massive scale
Very mature
Powerful P2P syncing system
Extreme HA

Cons:

Very complex / resource heavy for our needs
Still requires S3 or similar object storage backend
Requires daemonset + backend storage system

Racer159 · 2022-03-28T22:42:14Z

C seems to be the most elegant solution (even though custom stuff would need to be written). You could try C short term and then work with the Trow dev to improve that product. Relying on S3/minio for a core service is kinda a big con IMO. I have seen that go pretty poorly with minio in the past and a platform provided s3 is obviously not available on the other side of most air gaps.

jeff-mccoy · 2022-03-30T02:27:35Z

Thanks @Racer159! After chatting with some team members we are going to focus on options A - D for now.

anoncam · 2022-04-01T18:29:52Z

B. I think people will have the most familiarity operating in with this model. I also think IAM is not a con, it would merely need to be purposeful, which also alludes to multiple human people aware of the resource utilization.

I also emphasize with the perspective that this inherently is a risk operationally if unable to modify account level IAM or being dependent on others. I go back to my initial takeaway that seat belts aren't always bad.

In my view it is the most straight forward approach. Worth a test to gain insight on concerns of latency. I would also want to know how integral are the mechanics as compared to the functional requirement? Does the implementation greatly drive any subset of users one way or another?

Just some thoughts...

mikhailswift · 2022-04-01T18:34:49Z

C or D would probably be the way I'd lean, with a preference toward C. I think the code that would be necessary to write would be simple enough and the benefit of the portability is great.

Agree with @Racer159 on the S3/Minio comments in regards to B, and there could be edge cases with how to handle situations where perhaps the consumer may already have a HA storage solution setup in cluster and is now burdened with maintenance on another one. A could perhaps make sense with an optional deployable component that provides a minio deployment that satisfies the PVC requirement? Suppose that's somewhat of a hybrid of A+B

jeff-mccoy · 2022-04-01T18:40:57Z

Another piece of data--with zarf in a connected/semi-connected environment, HA is essential as not having HA would actually reduce fault-tolerance as it's very likely any other external registry would already be highly-available.

jeff-mccoy · 2022-04-01T19:20:53Z

Another idea to use IPFS instead of object storage has been mentioned and even some POCs might be worth considering, but I think those would have to be solutions down the road because they are still POC: https://github.com/ipdr/ipdr and a more direct POC https://github.com/joshrwolf/ripfs.

tonybutt · 2022-04-01T19:28:44Z

As other's have stated C/D. The amount of overhead code that would be added seems well worth the portability you would gain. C is likely the more standard way to implement such a feature with D being its quicker/dirtier counterpart imo.

@jeff-mccoy IPFS is a fantastic idea and is the HA of HAist things you could implement and you get all the great things that come with that tamper evidence, etc. As a nerd I'd prefer to see this as the end solution.

Ultimately to solve the issues in the near term, C or D determined on the level of effort desired to complete the solution. Could always start with D and iterate to C.

salt-mountain · 2022-04-04T03:41:34Z

Not sure if this is the odd man out opinion, but I lean B or C.

B makes sense to me because I'd like to think that MinIO would support the majority, if not all, of the deployment targets. I also like the idea of being able to leverage a mature helm chart for storage and their docs as well.

C, if I'm understanding it correctly, sounds interesting. My concern with custom solutions like this becomes a question of "how difficult will it be to debug if it goes haywire?" An initContainer sounds more of a hassle to deal with than fixing values in a Helm chart that has documentation.

jeff-mccoy · 2022-04-04T03:51:45Z

@salt-mountain thanks for the feedback. Regarding C: I actually discussed this exact issue with another engineer last week. The leaning on initContainer for me made sense because I could use my normal K8s troubleshooting flow for debugging vs some custom orchestration behavior. I think it's fair to say we'd need to prove that out, but I would say either initContainer or a smart readinessProbe + waiting to start serving the registry until the registry is updated could work too. The biggest issue I see is not when we are actually updating images, but if a node fails, we will need that pod to spin up on another node and would not it to report ready until it was fully up-to-date.

brandtkeller · 2022-04-05T02:48:32Z

My .02 - although not adding any significant/new opinion.

C sounds like a logical starting point for portability without added layers of complexity baked in. (IE MinIO/S3 or RWX compliant storage).

That said I'd be curious about playing devils advocate for a solution that possibly enables a combination of A/C. I believe utilizing RWX storage if available would be optimal. Only sticking to C might be an unnecessary burden for those who have the infrastructure available. (Although if wouldn't be that big of a burden)

I have seen orchestration that includes multiple configurations, which could be tailored to a case such as this - deployments w/ a RWX storage or statefulsets with a syncing mechanism. Lot's of details to stort there and I'm discluding the maintenance required to support multiple configs (and maybe a greater discussion about if that should be done at all).

I thought it may be worth entertaining. I think we all know of many growing spaces that will require air-gap processes with cloud infrastructure available that would benefit from being able to use that infrastructure.

andrewg-xyz · 2022-04-05T23:04:21Z

andrewg-xyz · 2022-04-05T23:13:35Z

Agree with C as best option. D would cause issues, as mentioned O(n). A,B require something to happen beyond zarf to meet the pre-reqs.

For option A, what would Zarf do for the PVCs?

RothAndrew · 2022-04-05T23:31:26Z

Start with A, but do it in a way that the user is able to choose if they want HA or not. If I'm on a single-node K3s cluster I don't even need HA since it is mostly pointless. They are only required to provide RWX if they choose "HA-mode" -- should be super easy and quick
Iterate by offering either B or C for "HA-mode", while preserving existing functionality of allowing the user to provide their own RWX if they want to

Between B and C I think I'm more in favor of B, but C doesn't give me heartburn either. If we went with B we might be able to offer the ability to choose the in-cluster Minio or utilize external S3 if the user is capable of it.

neoakris · 2022-04-06T21:33:57Z

BLUF/summary of the wall of text:

Option A could solve the problem in all cases except bare metal, if you need a 100% of cases solution then narrow down to option C/D. (like how earlier you eliminated options E &F)

Option B sounds terrible to me for the following reasons:

You've introduced a complex dependency that you don't have a lot of control over.
minio isn't the easiest thing to upgrade over time so you gain maintenance overhead.
You still have to worry about HA-minio, so instead of solving the problem you've just pushed the problem down to a lower turtle.

There's a lot to like about option A from a KISS perspective: (The big 3 CSPs now support RWX storage classes)

(the original AWS EBS in tree storage class didn't support RWX), AWS EBS CSI driver does support RWX.
Azure & GCP storage classes supported RWX a while ago.

The one flaw of option A is bare metal:

One option would be to say RWX = bare metal cluster prerequisite and call it a day, this has a slight annoyance in that it's not solving a problem. (if you go option A this is probably in your best interest though)
It might be possible to use something like rancher longhorn or rook-ceph as a cloud agnostic storage class supporting RWX, but both add a significant amount of overhead compared to options C/D:
- rook-ceph, and rancher longhorn both have config specific edge cases
- both require extra CPU/Ram just to exist
- although longhorn/rook are a complex dependency, at least it'd be possible to exercise a degree of control over them, only problem is committing to that creates a the yak shaving problem of yet another thing to maintain and create an upgrade path for.

When I first read option C & D, my gut told me those sound like clever hacky duct tape solutions, but there's some advantage to perusing them:

If you embed the hacky duct tape solution in a minimalist go container, you might end up with something that can be a timeless solution likely to be CVE free + not need to be maintained / upgraded over time / install once and it'd wouldn't be utterly crazy to go a year without upgrading. (Option A/B will create work in the form of upgrade maintenance)

neoakris · 2022-04-06T21:47:25Z

It's worth pointing out that you left out requirements in your original ask for input, being clear on the problem trying to be solved might help further narrow down choices. I'll give some example requirements gathering questions that might help make the point clearer:

Are you trying to bootstrap...
- a long lived registry?
- a registry only used by the platform? (Example: the consumer would never push their own custom built images to it.)
- a registry that the consumer might push their own custom built images to?
Could DR be prioritized over HA?
Example: Let's say that if the non HA single node registry goes away, automation could be used to spin up a replacement from code + artifacts in 1 hour. Maybe that's sufficient. Especially if paired with kube-fledge, also what was wrong with kube-fledge?

An example of why these questions matter:

If it's a registry used only by the platform and controlled 100% by automation, then you could do this simple pattern.
In 100% of installs the following setup occurs:
1 single node non HA registry + 1 node port / load balancer entry for it.
Then HA is enabled in a 2nd step by creating a 2nd single node non HA registry, using the same bootstrap automation that was used to create the 1st, and then update the node port / load balancer to point to both.

jeff-mccoy · 2022-04-06T22:03:21Z

Thanks @neoakris. This registry is the docker registry that zarf installs on a zarf init. The way it works right now is upon deploying to a cluster, a helm postRender stage performs image mutation to point to the local nodePort exposing the registry. The issue with this is if the registry dies or if the node is drained, we need a way to:

ensure it can come back up and downtime would potentially create an infinite wait if you have the registry waiting for itself to come back up (imagePullPolicy can help here, but it's not going to work if it's a new node)
not lose images / data the cluster needs to serve itself

For single node deployments (we call them appliance mode), I agree you certainly could just keep all your zarf packages and destroy/recreate and that might be sufficient. So that should be a consideration, especially since by default we use k3s and the local-path-provisioner, which is really just a folder on disk, so if the nodes dies we probably have bigger problems.

The registry will need to be long-lived (as long as the cluster lives) and the default deployment does not expose the registry for consumer use, though could be exposed with an ingress--we just don't do that in the standard configuration. The registry is meant to serve the cluster it is deployed to.

Kube-fledged is an interesting solution, but solves a slightly different problem. Much like Kraken you still need some form of reliable registry as a source-of-truth. It is also a good bit more complex, increases storage needs by caching images on every node and only has one main developer based out of India, which may be a problem for some of our US Dept of Defense users unfortunately.

jeff-mccoy · 2022-04-06T22:04:46Z

Regarding your final point on multiple non-HA registries, that's another option too I believe (and basically what option D is trying to do), but my concern remains around what happens when a node dies and the pod spins back up on a new node with no existing registry data.

jeff-mccoy · 2022-04-06T22:06:57Z

Between C and D, D actually feels more duck-tape like to me. C would leverage the OCI distribution spec to sync registries that def. battle-tested and use everywhere in prod. The initContainer component could be changed to some smarter init steps on the main container too, I just like the idea of using the initContainer instead because it makes it very easy to see where things are in the rollover process and where an issue exists with normal K8s tooling.

jeff-mccoy · 2022-04-07T06:36:43Z

Moved option D to eliminated based on discussions above.

Going to eliminate option B as well: B concerns me for the external dependencies that are (ironically) more complex for S3 vs EFS and not universal across the cloud providers. Minio also concerns me because of some of the reasons stated above: complexity of the deployment/upgrade, concern with conflicting other deployments of Minio, overall size, troubleshooting issues.

Looking at option A I do think it is nice for the fact that we can basically push the problem to someone else, but I also think we lose a lot by no longer just requiring a KUBECONTEXT. Quick googlefoo of the 3 major cloud providers indicate additional IaC or manual steps required on the cloud-provider side before those options will work. Note AKS does have an HA NFS v4.1 option that doesn't seem to require additional configuration.

https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/filestore-csi-driver
https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html
https://docs.microsoft.com/en-us/azure/aks/azure-files-csi#use-a-persistent-volume-with-private-azure-files-storage-private-endpoint

Since we have the ability to override storageclass now during init, I think A could be a great override behavior.

Someone on LInkedIn mentioned NFSv4 as a good candidate as it's already supported by K8s, but HA NFS still looks pretty complicated. That leaves us with a combo of C with an override for A. We could also explore IPFS for this, but I think that's another layer of complexity right now with a lot of unknowns around Day 2 operations.

neoakris · 2022-04-07T13:49:30Z

sounds like good analysis; however, before you completely throw out option D. (multiple non-HA registries being load balanced.)
I want to make explicit some mildly outside the box thinking that may not have occurred to you.

multiple non-HA registries + LB = dead simple KISS, reliable, and maintainable.
It should be able to work great, especially since you confirmed for me that this long lived registry is not intended to be consumed by customer's meaning it's predictable enough that it can be recreated from scratch.

Tell user to label nodes registry nodes as a pre-req
Let zarf init, bootstrap an identical copy of a non-HA registry on every node
Kubernetes does have long lived pet infrastructure, they're called master nodes, you could install the registry on those, and then you wouldn't even have to label registry nodes as a pre-req. (this wouldn't work for AKS/EKS, but could* work for bare metal/DIY)
Your concern: "what happens when a node dies and the pod spins back up on a new node with no existing registry data." What comes next is the point I wanted to make explicit as semi outside the box thinking. (admittedly it's more applicable to bare metal/DIY).
- Start off with 2 non-HA registries installed on 2 nodes labeled as registry nodes.
- You lose one or 2 of your registry nodes
- The outside the box thinking I thought you might have missed is that you're thinking from the perspective of the cluster being able to heal itself from within. What I wanted to point out was that zarf init's 10GB tar ball thing that initially bootstrapped them in the first place. You could document that admins should keep that installation artifact handy as it can be used as an external to the cluster method of restoring the registry node in the case of failure. (And since you're talking about HA, there's likely some fault tolerance where if 1 fails the cluster's still healthy, while an out of band repair occurs.)

This may be worth considering:

bare metal scenario could use non-HA registries being load balanced on master nodes. (*You probably already see it but I am introducing a failure scenario in the form of disk pressure bringing down the control plane, so maybe only deploy to worker nodes labeled as registry nodes vs master nodes.)
CSP (cloud service provider) scenario could favor RWX.

Otherwise you may have narrowed down to option C as a universal solution.

bburky · 2022-06-08T16:02:37Z

Some of this has already been mentioned, but I wanted to highlight that: HA is more than just object storage, and please consider the risk and complexity of an in-cluster implementation.

If a cloud managed registry service is available, use it. (ECR, ACR, etc)
- It will already be HA, out of cluster and fully managed. Likely shares no points of failure with the cluster itself.
The single-node-cluster use case doesn't need HA at all.
- A very simple registry implementation may be best.
The on-premises, non-cloud, fully disconnected use case.
- This actually needs HA for production use.
- Please evaluate the risk of in-cluster vs out-of-cluster: you will need to ensure your registry is HA distributed across multiple node pools or a bad Kubernetes upgrade can take down the registry and the whole cluster with it.
  - It is harder to manage a non-k8s out-of-cluster system, but has fewer points of failure.
- HA is more than just HA storage. For example, a fully featured registry will likely require a database for authentication. You have to be sure that everything it depends on is HA too (LoadBalancers, HTTP, DB, DNS, any other compute/networking/storage/etc) or an inaccessible registry can take down the cluster.

I currently use ECR and mirror images for a Big Bang installation to it. It has generally been smooth.

I've implemented an out-of-cluster registry once for an on-premise proof of concept before. One note: when doing Cluster API, you'll have a management cluster in addition to the main cluster that also needs a registry. It's simpler if these can share a registry (but not required). My use case was proof of concept and did not use a production quality registry.

brandtkeller · 2022-06-18T03:07:53Z

I was doing some reading on the replication process implemented by openebs.

OpenEBS creates a Micro-service for each Distributed Persistent volume using one of its engines - Mayastor, cStor or Jiva.

The Stateful Pod writes the data to the OpenEBS engines that synchronously replicate the data to multiple nodes in the cluster. The OpenEBS engine itself is deployed as a pod and orchestrated by Kubernetes. When the node running the Stateful pod fails, the pod will be rescheduled to another node in the cluster and OpenEBS provides access to the data using the available data copies on other nodes.

It may not be applicable - but the replication process might provide insight into a future implementation? or it might not and this can be disregarded 😄 .

JasonvanBrackel · 2022-09-15T13:46:34Z

Looking through this I'll vote along side @Racer159 C seems elegant, simple and the only con is we do work. I'm ok with us doing work.

I also feel like going the A direction is going to put a lot of work on the user, and will likely lead to more debugging down the line of the type of "My does funny things with Zarf, HELP!"

…754) ## Description This PR introduces the ability to connect to an already existing (and reachable) Container Registry and/or Git Repository during the `zarf init` command. Closes #570 (Support using an external git server) Closes #560 (Support using an external registry) This implementation will serve as a good midway point on having a fully HA in-cluster registry #375. ## PR Feature List - Added several flags to the `init` command to support using an external git repository - Added several flags to the `init` command to support using an external container registry - Update `zarf connect registry` to direct to `{HOST}/v2/_catalog` (this was confusing some other people since it would originally seem like the registry was returning an empty page) - Add utility function to create a tunnel to a service URL - Created slightly better regexp for replacing the host from a `containerImage` url - semi-refactored the `zarf package deploy` logic ## Breaking Changes List - We are changing the structure of the names of repos & containers we are pushing (we are simplifying the name and adding a sha1 hash of the original name to the end of the name) Co-authored-by: Wayne Starr <Racer159@users.noreply.github.com> Co-authored-by: Megamind <882485+jeff-mccoy@users.noreply.github.com>

willswire · 2023-05-01T16:51:19Z

Inline with @RothAndrew's comment:

Start with A, but do it in a way that the user is able to choose if they want HA or not. If I'm on a single-node K3s cluster I don't even need HA since it is mostly pointless. They are only required to provide RWX if they choose "HA-mode" -- should be super easy and quick

I've just submitted PR #1664 which does exactly this; disabled HPA by default, and when it is enabled, requires a RWX-compatible StorageClass for the PVC (we just create our own and pass the name to Zarf Init).

Our team identified an issue where the Zarf Registry doesn't scale when doing Node AMI upgrades, which led to the contribution towards this effort.

jeff-mccoy · 2023-05-03T16:23:03Z

HPA does not require RWX for node-attached storage. What storage provider was requiring RWX?

willswire · 2023-05-03T17:20:17Z

The PDB for the Zarf Registry requires a pod available at all times, but when a rolling node update takes place (for example upgrading EKS node AMIs), an additional registry pod cannot be created on a different node because RWO only allows same-node access.

We’re using EFS to solve this problem by allowing multiple Registry pods to access the same volume - which required RWX for the PVC

## Description When attempting to upgrade our EKS Node AMIs in AWS, we noticed that the Zarf Registry deployment was unable to horizontally scale across nodes which needed to restart. We believe the culprit is the `accessMode` specification for the PersistentVolumeController. In order for multiple pods to have access to the same PersistentVolume, the `accessMode` must be set to "ReadWriteMany". ~~This PR proposes that when autoscaling is enabled for the Zarf Registry, the `accessMode` is set to "ReadWriteMany" by default; when autoscaling is disabled, it is set to "ReadWriteOnce". Due to the additional work required (i.e. using an existing PersistencVolumeController with a storage class compatible with RWX), we also propose that `autoscaling` be disabled by default.~~ This PR exposes the `REGISTRY_PVC_ACCESS_MODE` variable for the `zarf-registry` portion of the init package. ## Related Issue - Relates to #375 ## Type of change - [ ] Bug fix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Other (security config, docs update, etc) ## Checklist before merging - [x] Test, docs, adr added or updated as needed - [x] [Contributor Guide Steps](https://github.com/defenseunicorns/zarf/blob/main/CONTRIBUTING.md#developer-workflow) followed

## Description - Provided fields for affinity and toleration of the registry pod - Added s3 irsa configurability for the docker registry helm chart if using a compatible image ## Related Issue Relates to #375 ## Checklist before merging - [x] Test, docs, adr added or updated as needed - [x] [Contributor Guide Steps](https://github.com/defenseunicorns/zarf/blob/main/.github/CONTRIBUTING.md#developer-workflow) followed --------- Co-authored-by: Zack A <zack-is-cool@users.noreply.github.com> Co-authored-by: corang <jordan@defenseunicorns.com> Co-authored-by: Zack Annexstein <zannexstein@gmail.com> Co-authored-by: Lucas Rodriguez <lucas.rodriguez@defenseunicorns.com> Co-authored-by: Lucas Rodriguez <lucas.rodriguez9616@gmail.com> Co-authored-by: razzle <razzle@defenseunicorns.com> Co-authored-by: Austin Abro <37223396+AustinAbro321@users.noreply.github.com>

RothAndrew added this to Zarf Project Board Mar 10, 2022

RothAndrew moved this to New Requests in Zarf Project Board Mar 10, 2022

jeff-mccoy added this to the Zarf GA milestone Mar 10, 2022

jeff-mccoy moved this from New Requests to Planned in Zarf Project Board Mar 14, 2022

jeff-mccoy moved this from Planned to Ready to Start in Zarf Project Board Mar 22, 2022

jeff-mccoy added packager labels Mar 23, 2022

jeff-mccoy pinned this issue Apr 1, 2022

jeff-mccoy changed the title ~~Define optimal multi-distro HA solution for the Docker Registry~~ Define optimal multi-distro HA solution for the in-cluster registry Apr 4, 2022

JasonvanBrackel moved this from Backlog to Doing Now in Zarf Project Board Sep 15, 2022

JasonvanBrackel unpinned this issue Sep 15, 2022

JasonvanBrackel mentioned this issue Sep 20, 2022

Make Zarf More Resilient to Infrastructure and Kubernetes Related Issues #752

Closed

YrrepNoj mentioned this issue Sep 23, 2022

Feature: Support Using Zarf With an External Registry and Repository #754

Merged

Racer159 removed this from the Zarf GA milestone Apr 18, 2023

willswire mentioned this issue May 1, 2023

Expose PVC accessMode as variable #1664

Merged

5 tasks

brianrexrode mentioned this issue Jul 10, 2023

Improve the experience when rolling nodes that have the Zarf registry deployed to them (to include private-registry secret updates) #1715

Closed

AbrohamLincoln mentioned this issue Aug 27, 2023

Expose additional registry settings #1993

Closed

brandtkeller mentioned this issue Aug 28, 2023

Zarf dependence on existing StorageClass #1995

Closed

bburky mentioned this issue Nov 15, 2023

Improve security of zarf registry NodePort #2146

Open

eddiezane added this to Zarf (old) Mar 4, 2024

lucasrod16 mentioned this issue Apr 15, 2024

feat: config to enable resilient registry #2440

Merged

2 tasks

salaxander added this to Zarf Jul 22, 2024

github-project-automation bot moved this to Backlog in Zarf Jul 22, 2024

salaxander removed the status in Zarf Jul 22, 2024

salaxander moved this to Triage in Zarf Sep 10, 2024

ntwkninja mentioned this issue Oct 22, 2024

Enable AWS IRSA auth for registry - Migrate from Docker V2 Registry to V3 #3124

Open

schristoff removed packager labels Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define optimal multi-distro HA solution for the in-cluster registry #375

Define optimal multi-distro HA solution for the in-cluster registry #375

jeff-mccoy commented Mar 10, 2022 •

edited

Loading

jeff-mccoy commented Mar 23, 2022 •

edited

Loading

Racer159 commented Mar 28, 2022

jeff-mccoy commented Mar 30, 2022

anoncam commented Apr 1, 2022

mikhailswift commented Apr 1, 2022

jeff-mccoy commented Apr 1, 2022

jeff-mccoy commented Apr 1, 2022

tonybutt commented Apr 1, 2022

salt-mountain commented Apr 4, 2022

jeff-mccoy commented Apr 4, 2022

brandtkeller commented Apr 5, 2022

andrewg-xyz commented Apr 5, 2022

andrewg-xyz commented Apr 5, 2022

RothAndrew commented Apr 5, 2022

neoakris commented Apr 6, 2022 •

edited

Loading

neoakris commented Apr 6, 2022

jeff-mccoy commented Apr 6, 2022

jeff-mccoy commented Apr 6, 2022

jeff-mccoy commented Apr 6, 2022

jeff-mccoy commented Apr 7, 2022

neoakris commented Apr 7, 2022 •

edited

Loading

bburky commented Jun 8, 2022

brandtkeller commented Jun 18, 2022

JasonvanBrackel commented Sep 15, 2022 •

edited

Loading

willswire commented May 1, 2023

jeff-mccoy commented May 3, 2023

willswire commented May 3, 2023

Define optimal multi-distro HA solution for the in-cluster registry #375

Define optimal multi-distro HA solution for the in-cluster registry #375

Comments

jeff-mccoy commented Mar 10, 2022 • edited Loading

Concerns:

Considerations:

jeff-mccoy commented Mar 23, 2022 • edited Loading

A. Require RWX StorageClass for HA

Pros:

Cons:

C. Use Statefulset combined with InitContainer for automatic image syncing in-cluster

Pros:

Cons:

Details:

Eliminated based on conversations with @runyontr / @YrrepNoj and comments below:

B. Deploy minio or use existing S3-compatible storage

Pros:

Cons:

D. Use existing helm chart registry deployment and push copies of the images to each pod

Pros:

Cons:

Details:

E. Use Trow for the registry

Pros:

Cons:

F: Use Uber Kraken

Pros:

Cons:

Racer159 commented Mar 28, 2022

jeff-mccoy commented Mar 30, 2022

anoncam commented Apr 1, 2022

mikhailswift commented Apr 1, 2022

jeff-mccoy commented Apr 1, 2022

jeff-mccoy commented Apr 1, 2022

tonybutt commented Apr 1, 2022

salt-mountain commented Apr 4, 2022

jeff-mccoy commented Apr 4, 2022

brandtkeller commented Apr 5, 2022

andrewg-xyz commented Apr 5, 2022

andrewg-xyz commented Apr 5, 2022

RothAndrew commented Apr 5, 2022

neoakris commented Apr 6, 2022 • edited Loading

neoakris commented Apr 6, 2022

jeff-mccoy commented Apr 6, 2022

jeff-mccoy commented Apr 6, 2022

jeff-mccoy commented Apr 6, 2022

jeff-mccoy commented Apr 7, 2022

neoakris commented Apr 7, 2022 • edited Loading

bburky commented Jun 8, 2022

brandtkeller commented Jun 18, 2022

JasonvanBrackel commented Sep 15, 2022 • edited Loading

willswire commented May 1, 2023

jeff-mccoy commented May 3, 2023

willswire commented May 3, 2023

jeff-mccoy commented Mar 10, 2022 •

edited

Loading

jeff-mccoy commented Mar 23, 2022 •

edited

Loading

neoakris commented Apr 6, 2022 •

edited

Loading

neoakris commented Apr 7, 2022 •

edited

Loading

JasonvanBrackel commented Sep 15, 2022 •

edited

Loading