Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodic loss of secrets #83

Closed
mmiller1 opened this issue Mar 13, 2018 · 18 comments
Closed

Periodic loss of secrets #83

mmiller1 opened this issue Mar 13, 2018 · 18 comments

Comments

@mmiller1
Copy link

This is mostly a question, maybe a bug report. The question being: How does the controller interact with secrets already created that have an owner reference to the Sealed Secrets CRD. My reason for asking which might be a bug report, is that on multiple occasions I've had in use secrets disappear from my clusters after something bad happens to the controller, ie. loss of a node, unexpected deletion of the controller pod, etc. I'm not strictly filing this as a bug because I have zero logs from anything indicating sealed-secrets is the culprit, just a hunch.

@arapulido
Copy link
Contributor

Thanks for the question. @jjo have we seen anything like this in our clusters?

@anguslees
Copy link
Contributor

anguslees commented Mar 16, 2018

Ouch, that sounds unpleasant. Do the Secrets come back again by themselves, or do you have to take some action?

The controller will delete the Secret if it thinks the SealedSecret has been deleted[1], and kubernetes will delete (garbage collect) the Secret if it thinks the SealedSecret has been deleted (because of the configured controllerRef[2]). Yes, these two are redundant on modern kubernetes, and we should remove the explicit code in sealed-secrets controller ;)

[1] https://github.com/bitnami-labs/sealed-secrets/blob/v0.6.0/cmd/controller/controller.go#L175
[2] https://github.com/bitnami-labs/sealed-secrets/blob/v0.6.0/pkg/apis/sealed-secrets/v1alpha1/sealedsecret_expansion.go#L109

Note [1] should leave characteristic messages in the sealed-secrets controller log that you can look for. [2] is done by the kube-controller-manager, so the sealed-secrets controller never even sees it happen (it only watches SealedSecrets, not Secrets atm for better or worse).

I think both of these approaches will only trigger if a GET of the SealedSecret returns a successful "not found" response, which a) shouldn't depend on the health of the sealed-secrets controller itself and b) would indicate a severe bug in apiserver/etcd if it were true. So .. I believe you, I'm just saying I don't understand what went wrong yet ;)

As for treatment of existing Secrets, the controller is currently very dumb since the assumption is that the process is idempotent and it is always safe to re-decrypt and recreate the Secret from the SealedSecret. So it just creates/replaces any existing Secret once for each existing SealedSecrets at controller startup and then whenever any add/update to the SealedSecret is observed[3].

[3] https://github.com/bitnami-labs/sealed-secrets/blob/v0.6.0/cmd/controller/controller.go#L48

Do you churn/update SealedSecrets frequently by any chance? ... Or if it happens on node failure, perhaps there's a race with a second sealed-secrets controller that has just started up... I don't immediately see any races in the code that would result in a Secret being deleted incorrectly, or created with incorrect contents. I can see it might be possible to get a Secret created from an older version of a SealedSecret if there is a race between multiple controllers, and currently the controller will give up on updates after a certain number of retries with only a hard-to-notice error message in the logs.

Hrm. We can do a few things to make the update code more robust, and we should definitely log events on the SealedSecret detailing errors and successful updates during unsealing. That should make any issues a lot clearer.

Thanks for the bug report. It would be super helpful if you manage to capture the sealed-secrets controller logs after you notice a missing Secret (presumably from the newly restarted controller, if you can't get to the old controller node anymore). If you find any lines mentioning the missing Secret name, they will greatly help in understanding what the controller thinks is going on. Without a reproducable test case, I'm just shooting in the dark as to whether I've fixed anything :(

Action items (for the github PR):

  • Record info/error events on the SealedSecret
  • Enforce ResourceVersion checks in update loop (and retry on race detected)

@mmiller1
Copy link
Author

mmiller1 commented Mar 16, 2018

Thanks for the response. We have improved our logging situation drastically since the last time it occurred, so if it does happen there's a better chance we'll find something. However, the severity of this issue has caused us to stop using sealed secrets on our busier clusters, I'll keep it running in our development environment and post back if it happens again.

edit: In answer to your first question, the secrets do not come back on their own.

@jbianquetti-nami
Copy link
Contributor

Hi,

I've recently updated to the latest version of sealed-secrets on our 5 clusters and I'm unable to reproduce the issue.

To monitor when the event happens, I've added kubewatch support for secrets and webhook notifications (vmware-archive/kubewatch#94). By this way, we're going to be able to detect when secrets are vanishing and have more info for debugging.

Can you give it a try?

@floriankoch
Copy link

Hi, i have seen this behaviour 3 times in our clusters (4 diferent), the secrets magically go away, i think in the apiserver audit logs i have seen gc runs around this time.
The sealed-secret is not affected

restarting the sealed-secrets-controller pod fixes the problem

@anguslees
Copy link
Contributor

i have seen this behaviour 3 times in our clusters

Wow. Which k8s version(s)? (just for tracking)

GC runs are a likely suspect, since there might be some upstream bug/race with CRDs and GC. It would be great to collect whatever relevant logs/traces you have (and are willing to share, even privately), and I can pursue this with k8s upstream. If it is a race, I would expect to see it happen around kube-controller-manager restarts or some other "reset" of the kube-controller-manager's internal state.

@anguslees
Copy link
Contributor

Regardless of the underlying cause, we can workaround this in sealed-secrets in a few ways:

  • Periodically (every hour?) recreate Secrets, even though there was no change in the SealedSecret. This still means the Secret goes away for a while, but it will come back automatically, eventually.

    Causes some additional load on the apiserver for each SealedSecret, not just changes to SealedSecrets (a potential issue for crazy-big clusters).

  • Watch created Secrets, and recreate them immediately if they go away "unexpectedly". This also deals with humans deleting the Secret but not the SealedSecret.

    This either means lots of outstanding watch requests (one per SealedSecret), or a global watch of all Secrets. So far I have tried to avoid reading Secrets into sealed-secrets controller unnecessarily, because each of them is yet another place to attack the confidential data - so streaming all your cluster's secrets through the controller is a bit icky.

For background context, the sealed-secrets controller does a once-off bonus decryption of all the SealedSecrets at process startup (because it will have missed change events while it was not running). This is why restarting the controller recreates any missing Secrets.

If I had to pick a workaround, I think I'd choose periodic decrypts (first option above), with the period exposed as a flag (0=disable).

@floriankoch
Copy link

floriankoch commented Feb 8, 2019

OK i have more Infos:

Kubernetes Version 1.12.4

And it does not affect all sealedsecrets in the cluster, it affects exactly one!

The one that have problems is a bit special, its our default tls ingress certificate
(we cannot use the option for this provided by nginx inress (provider limitation))
We create the secret in one Namespace, and then use kubectl watch to detect secret changes and copy the secrets in all other Namespaces.
(https://boxboat.com/2018/07/02/kubernetes-nginx-ingress-tls-secrets-all-namespaces/)

Whyever, the secret (not the sealedsecret) in the initial Namespace goes away - and because of the watch - in all other Namespaces too.

Unfortunately , the api audit logs are not aviable for this timeframe.

I see how i can setup a monitoring - and better debugging for this

@mmiller1
Copy link
Author

mmiller1 commented Feb 8, 2019

I do not have logs or any other helpful information, other than when I was using sealed secrets, we were running kubernetes 1.9, and it was also our SSL certs being used by nginx-ingress controllers that were vanishing

@alice-sawatzky
Copy link
Contributor

jumping in blind here, might it be related to the Secret type? is that SSL cert the only sealedsecret you're using that's not Opaque?

@floriankoch
Copy link

@alice-sawatzky no, all types are Opaque, and there are more TLS zertificates as this one

@floriankoch
Copy link

floriankoch commented Mar 21, 2019

Ok now i am sure that our problem is a specific one.
The reason for our loss was the way we copied the secrets , and only the copied ones are deleted, other sealed secrets survive this
The problem always happens when a worker restart or something similar happend, and the kubernetes controller come into play

So this is ### not a generic sealed-secret-controller bug

@ddanboyle
Copy link

Adding some more information as we saw one of these a couple of weeks ago...

We were running version 0.7 of the product in 7 different kubernetes Azure AKS clusters, 3 of which are production with customers. We've been using sealed secrets for a while now.

We noticed on our preprod cluster, the entire block of sealed secrets had disappeared. other secrets tied to the ingress controller, container registries, etc in that namespace still existed, but all the sealed secrets for THAT namespace were gone.

I went and restarted the sealed-secrets-container in kube-system and they magically reappeared, probably because the CRD files for each secret were still there.

Some of these clusters have a lot of usage, our preprod cluster was in a lull and not taking any updates for 2 months or so.

First time this happened for us, so I didn't capture logs or anything, we just got things going again...

I have an hourly process that looks across all our environments looking for a secret deletion and gives an alert if that happens.

@mkmik
Copy link
Collaborator

mkmik commented Oct 2, 2019

What k8s version are you running in those clusters?

@mkmik
Copy link
Collaborator

mkmik commented Oct 2, 2019

but all the sealed secrets for that namespace were gone.

(emphasis mine)

I'm not sure if you mean that all the Secret resources derived (i.e. decrypted) from SealedSecrets were gone, or that the SealedSecrets resources themselves were gone?

Could it be related to #224? (I.e. some controller is deleting the secrets?)

Anyway, a lot of fixes have been made since v0.7.x. please consider upgrading. There is a potentially relevant fix in #127, or in #110, or #183 but I'm just speculating.

Let me know if you need help with the migration strategy. Feel free to reach out in our slack channel (See readme for a pointer)

@ddanboyle
Copy link

Follow on from my previous post. I have since upgraded my dev clusters to 0.9.2, so we are current.

I also have a build that runs hourly and looks thorough 30+ namespaces and looks for a missing secret that I know should never be missing. I also have code that restarts the sealed secrets container when this problem is found.

The build triggered last week... It doesn't trigger very often, but it happened again, even under the new version. These are dev environments with a lot of loading and reloading of deployments, but the secrets themselves rarely change....

What logs or things do I need to save so we can get to the bottom of this problem?

@github-actions
Copy link
Contributor

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

@github-actions github-actions bot added the Stale label Jan 28, 2022
@juan131
Copy link
Collaborator

juan131 commented Feb 3, 2022

Closing issue in favor of #224

@juan131 juan131 closed this as completed Feb 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants