-
Notifications
You must be signed in to change notification settings - Fork 684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Periodic loss of secrets #83
Comments
Thanks for the question. @jjo have we seen anything like this in our clusters? |
Ouch, that sounds unpleasant. Do the Secrets come back again by themselves, or do you have to take some action? The controller will delete the Secret if it thinks the SealedSecret has been deleted[1], and kubernetes will delete (garbage collect) the Secret if it thinks the SealedSecret has been deleted (because of the configured controllerRef[2]). Yes, these two are redundant on modern kubernetes, and we should remove the explicit code in sealed-secrets controller ;) [1] https://github.com/bitnami-labs/sealed-secrets/blob/v0.6.0/cmd/controller/controller.go#L175 Note [1] should leave characteristic messages in the sealed-secrets controller log that you can look for. [2] is done by the kube-controller-manager, so the sealed-secrets controller never even sees it happen (it only watches SealedSecrets, not Secrets atm for better or worse). I think both of these approaches will only trigger if a GET of the SealedSecret returns a successful "not found" response, which a) shouldn't depend on the health of the sealed-secrets controller itself and b) would indicate a severe bug in apiserver/etcd if it were true. So .. I believe you, I'm just saying I don't understand what went wrong yet ;) As for treatment of existing Secrets, the controller is currently very dumb since the assumption is that the process is idempotent and it is always safe to re-decrypt and recreate the Secret from the SealedSecret. So it just creates/replaces any existing Secret once for each existing SealedSecrets at controller startup and then whenever any add/update to the SealedSecret is observed[3]. [3] https://github.com/bitnami-labs/sealed-secrets/blob/v0.6.0/cmd/controller/controller.go#L48 Do you churn/update SealedSecrets frequently by any chance? ... Or if it happens on node failure, perhaps there's a race with a second sealed-secrets controller that has just started up... I don't immediately see any races in the code that would result in a Secret being deleted incorrectly, or created with incorrect contents. I can see it might be possible to get a Secret created from an older version of a SealedSecret if there is a race between multiple controllers, and currently the controller will give up on updates after a certain number of retries with only a hard-to-notice error message in the logs. Hrm. We can do a few things to make the update code more robust, and we should definitely log events on the SealedSecret detailing errors and successful updates during unsealing. That should make any issues a lot clearer. Thanks for the bug report. It would be super helpful if you manage to capture the sealed-secrets controller logs after you notice a missing Secret (presumably from the newly restarted controller, if you can't get to the old controller node anymore). If you find any lines mentioning the missing Secret name, they will greatly help in understanding what the controller thinks is going on. Without a reproducable test case, I'm just shooting in the dark as to whether I've fixed anything :( Action items (for the github PR):
|
Thanks for the response. We have improved our logging situation drastically since the last time it occurred, so if it does happen there's a better chance we'll find something. However, the severity of this issue has caused us to stop using sealed secrets on our busier clusters, I'll keep it running in our development environment and post back if it happens again. edit: In answer to your first question, the secrets do not come back on their own. |
Hi, I've recently updated to the latest version of sealed-secrets on our 5 clusters and I'm unable to reproduce the issue. To monitor when the event happens, I've added kubewatch support for secrets and webhook notifications (vmware-archive/kubewatch#94). By this way, we're going to be able to detect when secrets are vanishing and have more info for debugging. Can you give it a try? |
Hi, i have seen this behaviour 3 times in our clusters (4 diferent), the secrets magically go away, i think in the apiserver audit logs i have seen gc runs around this time. restarting the sealed-secrets-controller pod fixes the problem |
Wow. Which k8s version(s)? (just for tracking) GC runs are a likely suspect, since there might be some upstream bug/race with CRDs and GC. It would be great to collect whatever relevant logs/traces you have (and are willing to share, even privately), and I can pursue this with k8s upstream. If it is a race, I would expect to see it happen around kube-controller-manager restarts or some other "reset" of the kube-controller-manager's internal state. |
Regardless of the underlying cause, we can workaround this in sealed-secrets in a few ways:
For background context, the sealed-secrets controller does a once-off bonus decryption of all the SealedSecrets at process startup (because it will have missed change events while it was not running). This is why restarting the controller recreates any missing Secrets. If I had to pick a workaround, I think I'd choose periodic decrypts (first option above), with the period exposed as a flag (0=disable). |
OK i have more Infos: Kubernetes Version 1.12.4 And it does not affect all sealedsecrets in the cluster, it affects exactly one! The one that have problems is a bit special, its our default tls ingress certificate Whyever, the secret (not the sealedsecret) in the initial Namespace goes away - and because of the watch - in all other Namespaces too. Unfortunately , the api audit logs are not aviable for this timeframe. I see how i can setup a monitoring - and better debugging for this |
I do not have logs or any other helpful information, other than when I was using sealed secrets, we were running kubernetes 1.9, and it was also our SSL certs being used by nginx-ingress controllers that were vanishing |
jumping in blind here, might it be related to the Secret type? is that SSL cert the only sealedsecret you're using that's not |
@alice-sawatzky no, all types are Opaque, and there are more TLS zertificates as this one |
Ok now i am sure that our problem is a specific one. So this is ### not a generic sealed-secret-controller bug |
Adding some more information as we saw one of these a couple of weeks ago... We were running version 0.7 of the product in 7 different kubernetes Azure AKS clusters, 3 of which are production with customers. We've been using sealed secrets for a while now. We noticed on our preprod cluster, the entire block of sealed secrets had disappeared. other secrets tied to the ingress controller, container registries, etc in that namespace still existed, but all the sealed secrets for THAT namespace were gone. I went and restarted the sealed-secrets-container in kube-system and they magically reappeared, probably because the CRD files for each secret were still there. Some of these clusters have a lot of usage, our preprod cluster was in a lull and not taking any updates for 2 months or so. First time this happened for us, so I didn't capture logs or anything, we just got things going again... I have an hourly process that looks across all our environments looking for a secret deletion and gives an alert if that happens. |
What k8s version are you running in those clusters? |
(emphasis mine) I'm not sure if you mean that all the Secret resources derived (i.e. decrypted) from SealedSecrets were gone, or that the SealedSecrets resources themselves were gone? Could it be related to #224? (I.e. some controller is deleting the secrets?) Anyway, a lot of fixes have been made since v0.7.x. please consider upgrading. There is a potentially relevant fix in #127, or in #110, or #183 but I'm just speculating. Let me know if you need help with the migration strategy. Feel free to reach out in our slack channel (See readme for a pointer) |
Follow on from my previous post. I have since upgraded my dev clusters to 0.9.2, so we are current. I also have a build that runs hourly and looks thorough 30+ namespaces and looks for a missing secret that I know should never be missing. I also have code that restarts the sealed secrets container when this problem is found. The build triggered last week... It doesn't trigger very often, but it happened again, even under the new version. These are dev environments with a lot of loading and reloading of deployments, but the secrets themselves rarely change.... What logs or things do I need to save so we can get to the bottom of this problem? |
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback. |
Closing issue in favor of #224 |
This is mostly a question, maybe a bug report. The question being: How does the controller interact with secrets already created that have an owner reference to the Sealed Secrets CRD. My reason for asking which might be a bug report, is that on multiple occasions I've had in use secrets disappear from my clusters after something bad happens to the controller, ie. loss of a node, unexpected deletion of the controller pod, etc. I'm not strictly filing this as a bug because I have zero logs from anything indicating sealed-secrets is the culprit, just a hunch.
The text was updated successfully, but these errors were encountered: