-
Notifications
You must be signed in to change notification settings - Fork 606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random failure of helm-controller to get last release revision #2074
Comments
At first sight this looks like the helm-controller Pod lost access rights on some API resources. Could you check if anything around RBAC has changed at the time these failures started to happen? |
No, there were clearly no configuration changes. Cause if they were, a simple deployment restart would not help. But you are right, |
Same for me, |
Same here, fixed by restart |
Seeing the same issue resolved by helm pod restart after months of uptime. |
Seems that Helm can't list secrets to find the release storage, as if the helm-controller service account lost its privileges. But if that was the case, then all the other API queries should've failed before it reached the helm function. Maybe these HelmReleases have |
We've just experienced the same issue, no changes to the RBAC for the cluster, and none of the helmreleases define a service account name. Very strange. |
I'm not sure if this is cause/correlation but someone with some more experience might enlighten me. I restarted the helm controller as suggested by others here, and then we noticed that the certificate for our multus daemonset in our EKS cluster had expired preventing the controller from spinning up again. Restarting the multus daemon set, regenerated the certs, the helm controller span back up, and everything was resolved. |
Page 540: https://docs.aws.amazon.com/eks/latest/userguide/eks-ug.pdf
Helm controller's pod was 91 days old when this problem happened. Restarting the pod and refreshing the service account's token did bring it back to normal. |
@abstractpaper this feels like an EKS bug, kubelet failed to renew the token and Flux ended up using one that has expired. Can you please see the troubleshooting guide here: https://github.com/kubernetes/enhancements/blob/master/keps/sig-auth/1205-bound-service-account-tokens/README.md#troubleshooting |
Same here, It fixed by restarting the pod after 110 days of uptime |
@migspedroso which version of Flux are you using? We fixed the stale token issue for helm-controller in v0.31 |
I can confirm this issue is still present at:
|
@Siebjee this has been fixed back in May in fluxcd/helm-controller#480 You need to upgrade the Flux controllers. |
Heh, I think i missed that part on this cluster :D |
Had same issue with my cluster.
Restart of the helm-controller pod resolved the issue. |
I thought to drop it here if someone finds this thread: |
I'm facing same issue. I tried restarting the helm-controller pod but it didn't help. if anyone has different solution please share here. |
Same for me, after that we ran a |
Describe the bug
Hi guys,
We run 20+ k8s clusters with workloads managed by Flux on them. Recently I observed that on three environments starting at different dates and times all the helm releases got stuck upgrading and Flux started to throw the following alert for each helm release:
The quick way to fix that was to bounce the helm-controller:
k rollout restart deployment -n flux-system helm-controller
. I had to fix all environments quickly as those were production ones.Have you observed this problem before or have any ideas why this happens and what is more importantly how to prevent this from happening?
Steps to reproduce
N/A
Expected behavior
N/A
Screenshots and recordings
No response
OS / Distro
N/A
Flux version
13.3
Flux check
N/A
Git provider
No response
Container Registry provider
No response
Additional context
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: