Random failure of helm-controller to get last release revision #2074

XtremeAI · 2021-11-11T08:58:32Z

Describe the bug

Hi guys,

We run 20+ k8s clusters with workloads managed by Flux on them. Recently I observed that on three environments starting at different dates and times all the helm releases got stuck upgrading and Flux started to throw the following alert for each helm release:

helmrelease/<hr-name>.flux-system
reconciliation failed: failed to get last release revision: query: failed to query with labels: Unauthorized

The quick way to fix that was to bounce the helm-controller: k rollout restart deployment -n flux-system helm-controller. I had to fix all environments quickly as those were production ones.

Have you observed this problem before or have any ideas why this happens and what is more importantly how to prevent this from happening?

Steps to reproduce

N/A

Expected behavior

N/A

Screenshots and recordings

No response

OS / Distro

N/A

Flux version

13.3

Flux check

N/A

Git provider

No response

Container Registry provider

No response

Additional context

No response

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

makkes · 2021-11-13T22:00:07Z

At first sight this looks like the helm-controller Pod lost access rights on some API resources. Could you check if anything around RBAC has changed at the time these failures started to happen?

XtremeAI · 2021-11-17T07:07:14Z

No, there were clearly no configuration changes. Cause if they were, a simple deployment restart would not help. But you are right, Unauthorized looks like helm-controller suddenly struggled to have access to something and apparently this error message comes from some helm operations.

starteleport · 2021-12-08T19:00:33Z

Same for me, helm-controller pod restart fixed the problem.

miph86 · 2022-01-10T11:43:51Z

Same here, fixed by restart

zmpeg · 2022-01-13T18:42:39Z

Seeing the same issue resolved by helm pod restart after months of uptime.

stefanprodan · 2022-01-13T18:58:31Z

At first sight this looks like the helm-controller Pod lost access rights on some API resources.

Seems that Helm can't list secrets to find the release storage, as if the helm-controller service account lost its privileges. But if that was the case, then all the other API queries should've failed before it reached the helm function.

Maybe these HelmReleases have spec.ServiceAccountName specified?

Alan01252 · 2022-02-16T10:35:20Z

We've just experienced the same issue, no changes to the RBAC for the cluster, and none of the helmreleases define a service account name. Very strange.

Alan01252 · 2022-02-16T10:56:57Z

I'm not sure if this is cause/correlation but someone with some more experience might enlighten me. I restarted the helm controller as suggested by others here, and then we noticed that the certificate for our multus daemonset in our EKS cluster had expired preventing the controller from spinning up again.

Restarting the multus daemon set, regenerated the certs, the helm controller span back up, and everything was resolved.

alfoudari · 2022-05-04T12:06:33Z

At first sight this looks like the helm-controller Pod lost access rights on some API resources.

Seems that Helm can't list secrets to find the release storage, as if the helm-controller service account lost its privileges. But if that was the case, then all the other API queries should've failed before it reached the helm function.

Maybe these HelmReleases have spec.ServiceAccountName specified?

Page 540: https://docs.aws.amazon.com/eks/latest/userguide/eks-ug.pdf

You see these errors if your service account token has expired on a 1.21 or later cluster.

As mentioned in the Kubernetes 1.21 (p. 69) and 1.22 (p. 67) release notes, the BoundServiceAccount
token feature that graduated to beta in 1.21 improves the security of service account tokens by allowing
workloads running on Kubernetes to request JSON web tokens that are audience, time, and key bound.
Service account tokens now have an expiration of one hour. To enable a smooth migration of clients to
the newer time-bound service account tokens, Kubernetes adds an extended expiry period to the service
account token over the default one hour. For Amazon EKS clusters, the extended expiry period is 90 days.
Your Amazon EKS cluster's Kubernetes API server rejects requests with tokens older than 90 days.

Helm controller's pod was 91 days old when this problem happened. Restarting the pod and refreshing the service account's token did bring it back to normal.

stefanprodan · 2022-05-04T12:43:15Z

@abstractpaper this feels like an EKS bug, kubelet failed to renew the token and Flux ended up using one that has expired.

Can you please see the troubleshooting guide here: https://github.com/kubernetes/enhancements/blob/master/keps/sig-auth/1205-bound-service-account-tokens/README.md#troubleshooting

migspedroso · 2022-08-01T15:24:09Z

Same here, It fixed by restarting the pod after 110 days of uptime

stefanprodan · 2022-08-01T15:32:50Z

@migspedroso which version of Flux are you using? We fixed the stale token issue for helm-controller in v0.31

Siebjee · 2022-09-28T11:39:28Z

I can confirm this issue is still present at:

flux: v0.33.0
helm-controller: v0.16.0
image-automation-controller: v0.20.0
image-reflector-controller: v0.16.0
kustomize-controller: v0.20.2
notification-controller: v0.21.0
source-controller: v0.21.2

stefanprodan · 2022-09-28T11:51:32Z

@Siebjee this has been fixed back in May in fluxcd/helm-controller#480 You need to upgrade the Flux controllers.

Siebjee · 2022-09-28T11:57:02Z

Heh, I think i missed that part on this cluster :D

MKnichal · 2022-12-22T22:25:34Z

Had same issue with my cluster.

flux version
flux: v0.35.0
helm-controller: v0.20.1
kustomize-controller: v0.24.4
notification-controller: v0.23.4
source-controller: v0.24.3

Restart of the helm-controller pod resolved the issue.

Cajga · 2024-02-19T14:10:48Z

I thought to drop it here if someone finds this thread:
If you are using flux multi-tenancy (spec.ServiceAccountName is defined in the Helmrelease), the helm-controller requires to have RW access to secrets in the namespace where the HelmRelease get's installed (as the Helm metadata is stored in a secret).

abhishekmotifworks · 2024-05-10T12:17:47Z

I'm facing same issue. I tried restarting the helm-controller pod but it didn't help. if anyone has different solution please share here.

apuleyo3 · 2024-06-05T22:31:56Z

Same for me, helm-controller pod restart fixed the problem.

Same for me, after that we ran a flux reconcile, and that's it.

makkes added the area/helm Helm related issues and pull requests label Nov 13, 2021

Alan01252 mentioned this issue Feb 18, 2022

Multus CNI - Certificate expires aws/amazon-vpc-cni-k8s#1868

Closed

pjbgf mentioned this issue May 10, 2022

Helm-controller pod is using stale tokens fluxcd/helm-controller#479

Closed

hiddeco mentioned this issue May 12, 2022

kube: load KubeConfig (token) from FS on every reconcile fluxcd/helm-controller#480

Merged

stefanprodan closed this as completed Sep 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random failure of helm-controller to get last release revision #2074

Random failure of helm-controller to get last release revision #2074

XtremeAI commented Nov 11, 2021 •

edited

Loading

makkes commented Nov 13, 2021

XtremeAI commented Nov 17, 2021

starteleport commented Dec 8, 2021 •

edited

Loading

miph86 commented Jan 10, 2022

zmpeg commented Jan 13, 2022

stefanprodan commented Jan 13, 2022

Alan01252 commented Feb 16, 2022 •

edited

Loading

Alan01252 commented Feb 16, 2022

alfoudari commented May 4, 2022

stefanprodan commented May 4, 2022 •

edited

Loading

migspedroso commented Aug 1, 2022

stefanprodan commented Aug 1, 2022

Siebjee commented Sep 28, 2022

stefanprodan commented Sep 28, 2022

Siebjee commented Sep 28, 2022

MKnichal commented Dec 22, 2022

Cajga commented Feb 19, 2024

abhishekmotifworks commented May 10, 2024

apuleyo3 commented Jun 5, 2024

Random failure of helm-controller to get last release revision #2074

Random failure of helm-controller to get last release revision #2074

Comments

XtremeAI commented Nov 11, 2021 • edited Loading

Describe the bug

Steps to reproduce

Expected behavior

Screenshots and recordings

OS / Distro

Flux version

Flux check

Git provider

Container Registry provider

Additional context

Code of Conduct

makkes commented Nov 13, 2021

XtremeAI commented Nov 17, 2021

starteleport commented Dec 8, 2021 • edited Loading

miph86 commented Jan 10, 2022

zmpeg commented Jan 13, 2022

stefanprodan commented Jan 13, 2022

Alan01252 commented Feb 16, 2022 • edited Loading

Alan01252 commented Feb 16, 2022

alfoudari commented May 4, 2022

stefanprodan commented May 4, 2022 • edited Loading

migspedroso commented Aug 1, 2022

stefanprodan commented Aug 1, 2022

Siebjee commented Sep 28, 2022

stefanprodan commented Sep 28, 2022

Siebjee commented Sep 28, 2022

MKnichal commented Dec 22, 2022

Cajga commented Feb 19, 2024

abhishekmotifworks commented May 10, 2024

apuleyo3 commented Jun 5, 2024

XtremeAI commented Nov 11, 2021 •

edited

Loading

starteleport commented Dec 8, 2021 •

edited

Loading

Alan01252 commented Feb 16, 2022 •

edited

Loading

stefanprodan commented May 4, 2022 •

edited

Loading