-
Notifications
You must be signed in to change notification settings - Fork 742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multus CNI - Certificate expires #1868
Comments
HI @Alan01252 the multus daemonset config file used in this image is same as the upstream repo except the part where we make vpc-cni as the primary interface. Will check and get back to you |
Thanks @cgchinmay, it's entirely possible I am confused. It's this file specifically that gets recreated when the pod restarts and had the 90 day problem |
@Alan01252 I checked the multus documentation and there was nothing to indicate any explicit expiry being set. But you should be able to repro this even with the upstream multus daemonset, vpc-cni isn't playing any role here as per my knowledge. This could also be specific to your cluster setup as well. If you can share your eks cluster arn, I can check more details on the token refresh. |
Thank for sharing cluster-arn on my email. For future ref you can use this: k8s-awscni-triage@amazon.com From the logs, the issue seems to be with flux-system/helm-controller pods. I see that the Issue started when pod for this deployment (flux-system/helm-controller) was deleted at 2022-02-16T10:30:34.659541Z and recreated. I am guessing you restarted multus and the issue got resolved. |
This is entirely correct :) I believe this pod failed to start because of the issue with multus, restarting the pod subsequently proceeds without issue. |
so based on the timeline in apiserver logs, I see that first helm-controller pod was deleted and recreated. As soon as the pod was deleted multus reported unauthorized errors. So these unauthorized errors in multus could be the source of helm-controller pod being restarted .Would like to confirm this. |
I'm curious as to your hypothesis. I think I could have restarted any pod at this point and it would have failed to start due to multus being unable to access the API. However, you're right in that I deleted this pod because it too was reporting authorization issues see this issue. I should have posted this error earlier too. I just assumed ( perhaps wrongly ) multus was the cause ( due to the fact restarting it fixed everything ) but perhaps there was something else? |
This error is expected which is why your helm controller was not able to restart correctly. I am trying to understand the unauthorized part. This is what I need to verify 1.Does the unauthorized error occur when you recreate helm-controller pod ? Could you run eks-log-collector script and send logs to [k8s-awscni-triage@amazon.com]. Thanks |
Restarting now is fine.
I'll get those logs sent over! Thanks again for your patience investigating this! |
I was unable to un-tar the log file that you sent. Instead could you confirm
I haven't been able to find any known issue around multus. From the [linked Issue] (fluxcd/flux2#2074) It does appear that something changed on the RBAC side for helm-controller so you had to restart the multus pod as well which regenerated service account token. I am still researching to find a link between this helm-controller pod restart and multus service account token About this
How/where are you using it in your cluster ? |
Hi @cgchinmay the tar file I sent extracts fine for me I've double-checked what I sent and even gmail can see the contents so I hope it's good! Let me know I can send it again if not. I don't think there's any link between multus and the helm chart other than once multus is installed the pods require multus to be functioning correctly in order to start up. I've tried to look for logs from the other running pods during the time frame, but no other pods are showing any issue in this namespace or any other namespace. Re letsencrypt, I've confused things there, we're not using it, but I was wondering if something in eks might be which could have caused this? |
Could you resend the file ? I tried the above command, but the file shows up as 0 bytes and I cannot extract. I m using Based on your explanation, I think this error is limited and has to do with the incorrect cleanup of helm-controller pod. I will take a look at the logs and let you know if I find anything. |
Thanks, @cgchinmay , I've resent. To clarify, this is what we are currently hypothesizing:
If you could share with me the thoughts on the somehow that would be appreciated, I can't work out what restarting multus would do that would be specific to a single pod restarting? If we see this again I'll definitely try restarting other pods too! :) |
So the reason we see this only for helm-controller pod since it was being teardown and thats when multus had to communicate with the api server. So that explains why we see Unauthorized error for this one pod in this case. We do have similar issues like this one: #1831 . |
@cgchinmay wondering if you've come across anything to fix this? Thanks |
@envybee based on the logs and information gathered so far, I didn't find anything to hint that multus token expires after 90 days. There is an update to the service account token in kubernetes 1.21 but its valid for 1 year. You can read more about this here: https://github.com/kubernetes/enhancements/blob/master/keps/sig-auth/1205-bound-service-account-tokens/README.md About this particular issue, dont think multus token expiry is the cause. It did become invalid but only for helm controller pod, wasn't expired. |
Based on the findings, I think we can close this issue. Did not find anything in multus or Control plane that would cause certificates to expire after 90 days. Please reopen if the issue reoccurs. Do collect logs and send it to [k8s-awscni-triage@amazon.com] |
|
A comment in the linked ticket sounds very much like the reason behind this failure. |
Yes, it's even now emailed out by AWS :) Note we even got an automated email about this from aws! As of April 20th 2022, we have identified the below service accounts attached to pods in one or more of your EKS clusters using stale (older than 1 hour) tokens. Service accounts are listed in the format : |namespace:serviceaccount arn:aws:eks:eu-west-2::cluster/prod-|kube-system:multus |
@Alan01252 Thanks for the update. I am reopening the issue and will update once I have more information on the required fix. |
As per this doc: https://docs.aws.amazon.com/eks/latest/userguide/service-accounts.html#identify-pods-using-stale-tokens |
Hi @Alan01252 I took another look at api server audit logs for your "prod-telemetry" cluster and the last request using stale token was at 2022-05-17T12:19:40.900Z (UTC) as the token mounted at (/var/run/secrets/kubernetes.io/serviceaccount/token) in the requesting multus pod was older than 1 hour. There were such cases since april 20th which is why you got an email from AWS. However these stale token requests seem to have stopped now. I also inspected your multus pod token located at (/var/run/secrets/kubernetes.io/serviceaccount/token) and could see that it is regularly getting updated every hour. This is how it is supposed to be as of K8s 1.21 and if the multus is using client-go > v0.15.7 which is indeed the case. Given the most recent behavior I don't think you should run into this issue again but will keep a watch on the api server logs and you can do it as well with this doc |
Thanks @cgchinmay, this is really interesting I'm now trying to work out what if anything we changed between getting that email and now. I really can't think of anything. Also are you able to please confirm that this file /etc/cni/net.d/multus.d/multus.kubeconfig And the multus user token inside it is nothing to do with the above? Thanks again for the thorough investigation and response, as always it's very much appreciated! :) |
Token used for communication with api server is mounted at (/var/run/secrets/kubernetes.io/serviceaccount/token) inside multus pod. As long as this token is updated every hour, your requests shouldn't go stale, regardless of what token is present in multus.kubeconfig. Based on current observation, only when you create/restart multus, the token inside multus pod and inside multus.kubeconfig are in sync. So I believe only the first time, token is mounted from this file, need to confrm with upstream multus code. But after 1 hour, the token inside pod gets updated independently with no change to the token inside multus.kubeconfig. So if there was a scenario where you restarted multus pod and it used older multus.kubeconfig file then you have a stale token inside multus pod. But I was not able to produce such scenario in my tests. On every restart or create, the multus.kubeconfig also gets updated token. |
Amazing :) Thank you @cgchinmay!!! I'm happy if you want to close this, I think this provides a good reference for other people and that all I's have been dotted and t's crossed. |
@Alan01252 I did see 4-5 requests all around 2022-05-19T11:38:39.307Z made by pod kube-multus-ds-zwh6w using a stale token. Also I could see kubelet requesting for a new token for the same pod around that time. It's not possible to see the actual token value but based on the elapsed time field for the token, it was 20 days old. You also have multus secret created 20 days ago and token for this secret doesn't refresh. Earlier I saw this with another multus pod and the elapsed time matched with the age of multus secret. So it makes me believe that the pod used the token from multus secret while making the request. To confirm this, you could recreate the multus service account thus regenerating the multus secret and we can verify from api logs if indeed this is the case. Also this is an intermittent issue. The elapsed time was less than 90 days so api server didn't deny the requests, it just marks them as stale. Will try to gather more information on why we are seeing such intermittent stale request. Could be kubelet or multus specific, am not sure yet. But this is the pattern I see. |
@Alan01252 Hi, The stale token is actually coming from multus.kubeconfig file. Initially I thought its coming from secret associated with multus service account as the multus.kubeconfig file as well as service account were created around same time. But I see that you had restarted multus, which recreated multus.kubeconfig file and am able to correlate the stale token accounts age with the multus.kubeconfig file. It would be good to open an issue on upstream multus and understand how it fetches this token. Closing this issue on this repo. |
|
What happened:
Certificates for multus cni expired after 90 days
Attach logs
This results in this in the log
What you expected to happen:
Multus daemon set to refresh certificates without requiring a restart
How to reproduce it (as minimally and precisely as possible):
Install multus daemon set and wait 90 days
Anything else we need to know?:
Environment:
The text was updated successfully, but these errors were encountered: