Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multus CNI - Certificate expires #1868

Closed
Alan01252 opened this issue Feb 16, 2022 · 29 comments
Closed

Multus CNI - Certificate expires #1868

Alan01252 opened this issue Feb 16, 2022 · 29 comments
Assignees
Labels

Comments

@Alan01252
Copy link

What happened:

Certificates for multus cni expired after 90 days

Attach logs

This results in this in the log

image

What you expected to happen:

Multus daemon set to refresh certificates without requiring a restart

How to reproduce it (as minimally and precisely as possible):

Install multus daemon set and wait 90 days

Anything else we need to know?:

Environment:

  • Kubernetes version 1.21
  • CNI Version multus-cni:.v3.7.2-eksbuild1
@Alan01252 Alan01252 added the bug label Feb 16, 2022
@cgchinmay
Copy link
Contributor

HI @Alan01252 the multus daemonset config file used in this image is same as the upstream repo except the part where we make vpc-cni as the primary interface. Will check and get back to you

@cgchinmay cgchinmay self-assigned this Feb 16, 2022
@Alan01252
Copy link
Author

Alan01252 commented Feb 16, 2022

Thanks @cgchinmay, it's entirely possible I am confused.

image

It's this file specifically that gets recreated when the pod restarts and had the 90 day problem

@cgchinmay
Copy link
Contributor

@Alan01252 I checked the multus documentation and there was nothing to indicate any explicit expiry being set.

But you should be able to repro this even with the upstream multus daemonset, vpc-cni isn't playing any role here as per my knowledge. This could also be specific to your cluster setup as well.

If you can share your eks cluster arn, I can check more details on the token refresh.

@cgchinmay
Copy link
Contributor

Thank for sharing cluster-arn on my email. For future ref you can use this: k8s-awscni-triage@amazon.com

From the logs, the issue seems to be with flux-system/helm-controller pods. I see that the Issue started when pod for this deployment (flux-system/helm-controller) was deleted at 2022-02-16T10:30:34.659541Z and recreated.

I am guessing you restarted multus and the issue got resolved.
Could you restart flux-system/helm-controller and see if this issue happens again ?

@Alan01252
Copy link
Author

I am guessing you restarted multus and the issue got resolved.

This is entirely correct :) I believe this pod failed to start because of the issue with multus, restarting the pod subsequently proceeds without issue.

@cgchinmay
Copy link
Contributor

cgchinmay commented Feb 18, 2022

I am guessing you restarted multus and the issue got resolved.

This is entirely correct :) I believe this pod failed to start because of the issue with multus, restarting the pod subsequently proceeds without issue.

so based on the timeline in apiserver logs, I see that first helm-controller pod was deleted and recreated. As soon as the pod was deleted multus reported unauthorized errors. So these unauthorized errors in multus could be the source of helm-controller pod being restarted .Would like to confirm this.
I feel it might not be multus token issue but something to do with this particular pod.

@Alan01252
Copy link
Author

Alan01252 commented Feb 18, 2022

I'm curious as to your hypothesis.

I think I could have restarted any pod at this point and it would have failed to start due to multus being unable to access the API. However, you're right in that I deleted this pod because it too was reporting authorization issues see this issue.

fluxcd/flux2#2074

I should have posted this error earlier too.

image-3

I just assumed ( perhaps wrongly ) multus was the cause ( due to the fact restarting it fixed everything ) but perhaps there was something else?

@cgchinmay
Copy link
Contributor

cgchinmay commented Feb 18, 2022

This error is expected which is why your helm controller was not able to restart correctly. I am trying to understand the unauthorized part. This is what I need to verify

1.Does the unauthorized error occur when you recreate helm-controller pod ?
2.If it doesn't then 90 days have some significance here which we need to find out. As far as I know multus does not have any expiry for its service account token.
3.But If multus service account token does expire after 90 days then you should see unauthorized errors with even other pods and not just helm controller.

Could you run eks-log-collector script and send logs to [k8s-awscni-triage@amazon.com]. Thanks

@Alan01252
Copy link
Author

1.Does the unauthorized error occur when you recreate helm-controller pod ?

Restarting now is fine.

  1. Agreed, this part is confusing me! The more research I do the less I'm sure about my original 90 day assertion, and perhaps it was just a coincidence the failure occurred after 90 days, feels too much of a coincidence. The only thing I can see is that by default letsencrypt certs default to 90 days but I don't think these are used anywhere is eks? Feels too coincidental doesn't it?

  2. Annoyingly this is the only one I restarted as all the others were running okay ( although I didn't inspect the logs of all the pods )

I'll get those logs sent over!

Thanks again for your patience investigating this!

@cgchinmay
Copy link
Contributor

I was unable to un-tar the log file that you sent. Instead could you confirm

  1. If the issue was limited to only helm-controller pods ?
  2. Did you see this issue with other pods using multus in flux-system namespace or any other namespace

I haven't been able to find any known issue around multus. From the [linked Issue] (fluxcd/flux2#2074) It does appear that something changed on the RBAC side for helm-controller so you had to restart the multus pod as well which regenerated service account token. I am still researching to find a link between this helm-controller pod restart and multus service account token

About this

The only thing I can see is that by default letsencrypt certs default to 90 days but I don't think these are used anywhere is eks? Feels too coincidental doesn't it?

How/where are you using it in your cluster ?

@Alan01252
Copy link
Author

Hi @cgchinmay the tar file I sent extracts fine for me tar -xvf fileename?

I've double-checked what I sent and even gmail can see the contents so I hope it's good! Let me know I can send it again if not.

I don't think there's any link between multus and the helm chart other than once multus is installed the pods require multus to be functioning correctly in order to start up.

I've tried to look for logs from the other running pods during the time frame, but no other pods are showing any issue in this namespace or any other namespace.

Re letsencrypt, I've confused things there, we're not using it, but I was wondering if something in eks might be which could have caused this?

@cgchinmay
Copy link
Contributor

Could you resend the file ? I tried the above command, but the file shows up as 0 bytes and I cannot extract. I m using
Mac OS 10.15.7

Based on your explanation, I think this error is limited and has to do with the incorrect cleanup of helm-controller pod. I will take a look at the logs and let you know if I find anything.

@Alan01252
Copy link
Author

Thanks, @cgchinmay , I've resent.

To clarify, this is what we are currently hypothesizing:

  1. 90 days is a red herring
  2. The helm controller reporting unauthorized also somehow caused rebuilding the helm pod to fail, and this was solved "somehow" by restarting multus?

If you could share with me the thoughts on the somehow that would be appreciated, I can't work out what restarting multus would do that would be specific to a single pod restarting?

If we see this again I'll definitely try restarting other pods too! :)

@cgchinmay
Copy link
Contributor

So the reason we see this only for helm-controller pod since it was being teardown and thats when multus had to communicate with the api server. So that explains why we see Unauthorized error for this one pod in this case. We do have similar issues like this one: #1831 .
Even the other issue started experiencing after ~100 days and using 1.21 cluster. We are talking to the EKS dataplane team and I will have more information.

@envybee
Copy link

envybee commented Mar 18, 2022

@cgchinmay wondering if you've come across anything to fix this? Thanks

@cgchinmay
Copy link
Contributor

cgchinmay commented Mar 19, 2022

@envybee based on the logs and information gathered so far, I didn't find anything to hint that multus token expires after 90 days. There is an update to the service account token in kubernetes 1.21 but its valid for 1 year. You can read more about this here: https://github.com/kubernetes/enhancements/blob/master/keps/sig-auth/1205-bound-service-account-tokens/README.md

About this particular issue, dont think multus token expiry is the cause. It did become invalid but only for helm controller pod, wasn't expired.
I checked the cloud watch audit logs and there was a failure during this recreation (update) of the helm-controller replica due to a conflict in resource version, only after which I saw the unauthorized errors. So the issue seems to be caused by/specific to the helm-controller behavior.

@cgchinmay
Copy link
Contributor

Based on the findings, I think we can close this issue. Did not find anything in multus or Control plane that would cause certificates to expire after 90 days. Please reopen if the issue reoccurs. Do collect logs and send it to [k8s-awscni-triage@amazon.com]

@github-actions
Copy link

github-actions bot commented Apr 4, 2022

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@Alan01252
Copy link
Author

Alan01252 commented May 4, 2022

A comment in the linked ticket sounds very much like the reason behind this failure.

fluxcd/flux2#2074 (comment)

@Alan01252
Copy link
Author

Yes, it's even now emailed out by AWS :)

Note we even got an automated email about this from aws!

As of April 20th 2022, we have identified the below service accounts attached to pods in one or more of your EKS clusters using stale (older than 1 hour) tokens. Service accounts are listed in the format : |namespace:serviceaccount

arn:aws:eks:eu-west-2::cluster/prod-|kube-system:multus
arn:aws:eks:eu-west-2::cluster/prod-**|flux-system:helm-controller

@cgchinmay
Copy link
Contributor

cgchinmay commented May 13, 2022

@Alan01252 Thanks for the update. I am reopening the issue and will update once I have more information on the required fix.

@cgchinmay cgchinmay reopened this May 13, 2022
@cgchinmay
Copy link
Contributor

cgchinmay commented May 13, 2022

As per this doc: https://docs.aws.amazon.com/eks/latest/userguide/service-accounts.html#identify-pods-using-stale-tokens
I verified from the audit logs and did see these two accounts flagged for using stale tokens.
As per the above the doc, the applications must be using client-go sdk of v0.15.7 or later to auto renew the token after every hour.
Multus already has client-go sdk of version above v0.15.7, same was the case for helm-controller. I do see that a new version of helm-controller addresses this issue. Need to check what changes are being introduced here to address the issue. A similar fix would be needed on the upstream multus repo .

@cgchinmay
Copy link
Contributor

cgchinmay commented May 18, 2022

Hi @Alan01252 I took another look at api server audit logs for your "prod-telemetry" cluster and the last request using stale token was at 2022-05-17T12:19:40.900Z (UTC) as the token mounted at (/var/run/secrets/kubernetes.io/serviceaccount/token) in the requesting multus pod was older than 1 hour. There were such cases since april 20th which is why you got an email from AWS.

However these stale token requests seem to have stopped now. I also inspected your multus pod token located at (/var/run/secrets/kubernetes.io/serviceaccount/token) and could see that it is regularly getting updated every hour. This is how it is supposed to be as of K8s 1.21 and if the multus is using client-go > v0.15.7 which is indeed the case.

Given the most recent behavior I don't think you should run into this issue again but will keep a watch on the api server logs and you can do it as well with this doc

@Alan01252
Copy link
Author

Thanks @cgchinmay, this is really interesting I'm now trying to work out what if anything we changed between getting that email and now. I really can't think of anything.

Also are you able to please confirm that this file

/etc/cni/net.d/multus.d/multus.kubeconfig

And the multus user token inside it is nothing to do with the above?

Thanks again for the thorough investigation and response, as always it's very much appreciated! :)

@cgchinmay
Copy link
Contributor

cgchinmay commented May 18, 2022

Token used for communication with api server is mounted at (/var/run/secrets/kubernetes.io/serviceaccount/token) inside multus pod. As long as this token is updated every hour, your requests shouldn't go stale, regardless of what token is present in multus.kubeconfig.

Based on current observation, only when you create/restart multus, the token inside multus pod and inside multus.kubeconfig are in sync. So I believe only the first time, token is mounted from this file, need to confrm with upstream multus code. But after 1 hour, the token inside pod gets updated independently with no change to the token inside multus.kubeconfig.

So if there was a scenario where you restarted multus pod and it used older multus.kubeconfig file then you have a stale token inside multus pod. But I was not able to produce such scenario in my tests. On every restart or create, the multus.kubeconfig also gets updated token.

@Alan01252
Copy link
Author

Alan01252 commented May 18, 2022

Amazing :) Thank you @cgchinmay!!! I'm happy if you want to close this, I think this provides a good reference for other people and that all I's have been dotted and t's crossed.

@cgchinmay
Copy link
Contributor

cgchinmay commented May 20, 2022

@Alan01252 I did see 4-5 requests all around 2022-05-19T11:38:39.307Z made by pod kube-multus-ds-zwh6w using a stale token. Also I could see kubelet requesting for a new token for the same pod around that time. It's not possible to see the actual token value but based on the elapsed time field for the token, it was 20 days old.

You also have multus secret created 20 days ago and token for this secret doesn't refresh. Earlier I saw this with another multus pod and the elapsed time matched with the age of multus secret. So it makes me believe that the pod used the token from multus secret while making the request. To confirm this, you could recreate the multus service account thus regenerating the multus secret and we can verify from api logs if indeed this is the case.

Also this is an intermittent issue. The elapsed time was less than 90 days so api server didn't deny the requests, it just marks them as stale. Will try to gather more information on why we are seeing such intermittent stale request. Could be kubelet or multus specific, am not sure yet. But this is the pattern I see.

@cgchinmay
Copy link
Contributor

cgchinmay commented Jun 10, 2022

@Alan01252 Hi, The stale token is actually coming from multus.kubeconfig file. Initially I thought its coming from secret associated with multus service account as the multus.kubeconfig file as well as service account were created around same time. But I see that you had restarted multus, which recreated multus.kubeconfig file and am able to correlate the stale token accounts age with the multus.kubeconfig file. It would be good to open an issue on upstream multus and understand how it fetches this token. Closing this issue on this repo.

@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants