Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods stuck in ContainerCreating due to CNI failed to set up pod: Pod Event says : Unauthorized; until aws-node pod restart. #1831

Closed
malpania opened this issue Jan 29, 2022 · 14 comments
Assignees
Labels

Comments

@malpania
Copy link

malpania commented Jan 29, 2022

On the nodes more than 100 days old when few of the pods like nats or datadog agent get scheduled in that pod gets stuck on ContainerCreating. Node size varies and subnet has sufficient ip available. The new nodes (say 20-30) days old does not have this issue. If these pods gets scheduled to those nodes running recently they come up fine. Also nodes are not spot instances. We have policy set on pod scheduling to schedule only on demand nodes.
We have 2 different eks cluster running in different account. 1 is over 100 days old and another 90 days old. This problem started happening recently say 26 Jan 2022 onwards. Prior to that we never had issue.

The Error Message on pod event is like below

  1. nats pod event history.

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "551807b9b2e97601e6779a8435b7650d6a54b0c11292c4e6a63365659d0dc846" network for pod "nats-2": networkPlugin cni failed to set up pod "nats-2_nats" network: Unauthorized

  1. datadog pod event history after restart.

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "0aa5c3567cf0438bae62b6a9f61567bb66cd33b17ffa533fd0c9341639e864e9" network for pod "datadog-l8jct": networkPlugin cni failed to set up pod "datadog-l8jct_datadog-system" network: Unauthorized

** Work around.
I have tried #59 solution restarted aws-node pod in the node where issue happened and it solved the issue.

Attach logs
IPAMD.log file and similar logs in plugin.log

plugin.log:{"level":"error","ts":"2022-01-27T17:42:48.076Z","caller":"driver/driver.go:421","msg":"delete NS network: failed to delete host route for xxx.xx.xxx.xxx/32, no such process"}
plugin.log:{"level":"error","ts":"2022-01-27T17:42:48.116Z","caller":"driver/driver.go:421","msg":"delete NS network: failed to delete host route for xxx.xx.xxx.167/32, no such process"}
plugin.log:{"level":"error","ts":"2022-01-27T18:08:21.965Z","caller":"driver/driver.go:421","msg":"delete NS network: failed to delete host route for xxx.xx.xxx.167/32, no such process"}
plugin.log:{"level":"error","ts":"2022-01-27T18:15:30.154Z","caller":"driver/driver.go:421","msg":"delete NS network: failed to delete host route for xxx.xx.xxx.157/32, no such process"}
plugin.log:{"level":"error","ts":"2022-01-27T18:16:24.779Z","caller":"driver/driver.go:421","msg":"delete NS network: failed to delete host route for xxx.xx.xxx.32/32, no such process"}
plugin.log:{"level":"error","ts":"2022-01-28T16:18:48.461Z","caller":"driver/driver.go:421","msg":"delete NS network: failed to delete host route for xxx.xx.xxx.32/32, no such process"}
plugin.log:{"level":"error","ts":"2022-01-28T16:18:49.321Z","caller":"driver/driver.go:421","msg":"delete NS network: failed to delete host route for xxx.xx.xxx.208/32, no such process"}
plugin.log:{"level":"error","ts":"2022-01-28T16:26:50.593Z","caller":"driver/driver.go:421","msg":"delete NS network: failed to delete host route for 

Datadog pod log (Running pod started failing and after restart it was stuck )

2022-01-28 00:38:32 UTC | CORE | ERROR | (pkg/collector/runner/runner.go:292 in work) | Error running check kubelet: [{"message": "'NoneType' object is not subscriptable", "traceback": "Traceback (most recent call last):\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 901, in run\n self.check(instance)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/kubelet/kubelet.py\", line 336, in check\n self.pod_tags_by_pvc = self._create_pod_tags_by_pvc(self.pod_list)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/kubelet/kubelet.py\", line 253, in _create_pod_tags_by_pvc\n for pod in pods['items']:\nTypeError: 'NoneType' object is not subscriptable\n"}]
2022-01-28 00:38:33 UTC | CORE | WARN | (pkg/tagger/local/tagger.go:206 in pull) | couldn't fetch "podlist": unexpected status code 401 on https://xxx.xx.xxx.179:10250/pods: Unauthorized
2022-01-28 00:38:33 UTC | CORE | WARN | (pkg/tagger/local/tagger.go:206 in pull) | couldn't fetch "podlist": unexpected status code 401 on https://xxx.xx.xxx.179:10250/pods: Unauthorized
2022-01-28 00:38:34 UTC | CORE | ERROR | (pkg/autodiscovery/listeners/kubelet.go:122 in func1) | couldn't fetch "podlist": unexpected status code 401 on https://xxx.xx.xxx.179:10250/pods: Unauthorized

How to reproduce it (as minimally and precisely as possible):
I have no idea how to reproduce it as it started appearing suddenly.

Environment:

  • EKS Version : 1.21
  • CNI : amazon-k8s-cni:v1.9.0
  • AWS supplied default eks images.
@malpania malpania added the bug label Jan 29, 2022
@malpania malpania changed the title Pods stuck in ContainerCreating due to CNI failed to set up pod: Pod Event says : Unauthorized until aws-node pod restart. Pods stuck in ContainerCreating due to CNI failed to set up pod: Pod Event says : Unauthorized; until aws-node pod restart. Jan 29, 2022
@achevuru achevuru self-assigned this Feb 9, 2022
@achevuru
Copy link
Contributor

@malpania Was there an AMI upgrade on these old nodes? So, even a previously running Datadog pod on this node started failing as well? I see that even that pod is experiencing Unauthorized error from the logs shared above. Did you check the resource usage on these nodes?

@malpania
Copy link
Author

malpania commented Feb 15, 2022

Thanks @achevuru , for your response.

** #Was there an AMI upgrade on these old nodes **
I just checked all the AMI's
The old node (125 days old now) is using :

EKS Version - v1.21.4-eks-033ce7e
AMI ID : [ami-01a46e1c21f3c7ab2]
AMI NAME: amazon-eks-node-1.21-v20211008

The latest node (35 days old now) is using:

EKS Version - v1.21.5-eks-bc4871b
AMI ID: ami-0918791b0c01fc977
AMI NAME:amazon-eks-node-1.21-v20211117

** even a previously running Datadog pod on this node started failing as well **
Yes only datadog pod was showing Unauthorized error, all other pods were working fine.

** Did you check the resource usage on these nodes **
Resource usage on these nodes are very low. here is an screenshot from Rancher UI.

Screen Shot 2022-02-15 at 15 22 02

@achevuru
Copy link
Contributor

@malpania Is this an EKS cluster? This looks like some sort of cert expiry/SA token expiry and these pods are not able to access certain resources (pods in this case). Restarting the CNI daemonset probably regenerated the SA token and restored access to these resources. Not a CNI issue and we should check what contributed to the Unauthorized issue for CNI/Datadog pods...

100 days also reminds me that certs signed by providers like Let's encrypt comes with a default 90 day expiry. Not sure if you are using it in anyway but just thought it is worth calling out...

@malpania
Copy link
Author

malpania commented Mar 2, 2022

Is this an EKS cluster?
Yes it is.
It happened yesterday (Mar 1, 2022) again on 2 nodes and nodes were 99 days old. We are investigating. We will check today if there was any SA token expired. Its our Prod cluster we don't use Let's encrypt.

@achevuru : I have captured the zip file after running /opt/cni/bin/aws-cni-support.sh. let me know if you like me to send you as email.

@envybee
Copy link

envybee commented Mar 21, 2022

@achevuru wondering if you've come across anything regarding this? We're starting to see this happen in multiple production EKS clusters and just wanted to check

@achevuru
Copy link
Contributor

achevuru commented Mar 21, 2022

@envybee EKS/K8S 1.21 has BoundServiceAccountToken feature enabled by default. So, the ServiceAccount tokens are time and audience bound. 1 hr is the default expiry time. But the EKS 1.21 clusters has the migration flag enabled, so new tokens will be honored for a period of 1 yr.

If you're using VPC CNI 1.9.x+, you shouldn't be affected by this issue with regards to VPC CNI pods. If you're on an older CNI version, you can upgrade to the latest version. If your application pods are running in to this, you can check if they read the refreshed token periodically.

@malpania Sorry, I missed your reply. Were you able to figure out what contributed to the expired SA tokens? Also, was it a brand new EKS 1.21 cluster (or) did you upgrade your existing EKS clusters to EKS 1.21? You can send your logs to k8s-awscni-triage@amazon.com.

@malpania
Copy link
Author

malpania commented Mar 22, 2022

Hi @achevuru , We have upgraded our cluster 1.19 to 1.20 and now running 1.21 version. CNI version we are running 1.9.0.
Another interesting find we can not do kubectl get ... in broken node. it gives unauthorized. But other nodes its working fine.

@malpania
Copy link
Author

Similar issue has been raised by us on istio/istio#38077

@jbilliau-rcd
Copy link

I am having this exact same issue on multiple clusters....happens around twice a week, REALLY annoying. Running EKS 1.21, 1.10.1 CNI, managed node groups. When this happens, NO pods can spawn at all, since the aws-node CNI is unable to setup networking for them. I've collected logs and created support tickets to no avail.....if you check the aws-node daemonset logs, there are no errors or anything, everything seems fine.

@achevuru
Copy link
Contributor

@jbilliau-rcd Are you using Security Groups Per Pod feature? aws-node (CNI) shouldn't have any issue while trying to setup new pods even if it doesn't have access to API Server. As called out above, K8S/EKS 1.21 has BoundServiceAccountToken feature enabled by default and if your application is dependent on old K8S Client SDKs that don't periodically read the refreshed token they will run in to 401 Unauthorized. If you're on v.1.7.x or earlier VPC CNI versions, you can upgrade to v1.8.0+.

Please refer to - https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-1.21

@jbilliau-rcd
Copy link

@achevuru nope, not using security group feature at all, and we are already on 1.10.1 of the CNI. It happened on 1.7.5, and we thought upgrading would fix it, but it still happening. As for an application being dependent on old k8s client SDK, this has nothing to do with our applications; the pods themselves wont even attempt to schedule because of this error, so the code isn't even being called at this point. Just had it happen again last night, restarting aws-node daemonset fixes it:

Events:
  Type     Reason                  Age                     From     Message
  ----     ------                  ----                    ----     -------
  Warning  FailedCreatePodSandBox  4m39s (x4669 over 41h)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "d2a873fd7bdaf35975f71bf7d7efc711e371860c040e49bf7d20ee860a5463e7" network for pod "beta-fetch-scheduler-batch-27535920-tjswh": networkPlugin cni failed to set up pod "beta-fetch-scheduler-batch-27535920-tjswh_beta-fetch-scheduler" network: Unauthorized

Im not quite sure I understand this BoundServiceAccountToken thing....it says here they expire by default in 1 hour; I assume they are then rotated? https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/

@jbilliau-rcd
Copy link

Found this issue on the Istio github, maybe aws-node isn't actually the problem and is a red herring?

istio/istio#38077

@malpania
Copy link
Author

This issue is more of istio related. Closing this as attached ticket has to be fixed by istio team.

@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants