Pods stuck in ContainerCreating due to CNI failed to set up pod: Pod Event says : Unauthorized; until aws-node pod restart. #1831

malpania · 2022-01-29T00:24:33Z

On the nodes more than 100 days old when few of the pods like nats or datadog agent get scheduled in that pod gets stuck on ContainerCreating. Node size varies and subnet has sufficient ip available. The new nodes (say 20-30) days old does not have this issue. If these pods gets scheduled to those nodes running recently they come up fine. Also nodes are not spot instances. We have policy set on pod scheduling to schedule only on demand nodes.
We have 2 different eks cluster running in different account. 1 is over 100 days old and another 90 days old. This problem started happening recently say 26 Jan 2022 onwards. Prior to that we never had issue.

The Error Message on pod event is like below

nats pod event history.

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "551807b9b2e97601e6779a8435b7650d6a54b0c11292c4e6a63365659d0dc846" network for pod "nats-2": networkPlugin cni failed to set up pod "nats-2_nats" network: Unauthorized

datadog pod event history after restart.

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "0aa5c3567cf0438bae62b6a9f61567bb66cd33b17ffa533fd0c9341639e864e9" network for pod "datadog-l8jct": networkPlugin cni failed to set up pod "datadog-l8jct_datadog-system" network: Unauthorized

** Work around.
I have tried #59 solution restarted aws-node pod in the node where issue happened and it solved the issue.

Attach logs
IPAMD.log file and similar logs in plugin.log

plugin.log:{"level":"error","ts":"2022-01-27T17:42:48.076Z","caller":"driver/driver.go:421","msg":"delete NS network: failed to delete host route for xxx.xx.xxx.xxx/32, no such process"}
plugin.log:{"level":"error","ts":"2022-01-27T17:42:48.116Z","caller":"driver/driver.go:421","msg":"delete NS network: failed to delete host route for xxx.xx.xxx.167/32, no such process"}
plugin.log:{"level":"error","ts":"2022-01-27T18:08:21.965Z","caller":"driver/driver.go:421","msg":"delete NS network: failed to delete host route for xxx.xx.xxx.167/32, no such process"}
plugin.log:{"level":"error","ts":"2022-01-27T18:15:30.154Z","caller":"driver/driver.go:421","msg":"delete NS network: failed to delete host route for xxx.xx.xxx.157/32, no such process"}
plugin.log:{"level":"error","ts":"2022-01-27T18:16:24.779Z","caller":"driver/driver.go:421","msg":"delete NS network: failed to delete host route for xxx.xx.xxx.32/32, no such process"}
plugin.log:{"level":"error","ts":"2022-01-28T16:18:48.461Z","caller":"driver/driver.go:421","msg":"delete NS network: failed to delete host route for xxx.xx.xxx.32/32, no such process"}
plugin.log:{"level":"error","ts":"2022-01-28T16:18:49.321Z","caller":"driver/driver.go:421","msg":"delete NS network: failed to delete host route for xxx.xx.xxx.208/32, no such process"}
plugin.log:{"level":"error","ts":"2022-01-28T16:26:50.593Z","caller":"driver/driver.go:421","msg":"delete NS network: failed to delete host route for

Datadog pod log (Running pod started failing and after restart it was stuck )

2022-01-28 00:38:32 UTC | CORE | ERROR | (pkg/collector/runner/runner.go:292 in work) | Error running check kubelet: [{"message": "'NoneType' object is not subscriptable", "traceback": "Traceback (most recent call last):\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 901, in run\n self.check(instance)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/kubelet/kubelet.py\", line 336, in check\n self.pod_tags_by_pvc = self._create_pod_tags_by_pvc(self.pod_list)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/kubelet/kubelet.py\", line 253, in _create_pod_tags_by_pvc\n for pod in pods['items']:\nTypeError: 'NoneType' object is not subscriptable\n"}]
2022-01-28 00:38:33 UTC | CORE | WARN | (pkg/tagger/local/tagger.go:206 in pull) | couldn't fetch "podlist": unexpected status code 401 on https://xxx.xx.xxx.179:10250/pods: Unauthorized
2022-01-28 00:38:33 UTC | CORE | WARN | (pkg/tagger/local/tagger.go:206 in pull) | couldn't fetch "podlist": unexpected status code 401 on https://xxx.xx.xxx.179:10250/pods: Unauthorized
2022-01-28 00:38:34 UTC | CORE | ERROR | (pkg/autodiscovery/listeners/kubelet.go:122 in func1) | couldn't fetch "podlist": unexpected status code 401 on https://xxx.xx.xxx.179:10250/pods: Unauthorized

How to reproduce it (as minimally and precisely as possible):
I have no idea how to reproduce it as it started appearing suddenly.

Environment:

EKS Version : 1.21
CNI : amazon-k8s-cni:v1.9.0
AWS supplied default eks images.

The text was updated successfully, but these errors were encountered:

achevuru · 2022-02-15T22:32:22Z

@malpania Was there an AMI upgrade on these old nodes? So, even a previously running Datadog pod on this node started failing as well? I see that even that pod is experiencing Unauthorized error from the logs shared above. Did you check the resource usage on these nodes?

malpania · 2022-02-15T23:24:33Z

Thanks @achevuru , for your response.

** #Was there an AMI upgrade on these old nodes **
I just checked all the AMI's
The old node (125 days old now) is using :

EKS Version - v1.21.4-eks-033ce7e
AMI ID : [ami-01a46e1c21f3c7ab2]
AMI NAME: amazon-eks-node-1.21-v20211008

The latest node (35 days old now) is using:

EKS Version - v1.21.5-eks-bc4871b
AMI ID: ami-0918791b0c01fc977
AMI NAME:amazon-eks-node-1.21-v20211117

** even a previously running Datadog pod on this node started failing as well **
Yes only datadog pod was showing Unauthorized error, all other pods were working fine.

** Did you check the resource usage on these nodes **
Resource usage on these nodes are very low. here is an screenshot from Rancher UI.

achevuru · 2022-02-22T20:49:04Z

@malpania Is this an EKS cluster? This looks like some sort of cert expiry/SA token expiry and these pods are not able to access certain resources (pods in this case). Restarting the CNI daemonset probably regenerated the SA token and restored access to these resources. Not a CNI issue and we should check what contributed to the Unauthorized issue for CNI/Datadog pods...

100 days also reminds me that certs signed by providers like Let's encrypt comes with a default 90 day expiry. Not sure if you are using it in anyway but just thought it is worth calling out...

malpania · 2022-03-02T17:35:17Z

Is this an EKS cluster?
Yes it is.
It happened yesterday (Mar 1, 2022) again on 2 nodes and nodes were 99 days old. We are investigating. We will check today if there was any SA token expired. Its our Prod cluster we don't use Let's encrypt.

@achevuru : I have captured the zip file after running /opt/cni/bin/aws-cni-support.sh. let me know if you like me to send you as email.

envybee · 2022-03-21T22:13:33Z

@achevuru wondering if you've come across anything regarding this? We're starting to see this happen in multiple production EKS clusters and just wanted to check

achevuru · 2022-03-21T23:15:29Z

@envybee EKS/K8S 1.21 has BoundServiceAccountToken feature enabled by default. So, the ServiceAccount tokens are time and audience bound. 1 hr is the default expiry time. But the EKS 1.21 clusters has the migration flag enabled, so new tokens will be honored for a period of 1 yr.

If you're using VPC CNI 1.9.x+, you shouldn't be affected by this issue with regards to VPC CNI pods. If you're on an older CNI version, you can upgrade to the latest version. If your application pods are running in to this, you can check if they read the refreshed token periodically.

@malpania Sorry, I missed your reply. Were you able to figure out what contributed to the expired SA tokens? Also, was it a brand new EKS 1.21 cluster (or) did you upgrade your existing EKS clusters to EKS 1.21? You can send your logs to k8s-awscni-triage@amazon.com.

malpania · 2022-03-22T19:13:05Z

Hi @achevuru , We have upgraded our cluster 1.19 to 1.20 and now running 1.21 version. CNI version we are running 1.9.0.
Another interesting find we can not do kubectl get ... in broken node. it gives unauthorized. But other nodes its working fine.

malpania · 2022-03-23T00:17:27Z

Similar issue has been raised by us on istio/istio#38077

jbilliau-rcd · 2022-05-11T22:06:05Z

I am having this exact same issue on multiple clusters....happens around twice a week, REALLY annoying. Running EKS 1.21, 1.10.1 CNI, managed node groups. When this happens, NO pods can spawn at all, since the aws-node CNI is unable to setup networking for them. I've collected logs and created support tickets to no avail.....if you check the aws-node daemonset logs, there are no errors or anything, everything seems fine.

achevuru · 2022-05-11T23:50:23Z

@jbilliau-rcd Are you using Security Groups Per Pod feature? aws-node (CNI) shouldn't have any issue while trying to setup new pods even if it doesn't have access to API Server. As called out above, K8S/EKS 1.21 has BoundServiceAccountToken feature enabled by default and if your application is dependent on old K8S Client SDKs that don't periodically read the refreshed token they will run in to 401 Unauthorized. If you're on v.1.7.x or earlier VPC CNI versions, you can upgrade to v1.8.0+.

Please refer to - https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-1.21

jbilliau-rcd · 2022-05-12T13:20:36Z

@achevuru nope, not using security group feature at all, and we are already on 1.10.1 of the CNI. It happened on 1.7.5, and we thought upgrading would fix it, but it still happening. As for an application being dependent on old k8s client SDK, this has nothing to do with our applications; the pods themselves wont even attempt to schedule because of this error, so the code isn't even being called at this point. Just had it happen again last night, restarting aws-node daemonset fixes it:

Events:
  Type     Reason                  Age                     From     Message
  ----     ------                  ----                    ----     -------
  Warning  FailedCreatePodSandBox  4m39s (x4669 over 41h)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "d2a873fd7bdaf35975f71bf7d7efc711e371860c040e49bf7d20ee860a5463e7" network for pod "beta-fetch-scheduler-batch-27535920-tjswh": networkPlugin cni failed to set up pod "beta-fetch-scheduler-batch-27535920-tjswh_beta-fetch-scheduler" network: Unauthorized

Im not quite sure I understand this BoundServiceAccountToken thing....it says here they expire by default in 1 hour; I assume they are then rotated? https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/

jbilliau-rcd · 2022-05-12T13:23:16Z

Found this issue on the Istio github, maybe aws-node isn't actually the problem and is a red herring?

istio/istio#38077

malpania · 2022-05-13T16:48:36Z

This issue is more of istio related. Closing this as attached ticket has to be fixed by istio team.

github-actions · 2022-05-13T16:49:02Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

malpania added the bug label Jan 29, 2022

malpania changed the title ~~Pods stuck in ContainerCreating due to CNI failed to set up pod: Pod Event says : Unauthorized until aws-node pod restart.~~ Pods stuck in ContainerCreating due to CNI failed to set up pod: Pod Event says : Unauthorized; until aws-node pod restart. Jan 29, 2022

achevuru self-assigned this Feb 9, 2022

cgchinmay mentioned this issue Feb 23, 2022

Multus CNI - Certificate expires #1868

Closed

malpania closed this as completed May 13, 2022

zip-chanko mentioned this issue Jan 6, 2024

Pods stuck in "ContainerCreating" status in AKS: FailedCreatePodSandBox linkerd/linkerd2#11478

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods stuck in ContainerCreating due to CNI failed to set up pod: Pod Event says : Unauthorized; until aws-node pod restart. #1831

Pods stuck in ContainerCreating due to CNI failed to set up pod: Pod Event says : Unauthorized; until aws-node pod restart. #1831

malpania commented Jan 29, 2022 •

edited

Loading

achevuru commented Feb 15, 2022

malpania commented Feb 15, 2022 •

edited

Loading

achevuru commented Feb 22, 2022

malpania commented Mar 2, 2022 •

edited

Loading

envybee commented Mar 21, 2022

achevuru commented Mar 21, 2022 •

edited

Loading

malpania commented Mar 22, 2022 •

edited

Loading

malpania commented Mar 23, 2022

jbilliau-rcd commented May 11, 2022

achevuru commented May 11, 2022

jbilliau-rcd commented May 12, 2022

jbilliau-rcd commented May 12, 2022

malpania commented May 13, 2022

github-actions bot commented May 13, 2022

Pods stuck in ContainerCreating due to CNI failed to set up pod: Pod Event says : Unauthorized; until aws-node pod restart. #1831

Pods stuck in ContainerCreating due to CNI failed to set up pod: Pod Event says : Unauthorized; until aws-node pod restart. #1831

Comments

malpania commented Jan 29, 2022 • edited Loading

achevuru commented Feb 15, 2022

malpania commented Feb 15, 2022 • edited Loading

achevuru commented Feb 22, 2022

malpania commented Mar 2, 2022 • edited Loading

envybee commented Mar 21, 2022

achevuru commented Mar 21, 2022 • edited Loading

malpania commented Mar 22, 2022 • edited Loading

malpania commented Mar 23, 2022

jbilliau-rcd commented May 11, 2022

achevuru commented May 11, 2022

jbilliau-rcd commented May 12, 2022

jbilliau-rcd commented May 12, 2022

malpania commented May 13, 2022

github-actions bot commented May 13, 2022

⚠️COMMENT VISIBILITY WARNING⚠️

malpania commented Jan 29, 2022 •

edited

Loading

malpania commented Feb 15, 2022 •

edited

Loading

malpania commented Mar 2, 2022 •

edited

Loading

achevuru commented Mar 21, 2022 •

edited

Loading

malpania commented Mar 22, 2022 •

edited

Loading