Pods stuck in ContainerCreating due to CNI Failing to Assing IP to Container Until aws-node is deleted #59

druidsbane · 2018-04-10T18:16:59Z

On a node that is only 3 days old all containers scheduled to be created on this node get stuck in ContainerCreating. This is on an m4.large node. The AWS console shows that it has the maximum number of private IP's reserved so there isn't a problem getting resources. There are no pods running on the node other than daemon sets. All new nodes that came up after cordoning this node came up fine as well. This is a big problem because the node is considered Ready and is accepting pods despite the fact that it can't launch any.

The resolution: once I deleted aws-node on the host from the kube-system namespace all the stuck containers came up. The version used is amazon-k8s-cni:0.1.4

In addition to trying to fix it, is there any mechanism for the aws-node process to have a health check and either get killed and restarted or drain and cordon a node if failures are detected? Even as an option?

I have left the machine running in case any more logs are needed. The logs on the host show:
skipping: failed to "CreatePodSandbox" for <Pod Name> error. The main reason was: failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod \"<PodName>-67484c8f8-xj96c_default\" network: add cmd: failed to assign an IP address to container"

Other choice error logs are:

kernel: IPVS: Creating netns size=2104 id=32780
kernel: IPVS: Creating netns size=2104 id=32781
kernel: IPVS: Creating netns size=2104 id=32782
kubelet: E0410 17:59:24.393151    4070 cni.go:259] Error adding network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.393185    4070 cni.go:227] Error while adding to cni network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.427733    4070 cni.go:259] Error adding network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.428095    4070 cni.go:227] Error while adding to cni network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.506935    4070 cni.go:259] Error adding network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.506962    4070 cni.go:227] Error while adding to cni network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.509609    4070 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "<PodName>-69dfc9984b-v8dw9_default" network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.509661    4070 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "<PodName>-69dfc9984b-v8dw9_default(8808ea13-3ce6-11e8-815e-02a9ad89df3c)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "<PodName>-69dfc9984b-v8dw9_default" network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.509699    4070 kuberuntime_manager.go:647] createPodSandbox for pod "<PodName>-69dfc9984b-v8dw9_default(8808ea13-3ce6-11e8-815e-02a9ad89df3c)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "<PodName>-69dfc9984b-v8dw9_default" network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.509771    4070 pod_workers.go:186] Error syncing pod 8808ea13-3ce6-11e8-815e-02a9ad89df3c ("<PodName>-69dfc9984b-v8dw9_default(8808ea13-3ce6-11e8-815e-02a9ad89df3c)"), skipping: failed to "CreatePodSandbox" for "<PodName>-69dfc9984b-v8dw9_default(8808ea13-3ce6-11e8-815e-02a9ad89df3c)" with CreatePodSandboxError: "CreatePodSandbox for pod \"<PodName>-69dfc9984b-v8dw9_default(8808ea13-3ce6-11e8-815e-02a9ad89df3c)\" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod \"<PodName>-69dfc9984b-v8dw9_default\" network: rpc error: code = Unavailable desc = grpc: the connection is unavailable"

The text was updated successfully, but these errors were encountered:

liwenwu-amazon · 2018-04-10T19:01:35Z

@druidsbane can you log onto this node and run /opt/cni/bin/aws-cni-support.sh and send me (liwenwu@amazon.com) /var/log/aws-routed-eni/aws-cni-support.tar.gz? thanks

liwenwu-amazon · 2018-04-10T20:13:52Z

@druidsbane from the log, it looks like you hit issue #18. here is a snips of logs ipamd.log.2018-04-10-17:

2018-04-10T17:59:17Z [INFO] Send AddNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: no available IP addresses
2018-04-10T17:59:17Z [INFO] Received AddNetwork for NS /proc/5025/ns/net, Pod clamav-bb688497d-hrrln, NameSpace default, Container b30274f12c192d65d0e716b8b3178221cfdeaa8f116fc505c9916d02a883775d, ifname eth0
2018-04-10T17:59:17Z [DEBUG] AssignIPv4Address: IP address pool stats: total:18, assigned 18
2018-04-10T17:59:17Z [DEBUG] AssignPodIPv4Address, skip ENI eni-c10169f7 that do not have available addresses
2018-04-10T17:59:17Z [DEBUG] AssignPodIPv4Address, skip ENI eni-ad147c9b that do not have available addresses
2018-04-10T17:59:17Z [INFO] DataStore has no available IP addresses
2018-04-10T17:59:17Z [INFO] Send AddNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: no available IP addresses
2018-04-10T17:59:17Z [INFO] Received AddNetwork for NS /proc/5140/ns/net, Pod lasso-engine-api-67484c8f8-thtlc, NameSpace default, Container 0e8ad12946302e7bbca9d97aa160ce3b0a97466d5c33052d34a73123dcc60d2b, ifname eth0
2018-04-10T17:59:17Z [DEBUG] AssignIPv4Address: IP address pool stats: total:18, assigned 18
2018-04-10T17:59:17Z [DEBUG] AssignPodIPv4Address, skip ENI eni-c10169f7 that do not have available addresses
2018-04-10T17:59:17Z [DEBUG] AssignPodIPv4Address, skip ENI eni-ad147c9b that do not have available addresses
2018-04-10T17:59:17Z [INFO] DataStore has no available IP addresses
2018-04-10T17:59:17Z [INFO] Send AddNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: no available IP addresses
2018-04-10T17:59:17Z [INFO] Received AddNetwork for NS /proc/5104/ns/net, Pod frontend-merge-request-nginx-84959cf4bc-cpvw7, NameSpace default, Container e5b38c10611d8a5eedbf4156a374a21ed74dd467ba861e10ded526ed627c446d, ifname eth0
2018-04-10T17:59:17Z [DEBUG] AssignIPv4Address: IP address pool stats: total:18, assigned 18
2018-04-10T17:59:17Z [DEBUG] AssignPodIPv4Address, skip ENI eni-c10169f7 that do not have available addresses
2018-04-10T17:59:17Z [DEBUG] AssignPodIPv4Address, skip ENI eni-ad147c9b that do not have available addresses
2018-04-10T17:59:17Z [INFO] DataStore has no available IP addresses

we have a PR (#31) and proposal #26 to address this .

druidsbane · 2018-04-10T22:11:35Z

Great, interesting to see the details on how to solve this. I'll follow those two and hopefully they get merged soon so we can test before they're GA!

liwenwu-amazon · 2018-04-16T14:59:47Z

Here is the root cause of this problem:
when following 2 events happens almost simultaneously:

Release an IP address for a deleted Pod ( kubelet ->(delete) -> CNI ->L-IPAM)
Allocate an unused IP address for a new Pod (kubelet ->(add) -> CNI -> L-IPAM)

L-IPAM can reassign the IP address which was just released immediately to a new Pod, instead of obeying the pod cooling period design. This can cause CNI fail to setup routing table for the newly added Pod. When this failure happens, kubelet will NOT invoke a delete Pod CNI call. Since CNI never release this IP back in the case of failure, this IP is leaked.

Here is the snips from plug.log.xxx

 // kubelet -> CNI to delete a Pod
2018-04-11T20:08:01Z [INFO] Received CNI del request: ContainerID(1eb7ad292495633a6a68876a741604164321bab048552302856f1a53ab9aeb74) Netns(/proc/17283/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=venn-frontend-v2-7f6c64df86-5vl2v;K8S_POD_INFRA_CONTAINER_ID=1eb7ad292495633a6a68876a741604164321bab048552302856f1a53ab9aeb74) Path(/opt/aws-cni/bin:/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2018-04-11T20:08:01Z [INFO] Delete toContainer rule for 192.168.109.167/32

// kubelet -> cni to add a new Pod
2018-04-11T20:08:01Z [INFO] Received CNI add request: ContainerID(86e77ed0a6917d1401695b4f8281ad23384630d9e9a3a679fbd8703f4579336f) Netns(/proc/20315/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=site-monitor-7b58cf584b-l2cfm;K8S_POD_INFRA_CONTAINER_ID=86e77ed0a6917d1401695b4f8281ad23384630d9e9a3a679fbd8703f4579336f) Path(/opt/aws-cni/bin:/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2018-04-11T20:08:01Z [INFO] Received add network response for pod site-monitor-7b58cf584b-l2cfm namespace default container 86e77ed0a6917d1401695b4f8281ad23384630d9e9a3a679fbd8703f4579336f: 192.168.109.167, table 0
2018-04-11T20:08:01Z [ERROR] Failed SetupPodNetwork for pod site-monitor-7b58cf584b-l2cfm namespace default container 86e77ed0a6917d1401695b4f8281ad23384630d9e9a3a679fbd8703f4579336f: setup NS network: failed to add host route: file exists

incognick · 2018-04-18T19:12:10Z

I'm having a similar problem with a pod not coming up with the same reason FailedCreatePodSandBox.

E0418 19:09:56.012799    3852 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "fluentd-pxks9_kube-system" network: add cmd: failed to assign an IP address to container
Apr 18 19:09:56 ip-172-20-19-97.us-west-2.compute.internal kubelet[3852]: E0418 19:09:56.012841    3852 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "fluentd-pxks9_kube-system(873acabd-4339-11e8-a92c-0609dc29c0b0)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "fluentd-pxks9_kube-system" network: add cmd:
Apr 18 19:09:56 ip-172-20-19-97.us-west-2.compute.internal kubelet[3852]: E0418 19:09:56.012855    3852 kuberuntime_manager.go:647] createPodSandbox for pod "fluentd-pxks9_kube-system(873acabd-4339-11e8-a92c-0609dc29c0b0)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "fluentd-pxks9_kube-system" network: add cmd
Apr 18 19:09:56 ip-172-20-19-97.us-west-2.compute.internal kubelet[3852]: E0418 19:09:56.012904    3852 pod_workers.go:186] Error syncing pod 873acabd-4339-11e8-a92c-0609dc29c0b0 ("fluentd-pxks9_kube-system(873acabd-4339-11e8-a92c-0609dc29c0b0)"), skipping: failed to "CreatePodSandbox" for "fluentd-pxks9_kube-system(873acabd-4339-11e8-a92c-0609d
Apr 18 19:09:56 ip-172-20-19-97.us-west-2.compute.internal kubelet[3852]: W0418 19:09:56.398522    3852 pod_container_deletor.go:77] Container "4ababa2ccded34032f99ee3e06cd0452a37ecc91539f8e33d8180f1e52e7de41" not found in pod's containers
Apr 18 19:09:57 ip-172-20-19-97.us-west-2.compute.internal kubelet[3852]: E0418 19:09:57.029012    3852 cni.go:259] Error adding network: add cmd: failed to assign an IP address to container
Apr 18 19:09:57 ip-172-20-19-97.us-west-2.compute.internal kubelet[3852]: E0418 19:09:57.029040    3852 cni.go:227] Error while adding to cni network: add cmd: failed to assign an IP address to container
Apr 18 19:09:57 ip-172-20-19-97.us-west-2.compute.internal kubelet[3852]: E0418 19:09:57.130779    3852 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "fluentd-pxks9_kube-system" network: add cmd: failed to assign an IP address to container
Apr 18 19:09:57 ip-172-20-19-97.us-west-2.compute.internal kubelet[3852]: E0418 19:09:57.130822    3852 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "fluentd-pxks9_kube-system(873acabd-4339-11e8-a92c-0609dc29c0b0)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "fluentd-pxks9_kube-system" network: add cmd:
Apr 18 19:09:57 ip-172-20-19-97.us-west-2.compute.internal kubelet[3852]: E0418 19:09:57.130836    3852 kuberuntime_manager.go:647] createPodSandbox for pod "fluentd-pxks9_kube-system(873acabd-4339-11e8-a92c-0609dc29c0b0)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "fluentd-pxks9_kube-system" network: add cmd
Apr 18 19:09:57 ip-172-20-19-97.us-west-2.compute.internal kubelet[3852]: E0418 19:09:57.130886    3852 pod_workers.go:186] Error syncing pod 873acabd-4339-11e8-a92c-0609dc29c0b0 ("fluentd-pxks9_kube-system(873acabd-4339-11e8-a92c-0609dc29c0b0)"), skipping: failed to "CreatePodSandbox" for "fluentd-pxks9_kube-system(873acabd-4339-11e8-a92c-0609d
Apr 18 19:09:57 ip-172-20-19-97.us-west-2.compute.internal kubelet[3852]: W0418 19:09:57.457326    3852 pod_container_deletor.go:77] Container "70d490486d4b9d1d8e8550b26f03801b465623b28ea302371da845972801ab2a" not found in pod's containers

@liwenwu-amazon I can provide the same logs if you wish. Thank you!

liwenwu-amazon · 2018-04-18T21:37:42Z

@incognick can you log onto this node and run /opt/cni/bin/aws-cni-support.sh and send me (liwenwu@amazon.com) /var/log/aws-routed-eni/aws-cni-support.tar.gz? thanks

edwize · 2018-04-25T16:01:21Z

I also got stuck in ContainerCreating:
Warning FailedCreatePodSandBox 16m (x58161 over 17h) kubelet, ip-172-x.x.x.us-west-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "wordpress-xxxx-vmtkg_default" network: add cmd: failed to assign an IP address to container

And the fix was to reload "aws-node" pod on the Node which stopped issuing IP's

liwenwu-amazon · 2018-05-12T02:06:15Z

PR #62 should fix this issue. Please re-open if the issues happens again

stafot · 2018-06-20T08:27:47Z

Having similar failures on AWS EKS.

max-rocket-internet · 2018-06-20T10:12:56Z

I also have this on EKS. Very simply situation (1 cluster node, 1 simple deployment). Just scale deployment from 1 replica to 20 and I get pods that are stuck as ContainerCreating:

your-name-app1-7d8f55b547-2k7kh                             0/1       ContainerCreating   0          3m
your-name-app1-7d8f55b547-4m849                             1/1       Running             0          3m
your-name-app1-7d8f55b547-7wr5g                             0/1       ContainerCreating   0          3m
your-name-app1-7d8f55b547-7xg6j                             0/1       ContainerCreating   0          3m
your-name-app1-7d8f55b547-b97k7                             0/1       ContainerCreating   0          3m
your-name-app1-7d8f55b547-bh7r9                             0/1       ContainerCreating   0          3m
your-name-app1-7d8f55b547-bs4mj                             1/1       Running             0          3m
your-name-app1-7d8f55b547-bvqcb                             1/1       Running             0          3m

Events from the pod:

Events:
  Type     Reason                  Age               From                                 Message
  ----     ------                  ----              ----                                 -------
  Normal   Scheduled               3m                default-scheduler                    Successfully assigned your-name-app1-7d8f55b547-7xg6j to ip-10-0-2-206.ec2.internal
  Normal   SuccessfulMountVolume   3m                kubelet, ip-10-0-2-206.ec2.internal  MountVolume.SetUp succeeded for volume "default-token-4z94m"
  Warning  FailedCreatePodSandBox  3m (x12 over 3m)  kubelet, ip-10-0-2-206.ec2.internal  Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "your-name-app1-7d8f55b547-7xg6j_default" network: add cmd: failed to assign an IP address to container
  Normal   SandboxChanged          3m (x12 over 3m)  kubelet, ip-10-0-2-206.ec2.internal  Pod sandbox changed, it will be killed and re-created.

max-rocket-internet · 2018-06-20T10:16:20Z

@liwenwu-amazon can we reopen this issue?

liwenwu-amazon · 2018-06-20T14:55:46Z

@max-rocket-internet can you collect node level debugging info by running

/opt/cni/bin/aws-cni-support.sh

#collect
/var/log/aws-routed-eni/aws-cni-support.tar.gz

you can send it to me (liwenwu@amazon.com) or attach it to this issue.

In addition, i have few more questions

what's the instance type of this node?
do you see these 1st time you scale up? How long have you wait ? Right now, if your instance requires more ENIs, the ipamD will need to call EC2 API to allocate and attach ENI to your instance. This can take sometime depends how busy is your VPC.

thanks,

stafot · 2018-06-21T11:28:42Z

@liwenwu-amazon
At first thanks for your answer.

We are using right 4->5 t2.medium instances with eks-worker-v20 ami. (Is not the production environment)
The issue seems to triggered by pods raising by CronJob resources. (but when starts happenning affects of course all kinds of pods)
I don't feel comfortable to share all the logs you asked for because inside them there are many critical application info (volume mounts, pods, containers used etc.) I found though some errors on logs which are probably related.

2018-06-21T07:59:05Z [INFO] DataStore has no available IP addresses
2018-06-21T07:59:05Z [INFO] Send AddNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: no available IP addresses
......
2018-06-21T07:59:22Z [DEBUG] IP pool stats: total = 15, used = 14, c.currentMaxAddrsPerENI = 6, c.maxAddrsPerENI = 6```


2018-06-21T07:59:22Z [DEBUG] Start increasing IP Pool size
2018-06-21T07:59:22Z [DEBUG] Skipping increase IPPOOL due to max ENI already attached to the instance : 3

Thanks in advance for any answer.

max-rocket-internet · 2018-06-21T14:46:51Z

Hi @liwenwu-amazon

I think I've made some mistake somewhere in the Terraform code. Our TF code is largely based on this but I've built a second cluster based on this module and it does not have this problem.

Both clusters are using same subnets, AMI and t2.medium node instance type with a single node.

I see this in the /var/log/aws-routed-eni/ipamd.log.* log files:

2018-06-21T14:31:50Z [DEBUG] AssignIPv4Address: IP address pool stats: total:15, assigned 15
2018-06-21T14:31:50Z [DEBUG] AssignPodIPv4Address, skip ENI eni-79cb08e4 that does not have available addresses
2018-06-21T14:31:50Z [DEBUG] AssignPodIPv4Address, skip ENI eni-294083b4 that does not have available addresses
2018-06-21T14:31:50Z [DEBUG] AssignPodIPv4Address, skip ENI eni-bde82b20 that does not have available addresses
2018-06-21T14:31:50Z [INFO] DataStore has no available IP addresses

To reproduce it, I just create the most basic deployment with 1 replica. Then scale the deployment to 30 replicas and then it happens.

Even if it is a configuration error on my part, it might be nice to know what the problem is since it seems a few people hit this error and can't work out where things went wrong.

do you see these 1st time you scale up?

Yes.

How long have you wait ?

Maximum 60 seconds.

I've emailed you the logs.

max-rocket-internet · 2018-06-21T14:48:58Z

2018-06-21T14:31:48Z [DEBUG] Skipping increase IPPOOL due to max ENI already attached to the instance : 3

BTW I know t2.medium is limited to 3 network interfaces but I still can't see how one cluster doesn't have this error but the other does when they are both using a single t2.medium node.

stafot · 2018-06-21T15:04:50Z

@max-rocket-internet the way you described seems like something is wrong on this example userdata. Our cluster TF code has it also as "base" with adjustments based on our custom needs.

max-rocket-internet · 2018-06-21T15:19:53Z

@stafot I thought this too so I checked the userdata from both clusters and they are the same except for sed -i s,MAX_PODS,100,g. Once cluster is set to 17 and the other is set to 100. So perhaps not userdata issue.

stafot · 2018-06-21T16:25:00Z

@max-rocket-internet I have 100 too on mine while the example has 20. So if only this is the difference probably something else is the root cause.
Thanks for sharing your thoughts again.

liwenwu-amazon · 2018-06-21T18:07:32Z

@max-rocket-internet is it possible that you are running into issue #18 ?
few more questions:

what's the version of cni? you can find out by
- in "ipamd.log.xxx" search for "Starting L-IPAMD x.x.x ..."
- or just kubectl get ds aws-node -n kube-system -o yaml
also can you find out how many pods have been scheduled to run on this node

kubectl get pod --all-namespaces -o wide | grep <node-ip> | wc -l

stafot · 2018-06-21T18:22:41Z

@liwenwu-amazon cni version : 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:1.0.0
falures happen in all nodes 3 nodes run 9 pods and 2 nodes 11

liwenwu-amazon · 2018-06-21T18:45:13Z

@stafot few more questions:

can you check ipamD datastore by running, and do u see any discrepancy there?

curl http://localhost:61678/v1/pods | python -m json.tool

if you restart ipamdD, do u see problem goes away?
also can you share or send me(liwenwu@amazon.com) the ipamD metrics ?

curl http://localhost:61678/metrics

max-rocket-internet · 2018-06-22T08:19:45Z

Thanks for the reply @liwenwu-amazon

is it possible that you are running into issue #18 ?

Yes, it looks very similar or identical but I don't know the cause so can't tell for sure.

what's the version of cni?

We are using eks-worker-v20 AMI (ami-dea4d5a1). The container image is 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:1.0.0

also can you find out how many pods have been scheduled to run on this node

33 but only 17 Running, the rest are ContainerCreating

max-rocket-internet · 2018-06-22T09:33:14Z

@liwenwu-amazon we should be able to run more than 17 pods on a t2.medium instance, right? I noticed that the other TF module has a hard coded list of pod limits per instance type:
https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/local.tf#L9

stafot · 2018-06-22T14:22:05Z

@liwenwu-amazon @max-rocket-internet seems that I have the same problem because with15 pods runs fine but after this start misbehaving and stuck.

siddharthshah3030 · 2020-10-14T18:42:48Z

@jayanthvn

I am not able to reproduce it right now

Also, Since it's production I can't do much testing there
Will update when issue happens again

Thanks

tdmalone · 2020-10-18T02:49:19Z

@jayanthvn Just chiming in to say we're another user also experiencing this.

It has happened on 2 of our 5 EKS clusters (all in separate AWS accounts), 784 times in the last 7 days (according to metrics). Most of the time it only reports once on a container before resolving, but 138 times it has reported a second time, so far we have not gone beyond 4 reports on the one container before it is resolved.

While I can't answer all of the questions above at the moment I can answer these:

I'm unsure of exact steps to reproduce but it's likely as pods are scaling up and down; there is also one CronJob that it occasionally reports on (but not always)
All 5 of our clusters are in ap-southeast-2 and the two experiencing this issue are suffixed -test and -internal (in different accounts)
No, with AWS VPC CNI v1.6.3 (and all clusters are on k8s 1.17 eks.2)
Instance type is t3.medium; number of pods change from time to time but we generally fill it up to the max pods possible for the instance type

5 & 6. I will try to remember to get these next time it happens

jayanthvn · 2020-10-18T14:51:10Z

Hi @tdmalone

Sorry to know you are experiencing this often. Kindly share me the logs and cluster ARN once you hit the issue next time.

Cluster ARN
logs - sudo bash /opt/cni/bin/aws-cni-support.sh
Kubectl describe o/p of aws-node

Thank you!

dallasmarlow · 2020-11-14T17:48:18Z

I'm also experiencing this issue w/ an EKS cluster running 1.18 w/ CNI version 1.7.5. I compiled some of relevant logs here: https://gist.github.com/dallasmarlow/445c926ea15d0dba71725a44bb9295f0

jayanthvn · 2020-11-14T18:22:51Z

Hi @dallasmarlow

Based on the logs you have attached, I see there are only 3 IPs attached to one of the ENI and other ENI has no secondary IPs. Can you please check on the EC2 console for the instance if the ENIs and secondary IPs are attached? Also can you please share the instance id?

{"level":"info","ts":"2020-11-14T17:02:45.211Z","caller":"ipamd/ipamd.go:773","msg":"Added ENI(eni-0c06846cc125fe51a)'s IP 100.64.3.113 to datastore"}
{"level":"info","ts":"2020-11-14T17:02:45.211Z","caller":"ipamd/ipamd.go:773","msg":"Added ENI(eni-0c06846cc125fe51a)'s IP 100.64.1.101 to datastore"}
{"level":"info","ts":"2020-11-14T17:02:45.211Z","caller":"ipamd/ipamd.go:773","msg":"Added ENI(eni-0c06846cc125fe51a)'s IP 100.64.7.253 to datastore"}
{"level":"debug","ts":"2020-11-14T17:02:45.211Z","caller":"ipamd/ipamd.go:763","msg":"IP Address Pool stats: total: 3, assigned: 0"}
{"level":"debug","ts":"2020-11-14T17:02:45.211Z","caller":"ipamd/ipamd.go:628","msg":"Successfully increased IP pool, total: 3, used: 0"}
{"level":"debug","ts":"2020-11-14T17:02:45.211Z","caller":"ipamd/ipamd.go:641","msg":"IP pool stats: total = 3, used = 0, c.maxIPsPerENI = 3"}
{"level":"debug","ts":"2020-11-14T17:02:47.049Z","caller":"sdk/informer-sync.go:88","msg":"Handle ENIConfig Add/Update: us-east-2a, [sg-0bca3189f9aded496], subnet-0bc1fa1b8ce84159d"}
{"level":"debug","ts":"2020-11-14T17:02:47.050Z","caller":"sdk/informer-sync.go:88","msg":"Handle ENIConfig Add/Update: us-east-2b, [sg-0bca3189f9aded496], subnet-02d3c9b96d47a66d7"}
{"level":"debug","ts":"2020-11-14T17:02:47.051Z","caller":"sdk/informer-sync.go:88","msg":"Handle corev1.Node: ip-10-200-100-10.us-east-2.compute.internal, map[node.alpha.kubernetes.io/ttl:0 volumes.kubernetes.io/controller-managed-attach-detach:true], map[beta.kubernetes.io/arch:amd64 beta.kubernetes.io/instance-type:t3a.small beta.kubernetes.io/os:linux failure-domain.beta.kubernetes.io/region:us-east-2 failure-domain.beta.kubernetes.io/zone:us-east-2a kubernetes.io/arch:amd64 kubernetes.io/hostname:ip-10-200-100-10.us-east-2.compute.internal kubernetes.io/os:linux node.kubernetes.io/instance-type:t3a.small topology.kubernetes.io/region:us-east-2 topology.kubernetes.io/zone:us-east-2a vpc.amazonaws.com/eniConfig:us-east-2a]"}
{"level":"debug","ts":"2020-11-14T17:02:47.878Z","caller":"sdk/informer-sync.go:88","msg":"Handle corev1.Node: ip-10-200-100-10.us-east-2.compute.internal, map[node.alpha.kubernetes.io/ttl:0 volumes.kubernetes.io/controller-managed-attach-detach:true], map[beta.kubernetes.io/arch:amd64 beta.kubernetes.io/instance-type:t3a.small beta.kubernetes.io/os:linux failure-domain.beta.kubernetes.io/region:us-east-2 failure-domain.beta.kubernetes.io/zone:us-east-2a kubernetes.io/arch:amd64 kubernetes.io/hostname:ip-10-200-100-10.us-east-2.compute.internal kubernetes.io/os:linux node.kubernetes.io/instance-type:t3a.small topology.kubernetes.io/region:us-east-2 topology.kubernetes.io/zone:us-east-2a vpc.amazonaws.com/eniConfig:us-east-2a]"}
{"level":"debug","ts":"2020-11-14T17:02:47.893Z","caller":"sdk/informer-sync.go:88","msg":"Handle corev1.Node: ip-10-200-100-10.us-east-2.compute.internal, map[node.alpha.kubernetes.io/ttl:0 volumes.kubernetes.io/controller-managed-attach-detach:true], map[beta.kubernetes.io/arch:amd64 beta.kubernetes.io/instance-type:t3a.small beta.kubernetes.io/os:linux failure-domain.beta.kubernetes.io/region:us-east-2 failure-domain.beta.kubernetes.io/zone:us-east-2a kubernetes.io/arch:amd64 kubernetes.io/hostname:ip-10-200-100-10.us-east-2.compute.internal kubernetes.io/os:linux node.kubernetes.io/instance-type:t3a.small topology.kubernetes.io/region:us-east-2 topology.kubernetes.io/zone:us-east-2a vpc.amazonaws.com/eniConfig:us-east-2a]"}
{"level":"info","ts":"2020-11-14T17:02:51.333Z","caller":"rpc/rpc.pb.go:486","msg":"Received AddNetwork for NS /proc/5333/ns/net, Sandbox a1391ea1df4a7d46bde8b5e1a0ff7c22d6f2d1f0d77b528cee1d502fe720fe68, ifname eth0"}
{"level":"debug","ts":"2020-11-14T17:02:51.333Z","caller":"rpc/rpc.pb.go:486","msg":"AddNetworkRequest: ClientVersion:\"v1.7.5\" K8S_POD_NAME:\"coredns-66bc8b7b7b-hlzsh\" K8S_POD_NAMESPACE:\"kube-system\" K8S_POD_INFRA_CONTAINER_ID:\"a1391ea1df4a7d46bde8b5e1a0ff7c22d6f2d1f0d77b528cee1d502fe720fe68\" ContainerID:\"a1391ea1df4a7d46bde8b5e1a0ff7c22d6f2d1f0d77b528cee1d502fe720fe68\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/proc/5333/ns/net\" "}
{"level":"debug","ts":"2020-11-14T17:02:51.333Z","caller":"ipamd/rpc_handler.go:142","msg":"AssignIPv4Address: IP address pool stats: total: 3, assigned 0"}
{"level":"debug","ts":"2020-11-14T17:02:51.334Z","caller":"ipamd/rpc_handler.go:142","msg":"AssignPodIPv4Address: ENI eni-0a62d2dbdda3de253 does not have available addresses"}
{"level":"info","ts":"2020-11-14T17:02:51.334Z","caller":"datastore/data_store.go:499","msg":"AssignPodIPv4Address: Assign IP 100.64.3.113 to sandbox aws-cni/a1391ea1df4a7d46bde8b5e1a0ff7c22d6f2d1f0d77b528cee1d502fe720fe68/eth0"}
{"level":"debug","ts":"2020-11-14T17:02:51.334Z","caller":"rpc/rpc.pb.go:486","msg":"VPC CIDR 10.200.100.0/26"}
{"level":"debug","ts":"2020-11-14T17:02:51.334Z","caller":"rpc/rpc.pb.go:486","msg":"VPC CIDR 100.64.0.0/20"}
{"level":"info","ts":"2020-11-14T17:02:51.334Z","caller":"rpc/rpc.pb.go:486","msg":"Send AddNetworkReply: IPv4Addr 100.64.3.113, DeviceNumber: 1, err: <nil>"}
{"level":"info","ts":"2020-11-14T17:02:51.584Z","caller":"rpc/rpc.pb.go:486","msg":"Received AddNetwork for NS /proc/5385/ns/net, Sandbox a2872b49cd4affdb306b84a948affc73b5498858fa12a00df971b067c7e89ce1, ifname eth0"}
{"level":"debug","ts":"2020-11-14T17:02:51.584Z","caller":"rpc/rpc.pb.go:486","msg":"AddNetworkRequest: ClientVersion:\"v1.7.5\" K8S_POD_NAME:\"metrics-server-7949d47784-wzmv4\" K8S_POD_NAMESPACE:\"kube-system\" K8S_POD_INFRA_CONTAINER_ID:\"a2872b49cd4affdb306b84a948affc73b5498858fa12a00df971b067c7e89ce1\" ContainerID:\"a2872b49cd4affdb306b84a948affc73b5498858fa12a00df971b067c7e89ce1\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/proc/5385/ns/net\" "}
{"level":"debug","ts":"2020-11-14T17:02:51.584Z","caller":"ipamd/rpc_handler.go:142","msg":"AssignIPv4Address: IP address pool stats: total: 3, assigned 1"}
{"level":"debug","ts":"2020-11-14T17:02:51.584Z","caller":"ipamd/rpc_handler.go:142","msg":"AssignPodIPv4Address: ENI eni-0a62d2dbdda3de253 does not have available addresses"}
{"level":"info","ts":"2020-11-14T17:02:51.584Z","caller":"datastore/data_store.go:499","msg":"AssignPodIPv4Address: Assign IP 100.64.1.101 to sandbox aws-cni/a2872b49cd4affdb306b84a948affc73b5498858fa12a00df971b067c7e89ce1/eth0"}
{"level":"debug","ts":"2020-11-14T17:02:51.584Z","caller":"rpc/rpc.pb.go:486","msg":"VPC CIDR 10.200.100.0/26"}
{"level":"debug","ts":"2020-11-14T17:02:51.584Z","caller":"rpc/rpc.pb.go:486","msg":"VPC CIDR 100.64.0.0/20"}
{"level":"info","ts":"2020-11-14T17:02:51.584Z","caller":"rpc/rpc.pb.go:486","msg":"Send AddNetworkReply: IPv4Addr 100.64.1.101, DeviceNumber: 1, err: <nil>"}

dallasmarlow · 2020-11-14T18:43:39Z

Hello @jayanthvn the instance ID for this kubelet host is i-01d5a3fe17d09bc48. The instance does shows that it has 3 secondary ip addresses attached to the ENI currently from the subnet configured via the ENIConfig CRD for this topology.kubernetes.io/zone / AZ. The subnet (subnet-0bc1fa1b8ce84159d) shows that there are 2039 free addresses left.

jayanthvn · 2020-11-14T20:12:15Z

Will check the instance id and get back to you.

VPC CIDR -

{"level":"debug","ts":"2020-11-14T17:02:51.334Z","caller":"rpc/rpc.pb.go:486","msg":"VPC CIDR 100.64.0.0/20"}

ENI CIDR -

"level":"debug","ts":"2020-11-14T17:02:45.207Z","caller":"awsutils/awsutils.go:569","msg":"Found ENI: eni-0c06846cc125fe51a, MAC 02:88:46:38:7d:3c, device 1"}
{"level":"debug","ts":"2020-11-14T17:02:45.207Z","caller":"awsutils/awsutils.go:589","msg":"Found CIDR 100.64.0.0/21 for ENI 02:88:46:38:7d:3c"}
{"level":"debug","ts":"2020-11-14T17:02:45.208Z","caller":"awsutils/awsutils.go:589","msg":"Found IP addresses [100.64.7.96 100.64.3.113 100.64.1.101 100.64.7.253] on ENI 02:88:46:38:7d:3c"}

dallasmarlow · 2020-11-14T20:29:17Z

@jayanthvn that instance was scaled down by an ASG, but it appears as though i-0b2609c9deb3a523c is in the same state and I updated the ASG to retain the instance. I have observed that roughly 50% of the kubelet hosts (using the latest EKS AMI and t3a.small instances) come online with this issue after scaling several nodes and testing.

https://gist.github.com/dallasmarlow/0dd2aecc0d16b4aad4dc74f22b3de03d

sarbajitdutta · 2020-11-19T15:45:47Z

I have the same issue right now.
Cluster version 1.18
amazon-k8s-cni-init:v1.7.5
amazon-k8s-cni:v1.7.5

dallasmarlow · 2020-11-20T14:33:22Z

@sarbajitdutta can you provide some logs from a host stuck in this state?
@jayanthvn did you ever have a chance to look into this issue?

jayanthvn · 2020-11-20T15:55:05Z

Hi @dallasmarlow and @sarbajitdutta

Can you please share the steps to repro and collect the logs using this - sudo bash /opt/cni/bin/aws-cni-support.sh and email it to me varavaj@amazon.com?

Thank you!

jayanthvn · 2020-11-20T15:57:04Z

@dallasmarlow - Never mind I found the logs attached here https://gist.github.com/dallasmarlow/0dd2aecc0d16b4aad4dc74f22b3de03d.

jayanthvn · 2020-11-20T16:21:46Z

Hi @dallasmarlow

You are using t3a.small instance with custom networking. T3a.small (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html) has 2 ENIs and 4 IPs per ENI. With custom networking primary ENI will not be used. So you will have only one ENI and 3 IPs since the first IP out of 4 is reserved. Hence you will have only 3 IPs and can schedule only 3 pods.

{"level":"debug","ts":"2020-11-14T21:44:52.264Z","caller":"ipamd/ipamd.go:852","msg":"IP pool stats: total = 3, used = 1, c.maxIPsPerENI = 3"}
{"level":"debug","ts":"2020-11-14T21:44:52.264Z","caller":"ipamd/ipamd.go:482","msg":"IP pool is too low: available (2) < ENI target (1) * addrsPerENI (3)"}
{"level":"debug","ts":"2020-11-14T21:44:52.264Z","caller":"ipamd/ipamd.go:483","msg":"Starting to increase IP pool size"}
{"level":"debug","ts":"2020-11-14T21:44:52.264Z","caller":"ipamd/ipamd.go:709","msg":"Skip the primary ENI for need IP check"}
{"level":"debug","ts":"2020-11-14T21:44:52.264Z","caller":"ipamd/ipamd.go:483","msg":"Skipping ENI allocation as the max ENI limit of 2 is already reached (accounting for 0 unmanaged ENIs and 0 trunk ENIs)"}

Also have you set the max pods value [Section 7 here - https://docs.aws.amazon.com/eks/latest/userguide/cni-custom-network.html] . Can you please share me the kubelet max pods value configured on the instance?

cat /etc/kubernetes/kubelet/kubelet-config.json | grep max
  "maxPods": 29

This should be the math to set the max pods -

maxPods = (numInterfaces - 1) * (maxIpv4PerInterface - 1) + 2 

Subtracting 1 from numInterfaces - Primary ENI is not used with custom networking
Subtracting 1 from maxIpv4PerInterface - Primary IP is reserved
Adding 2 as a constant to account for aws-node and kube-proxy as both use hostNetwork.

dallasmarlow · 2020-11-20T18:43:54Z

@jayanthvn Thank you for looking into this, the kubelet hosts are configured to accept a maximum of 8 pods which I now see will not work for this instance type due to the ENI limitations you described. It looks like the reason that sometimes this works and other times it didn't was due to a race condition w/ the other pods being scheduled (which for my testing cluster includes 2 coredns pods and a metrics-server pod). I will update the max pods configuration to 5 pods for this instance type and try again.

jayanthvn · 2020-11-20T18:55:25Z

Thanks for confirming @dallasmarlow :)

lowjoel · 2020-12-24T07:27:41Z

Hey @jayanthvn I've run aws-cni-support.sh and opened ticket 7782332001 for this, thanks!

jayanthvn · 2020-12-24T17:30:53Z

Thanks @lowjoel , we will look into it. Do you have managed addons?

lowjoel · 2020-12-25T06:24:23Z

@jayanthvn AFAIK no - I didn't know that existed until I searched for the announcement after seeing your comment 😂

jayanthvn · 2021-01-21T21:58:17Z

This issue seems to be resolved. So closing this for now.

vladikk · 2021-03-04T07:41:09Z

@jayanthvn Hey! In which version of the CNI the issue has been resolved? I cannot find anything related in the changelog.

lowjoel · 2021-03-12T23:53:01Z

@vladikk I can't remember the exact patch version, but taking the latest 1.7 series should fix it.

If memory serves, #1315 fixed it so v1.7.7.

ameena007 · 2021-03-25T10:11:30Z

Hi this is still persistent.

K8 version : EKS:

Major:"1", Minor:"18+",

https://stackoverflow.com/questions/66797082/aws-eks-getting-error-networkplugin-cni-failed-to-set-up-pod

jayanthvn · 2021-03-25T14:34:10Z

Hi @ameena007

Can you please share logs from the instance where you are seeing this issue? You can run the below script and mail it to me varavaj@amazon.com.

sudo bash /opt/cni/bin/aws-cni-support.sh

ameena007 · 2021-03-26T07:39:28Z

Sure @jayanthvn

NagarajBM · 2021-10-27T08:07:45Z

Hey,

issue still happening on aws eks.

cluster ARN = arn:aws:eks:us-east-1:531074996011:cluster/Cuttle

kubernetes version = 1.21
Platform version = eks.2
coredns version = v1.8.4-eksbuild.1
kube-proxy version = v1.21.2-eksbuild.2
vpc-cni version = v1.9.1-eksbuild.1

I have an AWS EKS cluster and while deploying pods I'm getting this error. "failed to assign an IP address to container"

My Current container setup:-

A cluster with single node group.
The node group with 2 nodes and "c5.2xlarge" instance type.
The node group is launched in a subnet which is not used for any other resources and has an CIDR of "10.0.1.0/24"

Error:
(combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "0707507c6ea3d0a38b1222ebc9acfb22bf9fc8ac5f5fa73ebd272fb107324f60" network for pod "automation-archival-hub-integration-service-6dfc4b4999-v94nq": networkPlugin cni failed to set up pod "automation-archival-hub-integration-service-6dfc4b4999-v94nq_cuttle-automation" network: add cmd: failed to assign an IP address to container.

Extra details:-

The "c5.2x large" can have 58 pods running in it. since only 4 network interface can be attached to the above instance type.
Our cluster has more than 75 pods.
When the 2nd node is launched only 2 network interfaces are created by the cluster and are attached to the node, since the network interface allows less than 15 secondary IP.

Thanks,
Nagaraj

srini-ram · 2021-10-27T16:14:54Z

@NagarajBM - Error message captured in the issue is generic top level msg and can be attributed to various underlying root cause. It would help if you can open a new Github issue and to debug this further we would need IPAMD logs on node which has container stuck in "ContainerCreating" state. You can attach logs in Github issue or you could send the same to varavaj@amazon.com ramabad@amazon.com & achevuru@amazon.com. Here is the command to collect logs from impacted node

sudo bash /opt/cni/bin/aws-cni-support.sh

druidsbane closed this as completed Apr 10, 2018

liwenwu-amazon reopened this Apr 11, 2018

liwenwu-amazon self-assigned this Apr 17, 2018

liwenwu-amazon added the bug label Apr 17, 2018

liwenwu-amazon mentioned this issue Apr 25, 2018

Add 30 sec cooling for IP address reassignment #62

Merged

liwenwu-amazon closed this as completed May 12, 2018

sanicheev mentioned this issue Jan 5, 2021

aws-node does not perform ip unassignment properly #1344

Closed

jayanthvn closed this as completed Jan 21, 2021

malpania mentioned this issue Jan 29, 2022

Pods stuck in ContainerCreating due to CNI failed to set up pod: Pod Event says : Unauthorized; until aws-node pod restart. #1831

Closed

zip-chanko mentioned this issue Jan 6, 2024

Pods stuck in "ContainerCreating" status in AKS: FailedCreatePodSandBox linkerd/linkerd2#11478

Closed

Pods stuck in ContainerCreating due to CNI Failing to Assing IP to Container Until aws-node is deleted #59

Pods stuck in ContainerCreating due to CNI Failing to Assing IP to Container Until aws-node is deleted #59

Comments

druidsbane commented Apr 10, 2018 • edited Loading

liwenwu-amazon commented Apr 10, 2018

liwenwu-amazon commented Apr 10, 2018

druidsbane commented Apr 10, 2018

liwenwu-amazon commented Apr 16, 2018

incognick commented Apr 18, 2018

liwenwu-amazon commented Apr 18, 2018

edwize commented Apr 25, 2018

liwenwu-amazon commented May 12, 2018

stafot commented Jun 20, 2018

max-rocket-internet commented Jun 20, 2018

max-rocket-internet commented Jun 20, 2018

liwenwu-amazon commented Jun 20, 2018 • edited Loading

stafot commented Jun 21, 2018 • edited Loading

max-rocket-internet commented Jun 21, 2018 • edited Loading

max-rocket-internet commented Jun 21, 2018

stafot commented Jun 21, 2018

max-rocket-internet commented Jun 21, 2018

stafot commented Jun 21, 2018

liwenwu-amazon commented Jun 21, 2018

stafot commented Jun 21, 2018

liwenwu-amazon commented Jun 21, 2018

max-rocket-internet commented Jun 22, 2018 • edited Loading

max-rocket-internet commented Jun 22, 2018

stafot commented Jun 22, 2018

siddharthshah3030 commented Oct 14, 2020

tdmalone commented Oct 18, 2020 • edited Loading

jayanthvn commented Oct 18, 2020

dallasmarlow commented Nov 14, 2020

jayanthvn commented Nov 14, 2020

dallasmarlow commented Nov 14, 2020 • edited Loading

jayanthvn commented Nov 14, 2020

dallasmarlow commented Nov 14, 2020 • edited Loading

sarbajitdutta commented Nov 19, 2020

dallasmarlow commented Nov 20, 2020

jayanthvn commented Nov 20, 2020

jayanthvn commented Nov 20, 2020

jayanthvn commented Nov 20, 2020

dallasmarlow commented Nov 20, 2020

jayanthvn commented Nov 20, 2020

lowjoel commented Dec 24, 2020

jayanthvn commented Dec 24, 2020 • edited Loading

lowjoel commented Dec 25, 2020

jayanthvn commented Jan 21, 2021

vladikk commented Mar 4, 2021

lowjoel commented Mar 12, 2021 • edited Loading

ameena007 commented Mar 25, 2021

jayanthvn commented Mar 25, 2021

ameena007 commented Mar 26, 2021

NagarajBM commented Oct 27, 2021

srini-ram commented Oct 27, 2021

druidsbane commented Apr 10, 2018 •

edited

Loading

liwenwu-amazon commented Jun 20, 2018 •

edited

Loading

stafot commented Jun 21, 2018 •

edited

Loading

max-rocket-internet commented Jun 21, 2018 •

edited

Loading

max-rocket-internet commented Jun 22, 2018 •

edited

Loading

tdmalone commented Oct 18, 2020 •

edited

Loading

dallasmarlow commented Nov 14, 2020 •

edited

Loading

dallasmarlow commented Nov 14, 2020 •

edited

Loading

jayanthvn commented Dec 24, 2020 •

edited

Loading

lowjoel commented Mar 12, 2021 •

edited

Loading