-
Notifications
You must be signed in to change notification settings - Fork 742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods stuck in ContainerCreating due to CNI Failing to Assing IP to Container Until aws-node is deleted #59
Comments
@druidsbane can you log onto this node and run /opt/cni/bin/aws-cni-support.sh and send me (liwenwu@amazon.com) /var/log/aws-routed-eni/aws-cni-support.tar.gz? thanks |
@druidsbane from the log, it looks like you hit issue #18. here is a snips of logs ipamd.log.2018-04-10-17:
|
Great, interesting to see the details on how to solve this. I'll follow those two and hopefully they get merged soon so we can test before they're GA! |
Here is the root cause of this problem:
L-IPAM can reassign the IP address which was just released immediately to a new Pod, instead of obeying the pod cooling period design. This can cause CNI fail to setup routing table for the newly added Pod. When this failure happens, kubelet will NOT invoke a delete Pod CNI call. Since CNI never release this IP back in the case of failure, this IP is leaked. Here is the snips from plug.log.xxx
|
I'm having a similar problem with a pod not coming up with the same reason
@liwenwu-amazon I can provide the same logs if you wish. Thank you! |
@incognick can you log onto this node and run /opt/cni/bin/aws-cni-support.sh and send me (liwenwu@amazon.com) /var/log/aws-routed-eni/aws-cni-support.tar.gz? thanks |
I also got stuck in ContainerCreating: And the fix was to reload "aws-node" pod on the Node which stopped issuing IP's |
PR #62 should fix this issue. Please re-open if the issues happens again |
Having similar failures on |
I also have this on EKS. Very simply situation (1 cluster node, 1 simple deployment). Just scale deployment from 1 replica to 20 and I get pods that are stuck as
Events from the pod:
|
@liwenwu-amazon can we reopen this issue? |
@max-rocket-internet can you collect node level debugging info by running
you can send it to me (liwenwu@amazon.com) or attach it to this issue. In addition, i have few more questions
thanks, |
@liwenwu-amazon
Thanks in advance for any answer. |
I think I've made some mistake somewhere in the Terraform code. Our TF code is largely based on this but I've built a second cluster based on this module and it does not have this problem. Both clusters are using same subnets, AMI and I see this in the
To reproduce it, I just create the most basic deployment with 1 replica. Then scale the deployment to 30 replicas and then it happens. Even if it is a configuration error on my part, it might be nice to know what the problem is since it seems a few people hit this error and can't work out where things went wrong.
Yes.
Maximum 60 seconds. I've emailed you the logs. |
BTW I know |
@max-rocket-internet the way you described seems like something is wrong on this example |
@stafot I thought this too so I checked the userdata from both clusters and they are the same except for |
@max-rocket-internet I have 100 too on mine while the example has 20. So if only this is the difference probably something else is the root cause. |
@max-rocket-internet is it possible that you are running into issue #18 ?
|
@liwenwu-amazon cni version : |
@stafot few more questions:
|
Thanks for the reply @liwenwu-amazon
Yes, it looks very similar or identical but I don't know the cause so can't tell for sure.
We are using eks-worker-v20 AMI (ami-dea4d5a1). The container image is
33 but only 17 |
@liwenwu-amazon we should be able to run more than 17 pods on a |
@liwenwu-amazon @max-rocket-internet seems that I have the same problem because with15 pods runs fine but after this start misbehaving and stuck. |
I am not able to reproduce it right now Also, Since it's production I can't do much testing there Thanks |
@jayanthvn Just chiming in to say we're another user also experiencing this. It has happened on 2 of our 5 EKS clusters (all in separate AWS accounts), 784 times in the last 7 days (according to metrics). Most of the time it only reports once on a container before resolving, but 138 times it has reported a second time, so far we have not gone beyond 4 reports on the one container before it is resolved. While I can't answer all of the questions above at the moment I can answer these:
5 & 6. I will try to remember to get these next time it happens |
Hi @tdmalone Sorry to know you are experiencing this often. Kindly share me the logs and cluster ARN once you hit the issue next time.
Thank you! |
I'm also experiencing this issue w/ an EKS cluster running 1.18 w/ CNI version 1.7.5. I compiled some of relevant logs here: https://gist.github.com/dallasmarlow/445c926ea15d0dba71725a44bb9295f0 |
Based on the logs you have attached, I see there are only 3 IPs attached to one of the ENI and other ENI has no secondary IPs. Can you please check on the EC2 console for the instance if the ENIs and secondary IPs are attached? Also can you please share the instance id?
|
Hello @jayanthvn the instance ID for this kubelet host is |
Will check the instance id and get back to you. VPC CIDR -
ENI CIDR -
|
@jayanthvn that instance was scaled down by an ASG, but it appears as though https://gist.github.com/dallasmarlow/0dd2aecc0d16b4aad4dc74f22b3de03d |
@sarbajitdutta can you provide some logs from a host stuck in this state? |
Hi @dallasmarlow and @sarbajitdutta Can you please share the steps to repro and collect the logs using this - Thank you! |
@dallasmarlow - Never mind I found the logs attached here https://gist.github.com/dallasmarlow/0dd2aecc0d16b4aad4dc74f22b3de03d. |
You are using t3a.small instance with custom networking. T3a.small (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html) has 2 ENIs and 4 IPs per ENI. With custom networking primary ENI will not be used. So you will have only one ENI and 3 IPs since the first IP out of 4 is reserved. Hence you will have only 3 IPs and can schedule only 3 pods.
Also have you set the max pods value [Section 7 here - https://docs.aws.amazon.com/eks/latest/userguide/cni-custom-network.html] . Can you please share me the kubelet max pods value configured on the instance?
This should be the math to set the max pods -
|
@jayanthvn Thank you for looking into this, the kubelet hosts are configured to accept a maximum of 8 pods which I now see will not work for this instance type due to the ENI limitations you described. It looks like the reason that sometimes this works and other times it didn't was due to a race condition w/ the other pods being scheduled (which for my testing cluster includes 2 coredns pods and a metrics-server pod). I will update the max pods configuration to 5 pods for this instance type and try again. |
Thanks for confirming @dallasmarlow :) |
Hey @jayanthvn I've run aws-cni-support.sh and opened ticket 7782332001 for this, thanks! |
Thanks @lowjoel , we will look into it. Do you have managed addons? |
@jayanthvn AFAIK no - I didn't know that existed until I searched for the announcement after seeing your comment 😂 |
This issue seems to be resolved. So closing this for now. |
@jayanthvn Hey! In which version of the CNI the issue has been resolved? I cannot find anything related in the changelog. |
Hi this is still persistent. K8 version : EKS: Major:"1", Minor:"18+", |
Hi @ameena007 Can you please share logs from the instance where you are seeing this issue? You can run the below script and mail it to me varavaj@amazon.com.
|
Sure @jayanthvn |
Hey, issue still happening on aws eks. cluster ARN = arn:aws:eks:us-east-1:531074996011:cluster/Cuttle kubernetes version = 1.21 I have an AWS EKS cluster and while deploying pods I'm getting this error. "failed to assign an IP address to container" My Current container setup:-
Error: Extra details:-
Thanks, |
@NagarajBM - Error message captured in the issue is generic top level msg and can be attributed to various underlying root cause. It would help if you can open a new Github issue and to debug this further we would need IPAMD logs on node which has container stuck in "ContainerCreating" state. You can attach logs in Github issue or you could send the same to varavaj@amazon.com ramabad@amazon.com & achevuru@amazon.com. Here is the command to collect logs from impacted node
|
On a node that is only 3 days old all containers scheduled to be created on this node get stuck in
ContainerCreating
. This is on anm4.large
node. The AWS console shows that it has the maximum number of private IP's reserved so there isn't a problem getting resources. There are no pods running on the node other than daemon sets. All new nodes that came up after cordoning this node came up fine as well. This is a big problem because the node is consideredReady
and is accepting pods despite the fact that it can't launch any.The resolution: once I deleted
aws-node
on the host from thekube-system
namespace all the stuck containers came up. The version used isamazon-k8s-cni:0.1.4
In addition to trying to fix it, is there any mechanism for the
aws-node
process to have a health check and either get killed and restarted or drain and cordon a node if failures are detected? Even as an option?I have left the machine running in case any more logs are needed. The logs on the host show:
skipping: failed to "CreatePodSandbox" for <Pod Name>
error. The main reason was:failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod \"<PodName>-67484c8f8-xj96c_default\" network: add cmd: failed to assign an IP address to container"
Other choice error logs are:
The text was updated successfully, but these errors were encountered: