Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: eks cluster creation fails with "The maximum number of internet gateways/VPCs has been reached." #1171

Open
orfeas-k opened this issue Dec 3, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@orfeas-k
Copy link
Contributor

orfeas-k commented Dec 3, 2024

Bug Description

As seen here, eks cluster fails with the following error https://github.com/canonical/bundle-kubeflow/actions/runs/12130614512/job/33821313708#step:11:47.

To Reproduce

Rerun CI with all 3 supported versions

Environment

EKS 1.29

Relevant Log Output

2024-12-03 00:39:55 [ℹ]  eksctl version 0.196.0
2024-12-03 00:39:55 [ℹ]  using region eu-central-1
2024-12-03 00:39:55 [ℹ]  subnets for eu-central-1a - public:192.168.0.0/19 private:192.168.64.0/19
2024-12-03 00:39:55 [ℹ]  subnets for eu-central-1b - public:192.168.32.0/19 private:192.168.96.0/19
2024-12-03 00:39:55 [ℹ]  nodegroup "ng-d06bd84e" will use "ami-015db95d8173273e9" [Ubuntu2004/1.29]
2024-12-03 00:39:55 [ℹ]  using Kubernetes version 1.29
2024-12-03 00:39:55 [ℹ]  creating EKS cluster "kubeflow-test-latest" in "eu-central-1" region with managed nodes
2024-12-03 00:39:55 [ℹ]  1 nodegroup (ng-d06bd84e) was included (based on the include/exclude rules)
2024-12-03 00:39:55 [ℹ]  will create a CloudFormation stack for cluster itself and 1 managed nodegroup stack(s)
2024-12-03 00:39:55 [ℹ]  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=eu-central-1 --cluster=kubeflow-test-latest'
2024-12-03 00:39:55 [ℹ]  Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "kubeflow-test-latest" in "eu-central-1"
2024-12-03 00:39:55 [ℹ]  CloudWatch logging will not be enabled for cluster "kubeflow-test-latest" in "eu-central-1"
2024-12-03 00:39:55 [ℹ]  you can enable it with 'eksctl utils update-cluster-logging --enable-types={SPECIFY-YOUR-LOG-TYPES-HERE (e.g. all)} --region=eu-central-1 --cluster=kubeflow-test-latest'
2024-12-03 00:39:55 [ℹ]  default addons coredns, vpc-cni, kube-proxy were not specified, will install them as EKS addons
2024-12-03 00:39:55 [ℹ]  
2 sequential tasks: { create cluster control plane "kubeflow-test-latest", 
    2 sequential sub-tasks: { 
        2 sequential sub-tasks: { 
            1 task: { create addons },
            wait for control plane to become ready,
        },
        create managed nodegroup "ng-d06bd84e",
    } 
}
2024-12-03 00:39:55 [ℹ]  building cluster stack "eksctl-kubeflow-test-latest-cluster"
2024-12-03 00:39:56 [ℹ]  deploying stack "eksctl-kubeflow-test-latest-cluster"
2024-12-03 00:40:26 [ℹ]  waiting for CloudFormation stack "eksctl-kubeflow-test-latest-cluster"
2024-12-03 00:40:27 [✖]  unexpected status "ROLLBACK_IN_PROGRESS" while waiting for CloudFormation stack "eksctl-kubeflow-test-latest-cluster"
2024-12-03 00:40:27 [✖]  unexpected status "ROLLBACK_IN_PROGRESS" while waiting for CloudFormation stack "eksctl-kubeflow-test-latest-cluster"
2024-12-03 00:40:27 [ℹ]  fetching stack events in attempt to troubleshoot the root cause of the failure
2024-12-03 00:40:27 [!]  AWS::EC2::EIP/NATIP: DELETE_IN_PROGRESS
Error: failed to create cluster "kubeflow-test-latest"
2024-12-03 00:40:27 [!]  AWS::IAM::Role/ServiceRole: DELETE_IN_PROGRESS
2024-12-03 00:40:27 [✖]  AWS::IAM::Role/ServiceRole: CREATE_FAILED – "Resource creation cancelled"
2024-12-03 00:40:27 [✖]  AWS::EC2::EIP/NATIP: CREATE_FAILED – "Resource creation cancelled"
2024-12-03 00:40:27 [✖]  AWS::EC2::InternetGateway/InternetGateway: CREATE_FAILED – "Resource handler returned message: \"The maximum number of internet gateways has been reached. (Service: Ec2, Status Code: 400, Request ID: a97f20de-a1fa-4fd2-8a2f-f83ef2ccfaf9)\" (RequestToken: 933fc990-1bed-f543-4a69-ac24808072f5, HandlerErrorCode: ServiceLimitExceeded)"
2024-12-03 00:40:27 [✖]  AWS::EC2::VPC/VPC: CREATE_FAILED – "Resource handler returned message: \"The maximum number of VPCs has been reached. (Service: Ec2, Status Code: 400, Request ID: d75bf6ea-669a-4723-b056-cfa10de61ad8)\" (RequestToken: 99f79551-eabf-93fe-f5f8-4a65e599edb6, HandlerErrorCode: GeneralServiceException)"
2024-12-03 00:40:27 [!]  1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
2024-12-03 00:40:27 [ℹ]  to cleanup resources, run 'eksctl delete cluster --region=eu-central-1 --name=kubeflow-test-latest'
2024-12-03 00:40:27 [✖]  ResourceNotReady: failed waiting for successful resource state
Error: Process completed with exit code 1.

Additional Context

No response

@orfeas-k orfeas-k added the bug Something isn't working label Dec 3, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6636.

This message was autogenerated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant