Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed node update will leave autoscaler in disabled state #25

Open
Jasper-Ben opened this issue Oct 9, 2023 · 1 comment · May be fixed by #27
Open

Failed node update will leave autoscaler in disabled state #25

Jasper-Ben opened this issue Oct 9, 2023 · 1 comment · May be fixed by #27
Assignees
Labels
bug Something isn't working

Comments

@Jasper-Ben
Copy link
Member

When an eks-rolling-update job failes, the previous cluster state is not automatically recovered, instead requiring manual intervention:

│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,138 INFO     InstanceId i-026bce300ffa7d8d0 is node ip-10-208-33-228.eu-central-1.compute.internal in kubernetes land                                                 │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,138 INFO     Draining worker node with kubectl drain ip-10-208-33-228.eu-central-1.compute.internal --ignore-daemonsets --delete-emptydir-data --timeout=300s...      │
│ iris-devops-rolling-node-update-manual-42b-hrvhd node/ip-10-208-33-228.eu-central-1.compute.internal already cordoned                                                                                                                      │
│ iris-devops-rolling-node-update-manual-42b-hrvhd error: unable to drain node "ip-10-208-33-228.eu-central-1.compute.internal" due to error:cannot delete Pods declare no controller (use --force to override): gitlab-runner/runner-8e3ydb │
│ hn-project-1998-concurrent-8-ek1wq7qg, continuing command...                                                                                                                                                                               │
│ iris-devops-rolling-node-update-manual-42b-hrvhd There are pending nodes to be drained:                                                                                                                                                    │
│ iris-devops-rolling-node-update-manual-42b-hrvhd  ip-10-208-33-228.eu-central-1.compute.internal                                                                                                                                           │
│ iris-devops-rolling-node-update-manual-42b-hrvhd cannot delete Pods declare no controller (use --force to override): gitlab-runner/runner-8e3ydbhn-project-1998-concurrent-8-ek1wq7qg                                                      │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,990 INFO     Node not drained properly. Exiting                                                                                                                       │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,990 ERROR    ('Rolling update on ASG failed', 'ci-runner-kas-20230710121010942300000012')                                                                             │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,990 ERROR    *** Rolling update of ASG has failed. Exiting ***                                                                                                        │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,990 ERROR    AWS Auto Scaling Group processes will need resuming manually                                                                                             │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,990 ERROR    Kubernetes Cluster Autoscaler will need resuming manually                                                                                                │

Most notably, auto-scaling will be scaled to 0. This is an issue as our workloads (especially CI) heavily depend on functioning auto-scaling.

@Jasper-Ben Jasper-Ben added the bug Something isn't working label Oct 9, 2023
@Jasper-Ben
Copy link
Member Author

These exceptions should cause an uncordon on the affected nodes:

https://github.com/deinstapel/eks-rolling-update/blob/master/eksrollup/lib/k8s.py#L195-L198

@martin31821 please look into it. Thx 🙂

@martin31821 martin31821 linked a pull request Apr 17, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants