Failed node update will leave autoscaler in disabled state #25

Jasper-Ben · 2023-10-09T10:34:25Z

When an eks-rolling-update job failes, the previous cluster state is not automatically recovered, instead requiring manual intervention:

│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,138 INFO     InstanceId i-026bce300ffa7d8d0 is node ip-10-208-33-228.eu-central-1.compute.internal in kubernetes land                                                 │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,138 INFO     Draining worker node with kubectl drain ip-10-208-33-228.eu-central-1.compute.internal --ignore-daemonsets --delete-emptydir-data --timeout=300s...      │
│ iris-devops-rolling-node-update-manual-42b-hrvhd node/ip-10-208-33-228.eu-central-1.compute.internal already cordoned                                                                                                                      │
│ iris-devops-rolling-node-update-manual-42b-hrvhd error: unable to drain node "ip-10-208-33-228.eu-central-1.compute.internal" due to error:cannot delete Pods declare no controller (use --force to override): gitlab-runner/runner-8e3ydb │
│ hn-project-1998-concurrent-8-ek1wq7qg, continuing command...                                                                                                                                                                               │
│ iris-devops-rolling-node-update-manual-42b-hrvhd There are pending nodes to be drained:                                                                                                                                                    │
│ iris-devops-rolling-node-update-manual-42b-hrvhd  ip-10-208-33-228.eu-central-1.compute.internal                                                                                                                                           │
│ iris-devops-rolling-node-update-manual-42b-hrvhd cannot delete Pods declare no controller (use --force to override): gitlab-runner/runner-8e3ydbhn-project-1998-concurrent-8-ek1wq7qg                                                      │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,990 INFO     Node not drained properly. Exiting                                                                                                                       │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,990 ERROR    ('Rolling update on ASG failed', 'ci-runner-kas-20230710121010942300000012')                                                                             │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,990 ERROR    *** Rolling update of ASG has failed. Exiting ***                                                                                                        │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,990 ERROR    AWS Auto Scaling Group processes will need resuming manually                                                                                             │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,990 ERROR    Kubernetes Cluster Autoscaler will need resuming manually                                                                                                │

Most notably, auto-scaling will be scaled to 0. This is an issue as our workloads (especially CI) heavily depend on functioning auto-scaling.

The text was updated successfully, but these errors were encountered:

Jasper-Ben · 2024-02-23T15:22:14Z

These exceptions should cause an uncordon on the affected nodes:

https://github.com/deinstapel/eks-rolling-update/blob/master/eksrollup/lib/k8s.py#L195-L198

@martin31821 please look into it. Thx 🙂

Jasper-Ben added the bug Something isn't working label Oct 9, 2023

Jasper-Ben assigned martin31821 Feb 23, 2024

martin31821 linked a pull request Apr 17, 2024 that will close this issue

Fix/mkl/uncordon autoscaler #27

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed node update will leave autoscaler in disabled state #25

Failed node update will leave autoscaler in disabled state #25

Jasper-Ben commented Oct 9, 2023

Jasper-Ben commented Feb 23, 2024

Failed node update will leave autoscaler in disabled state #25

Failed node update will leave autoscaler in disabled state #25

Comments

Jasper-Ben commented Oct 9, 2023

Jasper-Ben commented Feb 23, 2024