-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster autoscaler gets stuck trying to remove expired spot instances #3255
Comments
I had noticed #2235 that was to fix a lack of compute availability when scaling up and didn't think it was related to this specific issue (getting stuck removing already expired spot instances) but I've just noticed the following change: if err := m.asgCache.DeleteInstances(instances); err != nil {
return err
}
klog.V(2).Infof("Some ASG instances might have been deleted, forcing ASG list refresh")
return m.forceRefresh() This looks like it might cover this issue as well but not 100% sure as I've not spent much time looking just yet. Unfortunately the PR/commit doesn't seem to justify why that change is necessary there when the PR was seemingly to solve the other issue of lack of capacity in a node pool when scaling up. |
Having had a look at where your repeated log line comes from, I can't see why the change you highlight would cause this behaviour, it looks to me like the call to the AWS API How long did you see it stuck in this loop before terminating the pod? |
Sorry for the late reply. AWS support finally spotted the issue in my setup. We're using termination lifecycle hooks so that we can replace the ASG and have it block on destroying the ASG while there are still pods running on it (using https://github.com/VirtusLab/kubedrainer) so cluster autoscaler keeps calling I've always felt that it's odd that the ASG will still trigger the termination lifecycle hook when the relevant instance has already been terminated and have asked about that behaviour with AWS support but suspect it isn't likely to change. I've temporarily removed the lifecycle hooks to get past this issue for now and will have to handle replacing ASGs outside of used hours/architect around it but I'm also considering whether the CA should be checking if the instance is already terminated and if so just skipping the Is there a pattern that others are using for using termination lifecycle hooks to help replace ASGs? Or another way to replace instances in ASGs in a non disruptive fashion? |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Currently running cluster autoscaler v1.15.5 (
k8s.gcr.io/cluster-autoscaler:v1.15.5
) on AWS EKS 1.15.Earlier today we had a couple of spot instances terminated through lack of capacity. After this the cluster autoscaler got caught in a loop trying to terminate these instances and was unable to schedule new instances for pending pods.
Looking at the logs showed:
while normally we only see one line for terminating an EC2 instance:
The extent of the loop is also shown here:
Throughout this we had lots of pending pods (being scheduled by Gitlab CI) but grepping the cluster autoscaler logs for the pods didn't return any log lines while normally a pending pod that gets a new instance spun up for it has a log line similar to this:
Killing the pod and allowing the replicaset to bring it back fixed the issue.
I didn't see an existing issue for this (either open or closed) so not sure if this has been fixed in a newer version of the cluster autoscaler or if others have ran into this before.
The text was updated successfully, but these errors were encountered: