Cluster autoscaler gets stuck trying to remove expired spot instances #3255

tomelliff · 2020-06-30T11:01:24Z

Currently running cluster autoscaler v1.15.5 (k8s.gcr.io/cluster-autoscaler:v1.15.5) on AWS EKS 1.15.

Earlier today we had a couple of spot instances terminated through lack of capacity. After this the cluster autoscaler got caught in a loop trying to terminate these instances and was unable to schedule new instances for pending pods.

Looking at the logs showed:

...
I0630 10:28:53.861119       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:29:04.557199       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:29:16.060817       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:29:27.058916       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:29:38.557495       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:29:49.460021       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:30:00.369102       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:30:11.465990       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:30:23.258690       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:30:34.070006       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:30:44.667331       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:30:56.378597       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:31:07.369924       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:31:18.271079       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:31:29.664671       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:31:40.657104       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:31:51.667858       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:32:02.762352       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:32:13.865707       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:32:24.858137       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:32:35.957325       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:32:47.157091       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:32:58.157318       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:33:08.957205       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:33:19.856901       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e

while normally we only see one line for terminating an EC2 instance:

$ kubectl logs -n kube-system cluster-autoscaler-76b5996b6-dcszm | grep Terminating | grep -vE "(i-00f4781a15e825157|i-01e14d10268725a9e)"
I0630 08:36:44.250191       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0599925c7453cb77a
I0630 08:36:44.654045       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-091cc949feabc0b73
I0630 08:36:45.054044       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-010c0a4900ce2c26e
I0630 08:36:45.449514       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01fac0178f50e6b5a
I0630 08:36:45.648628       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0749a02b515a9d336
I0630 08:36:45.848301       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0affb7d11e4d6387b
I0630 09:16:29.043485       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-06af8d3b29ae1dfe7
I0630 09:16:29.452285       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0a31d7f5c24ed9060
I0630 09:16:29.945195       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0a8a65b142810e2a7
I0630 09:16:30.242093       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-018e2e9ea629ffb99
I0630 09:16:30.444145       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0f80e9de2e9b3603e
I0630 09:16:30.642077       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0aef774d292d3ecc3
I0630 09:16:30.843216       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0f9cc17f4802fa75c
I0630 09:16:31.145552       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0c56aab5ffcf3507b
I0630 09:16:31.342740       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0990918fe260dd3a2
I0630 09:16:39.743539       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-099feb2d81b9d67c7
I0630 09:16:40.142727       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-018e8040085c387e5
I0630 09:31:47.346911       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-05887dd5ed6958312
I0630 09:31:47.953168       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-02c0a99dbed87b5d5
I0630 09:31:48.146730       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-02ab819f1517c6865
I0630 09:31:48.350052       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e063f6a010fcee9
I0630 09:31:48.553408       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-06ce5b7d5b25c0df5
I0630 09:31:48.746812       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-065fa480170597585
I0630 09:31:48.949777       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0a10bbba7d0c87514
I0630 09:35:12.946723       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0cdd06b6666d6e5f0
I0630 09:37:14.951930       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-072e8c44317d481b1
I0630 09:37:25.044884       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-086986642958ed4f8
I0630 09:37:35.245076       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0657aba7d4bfac5ed
I0630 09:37:45.345607       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-00543fefe5b14ec77
I0630 09:38:35.846284       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-05793bfed4348b8ee
I0630 09:40:07.348593       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0bc96f6bb674cd567
I0630 09:40:37.845122       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-00acbb2fbfa4633c6
I0630 09:43:50.546003       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-091bbc4bda9c943e5
I0630 09:45:01.842578       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-073ed01b83fbc582d

The extent of the loop is also shown here:

$ kubectl logs -n kube-system cluster-autoscaler-76b5996b6-dcszm | grep Terminating | wc -l
389
$ kubectl logs -n kube-system cluster-autoscaler-76b5996b6-dcszm | grep Terminating | grep -vE "(i-00f4781a15e825157|i-01e14d10268725a9e)" | wc -l
34

Throughout this we had lots of pending pods (being scheduled by Gitlab CI) but grepping the cluster autoscaler logs for the pods didn't return any log lines while normally a pending pod that gets a new instance spun up for it has a log line similar to this:

I0630 10:50:42.658282       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"runner-abc123456-project-10-concurrent-abc123456", UID:"cb43c54a-997c-4900-a922-8607761486e2", APIVersion:"v1", ResourceVersion:"13124937", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{test-cpu-2xl-rested-guinea2020062319442341660000000a 11->13 (max: 30)}]

Killing the pod and allowing the replicaset to bring it back fixed the issue.

I didn't see an existing issue for this (either open or closed) so not sure if this has been fixed in a newer version of the cluster autoscaler or if others have ran into this before.

The text was updated successfully, but these errors were encountered:

tomelliff · 2020-06-30T11:13:41Z

I had noticed #2235 that was to fix a lack of compute availability when scaling up and didn't think it was related to this specific issue (getting stuck removing already expired spot instances) but I've just noticed the following change:

https://github.com/kubernetes/autoscaler/pull/2235/files#diff-ade7b95627ea0dd6b6f4deee7f24fa7eR280-R284

	if err := m.asgCache.DeleteInstances(instances); err != nil {
		return err
	}
	klog.V(2).Infof("Some ASG instances might have been deleted, forcing ASG list refresh")
	return m.forceRefresh()

This looks like it might cover this issue as well but not 100% sure as I've not spent much time looking just yet.

Unfortunately the PR/commit doesn't seem to justify why that change is necessary there when the PR was seemingly to solve the other issue of lack of capacity in a node pool when scaling up.

gjtempleton · 2020-07-20T23:36:56Z

Having had a look at where your repeated log line comes from, I can't see why the change you highlight would cause this behaviour, it looks to me like the call to the AWS API TerminateInstanceInAutoScalingGroup is not returning an error code, and therefore carrying on.

How long did you see it stuck in this loop before terminating the pod?

tomelliff · 2020-09-23T16:03:17Z

Sorry for the late reply.

AWS support finally spotted the issue in my setup. We're using termination lifecycle hooks so that we can replace the ASG and have it block on destroying the ASG while there are still pods running on it (using https://github.com/VirtusLab/kubedrainer) so cluster autoscaler keeps calling TerminateInstanceInAutoScalingGroup over and over and the AWS API just keeps saying "yep, gonna terminate that instance" until the lifecycle hook times out (because the thing handling the lifecycle hook is on the instance so the instance is terminated before the lifecycle hook is ever issued by the ASG).

I've always felt that it's odd that the ASG will still trigger the termination lifecycle hook when the relevant instance has already been terminated and have asked about that behaviour with AWS support but suspect it isn't likely to change.

I've temporarily removed the lifecycle hooks to get past this issue for now and will have to handle replacing ASGs outside of used hours/architect around it but I'm also considering whether the CA should be checking if the instance is already terminated and if so just skipping the TerminateInstanceInAutoScalingGroup call and then decrement the ASG size and move on? Although I don't think that would currently help because the refresh would still load the instance in the ASG when it calls forceRefresh.

Is there a pattern that others are using for using termination lifecycle hooks to help replace ASGs? Or another way to replace instances in ASGs in a non disruptive fashion?

fejta-bot · 2020-12-22T16:25:08Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

unixfox · 2021-01-11T11:34:10Z

/remove-lifecycle stale

fejta-bot · 2021-04-11T12:20:02Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-05-11T12:22:55Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

unixfox · 2021-05-11T12:48:18Z

/remove-lifecycle rotten

k8s-triage-robot · 2021-08-09T13:43:22Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2021-09-08T14:42:51Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2021-10-15T13:25:56Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2021-10-15T13:26:14Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 11, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 11, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 11, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 11, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 9, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 8, 2021

jbartosik added the area/cluster-autoscaler label Sep 15, 2021

k8s-ci-robot closed this as completed Oct 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster autoscaler gets stuck trying to remove expired spot instances #3255

Cluster autoscaler gets stuck trying to remove expired spot instances #3255

tomelliff commented Jun 30, 2020 •

edited

Loading

tomelliff commented Jun 30, 2020

gjtempleton commented Jul 20, 2020

tomelliff commented Sep 23, 2020

fejta-bot commented Dec 22, 2020

unixfox commented Jan 11, 2021

fejta-bot commented Apr 11, 2021

fejta-bot commented May 11, 2021

unixfox commented May 11, 2021

k8s-triage-robot commented Aug 9, 2021

k8s-triage-robot commented Sep 8, 2021

k8s-triage-robot commented Oct 15, 2021

k8s-ci-robot commented Oct 15, 2021

Cluster autoscaler gets stuck trying to remove expired spot instances #3255

Cluster autoscaler gets stuck trying to remove expired spot instances #3255

Comments

tomelliff commented Jun 30, 2020 • edited Loading

tomelliff commented Jun 30, 2020

gjtempleton commented Jul 20, 2020

tomelliff commented Sep 23, 2020

fejta-bot commented Dec 22, 2020

unixfox commented Jan 11, 2021

fejta-bot commented Apr 11, 2021

fejta-bot commented May 11, 2021

unixfox commented May 11, 2021

k8s-triage-robot commented Aug 9, 2021

k8s-triage-robot commented Sep 8, 2021

k8s-triage-robot commented Oct 15, 2021

k8s-ci-robot commented Oct 15, 2021

tomelliff commented Jun 30, 2020 •

edited

Loading