Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster autoscaler gets stuck trying to remove expired spot instances #3255

Closed
tomelliff opened this issue Jun 30, 2020 · 12 comments
Closed
Labels
area/cluster-autoscaler lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@tomelliff
Copy link

tomelliff commented Jun 30, 2020

Currently running cluster autoscaler v1.15.5 (k8s.gcr.io/cluster-autoscaler:v1.15.5) on AWS EKS 1.15.

Earlier today we had a couple of spot instances terminated through lack of capacity. After this the cluster autoscaler got caught in a loop trying to terminate these instances and was unable to schedule new instances for pending pods.

Looking at the logs showed:

...
I0630 10:28:53.861119       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:29:04.557199       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:29:16.060817       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:29:27.058916       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:29:38.557495       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:29:49.460021       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:30:00.369102       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:30:11.465990       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:30:23.258690       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:30:34.070006       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:30:44.667331       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:30:56.378597       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:31:07.369924       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:31:18.271079       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:31:29.664671       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:31:40.657104       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:31:51.667858       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:32:02.762352       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:32:13.865707       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:32:24.858137       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:32:35.957325       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:32:47.157091       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:32:58.157318       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:33:08.957205       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e
I0630 10:33:19.856901       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e14d10268725a9e

while normally we only see one line for terminating an EC2 instance:

$ kubectl logs -n kube-system cluster-autoscaler-76b5996b6-dcszm | grep Terminating | grep -vE "(i-00f4781a15e825157|i-01e14d10268725a9e)"
I0630 08:36:44.250191       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0599925c7453cb77a
I0630 08:36:44.654045       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-091cc949feabc0b73
I0630 08:36:45.054044       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-010c0a4900ce2c26e
I0630 08:36:45.449514       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01fac0178f50e6b5a
I0630 08:36:45.648628       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0749a02b515a9d336
I0630 08:36:45.848301       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0affb7d11e4d6387b
I0630 09:16:29.043485       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-06af8d3b29ae1dfe7
I0630 09:16:29.452285       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0a31d7f5c24ed9060
I0630 09:16:29.945195       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0a8a65b142810e2a7
I0630 09:16:30.242093       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-018e2e9ea629ffb99
I0630 09:16:30.444145       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0f80e9de2e9b3603e
I0630 09:16:30.642077       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0aef774d292d3ecc3
I0630 09:16:30.843216       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0f9cc17f4802fa75c
I0630 09:16:31.145552       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0c56aab5ffcf3507b
I0630 09:16:31.342740       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0990918fe260dd3a2
I0630 09:16:39.743539       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-099feb2d81b9d67c7
I0630 09:16:40.142727       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-018e8040085c387e5
I0630 09:31:47.346911       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-05887dd5ed6958312
I0630 09:31:47.953168       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-02c0a99dbed87b5d5
I0630 09:31:48.146730       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-02ab819f1517c6865
I0630 09:31:48.350052       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-01e063f6a010fcee9
I0630 09:31:48.553408       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-06ce5b7d5b25c0df5
I0630 09:31:48.746812       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-065fa480170597585
I0630 09:31:48.949777       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0a10bbba7d0c87514
I0630 09:35:12.946723       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0cdd06b6666d6e5f0
I0630 09:37:14.951930       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-072e8c44317d481b1
I0630 09:37:25.044884       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-086986642958ed4f8
I0630 09:37:35.245076       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0657aba7d4bfac5ed
I0630 09:37:45.345607       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-00543fefe5b14ec77
I0630 09:38:35.846284       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-05793bfed4348b8ee
I0630 09:40:07.348593       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-0bc96f6bb674cd567
I0630 09:40:37.845122       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-00acbb2fbfa4633c6
I0630 09:43:50.546003       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-091bbc4bda9c943e5
I0630 09:45:01.842578       1 auto_scaling_groups.go:279] Terminating EC2 instance: i-073ed01b83fbc582d

The extent of the loop is also shown here:

$ kubectl logs -n kube-system cluster-autoscaler-76b5996b6-dcszm | grep Terminating | wc -l
389
$ kubectl logs -n kube-system cluster-autoscaler-76b5996b6-dcszm | grep Terminating | grep -vE "(i-00f4781a15e825157|i-01e14d10268725a9e)" | wc -l
34

Throughout this we had lots of pending pods (being scheduled by Gitlab CI) but grepping the cluster autoscaler logs for the pods didn't return any log lines while normally a pending pod that gets a new instance spun up for it has a log line similar to this:

I0630 10:50:42.658282       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"runner-abc123456-project-10-concurrent-abc123456", UID:"cb43c54a-997c-4900-a922-8607761486e2", APIVersion:"v1", ResourceVersion:"13124937", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{test-cpu-2xl-rested-guinea2020062319442341660000000a 11->13 (max: 30)}]

Killing the pod and allowing the replicaset to bring it back fixed the issue.

I didn't see an existing issue for this (either open or closed) so not sure if this has been fixed in a newer version of the cluster autoscaler or if others have ran into this before.

@tomelliff
Copy link
Author

I had noticed #2235 that was to fix a lack of compute availability when scaling up and didn't think it was related to this specific issue (getting stuck removing already expired spot instances) but I've just noticed the following change:

https://github.com/kubernetes/autoscaler/pull/2235/files#diff-ade7b95627ea0dd6b6f4deee7f24fa7eR280-R284

	if err := m.asgCache.DeleteInstances(instances); err != nil {
		return err
	}
	klog.V(2).Infof("Some ASG instances might have been deleted, forcing ASG list refresh")
	return m.forceRefresh()

This looks like it might cover this issue as well but not 100% sure as I've not spent much time looking just yet.

Unfortunately the PR/commit doesn't seem to justify why that change is necessary there when the PR was seemingly to solve the other issue of lack of capacity in a node pool when scaling up.

@gjtempleton
Copy link
Member

Having had a look at where your repeated log line comes from, I can't see why the change you highlight would cause this behaviour, it looks to me like the call to the AWS API TerminateInstanceInAutoScalingGroup is not returning an error code, and therefore carrying on.

How long did you see it stuck in this loop before terminating the pod?

@tomelliff
Copy link
Author

Sorry for the late reply.

AWS support finally spotted the issue in my setup. We're using termination lifecycle hooks so that we can replace the ASG and have it block on destroying the ASG while there are still pods running on it (using https://github.com/VirtusLab/kubedrainer) so cluster autoscaler keeps calling TerminateInstanceInAutoScalingGroup over and over and the AWS API just keeps saying "yep, gonna terminate that instance" until the lifecycle hook times out (because the thing handling the lifecycle hook is on the instance so the instance is terminated before the lifecycle hook is ever issued by the ASG).

I've always felt that it's odd that the ASG will still trigger the termination lifecycle hook when the relevant instance has already been terminated and have asked about that behaviour with AWS support but suspect it isn't likely to change.

I've temporarily removed the lifecycle hooks to get past this issue for now and will have to handle replacing ASGs outside of used hours/architect around it but I'm also considering whether the CA should be checking if the instance is already terminated and if so just skipping the TerminateInstanceInAutoScalingGroup call and then decrement the ASG size and move on? Although I don't think that would currently help because the refresh would still load the instance in the ASG when it calls forceRefresh.

Is there a pattern that others are using for using termination lifecycle hooks to help replace ASGs? Or another way to replace instances in ASGs in a non disruptive fashion?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2020
@unixfox
Copy link

unixfox commented Jan 11, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 11, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 11, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 11, 2021
@unixfox
Copy link

unixfox commented May 11, 2021

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 11, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 9, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 8, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

7 participants