AWS Ondemand not scaled up if Spot requests remain "Open" #1795

sc250024 · 2019-03-14T14:06:50Z

Greetings,

I'm running cluster-autoscaler v1.3.8 in a Kubernetes v1.11.8 cluster in AWS created using Kops. I'm not sure if this falls under feature request, a bug, or a misconfiguration on my part, so apologies in advance if it's a misconfiguration. Basically, for our pre-production cluster, I want to run spot instances as much as possible. I have the following node groups (and thus, ASGs), and they look like the following if the cluster is running normally:

ASG Name                   Instances  Desired  Min  Max
nodes-ondemand-eu-west-1a  0          0        0    10
nodes-ondemand-eu-west-1b  0          0        0    10
nodes-ondemand-eu-west-1c  0          0        0    10
nodes-spot-eu-west-1a      2          2        2    10
nodes-spot-eu-west-1b      2          2        2    10
nodes-spot-eu-west-1c      2          2        2    10

I also have k8s-spot-rescheduler running to take pods from any ondemand nodes that are provisioned, and move them to spot instances so that the ondemand nodes can be removed. However, lately, there always seems to be spot requests that are open, but not yet fulfilled:

This is normal behavior in AWS, but the problem is, the cluster autoscaler does not use the ondemand node groups at all. The following log output is what I'll see during this situation:

cluster-autoscaler aws_manager.go:148] Refreshed ASG list, next refresh after 2019-03-14 13:59:04.235594224 +0000 UTC m=+241976.799746087
cluster-autoscaler clusterstate.go:542] Readiness for node group nodes-spot-eu-west-1a.my-k8s-cluster.com not found
cluster-autoscaler utils.go:541] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
cluster-autoscaler static_autoscaler.go:274] No unschedulable pods
cluster-autoscaler utils.go:498] Skipping ip-10-YY-XXX-ZZZ.eu-west-1.compute.internal - node group min size reached
cluster-autoscaler utils.go:498] Skipping ip-10-YY-XXX-ZZZ.eu-west-1.compute.internal - node group min size reached
cluster-autoscaler utils.go:498] Skipping ip-10-YY-XXX-ZZZ.eu-west-1.compute.internal - node group min size reached
cluster-autoscaler scale_down.go:643] No candidates for scale down
cluster-autoscaler clusterstate.go:324] Failed to find readiness information for nodes-spot-eu-west-1a.my-k8s-cluster.com
cluster-autoscaler clusterstate.go:380] Failed to find readiness information for nodes-spot-eu-west-1a.my-k8s-cluster.com
cluster-autoscaler clusterstate.go:324] Failed to find readiness information for nodes-spot-eu-west-1a.my-k8s-cluster.com

The issue now is that there are pods that remain unscheduled because there's no capacity, but the cluster-autoscaler treats the "open" spot requests as having taken care of the scaling.

Any ideas on how to achieve both? I'm using the following configuration flags:

- ./cluster-autoscaler
- --cloud-provider=aws
- --namespace=kube-system
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/my-k8s-cluster.com
- --balance-similar-node-groups
- --expander=random
- --logtostderr=true
- --scale-down-delay-after-add=5m
- --scale-down-delay-after-delete=5m
- --scale-down-delay-after-failure=5m
- --skip-nodes-with-local-storage=false
- --skip-nodes-with-system-pods=false
- --stderrthreshold=info
- --v=3

The text was updated successfully, but these errors were encountered:

jomar83 · 2019-03-14T16:05:18Z

Second that.

We're facing the same issue, but we're running spot instances only. Same taints multiple different instance groups in 3 azs. We expected failover to the next asg if a spot request couldn't be fulfilled.

Jeffwan · 2019-03-15T18:18:54Z

AWS spot instance in CA is not supported and I have not tested multiple node groups with onDemand and spot together. The behavior is unpredicted. I can help check.

814HiManny · 2019-03-31T05:37:35Z

#1133 describes this problem

fejta-bot · 2019-06-29T06:24:12Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

okgolove · 2019-07-11T08:52:40Z

It seems my issue is pretty related to the this one.
#2165

jaypipes · 2019-07-23T20:08:13Z

/label aws

jaypipes · 2019-08-20T14:27:12Z

I believe the underlying issue in this bug has been addressed by #2235, which is now merged. Can we close this issue out?

jaypipes · 2019-08-20T14:27:41Z

/remove-lifecycle stale

jaypipes · 2019-08-20T14:27:59Z

/area provider/aws

fejta-bot · 2019-11-18T15:17:50Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Jeffwan · 2019-11-18T16:15:48Z

As Jay mentioned, #2235 addressed this issue. Please find the right release version with this improvement. (All releases after Sept has this change from 1.12.x).

Jeffwan · 2019-11-18T16:15:51Z

/close

k8s-ci-robot · 2019-11-18T16:15:52Z

@Jeffwan: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mvisonneau mentioned this issue May 2, 2019

Stop waiting for upcoming nodes from unhealthy node groups #1980

Closed

piontec mentioned this issue May 10, 2019

fix: correctly handle lack of capacity of AWS spot ASGs #2008

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 29, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 20, 2019

k8s-ci-robot added the area/provider/aws Issues or PRs related to aws provider label Aug 20, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 18, 2019

k8s-ci-robot closed this as completed Nov 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS Ondemand not scaled up if Spot requests remain "Open" #1795

AWS Ondemand not scaled up if Spot requests remain "Open" #1795

sc250024 commented Mar 14, 2019 •

edited

Loading

jomar83 commented Mar 14, 2019

Jeffwan commented Mar 15, 2019

814HiManny commented Mar 31, 2019

fejta-bot commented Jun 29, 2019

okgolove commented Jul 11, 2019

jaypipes commented Jul 23, 2019

jaypipes commented Aug 20, 2019

jaypipes commented Aug 20, 2019

jaypipes commented Aug 20, 2019

fejta-bot commented Nov 18, 2019

Jeffwan commented Nov 18, 2019

Jeffwan commented Nov 18, 2019

k8s-ci-robot commented Nov 18, 2019

AWS Ondemand not scaled up if Spot requests remain "Open" #1795

AWS Ondemand not scaled up if Spot requests remain "Open" #1795

Comments

sc250024 commented Mar 14, 2019 • edited Loading

jomar83 commented Mar 14, 2019

Jeffwan commented Mar 15, 2019

814HiManny commented Mar 31, 2019

fejta-bot commented Jun 29, 2019

okgolove commented Jul 11, 2019

jaypipes commented Jul 23, 2019

jaypipes commented Aug 20, 2019

jaypipes commented Aug 20, 2019

jaypipes commented Aug 20, 2019

fejta-bot commented Nov 18, 2019

Jeffwan commented Nov 18, 2019

Jeffwan commented Nov 18, 2019

k8s-ci-robot commented Nov 18, 2019

sc250024 commented Mar 14, 2019 •

edited

Loading