Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Ondemand not scaled up if Spot requests remain "Open" #1795

Closed
sc250024 opened this issue Mar 14, 2019 · 13 comments
Closed

AWS Ondemand not scaled up if Spot requests remain "Open" #1795

sc250024 opened this issue Mar 14, 2019 · 13 comments
Labels
area/provider/aws Issues or PRs related to aws provider lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@sc250024
Copy link
Contributor

sc250024 commented Mar 14, 2019

Greetings,

I'm running cluster-autoscaler v1.3.8 in a Kubernetes v1.11.8 cluster in AWS created using Kops. I'm not sure if this falls under feature request, a bug, or a misconfiguration on my part, so apologies in advance if it's a misconfiguration. Basically, for our pre-production cluster, I want to run spot instances as much as possible. I have the following node groups (and thus, ASGs), and they look like the following if the cluster is running normally:

ASG Name                   Instances  Desired  Min  Max
nodes-ondemand-eu-west-1a  0          0        0    10
nodes-ondemand-eu-west-1b  0          0        0    10
nodes-ondemand-eu-west-1c  0          0        0    10
nodes-spot-eu-west-1a      2          2        2    10
nodes-spot-eu-west-1b      2          2        2    10
nodes-spot-eu-west-1c      2          2        2    10

I also have k8s-spot-rescheduler running to take pods from any ondemand nodes that are provisioned, and move them to spot instances so that the ondemand nodes can be removed. However, lately, there always seems to be spot requests that are open, but not yet fulfilled:

Screen Shot 2019-03-14 at 14 55 19

This is normal behavior in AWS, but the problem is, the cluster autoscaler does not use the ondemand node groups at all. The following log output is what I'll see during this situation:

cluster-autoscaler aws_manager.go:148] Refreshed ASG list, next refresh after 2019-03-14 13:59:04.235594224 +0000 UTC m=+241976.799746087
cluster-autoscaler clusterstate.go:542] Readiness for node group nodes-spot-eu-west-1a.my-k8s-cluster.com not found
cluster-autoscaler utils.go:541] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
cluster-autoscaler static_autoscaler.go:274] No unschedulable pods
cluster-autoscaler utils.go:498] Skipping ip-10-YY-XXX-ZZZ.eu-west-1.compute.internal - node group min size reached
cluster-autoscaler utils.go:498] Skipping ip-10-YY-XXX-ZZZ.eu-west-1.compute.internal - node group min size reached
cluster-autoscaler utils.go:498] Skipping ip-10-YY-XXX-ZZZ.eu-west-1.compute.internal - node group min size reached
cluster-autoscaler scale_down.go:643] No candidates for scale down
cluster-autoscaler clusterstate.go:324] Failed to find readiness information for nodes-spot-eu-west-1a.my-k8s-cluster.com
cluster-autoscaler clusterstate.go:380] Failed to find readiness information for nodes-spot-eu-west-1a.my-k8s-cluster.com
cluster-autoscaler clusterstate.go:324] Failed to find readiness information for nodes-spot-eu-west-1a.my-k8s-cluster.com

The issue now is that there are pods that remain unscheduled because there's no capacity, but the cluster-autoscaler treats the "open" spot requests as having taken care of the scaling.

Any ideas on how to achieve both? I'm using the following configuration flags:

- ./cluster-autoscaler
- --cloud-provider=aws
- --namespace=kube-system
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/my-k8s-cluster.com
- --balance-similar-node-groups
- --expander=random
- --logtostderr=true
- --scale-down-delay-after-add=5m
- --scale-down-delay-after-delete=5m
- --scale-down-delay-after-failure=5m
- --skip-nodes-with-local-storage=false
- --skip-nodes-with-system-pods=false
- --stderrthreshold=info
- --v=3
@jomar83
Copy link

jomar83 commented Mar 14, 2019

Second that.

We're facing the same issue, but we're running spot instances only. Same taints multiple different instance groups in 3 azs. We expected failover to the next asg if a spot request couldn't be fulfilled.

@Jeffwan
Copy link
Contributor

Jeffwan commented Mar 15, 2019

AWS spot instance in CA is not supported and I have not tested multiple node groups with onDemand and spot together. The behavior is unpredicted. I can help check.

@814HiManny
Copy link

#1133 describes this problem

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 29, 2019
@okgolove
Copy link

It seems my issue is pretty related to the this one.
#2165

@jaypipes
Copy link
Contributor

/label aws

@jaypipes
Copy link
Contributor

I believe the underlying issue in this bug has been addressed by #2235, which is now merged. Can we close this issue out?

@jaypipes
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 20, 2019
@jaypipes
Copy link
Contributor

/area provider/aws

@k8s-ci-robot k8s-ci-robot added the area/provider/aws Issues or PRs related to aws provider label Aug 20, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 18, 2019
@Jeffwan
Copy link
Contributor

Jeffwan commented Nov 18, 2019

As Jay mentioned, #2235 addressed this issue. Please find the right release version with this improvement. (All releases after Sept has this change from 1.12.x).

@Jeffwan
Copy link
Contributor

Jeffwan commented Nov 18, 2019

/close

@k8s-ci-robot
Copy link
Contributor

@Jeffwan: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/aws Issues or PRs related to aws provider lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

8 participants