Stop waiting for upcoming nodes from unhealthy node groups #1980

mvisonneau · 2019-05-02T08:18:06Z

This is an attempt to solve a long going issue, preventing CA from failing over healthy node groups when a scale-up occurred onto a non-healthy one. This is particularly painful whilst working with spot based ASGs which are very prone to these kinds of disruptions.

References in #1795 and #1133

if people want to easily try it out, I published a docker release: docker.io/mvisonneau/cluster-autoscaler:1.14.2-fix_scale_up_unavail_node_groups

it seems to work perfectly fine for my use case :

Events:
  Type     Reason            Age                    From                Message
  ----     ------            ----                   ----                -------
  Normal   TriggeredScaleUp  52m (x7 over 54m)      cluster-autoscaler  pod triggered scale-up: [{sb1-in-k8s-worker-spot-r5.2xlarge-b 0->1 (max: 20)}]
  Warning  FailedScheduling  4m10s (x195 over 60m)  default-scheduler   0/9 nodes are available: 6 Insufficient cpu, 9 Insufficient memory.
  Normal   TriggeredScaleUp  52s                    cluster-autoscaler  pod triggered scale-up: [{sb1-in-k8s-worker-spot-m5.4xlarge-b 0->1 (max: 20)}]

k8s-ci-robot · 2019-05-02T08:18:34Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: aleksandra-malinowska

If they are not already assigned, you can assign the PR to them by writing /assign @aleksandra-malinowska in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Maxime VISONNEAU <maxime.visonneau@gmail.com>

MaciekPytel · 2019-05-08T09:51:46Z

I don't think this solves the issue. A single failed scale-up won't make a NodeGroup unhealthy, it will likely take multiple retries before it gets there. And just because a NodeGroup is unhealthy doesn't mean a node that is already being created will fail.

There is already plenty of logic to deal with this scenario. I think the problem here is that AWS cloudprovider implementation doesn't use it. I've put some details in #1996 (comment).

mvisonneau · 2019-05-08T10:26:21Z

Indeed @MaciekPytel, I did eventually end up with this issue :( I'll try to have a closer look following your comments.

Jeffwan · 2019-05-10T22:11:54Z

There is already plenty of logic to deal with this scenario. I think the problem here is that AWS cloudprovider implementation doesn't use it. I've put some details in #1996 (comment).

Hi @MaciekPytel Anywhere I can check for the logic you mentioned? I'd love to add to AWS cloudprovider side.

MaciekPytel · 2019-05-13T17:31:24Z

The implementation is already in progress in #2008 - let's continue discussion there.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 2, 2019

k8s-ci-robot requested review from aleksandra-malinowska and losipiuk May 2, 2019 08:18

Stop waiting for upcoming nodes from unhealthy node groups

d474766

Signed-off-by: Maxime VISONNEAU <maxime.visonneau@gmail.com>

mvisonneau force-pushed the fix_scale_up_unavail_node_groups branch from 2a8529d to d474766 Compare May 2, 2019 08:19

bskiba added the area/cluster-autoscaler label May 2, 2019

MaciekPytel mentioned this pull request May 8, 2019

cluster autoscaler failed to scale up when AWS couldn't start a new instance #1996

Closed

mvisonneau closed this May 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop waiting for upcoming nodes from unhealthy node groups #1980

Stop waiting for upcoming nodes from unhealthy node groups #1980

mvisonneau commented May 2, 2019 •

edited

Loading

k8s-ci-robot commented May 2, 2019

MaciekPytel commented May 8, 2019

mvisonneau commented May 8, 2019

Jeffwan commented May 10, 2019

MaciekPytel commented May 13, 2019

Stop waiting for upcoming nodes from unhealthy node groups #1980

Stop waiting for upcoming nodes from unhealthy node groups #1980

Conversation

mvisonneau commented May 2, 2019 • edited Loading

k8s-ci-robot commented May 2, 2019

MaciekPytel commented May 8, 2019

mvisonneau commented May 8, 2019

Jeffwan commented May 10, 2019

MaciekPytel commented May 13, 2019

mvisonneau commented May 2, 2019 •

edited

Loading