Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: cluster autoscaler integration with machine API #2653

Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
295 changes: 295 additions & 0 deletions keps/sig-autoscaling/0000-kep-template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,295 @@
---
kep-number: 0
title: My First KEP
authors:
- "@janedoe"
owning-sig: sig-xxx
participating-sigs:
- sig-aaa
- sig-bbb
reviewers:
- TBD
- "@alicedoe"
approvers:
- TBD
- "@oscardoe"
editor: TBD
creation-date: yyyy-mm-dd
last-updated: yyyy-mm-dd
status: provisional
see-also:
- KEP-1
- KEP-2
replaces:
- KEP-3
superseded-by:
- KEP-100
---

# Integrating cluster autoscaler (CA) with cluster-api

The *filename* for the KEP should include the KEP number along with the title.
The title should be lowercased and spaces/punctuation should be replaced with `-`.
As the KEP is approved and an official KEP number is allocated, the file should be renamed.

To get started with this template:
1. **Pick a hosting SIG.**
Make sure that the problem space is something the SIG is interested in taking up.
KEPs should not be checked in without a sponsoring SIG.
1. **Allocate a KEP number.**
Do this by (a) taking the next number in the `NEXT_KEP_NUMBER` file and (b) incrementing that number.
Include the updated `NEXT_KEP_NUMBER` file in your PR.
1. **Make a copy of this template.**
Name it `NNNN-YYYYMMDD-my-title.md` where `NNNN` is the KEP number that was allocated.
1. **Fill out the "overview" sections.**
This includes the Summary and Motivation sections.
These should be easy if you've preflighted the idea of the KEP with the appropriate SIG.
1. **Create a PR.**
Assign it to folks in the SIG that are sponsoring this process.
1. **Merge early.**
Avoid getting hung up on specific details and instead aim to get the goal of the KEP merged quickly.
The best way to do this is to just start with the "Overview" sections and fill out details incrementally in follow on PRs.
View anything marked as a `provisional` as a working document and subject to change.
Aim for single topic PRs to keep discussions focused.
If you disagree with what is already in a document, open a new PR with suggested changes.

The canonical place for the latest set of instructions (and the likely source of this file) is [here](/keps/0000-kep-template.md).

The `Metadata` section above is intended to support the creation of tooling around the KEP process.
This will be a YAML section that is fenced as a code block.
See the KEP process for details on each of these items.

## Table of Contents

TPB

## Summary

TPB

## Motivation

- delegate the responsibility of managing cloud providers out of the cluster autoscaler
- in general, every tool has its own implementation of the cloud provider layer to communicate with cloud provider API (logic duplicated) => cluster-api project



This section is for explicitly listing the motivation, goals and non-goals of this KEP.
Describe why the change is important and the benefits to users.
The motivation section can optionally provide links to [experience reports][] to demonstrate the interest in a KEP within the wider Kubernetes community.

[experience reports]: https://github.com/golang/go/wiki/ExperienceReports

### Goals

* cluster autoscaler is capable of scaling the cluster through machine API

### Non-Goals

* cluster autoscaler is able to autoprovision new node groups through machine API
* cluster autoscaler is able to estimate node resource requirements through machine API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Estimating node resources is part of CA implementation, not a user visible feature. The user facing result of this point is scale-to/from-0 won't be supported.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scaling the MachienSet/MachienDeployment to/from 0 could be a really interesting feature. AFAIK currently autoscaler relies on Node-objects to estimate the node-capacity based resources. We are introducing the concept of MachineClass in cluster-api, where the resource-capacity based details[eg capacity, allocatable] could be stored in separate CRD called MachineClass and corresponding MachineDeployment/MachineSet does not necessarily need to exist. Essentially MachineSet/Deployment would only have reference to the Machineclass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having MachineClass should be enough to support scale-to/from-0 easily, provided you can construct a Node object easily from it. It doesn't have to be perfect, but it needs to provide information used by scheduling predicates, such as labels, taints, local volumes, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Estimating node resources is part of CA implementation, not a user visible feature. The user facing result of this point is scale-to/from-0 won't be supported.

The estimation itself is, way of providing the information to generate the template is not. I wanted to point out the machine API will provide the information, the CA will use node template populated with the information to estimate requirements.

I agree this is information no user should be aware of. Though, one can not get cloud provider specific information unless the one knows how to interpret the provider config. The same holds for the machine classes. Unless we generalize concept of machine types, security groups, ... and make reasonable mapping from the generalized concepts into cloud specific resources.

* cluster autoscaler is able to use at least one pricing model on top of the machine API
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, pricing-model at the moment in autoscaler is config-rule based, if this information again could be embedded into the MachineClasses, we can achieve price-based autoscaling as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @venezia


## Proposal

Generalize the concept of node group autoscaling on the level of cloud provider
by integrating with machine API layer of [sigs.k8s.io/cluster-api](https://github.com/kubernetes-sigs/cluster-api) project to
build cloud provider-free implementation of cluster autoscaling mechanism
([#1050](https://github.com/kubernetes/autoscaler/issues/1050)).

Also suggest how to:
* perform node autoprovision on the level of the machine API ([node autoprovisioning
upstream proposal](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/node_autoprovisioning.md)).
* perform estimation of cluster autoscaling on the level of the machine API.
* build various pricing models on top of the machine API.

Cluster autoscaling builds on top of concept of node groups which effectively
translates to cloud provider specific autoscaling groups, resp. autoscaling sets.
In case of the machine API, the node group translates into machine set (living
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be MachineDeployment? Otherwise MachineDeployments can not be used in conjunction with CA.

as native Kubernetes object).

### Node group (scaling group)

The autoscaler currently supports `AWS`, `Azure`, `GCE`, `GKE` providers.
With `kubemark` as other cloud provider implementation that allows to create
hollow kubernetes cluster with huge number of nodes over a small subset of real
machines (for scaling testing purposes).
Each of the currently supported cloud providers has its own concept of scaling groups:
- AWS has autoscaling groups
- Azure has autoscaling sets
- GCE/GKE have autoscaling groups

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster api currently doesnt support GKE as i understand it . the integration of autoscaler with cluster api should be optional or opt in so that autoscaler continues to be usable in gke

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the implementation in Cluster Autoscaler would be done by adding a new 'Cluster API cloudprovider'. Existing cloudprovider integrations won't be impacted.


Autoscaling group allows to flexibly and natively increase/decrease number of instances
within cloud provider based on actual cluster workload.

TODO(jchaloup): describe advantages of auto scaling groups, why they are beneficial, etc.

### Machine API

Machine API (as part of the [cluster-api project](https://github.com/kubernetes-sigs/cluster-api/))
allows to declaratively specify a machine without saying how to provision underlying instance.
Machines can be grouped into a machine set that allows to specify a machine template
and by setting a number replicas create required number of machines (followed
by provisioning of underlying instances).
With some mama's marmalade, one can represent an autoscaling group with a machine set.
Though, the current implementation of the machineset controller needs to be improved
to support autoscaling use cases. E.g. define delete strategy to pick appropriate machine

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's crucial to Cluster Autoscaler logic that it is able to name a specific machine to be deleted in case of scale-down. Delete strategy on the side of machine set alone won't work for this use-case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a write-up on why this is the case, since it doesn't seem to be immediately obvious.

More specifically, it seems at first glance that the machineset controller would want similar criteria for scale-down to the cluster autoscaler; a scale-down that doesn't take into account disruption budget, node load, node readiness, particular pod types, etc doesn't seem like it would be particularly useful.

The difference that jumps to mind initially is the scheduling simulation -- is the issue that the optimal node to delete in terms of load+disruption-budget+pod-types+etc might have pods that don't fit elsewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if machine set should have all this logic. If the user resizes it some node must be removed, even if doing so violates PDB, kills a pod that won't be re-created, etc. There may be no right decision and any heuristic would be arbitrary.

Given the ambiguity I think it's more correct to put the responsibility to choose node on whoever makes the decision to remove it (CA, user, other controller). If no explicit decision is made machine set should use some very simple and predictable strategy (remove newest node, or oldest node, or something like that). Ideally the API should allow to reduce target size of machine deployment and remove specific machine. Some new subresource maybe?

BTW it feels like stateful set would have the same problem. Without detailed knowledge about workload it's not possible to make an informed decision on which replica to remove, so a user really should have an option to choose. It feels like Cluster API should do the same thing stateful set does (or the other way round, if stateful set haven't solved it yet).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also that's just how CA works. It picks a node to delete, not a node group to resize. Unless we guarantee the internal logic is 100% the same between CA and machine controller they will conflict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the ambiguity I think it's more correct to put the responsibility to choose node on whoever makes the decision to remove it (CA, user, other controller)

That may be an acceptable answer, but I think we should explicitly say that in the proposal, and enumerate why a bit. I've seen a lot of "it has to work this way because the cluster autoscaler currently works this way" written down, and I think we need to write down more justification beyond that.

It feels like Cluster API should do the same thing stateful set does (or the other way round, if stateful set haven't solved it yet).

Yeah, I wonder if the workloads folks have solved that yet @kubernetes/sig-apps-misc

Also that's just how CA works

If we're designing for future plans, maybe we should re-evaluate that. Maybe the CA should be closer to the way the HPA works? (i.e. "have we re-evaluated our choices in light of the cluster API being a thing")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the CA should be closer to the way the HPA works?

Various groups of VMs (ASGs on AWS, MIGs on GCP etc.) already work similarly to HPA (i.e., the number of replicas depends on a metric's value). So at least in those environment, users can already have it without running a separate component. The problem is that autoscaling Kubernetes nodes in this way doesn't work well in most cases. In fact, it causes so much trouble we've had to put explicit warnings in UI against enabling MIG autoscaler for VMs running GKE nodes.

It all boils down to "what does it mean that we need a bigger/smaller cluster". If a metric we use as input is above target, but there are no pending pods, adding a new machine will do nothing except unnecessarily increase the cost of running the cluster, because nothing will be scheduled there (although it may indeed bring some metrics down, e.g. average utilization.) This is even worse for scale-down - not all nodes run the same workloads, and removing nodes without checking the pods running there can be moved elsewhere may result in some of those pods no longer running (possibly taking down entire applications.)

It can be argued we just haven't figured out the right metrics for it yet, but even if we could use something like "number of pending pods" as input to scale-up, I've yet to see anything that would work for scale-down. And if we find one, perhaps it can be used with already existing VM autoscalers?

to delete from the machine set ([#75](https://github.com/kubernetes-sigs/cluster-api/issues/75)).

`Important`: future implementation of the autoscaling concept through machineset will not
replace the current cloud specific implementation. The goal is to lay down a solution
that will cover the most common use cases currently implemented by the autoscaler.
It's important to draw a line into the sand so only necessary portion of the cloud
autoscaling logic is re-implemented instead of re-inventing entire wheel.

### Cloud provider interface
The cluster autoscaler defines two interfaces (`NodeGroup` and `CloudProvider`
from `k8s.io/autoscaler/cluster-autoscaler/cloudprovider` package)
that a cloud provider needs to implemented in order to allow the cluster
autoscaler to properly perform scaling operations.

The `CloudProvider` interface allows to:
- operate on top of node groups
- work with pricing models (to estimate reasonable expenses)
- check limit resources (check if maximum resources per node group are exceeded)

The `NodeGroup` interface allows to operate with node groups themselves:
- change size of a node group
- list nodes belonging to a node group
- autoprovision new node group
- generate node template for a node group (to determine resource capacity of
new node in case a node group has zero size)

In case of `AWS` the `NodeGroup` then corresponds to autoscaling groups,
in case of `Azure` to autoscaling sets, etc.

### Node template

In case a node group has no node, the cluster autoscaler needs a way to generate
a node template for a given node group before it can predict how much new resources
will get allocated and thus how many new nodes are needed to scale up.

Each of the mentioned cloud providers has its own implementation of constructing
the node template with its own list of information pulled from cloud provider API:

* Azure:
- instance type to determine resource capacity (CPU, GPU, Memory)
- instance tags to construct node labels
- generic node labels: Arch, OS, instance type, region, failure domain, hostname
* AWS:
- instance type to determine resource capacity (CPU, GPU, Memory)
- instance tags to construct node labels and taints
- generic node labels: Arch, OS, instance type, region, failure domain, hostname
* GCE:
- instance type to determine resource capacity (CPU, Memory)
- instance tags to construct node labels (e.g. gpu.ResourceNvidiaGPU)
- maximum number of pods: apiv1.ResourcePods
- taints (from instance metadata)
- kubelet system and reserved allocatable

## Code status

At this point the cluster autoscaler does not support node autoprovisioning (TODO(jchaloup): link to the proposal).
Every implementation assumes every node group already exists in the cloud provider.
Both `Create` and `Delete` operations return an error `ErrAlreadyExist` and the autoprovisioning is off.

The autoscaler allows to either enumerate all node groups (via `--nodes` option)
or automatically discover all node groups on-the-fly (via `--node-group-auto-discovery`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--node-group-auto-discovery is provider specific, all related logic is implemented by cloudprovider. It's optional for cloudprovider to implement it (obviously the flag doesn't work for providers who don't implement it).


### Integration with machine API

Given the scaling up/down operation corresponds to increasing/decreasing the number
of nodes in a node group, it's sufficient to "just" change the number of replicas
of corresponding machine set.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is only true for scale-up. Scale-down requires the ability to delete a specific node, not just reduce number of replicas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, enhancements for machine-set are being discussed at cluster-api community, which allows machine-set to delete specific machine while scaling-down.


#### Node template

The `machine` object holds all cloud provider specific information under
`spec.providerConfig` field. Given only the actuator can interpret the provider
configuration and there can be multiple actuator implementations for the same cloud provider,
all information necessary to render the template must be contained outside of the machine
's provider config. At the same time the `machineset` object works over a set of
`machines` and has no way of communicating with the machine controller.

The node template corresponding to a node group has to be either rendered
on the cluster autoscaler side or inside the machine controller.
In the former case the machine (outside of the provider config) needs to contain
all the information in a generic form to render the template (e.g. through labels)
or import the actuator code that can interpret the provider config. In the latter,
the machine controller has to store the rendered node template (that is free
of cloud provider specifics) into machine object.

* **Labels**: Label use case may lead to data duplication as one needs to provide a cpu, gpu
and memory requirements (based on instance type) to specify node's allocatable
resources. Or other information such as region, availability zones (which may
have different representation through different cloud providers).
Implemented through labels the `machine` spec can look like:

```yaml
---
apiVersion: cluster.k8s.io/v1alpha1
kind: Machine
metadata:
name: node-group-machine
namespace: application
labels:
sigs.k8s.io/cluster-autoscaler-resource-capacity: "cpu:200m,gpu:1,memory:4GB"
sigs.k8s.io/cluster-autoscaler-region: us-west-1
...
```

* **Actuator code imports**: Importing the actuator code and pulling all
the information from the `spec.providerConfig` forces the cluster autoscaler
to choose a single implementation of actuator for each cloud provider.
That can be to restrictive.

* **Rendering inside the machine controller**: From the "I only know what I need to know"
point of view, this approach encapsulates all the knowledge about provider
configuration within the actuator itself. The actuator knows the best how to
properly construct the node template. Obviously, the node template is stored
in the machine object. Thus, any client querying the machine can then read
and consume the template free of any cloud provider. Given the node template
can reflect the current state of the underlying machine's instance
(e.g. instance tags) the node template can be periodically rendered
and published in machine's status.

#### Scaling from 0

In all cases, if it can be assumed all machines/nodes in the same node group have the same
system requirements, one can use generic kubelet's configuration discovery to get
the kubelet's reserved and system requirements from the first machine in a given
node group (given there is at least one node in the group).
The same holds for node taints.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cluster Autoscaler is based on assumption that every node in node group is exactly identical (for example multi-zonal ASGs are not supported). If at least a single node exists CA won't try to parse machine object, it will just make an in-memory copy of existing node and use that to represent new node.



### User Stories [optional]

TPB

### Implementation Details/Notes/Constraints [optional]

TPB

### Risks and Mitigations

TPB

## Graduation Criteria

TPB

## Implementation History

TPB

## Drawbacks [optional]

TPB

## Alternatives [optional]

TPB

## Infrastructure Needed [optional]

TPB