Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KEP for default Even Pods Spreading #1260

Merged
302 changes: 302 additions & 0 deletions keps/sig-scheduling/20190926-default-even-pod-spreading.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,302 @@
---
title: Default Even Pod Spreading
authors:
- "@alculquicondor"
owning-sig: sig-scheduling
reviewers:
- "@ahg-g"
- "@Huang-Wei"
approvers:
- "@ahg-g"
- "@k82cn"
creation-date: 2019-09-26
last-updated: 2010-09-26
status: provisional
see-also:
- "/keps/sig-aaa/20190221-even-pods-spreading.md"
---

# Default Even Pod Spreading

## Table of Contents

<!-- toc -->
- [Release Signoff Checklist](#release-signoff-checklist)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [User Stories](#user-stories)
- [Story 1](#story-1)
- [Story 2](#story-2)
- [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
- [Relationship with SelectorSpreadingPriority](#relationship-with-selectorspreadingpriority)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [API](#api)
- [Default rules](#default-rules)
- [How user stories are addressed](#how-user-stories-are-addressed)
- [Implementation Details](#implementation-details)
- [In the metadata/predicates/priorities flow](#in-the-metadatapredicatespriorities-flow)
- [In the scheduler framework](#in-the-scheduler-framework)
- [Test Plan](#test-plan)
- [Graduation Criteria](#graduation-criteria)
- [Implementation History](#implementation-history)
- [Alternatives](#alternatives)
<!-- /toc -->

## Release Signoff Checklist

- [ ] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)
- [ ] KEP approvers have set the KEP status to `implementable`
- [ ] Design details are appropriately documented
- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- [ ] Graduation criteria is in place
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

## Summary

With [Even Pods Spreading](/keps/sig-scheduling/20190221-even-pods-spreading.md),
workload authors can define spreading rules for their loads based on the topology of the clusters.
The spreading rules are defined in the `PodSpec`, thus they are tied to the pod.

We propose the introduction of configurable default spreading constraints, i.e. constraints that
can be defined at the cluster level and are applied to pods that don't explicitly define spreading constraints.
This way, all pods can be spread according to (likely better informed) constraints set by a cluster operator.
Workload authors don't need to know the topology of the cluster they will be running on to have their pods spread.
But if they do, they can still set their own spreading constraints if they have specific needs.

## Motivation

In order for a workload (pod) to use `EvenPodsSpreadPriority`:

1. Authors have to have an idea of the underlying topology.
1. PodSpecs become less portable if their spreading constraints are tailored to a specific topology.

On the other hand, cluster operators know the underlying topology of the cluster, which makes
them suitable to provide default spreading constraints for all workloads in their cluster.

### Goals

- Cluster operators can define default spreading constraints for pods that don't provide any
`pod.spec.topologySpreadConstraints`.
- Workloads are spread with the default constraints if they belong to the same service, replication controller,
replica set or stateful set, and if they don't define `pod.spec.topologySpreadConstraints`.
- Provide a k8s default for `topologySpreadConstraints` that produces a priority equivalent to
`SelectorSpreadPriority`, so that this algorithm can be removed from the default algorithms' provider.

### Non-Goals

- Set defaults for specific namespaces or according to other selectors.
- Removal of `ServiceSpreadingPriority` or `ServiceAntiAffinity` priorities.

## Proposal

### User Stories

#### Story 1

As a cluster operator, I want to set default spreading constraints for workloads in the cluster.
Currently, `SelectorSpreadPriority` provides a canned priority that spreads across nodes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"nodes and zones", or just "zones"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both. It gives a weight of 1/3 to nodes and 2/3 to zones.

and zones (`failure-domain.beta.kubernetes.io/zone`). However, the nodes in my cluster have
custom topology keys (for physical host, rack, etc.).

#### Story 2

As a workload author, I want to spread the workload in the cluster, but:
(1) I don't know the topology of the cluster I'm running on.
(2) I want to be able to run my PodSpec in different clusters (on-prem and cloud).

### Implementation Details/Notes/Constraints


#### Relationship with SelectorSpreadingPriority

Note that Default `topologySpreadConstraints` has a similar effect to `SelectorSpreadingPriority`.
Given that the latter is not configurable, they could return conflicting priorities, which
may not be the intention of an operator. On the other hand, a proper Default for `topologySpreadConstraints`
could provide the same priority as `SelectorSpreadingPriority`. Thus, there's no need for the
features to co-exist.

K8s will set Default `topologySpreadConstraints` and remove `SelectorSpreadingPriority`
from the k8s providers (`DefaultProvider` and `ClusterAutoscalerProvider`). The set
[default](#default-rules) will have a similar effect.

If an operator sets a Policy, these are the semantics of the presence of `SelectorSpreadingPriority`
and/or `EvenPodsSpreadPriority`:

| SelectorSpreading | EvenPodsSpread | Operator constraints | Valid | Pod spread constraints |
| :---------------: | :------------: | :------------------: | :---: | :---------------------------: |
| N | Y | Not provided | Yes | [k8s default](#default-rules) |
| N | Y | Provided | Yes | provided by operator |
| Y | Y | Not provided | Yes | [k8s default](#default-rules) |
| Y | Y | Provided | No | - |
| N | N | - | Yes | None |
| Y | N | - | No | - |

Selecting `SelectorSpreadingPriority` but not `EvenPodsSpreadPriority` in a policy is an invalid
configuration, because the latter is a requirement for the former.

### Risks and Mitigations

`EvenPodsSpreadPriority` has some overhead and we currently ensure that pods that don't use the
feature get minimally affected. After Default `topologySpreadConstraints` is rolled out,
all pods will run through the algorithms.
However, we should ensure that the running overhead is not significantly higher than
`SelectorSpreadingPriority` with k8s Default `topologySpreadConstraints`.

## Design Details

### API

A new structure called `TopologySpreadConstraint` is introduced to `KubeSchedulerConfiguration`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A new structure called `TopologySpreadConstraint` is introduced to `KubeSchedulerConfiguration`.
`TopologySpreadConstraint` or `DefaultTopologySpreadConstraints`?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type is independent from its "Default" nature.


```go
type KubeSchedulerConfiguration struct {
....
// TopologySpreadConstraints defines topology spread constraints to be applied to pods
// that don't define any in `pod.spec.topologySpreadConstraints`. Pod selectors must
// be empty, as they are deduced from the resources that the pod belongs to
// (includes services, replication controllers, replica sets and stateful sets).
// If not specified, the scheduler applies the following default constraints:
// <default rules go here. See next section>
// +optional
TopologySpreadConstraints []corev1.TopologySpreadConstraint
....
}
```

Note the use of `k8s.io/api/core/v1.TopologySpreadConstraint`. During validation,
we verify that selectors are not defined.

### Default rules

These will be the default constraints for the cluster when the operator doesn't provide any:

```yaml
alculquicondor marked this conversation as resolved.
Show resolved Hide resolved
defaultTopologySpreadConstraints:
alculquicondor marked this conversation as resolved.
Show resolved Hide resolved
- maxSkew: 1
topologyKey: "kubernetes.io/hostname"
whenUnsatisfiable: ScheduleAnyway
- maxSkew: 1
topologyKey: "failure-domain.beta.kubernetes.io/zone"
whenUnsatisfiable: ScheduleAnyway
```

### How user stories are addressed

Let's say we have a cluster that has a topology based on physical hosts and racks.
Then, an operator can set the following scheduler configuration:

```yaml
apiVersion: componentconfig/v1alpha1
defaultTopologySpreadConstraints:
- maxSkew: 5
topologyKey: "example.com/topology/physical_host"
whenUnsatisfiable: ScheduleAnyway
- maxSkew: 15
topologyKey: "example.com/topology/rack"
whenUnsatisfiable: DoNotSchedule
```

Then, a workload author could have the following `ReplicaSet`:

```yaml
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: replicated_demo
spec:
replicas: 3
selector:
matchLabels:
app: demo
template:
metadata:
labels:
app: demo
spec:
containers:
- name: php-redis
image: example.com/registry/demo:latest
```

Note that the workload author didn't provide spreading constraints in the `pod.spec`.
The following spreading constraints will be derived from the constraints defined in ComponentConfig,
and will be applied at runtime:

```yaml
alculquicondor marked this conversation as resolved.
Show resolved Hide resolved
topologySpreadConstraints:
- maxSkew: 5
TopologyKey: "example.com/topology/physical_host"
WhenUnsatisfiable: ScheduleAnyway
selector:
matchLabels:
app: demo
- maxSkew: 15
TopologyKey: "example.com/topology/rack"
WhenUnsatisfiable: DoNotSchedule
selector:
matchLabels:
app: demo
```

Please note that these constraints are honored internally in the scheduler, but they are NOT
persisted in the PodSpec via API Server.

### Implementation Details

#### In the metadata/predicates/priorities flow

1. Calculate the spreading constraints for the pod as part of the metadata calculation.
Use the constraints provided by the pod or calculate the default ones if they don't provide any.
1. When running the predicates or priorities, use the constraints stored in the metadata.

#### In the scheduler framework

1. Calculate spreading constraints for the pod in the `PreFilter` extension point. Store them
in the `PluginContext`.
1. In the `Filter` and `Score` extension points, use the stored spreading constraints instead of
the ones defined by the pod.

### Test Plan

To ensure this feature to be rolled out in high quality. Following tests are mandatory:

- **Unit Tests**: All core changes must be covered by unit tests.
- **Integration Tests**: One integration test for the default rules and one for custom rules.
- **Benchmark Tests**: A benchmark test that compare the default rules against `SelectorSpreadingPriority`.
The performance should be as close as possible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for jumping late into this discussion - but do we believe we can do that efficiently?
I'm asking in the context of large clusters - i.e. 5k nodes.

I want to ensure that we won't introduce a visible regression, because this may become a blocker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be able to do this as efficiently as the current logic, I don't see why we can't. I agree that it is a blocker if this will be causing performance regressions though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SelectorSpreadingPriority is currently part of the default algorithms that run. Theoretically, we are talking about the same problem, and then it should be possible to devise an algorithm for EvenPodsSpreadPriority with the same complexity.
We will be backing it with performance measurements, as seen in the Test Plan.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alculquicondor perhaps we should explicitly call that out in the KEP?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Test Plan is part of the graduation criteria.


### Graduation Criteria

Alpha (v1.17):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should say that this feature is tired to even pods spread, so they will graduate together.


- [ ] Scheduler Component Config API changes.
- [ ] Default, validation and generated code.
- [ ] Priority Implementation.
- [ ] Predicate implementation.
- [ ] Test cases mentioned in the [Test Plan](#test-plan).

## Implementation History

- 2019-09-26: Initial KEP sent out for review.

## Alternatives

- Make the topology keys used in `SelectorSpreadingPriority` configurable.

While this moves the scheduler in the right direction, there are two problems:

1. We can only support one topology key.
1. It makes it hard for pods to override the operator-provided spreading rules.

- Implement a mutating controller that sets defaults.

This approach would likely allow us to provide a more flexible interface that
can set defaults for specific namespaces or with other selectors. However, that
wouldn't allow us to replace `SelectorSpreadingPriority` with
`EvenPodsSpreading`.