diff --git a/keps/sig-storage/20191031-storage-capacity-constraints-for-pod-scheduling.md b/keps/sig-storage/20191031-storage-capacity-constraints-for-pod-scheduling.md index 345529f793c9..d3e6481dfb99 100644 --- a/keps/sig-storage/20191031-storage-capacity-constraints-for-pod-scheduling.md +++ b/keps/sig-storage/20191031-storage-capacity-constraints-for-pod-scheduling.md @@ -106,58 +106,51 @@ enough storage capacity. ### Goals -The goal of this KEP is to increase the chance of choosing a node for -which volume creation will succeed by tracking the currently available -capacity available through a CSI driver and using that information -during pod scheduling. - -Although the issue is more common with storage that is local to a node -(LVM, PMEM), it may also occur for storage providers that have -capacity limits for other topology segments (rack, data center, -region, etc.). Capacity tracking is meant to be generic enough to -support all of these cases. - -Inline volumes currently do not have a standardized way of specify the -size; this KEP will introduce such a field and ensure that capacity -tracking also works for drivers which only support ephemeral inline -volumes. - -For persistent volumes, capacity will be tracked per storage class, so -additional parameters in a storage class are taken into account. This -is important because those parameters might have a significant impact on -how much space a new volume actually needs in the underlying storage -system (for example, an LVM volume with mirroring needs more space -than a LVM volume that is striped). For ephemeral volumes there is no -storage class, so tracking is only done per node and provisioner. +* Define an API for exposing information about storage that is + flexible enough for a variety of use cases and that can be extended + later on. + +* The initial goal is to expose capacity information at the semantic + level that Kubernetes currently understands, i.e. in a way that + Kubernetes can compare capacity against the requested size of + volumes. This has to work for local storage, network-attached + storage and for drivers where the capacity depends on parameters in + the storage class. + +* Increase the chance of choosing a node for which volume creation + will succeed by tracking the currently available capacity available + through a CSI driver and using that information during pod + scheduling. ### Non-Goals -Only CSI drivers will be supported. - -The Kubernetes scheduler could try to anticipate the effect of -creating multiple volumes concurrently. But this depends on knowledge -about internal driver details that Kubernetes doesn’t have, so pending -volume operations are simply ignored when making scheduling decisions. - -Because of that and also for other reasons (capacity changed via -operations outside of Kubernetes, like creating or deleting volumes, -or expanding the storage), it is expected that pod scheduling may -still end up with a node from time to time where volume creation then -fails. Rolling back in this case is complicated and outside of the -scope of this KEP. For example, a pod might use two persistent -volumes, of which one was created and the other not, and then it -wouldn’t be obvious whether the existing volume can or should be -deleted. - -For persistent volumes that get created independently of a pod nothing -changes: it’s still the responsibility of the CSI driver to decide how -to create the volume and then communicate back through topology -information where pods using that volume need to run. - -Inline volumes could be extended to reference a storage class, which -then could be used to handle more complex situations (like the LVM -mirror vs. striped case) also for inline volumes. But this is outside -the scope of this KEP. +* Only CSI drivers will be supported. + +* No attempts will be made to model how capacity will be affected by + pending volume operations. This would depend on internal driver + details that Kubernetes doesn’t have. + +* Because of that and also for other reasons (capacity changed via + operations outside of Kubernetes, like creating or deleting volumes, + or expanding the storage), it is expected that pod scheduling may + still end up with a node from time to time where volume creation + then fails. Rolling back in this case is complicated and outside of + the scope of this KEP. For example, a pod might use two persistent + volumes, of which one was created and the other not, and then it + wouldn’t be obvious whether the existing volume can or should be + deleted. + +* For persistent volumes that get created independently of a pod + nothing changes: it’s still the responsibility of the CSI driver to + decide how to create the volume and then communicate back through + topology information where pods using that volume need to run. + However, a CSI driver may use the capacity information exposed + through the proposed API to make its choice. + +* Inline volumes could be extended to reference a storage class, which + then could be used to handle more complex situations (like the LVM + mirror vs. striped case) also for inline volumes. But this is outside + the scope of this KEP. ## Proposal @@ -176,6 +169,48 @@ backed by PMEM and provided by that is local to a node and thus the scheduler has to be aware whether enough of it is available on a node before assigning a pod to it. +#### Different LVM configurations + +A user may want to choose between higher performance of local disks +and higher fault tolerance by selecting striping respectively +mirroring or raid in the storage class parameters of a driver for LVM, +like for example [TopoLVM](https://github.com/cybozu-go/topolvm). + +The maximum size of the resulting volume then depends on the storage +class and its parameters. + +#### Network attached storage + +In contrast to local storage, network attached storage can be made +available on more than just one node. However, for technical reasons +(high-speed network for data transfer inside a single data center) or +political reasons (data must only be stored and processed in a single +jurisdication) availability may still be limited to a subset of the +nodes in a cluster. + +#### Custom schedulers + +For situations not handled by the Kubernetes scheduler now and/or in +the future, a [scheduler +extender](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/scheduler_extender.md) +can influence pod scheduling based on the information exposed via the +new API. The +[topolvm-scheduler](https://github.com/cybozu-go/topolvm/blob/master/docs/design.md#how-the-scheduler-extension-works) +currently does that with a driver-specific way of storing capacity +information. + +#### Operators for applications + +Application operators for modern scale out storage services +(e.g. MongoDB, ElasticSearch, Kafka, MySQL, PostgreSQL, Minio, etc.) +may want to control creation of volumes carefully in order to optimize +availability, durability, performance and cost. For more information +about this, see the [StoragePool API for Advanced Storage Placement +KEP](https://github.com/kubernetes/enhancements/pull/1347). + +Storage pools as introduced in this KEP enable the creation of such +operators by providing them the necessary information about storage in +the cluster. ### Caching remaining capacity via the API server @@ -556,12 +591,13 @@ delete `CSIStoragePool` objects: - when nodes change (for central provisioning) - when storage classes change (for persistent volumes) - when volumes were created or deleted (for central provisioning) -- periodically, to detect changes in the underlying backing store; a - CSI spec extension would be necessary to avoid this polling (all cases) +- when volumes are resized or snapshots are created or deleted (for persistent volumes) +- periodically, to detect changes in the underlying backing store (all cases) -That last point also covers ephemeral inline volumes, because creating -or deleting those is not something that external-provisioner gets -notified of. +Because sidecars are currently separated, external-snapshotter is +unaware of resizing and snapshotting. It also not involved with +ephemeral inline volumes. The periodic polling will catch up +with changes caused by those operations. ### Using capacity information