Skip to content

Commit

Permalink
storage capacity: address further review feedback
Browse files Browse the repository at this point in the history
The "user stories" and goals section gets revamped to include more
examples and incorporate the idea behind
kubernetes#1347.

Storage capacity also needs to be updated for snapshotting or
resizing.
  • Loading branch information
pohly committed Dec 13, 2019
1 parent 06064c7 commit 54dc326
Showing 1 changed file with 90 additions and 54 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -106,58 +106,51 @@ enough storage capacity.

### Goals

The goal of this KEP is to increase the chance of choosing a node for
which volume creation will succeed by tracking the currently available
capacity available through a CSI driver and using that information
during pod scheduling.

Although the issue is more common with storage that is local to a node
(LVM, PMEM), it may also occur for storage providers that have
capacity limits for other topology segments (rack, data center,
region, etc.). Capacity tracking is meant to be generic enough to
support all of these cases.

Inline volumes currently do not have a standardized way of specify the
size; this KEP will introduce such a field and ensure that capacity
tracking also works for drivers which only support ephemeral inline
volumes.

For persistent volumes, capacity will be tracked per storage class, so
additional parameters in a storage class are taken into account. This
is important because those parameters might have a significant impact on
how much space a new volume actually needs in the underlying storage
system (for example, an LVM volume with mirroring needs more space
than a LVM volume that is striped). For ephemeral volumes there is no
storage class, so tracking is only done per node and provisioner.
* Define an API for exposing information about storage that is
flexible enough for a variety of use cases and that can be extended
later on.

* The initial goal is to expose capacity information at the semantic
level that Kubernetes currently understands, i.e. in a way that
Kubernetes can compare capacity against the requested size of
volumes. This has to work for local storage, network-attached
storage and for drivers where the capacity depends on parameters in
the storage class.

* Increase the chance of choosing a node for which volume creation
will succeed by tracking the currently available capacity available
through a CSI driver and using that information during pod
scheduling.

### Non-Goals

Only CSI drivers will be supported.

The Kubernetes scheduler could try to anticipate the effect of
creating multiple volumes concurrently. But this depends on knowledge
about internal driver details that Kubernetes doesn’t have, so pending
volume operations are simply ignored when making scheduling decisions.

Because of that and also for other reasons (capacity changed via
operations outside of Kubernetes, like creating or deleting volumes,
or expanding the storage), it is expected that pod scheduling may
still end up with a node from time to time where volume creation then
fails. Rolling back in this case is complicated and outside of the
scope of this KEP. For example, a pod might use two persistent
volumes, of which one was created and the other not, and then it
wouldn’t be obvious whether the existing volume can or should be
deleted.

For persistent volumes that get created independently of a pod nothing
changes: it’s still the responsibility of the CSI driver to decide how
to create the volume and then communicate back through topology
information where pods using that volume need to run.

Inline volumes could be extended to reference a storage class, which
then could be used to handle more complex situations (like the LVM
mirror vs. striped case) also for inline volumes. But this is outside
the scope of this KEP.
* Only CSI drivers will be supported.

* No attempts will be made to model how capacity will be affected by
pending volume operations. This would depend on internal driver
details that Kubernetes doesn’t have.

* Because of that and also for other reasons (capacity changed via
operations outside of Kubernetes, like creating or deleting volumes,
or expanding the storage), it is expected that pod scheduling may
still end up with a node from time to time where volume creation
then fails. Rolling back in this case is complicated and outside of
the scope of this KEP. For example, a pod might use two persistent
volumes, of which one was created and the other not, and then it
wouldn’t be obvious whether the existing volume can or should be
deleted.

* For persistent volumes that get created independently of a pod
nothing changes: it’s still the responsibility of the CSI driver to
decide how to create the volume and then communicate back through
topology information where pods using that volume need to run.
However, a CSI driver may use the capacity information exposed
through the proposed API to make its choice.

* Inline volumes could be extended to reference a storage class, which
then could be used to handle more complex situations (like the LVM
mirror vs. striped case) also for inline volumes. But this is outside
the scope of this KEP.

## Proposal

Expand All @@ -176,6 +169,48 @@ backed by PMEM and provided by
that is local to a node and thus the scheduler has to be aware whether
enough of it is available on a node before assigning a pod to it.

#### Different LVM configurations

A user may want to choose between higher performance of local disks
and higher fault tolerance by selecting striping respectively
mirroring or raid in the storage class parameters of a driver for LVM,
like for example [TopoLVM](https://github.com/cybozu-go/topolvm).

The maximum size of the resulting volume then depends on the storage
class and its parameters.

#### Network attached storage

In contrast to local storage, network attached storage can be made
available on more than just one node. However, for technical reasons
(high-speed network for data transfer inside a single data center) or
political reasons (data must only be stored and processed in a single
jurisdication) availability may still be limited to a subset of the
nodes in a cluster.

#### Custom schedulers

For situations not handled by the Kubernetes scheduler now and/or in
the future, a [scheduler
extender](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/scheduler_extender.md)
can influence pod scheduling based on the information exposed via the
new API. The
[topolvm-scheduler](https://github.com/cybozu-go/topolvm/blob/master/docs/design.md#how-the-scheduler-extension-works)
currently does that with a driver-specific way of storing capacity
information.

#### Operators for applications

Application operators for modern scale out storage services
(e.g. MongoDB, ElasticSearch, Kafka, MySQL, PostgreSQL, Minio, etc.)
may want to control creation of volumes carefully in order to optimize
availability, durability, performance and cost. For more information
about this, see the [StoragePool API for Advanced Storage Placement
KEP](https://github.com/kubernetes/enhancements/pull/1347).

Storage pools as introduced in this KEP enable the creation of such
operators by providing them the necessary information about storage in
the cluster.

### Caching remaining capacity via the API server

Expand Down Expand Up @@ -556,12 +591,13 @@ delete `CSIStoragePool` objects:
- when nodes change (for central provisioning)
- when storage classes change (for persistent volumes)
- when volumes were created or deleted (for central provisioning)
- periodically, to detect changes in the underlying backing store; a
CSI spec extension would be necessary to avoid this polling (all cases)
- when volumes are resized or snapshots are created or deleted (for persistent volumes)
- periodically, to detect changes in the underlying backing store (all cases)

That last point also covers ephemeral inline volumes, because creating
or deleting those is not something that external-provisioner gets
notified of.
Because sidecars are currently separated, external-snapshotter is
unaware of resizing and snapshotting. It also not involved with
ephemeral inline volumes. The periodic polling will catch up
with changes caused by those operations.

### Using capacity information

Expand Down

0 comments on commit 54dc326

Please sign in to comment.