Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: StoragePool API for Advanced Storage Placement #1347

Closed

Conversation

cdickmann
Copy link

This KEP aims to extend Kubernetes with a new storage pool concept which enables the underlying storage system to provide more control over placement decisions. These are expected to be leveraged by application operators for modern scale out storage services (e.g. MongoDB, ElasticSearch, Kafka, MySQL, PostgreSQL, Minio, etc.) in order to optimize availability, durability, performance and cost.

The document lays out the background for why modern storage services benefit from finer control over placement, explain how SDS offers and abstracts such capabilities, and analyses the gaps in the existing Kubernetes APIs. Based on that, it derives detailed Goals and User Stories and proposes to introduce a new StoragePool CRD and suggets how to use existing StorageClass for how to steer placement to these storage pools.

@k8s-ci-robot
Copy link
Contributor

Welcome @cdickmann!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @cdickmann. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 31, 2019
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 31, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cdickmann
To complete the pull request process, please assign saad-ali
You can assign the PR to them by writing /assign @saad-ali in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Oct 31, 2019
@cdickmann cdickmann force-pushed the cdickmann-storagepool-kep branch from fba9e3f to 398a40e Compare November 1, 2019 09:24
[kubernetes/kubernetes]: https://github.com/kubernetes/kubernetes
[kubernetes/website]: https://github.com/kubernetes/website

-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all these comments above can be removed

- node2
capacity:
total: 124676572
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who installs and updates this CRD?

Can API extensions that core Kubernetes components (like the Kubernetes scheduler, should it start to use this information) depend on a CRD or does it have to be built into the apiserver?

This is an implementation detail, but it's still something that needs to be decided. We've gone from CRDs to in-tree APIs before (CSINodeInfo, CSIDriverInfo) and that caused quite a bit of extra work - it would be nice to avoid that by picking the right solution from the start.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this CRD can be installed in the similar way as the snapshot beta CRD. It can be installed by the cluster addon manager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have some pointers for me about that? Are there language bindings involved and how does Kubernetes pull those in?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this a bit on the #csi Slack channel:

To me it still looks like core API extensions need to be done in apiserver.

Copy link
Contributor

@xing-yang xing-yang Nov 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that CSINode was transitioned from CRDs to in-tree APIs was mainly because kubelet depends on it.

Unless if we can prove that we can't use CRDs to achieve our goals, we can't explain to sig-architecture why StoragePool has to be in-tree APIs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that CSINode was transitioned from CRDs to in-tree APIs was mainly because kubelet depends on it.

The same applies here once we want to use the capacity information also for scheduling pods because then the Kubernetes controller depends on the capacity API - see #1353.

apiVersion: storagepool.storage.k8s.io/v1alpha1
kind: StoragePool
metadata:
name: storagePool1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who chooses the name and how? It's distributed, so there has to be a naming scheme that avoids conflicts.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a good question. Do we have precedent for the infrastructure surfacing up entities and having to give unique names? I would assume PV names fall into that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PVs get unreadable names like pvc-24fcddf6-9178-4fe4-b71f-4417652bca5d. One has to find the right object through a reference elsewhere (PVC.Spec.VolumeName is set to it) or by searching for the right one by its attributes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, yeah, thats likely not a great path to follow :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdickmann - for load balancer services, kubernetes relies on service uid field to name/create resources at infrastructure provider level, also +1 on PVs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the "New external controller for StoragePool" section, we mentioned that CSI driver will implement a ControllerListStoragePools() function. The controller will call that CSI function at startup to discover all pools which include names of the pools on the storage systems. We should probably consider appending a uuid to those names to avoid name collision.

spec:
driver: myDriver
parameters:
csiBackendSpecificOpaqueKey: csiBackendSpecificOpaqueValue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these parameters are opaque to anyone other than the CSI driver itself, how are they going to be used?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The section "Using StorageClass for placement" discusses that applications should copy these into the StorageClass, until a separate KEP automates placement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, somehow I missed that.

Which binding mode is that StorageClass meant to have? For the intended use case it'll probably be the normal immediate binding mode, right?

With delayed binding mode you would be back at the problem described in #1353: a node might get chosen without Kubernetes knowing whether the node is suitable for the intended StoragePool.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct, let me say that more explicitly. I added "Most commonly this will lead to a "compute-follows-storage" design, i.e. it won't use WaitForFirstConsumer and instead have the PV be created first, and then have the Pod follow it based on which nodes the volume is accessible from."


#### CSI changes

* Add a “StoragePoolSelector map[string]string” field in CreateVolumeRequest.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify where the external-provisioner gets those values when calling CreateVolumeRequest? The text implies that this comes from the PVC, but then that's another Kubernetes API change that would have to be defined in this KEP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that you removed some PVC extension in commit 398a40e. I think that PVC extension needs to be added back. Without it, this section here and the parameters field in the CRD are hard to understand.

You could also remove those, but is then enough content left in this KEP to understand how the new API is meant to be used?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this is left over. I am removing this line. Sections "Usage of StoragePool objects by application" and "Using StorageClass for placement" should make clear how applications can construct a StorageClass from the StoragePool that allows for placement into the pool. The KEP also expresses that it expect future work in this area to further automate/simplify placement.


#### New external controller for StoragePool

CSI will be extended with a new ControllerListStoragePools() API, which returns all storage pools available via this CSI driver. A new external K8s controller uses this API upon CSI startup to learn about all storage pools and publish them into the K8s API server using the new StoragePool CRD.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean a new sidecar? Does that new API go into the controller service and thus the new sidecar runs alongside that?

Note that CSI driver's which only support ephemeral inline volumes don't need to implement a controller service. In #1353 I propose to require that they provide GetCapacity and to run external-provisioner on each node in a mode where it just takes that information and puts it into the capacity object for the driver.

The decision to extend external-provisioner was based on the observation in the Kubernetes-CSI WG that the proliferation of sidecars is making the release process more complicated and a waste of resources because typically they do get deployed together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be discussed further. One solution is to let external-provisioner handle it. Another solution is to have a separate storage pool sidecar handle it. The third solution is to follow the snapshot controller split model and have a common storage pool controller and a sidecar.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I have no preference and will go with the guidance of the rest of the community.

pohly added a commit to pohly/enhancements that referenced this pull request Nov 5, 2019
As pointed out in
kubernetes#1347, there may be
other attributes than just capacity that need to be tracked (like
failures). Also, there may be different storage pools within a single
node.

Avoiding "capacity" in the name of the data structures and treating it
as just one field in the leafs of the data structure allows future
extensions for those use cases without a major API change. For the
same reason all fields in the leaf are now optional, with reasonable
fallbacks if not set.
@yastij
Copy link
Member

yastij commented Nov 6, 2019

/cc

@k8s-ci-robot k8s-ci-robot requested a review from yastij November 6, 2019 15:58
parameters:
csiBackendSpecificOpaqueKey: csiBackendSpecificOpaqueValue
status:
accessibleNodes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered label selector for nodes instead of list of names? Maybe allow both (not together though)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, somehow I managed to miss this field during my first pass through this KEP. I agree with @andrewsykim, relying exclusively on a list of node names is not going to scale for large clusters for pools which are accessible by all nodes.

In #1353 I propose to use a v1.NodeSelector exactly for this reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accessibleNodes here is under status so it should be a list of node names as the driver should know what nodes this pool is accessible to.

We could have a node selector in spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point about status. I don't see accessibleNodes in spec, is the consumer of StoragePool API supposed to know that without input from the storage pool API?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All, I don't quite follow the discussion. StoragePool is generated by code, in fact coming from CSI information. How would the CSI driver construct a NodeSelector, it doesn't really know the node labels, right?

But why do we need the node selector in the spec? What would it express? Similar question for Andrew. My expectation is that the consumer (an application) would read the status of the storage pool and hence know the nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are probably thinking of pools which only have a very short list of nodes. That may be valid for the intended purpose of this KEP.

But I am looking at it from the perspective of #1353 where a StoragePool might be accessible from one half of a very large cluster. Then listing all nodes is not going to be very space-efficient. I also don't know how often nodes are added or removed - such changes might then force updating the StoragePool.

I think the ideal API should support both: NodeSelector and lists of node names. Then whoever creates those objects can choose the method that is more suitable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StoragePool is generated by code, in fact coming from CSI information. How would the CSI driver construct a NodeSelector, it doesn't really know the node labels, right?

The CSI driver also doesn't know what the nodes are called in Kubernetes. So whoever calls the CSI driver will have to translate between the naming used by the CSI driver and the naming used by Kubernetes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so you are suggesting that the controller may do something like add a label to all the nodes in "row 1" and then set up a NodeSelector accordingly. This way, the size of the StoragePool and any Pod that tries to affinititze to the nodes doesn't have a huge list to work off of. I think that makes sense. I will update the proposal accordingly.

@cdickmann
Copy link
Author

One more try for CLA

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Nov 12, 2019
@dims
Copy link
Member

dims commented Nov 12, 2019

w00t! authorized

@yastij
Copy link
Member

yastij commented Nov 12, 2019

/cc @thockin

@k8s-ci-robot k8s-ci-robot requested a review from thockin November 12, 2019 12:28
pohly added a commit to pohly/enhancements that referenced this pull request Dec 13, 2019
The "user stories" and goals section gets revamped to include more
examples and incorporate the idea behind
kubernetes#1347.

Storage capacity also needs to be updated for snapshotting or
resizing.
pohly added a commit to pohly/enhancements that referenced this pull request Dec 13, 2019
The "user stories" and goals section gets revamped to include more
examples and incorporate the idea behind
kubernetes#1347.

Storage capacity also needs to be updated for snapshotting or
resizing.
pohly added a commit to pohly/enhancements that referenced this pull request Dec 13, 2019
The "user stories" and goals section gets revamped to include more
examples and incorporate the idea behind
kubernetes#1347.

Storage capacity also needs to be updated for snapshotting or
resizing.
CSI will be extended with a new ControllerListStoragePools() API, which returns all storage pools available via this CSI driver. A new external K8s controller uses this API upon CSI startup to learn about all storage pools and publish them into the K8s API server using the new StoragePool CRD.

#### Usage of StoragePool objects by application
Storage applications (e.g. MongoDB, Kafka, Minio, etc.), with their own operator (i.e. controllers), consume the data in the StoragePool. It is how they understand topology. For example, let’s say we are in User Story 2, i.e. every Node has a bunch of local drives each published as a StoragePool. The operator will understand that they are “local” drives by seeing that only one K8s Node has access to each of them, and it will see how many there are and of which size. Based on the mirroring/erasure coding scheme of the specific storage service, it can now translate to how many PVCs, which size and on which of these StoragePools to create. Given that the StoragePool may model something as fragile as a single physical drive, it is a real possibility for a StoragePool to fail or be currently inaccessible. The StoragePool hence also communicates that fact, and the Storage Service operator can understand when and why volumes are failing, and how to remediate (e.g. by putting an additional PVC on another pool and moving some data around).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of one storagepool per local drive, is there possibility of grouping set of local drives which then makes sense as pool?
If app operator need to really use storagePool which is one per local drive, instead, it can actually create a PV out of this local drive and use the same in app deployment/STS.

@xing-yang
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 24, 2020
@k8s-ci-robot
Copy link
Contributor

@cdickmann: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-enhancements-verify 1fe6996 link /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 23, 2020
@xing-yang
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 23, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 22, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 21, 2020
@xing-yang
Copy link
Contributor

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Sep 10, 2020
@fejta-bot
Copy link

Enhancement issues opened in kubernetes/enhancements should never be marked as frozen.
Enhancement Owners can ensure that enhancements stay fresh by consistently updating their states across release cycles.

/remove-lifecycle frozen

@k8s-ci-robot k8s-ci-robot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Sep 10, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 9, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 8, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants