-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: StoragePool API for Advanced Storage Placement #1347
Conversation
Welcome @cdickmann! |
Hi @cdickmann. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: cdickmann The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
fba9e3f
to
398a40e
Compare
[kubernetes/kubernetes]: https://github.com/kubernetes/kubernetes | ||
[kubernetes/website]: https://github.com/kubernetes/website | ||
|
||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all these comments above can be removed
- node2 | ||
capacity: | ||
total: 124676572 | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who installs and updates this CRD?
Can API extensions that core Kubernetes components (like the Kubernetes scheduler, should it start to use this information) depend on a CRD or does it have to be built into the apiserver?
This is an implementation detail, but it's still something that needs to be decided. We've gone from CRDs to in-tree APIs before (CSINodeInfo
, CSIDriverInfo
) and that caused quite a bit of extra work - it would be nice to avoid that by picking the right solution from the start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this CRD can be installed in the similar way as the snapshot beta CRD. It can be installed by the cluster addon manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have some pointers for me about that? Are there language bindings involved and how does Kubernetes pull those in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this a bit on the #csi Slack channel:
- the snapshot beta CRD is not used by core Kubernetes, so it's not an example that applies here
- the addon manager that @xing-yang referred to is https://github.com/kubernetes/kubernetes/tree/master/cluster/addons, which is deprecated and not a general solution for installing a CRD
To me it still looks like core API extensions need to be done in apiserver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason that CSINode was transitioned from CRDs to in-tree APIs was mainly because kubelet depends on it.
Unless if we can prove that we can't use CRDs to achieve our goals, we can't explain to sig-architecture why StoragePool has to be in-tree APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason that CSINode was transitioned from CRDs to in-tree APIs was mainly because kubelet depends on it.
The same applies here once we want to use the capacity information also for scheduling pods because then the Kubernetes controller depends on the capacity API - see #1353.
apiVersion: storagepool.storage.k8s.io/v1alpha1 | ||
kind: StoragePool | ||
metadata: | ||
name: storagePool1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who chooses the name and how? It's distributed, so there has to be a naming scheme that avoids conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats a good question. Do we have precedent for the infrastructure surfacing up entities and having to give unique names? I would assume PV names fall into that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PVs get unreadable names like pvc-24fcddf6-9178-4fe4-b71f-4417652bca5d
. One has to find the right object through a reference elsewhere (PVC.Spec.VolumeName
is set to it) or by searching for the right one by its attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, yeah, thats likely not a great path to follow :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cdickmann - for load balancer services, kubernetes relies on service uid
field to name/create resources at infrastructure provider level, also +1 on PVs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the "New external controller for StoragePool" section, we mentioned that CSI driver will implement a ControllerListStoragePools() function. The controller will call that CSI function at startup to discover all pools which include names of the pools on the storage systems. We should probably consider appending a uuid to those names to avoid name collision.
spec: | ||
driver: myDriver | ||
parameters: | ||
csiBackendSpecificOpaqueKey: csiBackendSpecificOpaqueValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these parameters are opaque to anyone other than the CSI driver itself, how are they going to be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The section "Using StorageClass for placement" discusses that applications should copy these into the StorageClass, until a separate KEP automates placement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, somehow I missed that.
Which binding mode is that StorageClass meant to have? For the intended use case it'll probably be the normal immediate binding mode, right?
With delayed binding mode you would be back at the problem described in #1353: a node might get chosen without Kubernetes knowing whether the node is suitable for the intended StoragePool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct, let me say that more explicitly. I added "Most commonly this will lead to a "compute-follows-storage" design, i.e. it won't use WaitForFirstConsumer
and instead have the PV be created first, and then have the Pod follow it based on which nodes the volume is accessible from."
|
||
#### CSI changes | ||
|
||
* Add a “StoragePoolSelector map[string]string” field in CreateVolumeRequest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify where the external-provisioner
gets those values when calling CreateVolumeRequest
? The text implies that this comes from the PVC, but then that's another Kubernetes API change that would have to be defined in this KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that you removed some PVC extension in commit 398a40e. I think that PVC extension needs to be added back. Without it, this section here and the parameters
field in the CRD are hard to understand.
You could also remove those, but is then enough content left in this KEP to understand how the new API is meant to be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed this is left over. I am removing this line. Sections "Usage of StoragePool objects by application" and "Using StorageClass for placement" should make clear how applications can construct a StorageClass from the StoragePool that allows for placement into the pool. The KEP also expresses that it expect future work in this area to further automate/simplify placement.
|
||
#### New external controller for StoragePool | ||
|
||
CSI will be extended with a new ControllerListStoragePools() API, which returns all storage pools available via this CSI driver. A new external K8s controller uses this API upon CSI startup to learn about all storage pools and publish them into the K8s API server using the new StoragePool CRD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean a new sidecar? Does that new API go into the controller service and thus the new sidecar runs alongside that?
Note that CSI driver's which only support ephemeral inline volumes don't need to implement a controller service. In #1353 I propose to require that they provide GetCapacity
and to run external-provisioner
on each node in a mode where it just takes that information and puts it into the capacity object for the driver.
The decision to extend external-provisioner was based on the observation in the Kubernetes-CSI WG that the proliferation of sidecars is making the release process more complicated and a waste of resources because typically they do get deployed together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be discussed further. One solution is to let external-provisioner handle it. Another solution is to have a separate storage pool sidecar handle it. The third solution is to follow the snapshot controller split model and have a common storage pool controller and a sidecar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I have no preference and will go with the guidance of the rest of the community.
As pointed out in kubernetes#1347, there may be other attributes than just capacity that need to be tracked (like failures). Also, there may be different storage pools within a single node. Avoiding "capacity" in the name of the data structures and treating it as just one field in the leafs of the data structure allows future extensions for those use cases without a major API change. For the same reason all fields in the leaf are now optional, with reasonable fallbacks if not set.
/cc |
parameters: | ||
csiBackendSpecificOpaqueKey: csiBackendSpecificOpaqueValue | ||
status: | ||
accessibleNodes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we considered label selector for nodes instead of list of names? Maybe allow both (not together though)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, somehow I managed to miss this field during my first pass through this KEP. I agree with @andrewsykim, relying exclusively on a list of node names is not going to scale for large clusters for pools which are accessible by all nodes.
In #1353 I propose to use a v1.NodeSelector
exactly for this reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accessibleNodes
here is under status
so it should be a list of node names as the driver should know what nodes this pool is accessible to.
We could have a node selector in spec
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good point about status. I don't see accessibleNodes
in spec, is the consumer of StoragePool API supposed to know that without input from the storage pool API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All, I don't quite follow the discussion. StoragePool is generated by code, in fact coming from CSI information. How would the CSI driver construct a NodeSelector, it doesn't really know the node labels, right?
But why do we need the node selector in the spec? What would it express? Similar question for Andrew. My expectation is that the consumer (an application) would read the status of the storage pool and hence know the nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are probably thinking of pools which only have a very short list of nodes. That may be valid for the intended purpose of this KEP.
But I am looking at it from the perspective of #1353 where a StoragePool might be accessible from one half of a very large cluster. Then listing all nodes is not going to be very space-efficient. I also don't know how often nodes are added or removed - such changes might then force updating the StoragePool.
I think the ideal API should support both: NodeSelector and lists of node names. Then whoever creates those objects can choose the method that is more suitable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StoragePool is generated by code, in fact coming from CSI information. How would the CSI driver construct a NodeSelector, it doesn't really know the node labels, right?
The CSI driver also doesn't know what the nodes are called in Kubernetes. So whoever calls the CSI driver will have to translate between the naming used by the CSI driver and the naming used by Kubernetes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so you are suggesting that the controller may do something like add a label to all the nodes in "row 1" and then set up a NodeSelector accordingly. This way, the size of the StoragePool and any Pod that tries to affinititze to the nodes doesn't have a huge list to work off of. I think that makes sense. I will update the proposal accordingly.
One more try for CLA |
w00t! |
/cc @thockin |
The "user stories" and goals section gets revamped to include more examples and incorporate the idea behind kubernetes#1347. Storage capacity also needs to be updated for snapshotting or resizing.
The "user stories" and goals section gets revamped to include more examples and incorporate the idea behind kubernetes#1347. Storage capacity also needs to be updated for snapshotting or resizing.
The "user stories" and goals section gets revamped to include more examples and incorporate the idea behind kubernetes#1347. Storage capacity also needs to be updated for snapshotting or resizing.
CSI will be extended with a new ControllerListStoragePools() API, which returns all storage pools available via this CSI driver. A new external K8s controller uses this API upon CSI startup to learn about all storage pools and publish them into the K8s API server using the new StoragePool CRD. | ||
|
||
#### Usage of StoragePool objects by application | ||
Storage applications (e.g. MongoDB, Kafka, Minio, etc.), with their own operator (i.e. controllers), consume the data in the StoragePool. It is how they understand topology. For example, let’s say we are in User Story 2, i.e. every Node has a bunch of local drives each published as a StoragePool. The operator will understand that they are “local” drives by seeing that only one K8s Node has access to each of them, and it will see how many there are and of which size. Based on the mirroring/erasure coding scheme of the specific storage service, it can now translate to how many PVCs, which size and on which of these StoragePools to create. Given that the StoragePool may model something as fragile as a single physical drive, it is a real possibility for a StoragePool to fail or be currently inaccessible. The StoragePool hence also communicates that fact, and the Storage Service operator can understand when and why volumes are failing, and how to remediate (e.g. by putting an additional PVC on another pool and moving some data around). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of one storagepool per local drive, is there possibility of grouping set of local drives which then makes sense as pool?
If app operator need to really use storagePool which is one per local drive, instead, it can actually create a PV out of this local drive and use the same in app deployment/STS.
/ok-to-test |
@cdickmann: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/lifecycle frozen |
Enhancement issues opened in /remove-lifecycle frozen |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-contributor-experience at kubernetes/community. |
@fejta-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This KEP aims to extend Kubernetes with a new storage pool concept which enables the underlying storage system to provide more control over placement decisions. These are expected to be leveraged by application operators for modern scale out storage services (e.g. MongoDB, ElasticSearch, Kafka, MySQL, PostgreSQL, Minio, etc.) in order to optimize availability, durability, performance and cost.
The document lays out the background for why modern storage services benefit from finer control over placement, explain how SDS offers and abstracts such capabilities, and analyses the gaps in the existing Kubernetes APIs. Based on that, it derives detailed Goals and User Stories and proposes to introduce a new
StoragePool
CRD and suggets how to use existingStorageClass
for how to steer placement to these storage pools.